# FNN-VAE for noisy time sequence forecasting

This put up didn’t find yourself fairly the best way I’d imagined. A fast follow-up on the current Time series prediction with

FNN-LSTM, it was presupposed to exhibit how *noisy* time sequence (so frequent in

observe) may revenue from a change in structure: As a substitute of FNN-LSTM, an LSTM autoencoder regularized by false nearest

neighbors (FNN) loss, use FNN-VAE, a variational autoencoder constrained by the identical. Nevertheless, FNN-VAE didn’t appear to deal with

noise higher than FNN-LSTM. No plot, no put up, then?

Alternatively – this isn’t a scientific research, with speculation and experimental setup all preregistered; all that basically

issues is that if there’s one thing helpful to report. And it appears like there may be.

Firstly, FNN-VAE, whereas on par performance-wise with FNN-LSTM, is much superior in that different which means of “efficiency”:

Coaching goes a *lot* quicker for FNN-VAE.

Secondly, whereas we don’t see a lot distinction between FNN-LSTM and FNN-VAE, we *do* see a transparent impression of utilizing FNN loss. Including in FNN loss strongly reduces imply squared error with respect to the underlying (denoised) sequence – particularly within the case of VAE, however for LSTM as effectively. That is of explicit curiosity with VAE, because it comes with a regularizer

out-of-the-box – particularly, Kullback-Leibler (KL) divergence.

In fact, we don’t declare that related outcomes will at all times be obtained on different noisy sequence; nor did we tune any of

the fashions “to loss of life.” For what may very well be the intent of such a put up however to point out our readers fascinating (and promising) concepts

to pursue in their very own experimentation?

## The context

This put up is the third in a mini-series.

In Deep attractors: Where deep learning meets chaos, we

defined, with a considerable detour into chaos principle, the thought of FNN loss, launched in (Gilpin 2020). Please seek the advice of

that first put up for theoretical background and intuitions behind the method.

The next put up, Time series prediction with FNN-LSTM, confirmed

learn how to use an LSTM autoencoder, constrained by FNN loss, for forecasting (versus reconstructing an attractor). The outcomes have been beautiful: In multi-step prediction (12-120 steps, with that quantity various by

dataset), the short-term forecasts have been drastically improved by including in FNN regularization. See that second put up for

experimental setup and outcomes on 4 very completely different, non-synthetic datasets.

As we speak, we present learn how to exchange the LSTM autoencoder by a – convolutional – VAE. In mild of the experimentation outcomes,

already hinted at above, it’s fully believable that the “variational” half shouldn’t be even so vital right here – {that a}

convolutional autoencoder with simply MSE loss would have carried out simply as effectively on these information. In reality, to search out out, it’s

sufficient to take away the decision to `reparameterize()`

and multiply the KL element of the loss by 0. (We depart this to the

reader, to maintain the put up at cheap size.)

One final piece of context, in case you haven’t learn the 2 earlier posts and want to leap in right here immediately. We’re

doing time sequence forecasting; so why this discuss of autoencoders? Shouldn’t we simply be evaluating an LSTM (or another sort of

RNN, for that matter) to a convnet? In reality, the need of a latent illustration is because of the very concept of FNN: The

latent code is meant to replicate the true attractor of a dynamical system. That’s, if the attractor of the underlying

system is roughly two-dimensional, we hope to search out that simply two of the latent variables have appreciable variance. (This

reasoning is defined in loads of element within the earlier posts.)

## FNN-VAE

So, let’s begin with the code for our new mannequin.

The encoder takes the time sequence, of format `batch_size x num_timesteps x num_features`

similar to within the LSTM case, and

produces a flat, 10-dimensional output: the latent code, which FNN loss is computed on.

```
library(tensorflow)
library(keras)
library(tfdatasets)
library(tfautograph)
library(reticulate)
library(purrr)
vae_encoder_model <- perform(n_timesteps,
n_features,
n_latent,
title = NULL) {
keras_model_custom(title = title, perform(self) {
self$conv1 <- layer_conv_1d(kernel_size = 3,
filters = 16,
strides = 2)
self$act1 <- layer_activation_leaky_relu()
self$batchnorm1 <- layer_batch_normalization()
self$conv2 <- layer_conv_1d(kernel_size = 7,
filters = 32,
strides = 2)
self$act2 <- layer_activation_leaky_relu()
self$batchnorm2 <- layer_batch_normalization()
self$conv3 <- layer_conv_1d(kernel_size = 9,
filters = 64,
strides = 2)
self$act3 <- layer_activation_leaky_relu()
self$batchnorm3 <- layer_batch_normalization()
self$conv4 <- layer_conv_1d(
kernel_size = 9,
filters = n_latent,
strides = 2,
activation = "linear"
)
self$batchnorm4 <- layer_batch_normalization()
self$flat <- layer_flatten()
perform (x, masks = NULL) {
x %>%
self$conv1() %>%
self$act1() %>%
self$batchnorm1() %>%
self$conv2() %>%
self$act2() %>%
self$batchnorm2() %>%
self$conv3() %>%
self$act3() %>%
self$batchnorm3() %>%
self$conv4() %>%
self$batchnorm4() %>%
self$flat()
}
})
}
```

The decoder begins from this – flat – illustration and decompresses it right into a time sequence. In each encoder and decoder

(de-)conv layers, parameters are chosen to deal with a sequence size (`num_timesteps`

) of 120, which is what we’ll use for

prediction beneath.

```
vae_decoder_model <- perform(n_timesteps,
n_features,
n_latent,
title = NULL) {
keras_model_custom(title = title, perform(self) {
self$reshape <- layer_reshape(target_shape = c(1, n_latent))
self$conv1 <- layer_conv_1d_transpose(kernel_size = 15,
filters = 64,
strides = 3)
self$act1 <- layer_activation_leaky_relu()
self$batchnorm1 <- layer_batch_normalization()
self$conv2 <- layer_conv_1d_transpose(kernel_size = 11,
filters = 32,
strides = 3)
self$act2 <- layer_activation_leaky_relu()
self$batchnorm2 <- layer_batch_normalization()
self$conv3 <- layer_conv_1d_transpose(
kernel_size = 9,
filters = 16,
strides = 2,
output_padding = 1
)
self$act3 <- layer_activation_leaky_relu()
self$batchnorm3 <- layer_batch_normalization()
self$conv4 <- layer_conv_1d_transpose(
kernel_size = 7,
filters = 1,
strides = 1,
activation = "linear"
)
self$batchnorm4 <- layer_batch_normalization()
perform (x, masks = NULL) {
x %>%
self$reshape() %>%
self$conv1() %>%
self$act1() %>%
self$batchnorm1() %>%
self$conv2() %>%
self$act2() %>%
self$batchnorm2() %>%
self$conv3() %>%
self$act3() %>%
self$batchnorm3() %>%
self$conv4() %>%
self$batchnorm4()
}
})
}
```

Observe that regardless that we referred to as these constructors `vae_encoder_model()`

and `vae_decoder_model()`

, there may be nothing

variational to those fashions per se; they’re actually simply an encoder and a decoder, respectively. Metamorphosis right into a VAE will

occur within the coaching process; in truth, the one two issues that may make this a VAE are going to be the

reparameterization of the latent layer and the added-in KL loss.

Talking of coaching, these are the routines we’ll name. The perform to compute FNN loss, `loss_false_nn()`

, may be present in

each of the abovementioned predecessor posts; we kindly ask the reader to repeat it from one in every of these locations.

```
# to reparameterize encoder output earlier than calling decoder
reparameterize <- perform(imply, logvar = 0) {
eps <- k_random_normal(form = n_latent)
eps * k_exp(logvar * 0.5) + imply
}
# loss has 3 elements: NLL, KL, and FNN
# in any other case, that is simply regular TF2-style coaching
train_step_vae <- perform(batch) {
with (tf$GradientTape(persistent = TRUE) %as% tape, {
code <- encoder(batch[[1]])
z <- reparameterize(code)
prediction <- decoder(z)
l_mse <- mse_loss(batch[[2]], prediction)
# see loss_false_nn in 2 earlier posts
l_fnn <- loss_false_nn(code)
# KL divergence to a typical regular
l_kl <- -0.5 * k_mean(1 - k_square(z))
# general loss is a weighted sum of all 3 elements
loss <- l_mse + fnn_weight * l_fnn + kl_weight * l_kl
})
encoder_gradients <-
tape$gradient(loss, encoder$trainable_variables)
decoder_gradients <-
tape$gradient(loss, decoder$trainable_variables)
optimizer$apply_gradients(purrr::transpose(list(
encoder_gradients, encoder$trainable_variables
)))
optimizer$apply_gradients(purrr::transpose(list(
decoder_gradients, decoder$trainable_variables
)))
train_loss(loss)
train_mse(l_mse)
train_fnn(l_fnn)
train_kl(l_kl)
}
# wrap all of it in autograph
training_loop_vae <- tf_function(autograph(perform(ds_train) {
for (batch in ds_train) {
train_step_vae(batch)
}
tf$print("Loss: ", train_loss$consequence())
tf$print("MSE: ", train_mse$consequence())
tf$print("FNN loss: ", train_fnn$consequence())
tf$print("KL loss: ", train_kl$consequence())
train_loss$reset_states()
train_mse$reset_states()
train_fnn$reset_states()
train_kl$reset_states()
}))
```

To complete up the mannequin part, right here is the precise coaching code. That is almost similar to what we did for FNN-LSTM earlier than.

```
n_latent <- 10L
n_features <- 1
encoder <- vae_encoder_model(n_timesteps,
n_features,
n_latent)
decoder <- vae_decoder_model(n_timesteps,
n_features,
n_latent)
mse_loss <-
tf$keras$losses$MeanSquaredError(discount = tf$keras$losses$Discount$SUM)
train_loss <- tf$keras$metrics$Imply(title = 'train_loss')
train_fnn <- tf$keras$metrics$Imply(title = 'train_fnn')
train_mse <- tf$keras$metrics$Imply(title = 'train_mse')
train_kl <- tf$keras$metrics$Imply(title = 'train_kl')
fnn_multiplier <- 1 # default worth utilized in almost all circumstances (see textual content)
fnn_weight <- fnn_multiplier * nrow(x_train)/batch_size
kl_weight <- 1
optimizer <- optimizer_adam(lr = 1e-3)
for (epoch in 1:100) {
cat("Epoch: ", epoch, " -----------n")
training_loop_vae(ds_train)
test_batch <- as_iterator(ds_test) %>% iter_next()
encoded <- encoder(test_batch[[1]][1:1000])
test_var <- tf$math$reduce_variance(encoded, axis = 0L)
print(test_var %>% as.numeric() %>% round(5))
}
```

## Experimental setup and information

The thought was so as to add white noise to a deterministic sequence. This time, the Roessler

system was chosen, primarily for the prettiness of its attractor, obvious

even in its two-dimensional projections:

Like we did for the Lorenz system within the first a part of this sequence, we use `deSolve`

to generate information from the Roessler

equations.

```
library(deSolve)
parameters <- c(a = .2,
b = .2,
c = 5.7)
initial_state <-
c(x = 1,
y = 1,
z = 1.05)
roessler <- perform(t, state, parameters) {
with(as.list(c(state, parameters)), {
dx <- -y - z
dy <- x + a * y
dz = b + z * (x - c)
list(c(dx, dy, dz))
})
}
instances <- seq(0, 2500, size.out = 20000)
roessler_ts <-
ode(
y = initial_state,
instances = instances,
func = roessler,
parms = parameters,
methodology = "lsoda"
) %>% unclass() %>% as_tibble()
n <- 10000
roessler <- roessler_ts$x[1:n]
roessler <- scale(roessler)
```

Then, noise is added, to the specified diploma, by drawing from a traditional distribution, centered at zero, with commonplace deviations

various between 1 and a couple of.5.

```
# add noise
noise <- 1 # additionally used 1.5, 2, 2.5
roessler <- roessler + rnorm(10000, imply = 0, sd = noise)
```

Right here you possibly can examine results of not including any noise (left), commonplace deviation-1 (center), and commonplace deviation-2.5 Gaussian noise:

In any other case, preprocessing proceeds as within the earlier posts. Within the upcoming outcomes part, we’ll examine forecasts not simply

to the “actual,” after noise addition, check cut up of the information, but in addition to the underlying Roessler system – that’s, the factor

we’re actually concerned with. (Simply that in the true world, we will’t do this examine.) This second check set is ready for

forecasting similar to the opposite one; to keep away from duplication we don’t reproduce the code.

```
n_timesteps <- 120
batch_size <- 32
gen_timesteps <- perform(x, n_timesteps) {
do.call(rbind,
purrr::map(seq_along(x),
perform(i) {
begin <- i
finish <- i + n_timesteps - 1
out <- x[start:end]
out
})
) %>%
na.omit()
}
practice <- gen_timesteps(roessler[1:(n/2)], 2 * n_timesteps)
check <- gen_timesteps(roessler[(n/2):n], 2 * n_timesteps)
dim(practice) <- c(dim(practice), 1)
dim(check) <- c(dim(check), 1)
x_train <- practice[ , 1:n_timesteps, , drop = FALSE]
y_train <- practice[ , (n_timesteps + 1):(2*n_timesteps), , drop = FALSE]
ds_train <- tensor_slices_dataset(list(x_train, y_train)) %>%
dataset_shuffle(nrow(x_train)) %>%
dataset_batch(batch_size)
x_test <- check[ , 1:n_timesteps, , drop = FALSE]
y_test <- check[ , (n_timesteps + 1):(2*n_timesteps), , drop = FALSE]
ds_test <- tensor_slices_dataset(list(x_test, y_test)) %>%
dataset_batch(nrow(x_test))
```

## Outcomes

The LSTM used for comparability with the VAE described above is similar to the structure employed within the earlier put up.

Whereas with the VAE, an `fnn_multiplier`

of 1 yielded ample regularization for all noise ranges, some extra experimentation

was wanted for the LSTM: At noise ranges 2 and a couple of.5, that multiplier was set to five.

Consequently, in all circumstances, there was one latent variable with excessive variance and a second one in every of minor significance. For all

others, variance was near 0.

*In all circumstances* right here means: In all circumstances the place FNN regularization was used. As already hinted at within the introduction, the primary

regularizing issue offering robustness to noise right here appears to be FNN loss, not KL divergence. So for all noise ranges,

moreover FNN-regularized LSTM and VAE fashions we additionally examined their non-constrained counterparts.

#### Low noise

Seeing how all fashions did fantastically on the unique deterministic sequence, a noise stage of 1 can nearly be handled as

a baseline. Right here you see sixteen 120-timestep predictions from each regularized fashions, FNN-VAE (darkish blue), and FNN-LSTM

(orange). The noisy check information, each enter (`x`

, 120 steps) and output (`y`

, 120 steps) are displayed in (blue-ish) gray. In

inexperienced, additionally spanning the entire sequence, we’ve got the unique Roessler information, the best way they’d look had no noise been added.

Regardless of the noise, forecasts from each fashions look glorious. Is that this because of the FNN regularizer?

Taking a look at forecasts from their unregularized counterparts, we’ve got to confess these don’t look any worse. (For higher

comparability, the sixteen sequences to forecast have been initiallly picked at random, however used to check all fashions and

circumstances.)

What occurs after we begin to add noise?

#### Substantial noise

Between noise ranges 1.5 and a couple of, one thing modified, or grew to become noticeable from visible inspection. Let’s leap on to the

highest-used stage although: 2.5.

Right here first are predictions obtained from the unregularized fashions.

Each LSTM and VAE get “distracted” a bit an excessive amount of by the noise, the latter to a good greater diploma. This results in circumstances

the place predictions strongly “overshoot” the underlying non-noisy rhythm. This isn’t stunning, in fact: They have been *skilled*

on the noisy model; predict fluctuations is what they realized.

Will we see the identical with the FNN fashions?

Apparently, we see a significantly better match to the underlying Roessler system now! Particularly the VAE mannequin, FNN-VAE, surprises

with an entire new smoothness of predictions; however FNN-LSTM turns up a lot smoother forecasts as effectively.

“Easy, becoming the system…” – by now chances are you’ll be questioning, when are we going to give you extra quantitative

assertions? If quantitative implies “imply squared error” (MSE), and if MSE is taken to be some divergence between forecasts

and the true goal from the check set, the reply is that this MSE doesn’t differ a lot between any of the 4 architectures.

Put in a different way, it’s principally a perform of noise stage.

Nevertheless, we may argue that what we’re actually concerned with is how effectively a mannequin forecasts the underlying course of. And there,

we see variations.

Within the following plot, we distinction MSEs obtained for the 4 mannequin varieties (gray: VAE; orange: LSTM; darkish blue: FNN-VAE; inexperienced:

FNN-LSTM). The rows replicate noise ranges (1, 1.5, 2, 2.5); the columns characterize MSE in relation to the noisy(“actual”) goal

(left) on the one hand, and in relation to the underlying system on the opposite (proper). For higher visibility of the impact,

*MSEs have been normalized as fractions of the utmost MSE in a class*.

So, if we need to predict *sign plus noise* (left), it isn’t extraordinarily important whether or not we use FNN or not. But when we need to

predict the sign solely (proper), with rising noise within the information FNN loss turns into more and more efficient. This impact is much

stronger for VAE vs. FNN-VAE than for LSTM vs. FNN-LSTM: The space between the gray line (VAE) and the darkish blue one

(FNN-VAE) turns into bigger and bigger as we add extra noise.

## Summing up

Our experiments present that when noise is prone to obscure measurements from an underlying deterministic system, FNN

regularization can strongly enhance forecasts. That is the case particularly for convolutional VAEs, and doubtless convolutional

autoencoders on the whole. And if an FNN-constrained VAE performs as effectively, for time sequence prediction, as an LSTM, there’s a

robust incentive to make use of the convolutional mannequin: It trains considerably quicker.

With that, we conclude our mini-series on FNN-regularized fashions. As at all times, we’d love to listen to from you should you have been capable of

make use of this in your personal work!

Thanks for studying!