Posit AI Weblog: torch time collection, take three: Sequencetosequence prediction
In the present day, we proceed our exploration of multistep timeseries forecasting with torch
. This put up is the third in a collection.

Initially, we lined fundamentals of recurrent neural networks (RNNs), and skilled a mannequin to foretell the very subsequent worth in a sequence. We additionally discovered we may forecast fairly just a few steps forward by feeding again particular person predictions in a loop.

Next, we constructed a mannequin “natively” for multistep prediction. A small multilayerperceptron (MLP) was used to venture RNN output to a number of time factors sooner or later.
Of each approaches, the latter was the extra profitable. However conceptually, it has an unsatisfying contact to it: When the MLP extrapolates and generates output for, say, ten consecutive closing dates, there is no such thing as a causal relation between these. (Think about a climate forecast for ten days that by no means received up to date.)
Now, we’d wish to attempt one thing extra intuitively interesting. The enter is a sequence; the output is a sequence. In pure language processing (NLP), one of these activity is quite common: It’s precisely the type of state of affairs we see with machine translation or summarization.
Fairly fittingly, the varieties of fashions employed to those ends are named sequencetosequence fashions (typically abbreviated seq2seq). In a nutshell, they break up up the duty into two parts: an encoding and a decoding half. The previous is completed simply as soon as per inputtarget pair. The latter is completed in a loop, as in our first attempt. However the decoder has extra data at its disposal: At every iteration, its processing relies on the earlier prediction in addition to earlier state. That earlier state would be the encoder’s when a loop is began, and its personal ever thereafter.
Earlier than discussing the mannequin intimately, we have to adapt our information enter mechanism.
We proceed working with vic_elec
, supplied by tsibbledata
.
Once more, the dataset definition within the present put up appears to be like a bit totally different from the best way it did earlier than; it’s the form of the goal that differs. This time, y
equals x
, shifted to the left by one.
The rationale we do that is owed to the best way we’re going to practice the community. With seq2seq, folks typically use a way known as “trainer forcing” the place, as a substitute of feeding again its personal prediction into the decoder module, you cross it the worth it ought to have predicted. To be clear, that is carried out throughout coaching solely, and to a configurable diploma.
library(torch)
library(tidyverse)
library(tsibble)
library(tsibbledata)
library(lubridate)
library(fable)
library(zeallot)
n_timesteps < 7 * 24 * 2
n_forecast < n_timesteps
vic_elec_get_year < operate(12 months, month = NULL) {
vic_elec %>%
filter(year(Date) == 12 months, month(Date) == if (is.null(month)) month(Date) else month) %>%
as_tibble() %>%
choose(Demand)
}
elec_train < vic_elec_get_year(2012) %>% as.matrix()
elec_valid < vic_elec_get_year(2013) %>% as.matrix()
elec_test < vic_elec_get_year(2014, 1) %>% as.matrix()
train_mean < mean(elec_train)
train_sd < sd(elec_train)
elec_dataset < dataset(
identify = "elec_dataset",
initialize = operate(x, n_timesteps, sample_frac = 1) {
self$n_timesteps < n_timesteps
self$x < torch_tensor((x  train_mean) / train_sd)
n < length(self$x)  self$n_timesteps  1
self$begins < sort(sample.int(
n = n,
measurement = n * sample_frac
))
},
.getitem = operate(i) {
begin < self$begins[i]
finish < begin + self$n_timesteps  1
lag < 1
list(
x = self$x[start:end],
y = self$x[(start+lag):(end+lag)]$squeeze(2)
)
},
.size = operate() {
length(self$begins)
}
)
Dataset in addition to dataloader instantations then can proceed as earlier than.
batch_size < 32
train_ds < elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl < train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)
valid_ds < elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl < valid_ds %>% dataloader(batch_size = batch_size)
test_ds < elec_dataset(elec_test, n_timesteps)
test_dl < test_ds %>% dataloader(batch_size = 1)
Technically, the mannequin consists of three modules: the aforementioned encoder and decoder, and the seq2seq module that orchestrates them.
Encoder
The encoder takes its enter and runs it by way of an RNN. Of the 2 issues returned by a recurrent neural community, outputs and state, to this point we’ve solely been utilizing output. This time, we do the other: We throw away the outputs, and solely return the state.
If the RNN in query is a GRU (and assuming that of the outputs, we take simply the ultimate time step, which is what we’ve been doing all through), there actually is not any distinction: The ultimate state equals the ultimate output. If it’s an LSTM, nonetheless, there’s a second type of state, the “cell state”. In that case, returning the state as a substitute of the ultimate output will carry extra data.
encoder_module < nn_module(
initialize = operate(sort, input_size, hidden_size, num_layers = 1, dropout = 0) {
self$sort < sort
self$rnn < if (self$sort == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
}
},
ahead = operate(x) {
x < self$rnn(x)
# return final states for all layers
# per layer, a single tensor for GRU, a listing of two tensors for LSTM
x < x[[2]]
x
}
)
Decoder
Within the decoder, identical to within the encoder, the primary element is an RNN. In distinction to previouslyshown architectures, although, it doesn’t simply return a prediction. It additionally stories again the RNN’s ultimate state.
decoder_module < nn_module(
initialize = operate(sort, input_size, hidden_size, num_layers = 1) {
self$sort < sort
self$rnn < if (self$sort == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
}
self$linear < nn_linear(hidden_size, 1)
},
ahead = operate(x, state) {
# enter to ahead:
# x is (batch_size, 1, 1)
# state is (1, batch_size, hidden_size)
x < self$rnn(x, state)
# break up RNN return values
# output is (batch_size, 1, hidden_size)
# next_hidden is
c(output, next_hidden) %<% x
output < output$squeeze(2)
output < self$linear(output)
list(output, next_hidden)
}
)
seq2seq
module
seq2seq
is the place the motion occurs. The plan is to encode as soon as, then name the decoder in a loop.
For those who look again to decoder ahead()
, you see that it takes two arguments: x
and state
.
Relying on the context, x
corresponds to one among three issues: ultimate enter, previous prediction, or prior floor fact.

The very first time the decoder known as on an enter sequence,
x
maps to the ultimate enter worth. That is totally different from a activity like machine translation, the place you’ll cross in a begin token. With time collection, although, we’d wish to proceed the place the precise measurements cease. 
In additional calls, we would like the decoder to proceed from its most uptodate prediction. It is just logical, thus, to cross again the previous forecast.

That mentioned, in NLP a way known as “trainer forcing” is often used to hurry up coaching. With trainer forcing, as a substitute of the forecast we cross the precise floor fact, the factor the decoder ought to have predicted. We do this solely in a configurable fraction of circumstances, and – naturally – solely whereas coaching. The rationale behind this system is that with out this type of recalibration, consecutive prediction errors can shortly erase any remaining sign.
state
, too, is polyvalent. However right here, there are simply two potentialities: encoder state and decoder state.

The primary time the decoder known as, it’s “seeded” with the ultimate state from the encoder. Be aware how that is the one time we make use of the encoding.

From then on, the decoder’s personal earlier state can be handed. Keep in mind the way it returns two values, forecast and state?
seq2seq_module < nn_module(
initialize = operate(sort, input_size, hidden_size, n_forecast, num_layers = 1, encoder_dropout = 0) {
self$encoder < encoder_module(sort = sort, input_size = input_size,
hidden_size = hidden_size, num_layers, encoder_dropout)
self$decoder < decoder_module(sort = sort, input_size = input_size,
hidden_size = hidden_size, num_layers)
self$n_forecast < n_forecast
},
ahead = operate(x, y, teacher_forcing_ratio) {
# put together empty output
outputs < torch_zeros(dim(x)[1], self$n_forecast)$to(machine = machine)
# encode present enter sequence
hidden < self$encoder(x)
# prime decoder with ultimate enter worth and hidden state from the encoder
out < self$decoder(x[ , n_timesteps, , drop = FALSE], hidden)
# decompose into predictions and decoder state
# pred is (batch_size, 1)
# state is (1, batch_size, hidden_size)
c(pred, state) %<% out
# retailer first prediction
outputs[ , 1] < pred$squeeze(2)
# iterate to generate remaining forecasts
for (t in 2:self$n_forecast) {
# name decoder on both floor fact or earlier prediction, plus earlier decoder state
teacher_forcing < runif(1) < teacher_forcing_ratio
enter < if (teacher_forcing == TRUE) y[ , t  1, drop = FALSE] else pred
enter < enter$unsqueeze(3)
out < self$decoder(enter, state)
# once more, decompose decoder return values
c(pred, state) %<% out
# and retailer present prediction
outputs[ , t] < pred$squeeze(2)
}
outputs
}
)
web < seq2seq_module("gru", input_size = 1, hidden_size = 32, n_forecast = n_forecast)
# coaching RNNs on the GPU at present prints a warning which will litter
# the console
# see https://github.com/mlverse/torch/points/461
# alternatively, use
# machine < "cpu"
machine < torch_device(if (cuda_is_available()) "cuda" else "cpu")
web < web$to(machine = machine)
The coaching process is primarily unchanged. We do, nonetheless, have to determine about teacher_forcing_ratio
, the proportion of enter sequences we wish to carry out recalibration on. In valid_batch()
, this could at all times be 0
, whereas in train_batch()
, it’s as much as us (or slightly, experimentation). Right here, we set it to 0.3
.
optimizer < optim_adam(web$parameters, lr = 0.001)
num_epochs < 50
train_batch < operate(b, teacher_forcing_ratio) {
optimizer$zero_grad()
output < web(b$x$to(machine = machine), b$y$to(machine = machine), teacher_forcing_ratio)
goal < b$y$to(machine = machine)
loss < nnf_mse_loss(output, goal)
loss$backward()
optimizer$step()
loss$merchandise()
}
valid_batch < operate(b, teacher_forcing_ratio = 0) {
output < web(b$x$to(machine = machine), b$y$to(machine = machine), teacher_forcing_ratio)
goal < b$y$to(machine = machine)
loss < nnf_mse_loss(output, goal)
loss$merchandise()
}
for (epoch in 1:num_epochs) {
web$practice()
train_loss < c()
coro::loop(for (b in train_dl) {
loss <train_batch(b, teacher_forcing_ratio = 0.3)
train_loss < c(train_loss, loss)
})
cat(sprintf("nEpoch %d, coaching: loss: %3.5f n", epoch, mean(train_loss)))
web$eval()
valid_loss < c()
coro::loop(for (b in valid_dl) {
loss < valid_batch(b)
valid_loss < c(valid_loss, loss)
})
cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, mean(valid_loss)))
}
Epoch 1, coaching: loss: 0.37961
Epoch 1, validation: loss: 1.10699
Epoch 2, coaching: loss: 0.19355
Epoch 2, validation: loss: 1.26462
# ...
# ...
Epoch 49, coaching: loss: 0.03233
Epoch 49, validation: loss: 0.62286
Epoch 50, coaching: loss: 0.03091
Epoch 50, validation: loss: 0.54457
It’s attentiongrabbing to match performances for various settings of teacher_forcing_ratio
. With a setting of 0.5
, coaching loss decreases much more slowly; the other is seen with a setting of 0
. Validation loss, nonetheless, just isn’t affected considerably.
The code to examine testset forecasts is unchanged.
web$eval()
test_preds < vector(mode = "listing", size = length(test_dl))
i < 1
coro::loop(for (b in test_dl) {
output < web(b$x$to(machine = machine), b$y$to(machine = machine), teacher_forcing_ratio = 0)
preds < as.numeric(output)
test_preds[[i]] < preds
i << i + 1
})
vic_elec_jan_2014 < vic_elec %>%
filter(year(Date) == 2014, month(Date) == 1)
test_pred1 < test_preds[[1]]
test_pred1 < c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_jan_2014)  n_timesteps  n_forecast))
test_pred2 < test_preds[[408]]
test_pred2 < c(rep(NA, n_timesteps + 407), test_pred2, rep(NA, nrow(vic_elec_jan_2014)  407  n_timesteps  n_forecast))
test_pred3 < test_preds[[817]]
test_pred3 < c(rep(NA, nrow(vic_elec_jan_2014)  n_forecast), test_pred3)
preds_ts < vic_elec_jan_2014 %>%
choose(Demand) %>%
add_column(
mlp_ex_1 = test_pred1 * train_sd + train_mean,
mlp_ex_2 = test_pred2 * train_sd + train_mean,
mlp_ex_3 = test_pred3 * train_sd + train_mean) %>%
pivot_longer(Time) %>%
update_tsibble(key = identify)
preds_ts %>%
autoplot() +
scale_colour_manual(values = c("#08c5d1", "#00353f", "#ffbf66", "#d46f4d")) +
theme_minimal()
Evaluating this to the forecast obtained from final time’s RNNMLP combo, we don’t see a lot of a distinction. Is that this stunning? To me it’s. If requested to invest in regards to the cause, I’d most likely say this: In all the architectures we’ve used to this point, the primary service of knowledge has been the ultimate hidden state of the RNN (one and solely RNN within the two earlier setups, encoder RNN on this one). It is going to be attentiongrabbing to see what occurs within the final a part of this collection, after we increase the encoderdecoder structure by consideration.
Thanks for studying!
Picture by Suzuha Kozuki on Unsplash