Flexible time series forecasting using machine learning

13 min readJan 5, 2022

If you have researched time series forecasting for a project, you might share my opinion that the content on the web can be confusing and incomplete. You find lots of repeated content which rarely answers your basic questions.

That was my motivation for this article and the linked repository on GitHub: To address questions I had as I was researching general purpose forecasting, to create easy to use modules for multivariate/multi-step forecasting and rank different methodologies on open datasets.

The following time series related terms are repeatedly mentioned in this article. To be clear what is meant by them, refer to the glossary.

Glossary:

ARIMA and VAR: Autoregressive Integrated Moving Average family of models, used historically by statisticians to forecast univariate time series. Vector Autoregression (VAR) or VARIMA is a generalisation of ARIMA for multivariate series.

Autocorrelation: Correlation between an event in a time series with its own lagged version.

Random walk series: A particular time series, often seen in the financial sector, in which the future event only depends on its immediate past event.

Single-step vs. Multi-Step prediction: One time-step prediction in future vs. multiple time-steps prediction in future.

Stationarity: A time series is stationary if its statistical properties such as mean and variance are constant over time.

Univariate vs. Multivariate: A univariate series is the change in time for one variable and a multivariate series, the change in time for multiple variables.

Introduction

Traditional statistical methods vs. Machine Learning (ML) methods:

This study [1] is often referenced to show that the traditional ARIMA family of models can produce better results than ML methods under certain conditions. The traditional statistical models are often applied to univariate time series to model autocorrelations linearly. The referenced article is also a univariate investigation.

There are however limitations to these traditional forecasting schemes. First: As mentioned, your model is linear in past events. Second: You need to rigorously investigate the stationarity, order of lags, the model and the residuals to ensure all is right. Third: By design your model predicts one event into the future and if you wish to produce more, the models usually do so recursively and therefore lose accuracy over time.

When you have a multivariate series, the statistical scheme to use is VAR(VARIMA) which has all the above restrictions.

If you have a large dataset, you can explore ML models that can handle multivariate series, can inherently predict multiple steps into the future and can take into account possible nonlinear autocorrelations as well.

The present work offers modules that facilitate the investigation of several families of neural networks and classic ML models on multivariate datasets for the general case of multi-step forecasting.

Framing the problem as a supervised learning problem:

In general machine learning problems, the separation between dependent and independent variables in the data is clear. The same is not true when facing time series data as you first need to frame the data as a supervised learning problem. If you search online for this, you will come across two main methods to formulate the problem:

1- There is a lazy cut-off approach seen on Kaggle or here in which the ordered univariate data is split at a certain point to create train and test sets. Point predictions are then made for each time series event as the target variable, given the date-time categorical features as independent variables. In some cases you see that lagged target variables have also been used.

The problem starts when you want to make multiple step forecasts into the future! Either you have purely date-time dummy variables or you can only use one particular lag variable of the same size as your forecast horizon. This is a very limiting strategy for making a supervised learning dataset from time series.

2- The second method is a flexible sliding window approach. You walk through your data and create input and output variables at varying length, depending on how you want to solve the problem. This blog article has shown how a two dimensional dataset can be created by the sliding window approach for multivariate time series to forecast multiple time-steps into the future.

A more general and elegant approach is shown in this Tensorflow blog article here. They introduce a “window generator” class for time series forecasting. The class allows varying input, output and offset sizes to produce a three dimensional Tensorflow dataset that includes past times (dimension 2) in multiple series (dimension 3).

The present article offers a generalisation of the data generator class from the Tensorflow blog, in Numpy that allows more flexibility in choice of input and output variables.

schematic application of sliding window to create the dataset

Multiple forecast into the future:

Time series statistical models often forecast one event in the future. To forecast multiple time-steps in the future, they recursively use the prediction as a past event and make the next prediction. Clearly the more time steps in the future, the larger the accumulated error.

Other than the recursive strategy, there are other strategies to apply to single-step producing models to forecast multi-steps ahead. A direct strategy is one in which independent models are trained using the same data to predict each future event separately [2][3]. You can also look here for a description on the recursive and direct strategies and other methods in between the two formulations.

Direct and recursive strategies can be applied to all models to produce multi-step forecasts. But there are machine learning (ML) models that can inherently produce vector outputs which in some cases are directly associated with the concept of time (Recurrent Neural Nets) and in other cases can indirectly produce multi-step forecasts.

In the present work I have used the three methodologies of built-in vector output, recursive and direct strategies to create multi-step forecasts and have compared the results.

(1) schematic window for a recursive forecast and (2) schematic window for a direct forecast

Methods and models:

In the linked repository, the window generator will be used in the design of flexible input/output size models from classic machine learning models to different families of neural networks. The models employed take multivariate series as input and can produce multi-step forecasts for each/all series.

The models (described below) in the neural networks (nn) module can produce single or multi-step outputs. The single output mode of the models can be used in the recursive or direct classes to produce multi-step outputs.

The recursive prediction class inherits from the data generator class, whereas the direct prediction class instantiates multiple window generators, one for each output time-step. It must be mentioned that one limitation for the recursive strategy is that the input feature columns and output label columns must be the same. If this condition is not met, a Value error is produced.

All neural net models take the data window, generated by the sliding window generator, as an argument. Window attributes such as input width, label width and number of input and label features are then used for creating size-flexible layers.The following are the models in the nn module:

naive_model: is the base model in this group. The performance of other models will be compared with the naive model and If other models can not outperform the naive model, time series data may be a random walk series.
dense_model: A multi layer perceptron is the simplest family of neural networks that can be used for time series forecasting. The fully connected layers do not inherently contain any concept of time but they can easily learn non-linear autocorrelations. The flatten layer at the beginning allows for any input size. The dense layer with units equal to window number of features multiplied by window label width and then a reshape layer at the end ensures outputting the label width and number of features of interest.

Convolutional networks: 1D convolutional layers can learn sequences by convolving multiple filters on them. The output size of convolutional layers depends on the input and the kernel size.

conv_model: is the simplest architecture with the lowest number of learnable parameters with one 1D conv layer. we can add one or two dense layers after the conv layer. Depending on the kernel size and the input width, you can predict distinct output sizes. The code will give a Value error if the input and output size do not match and a hint on how to change the input width.
conv_flex_model: is a more flexible design at the cost of the number of trainable parameters. we can have a flatten layer after the convolutional layer and add the dense and reshape layer similar to the multi-perceptron model to predict any output size flexibly.
conv_model_recursive_training: This model is a bit more involved. It applies the idea of recursive forecast not at the time of prediction, but during training. Using tensorflow.keras functional API, shared layers have been defined. At each time-step a one step prediction is made, the prediction will be concatenated with the training data and one time step will be dropped from the beginning of training examples to keep the shape of the data intact. Because the layers are shared, the overall training parameters are low compared with the previous convolutional model.

LSTM networks: The networks introduced so far did not have an inherent understanding of the time dimension. However Recurrent Neural Networks (RNNs) and the special group of Long Term Short Memory (LSTMs) contain a time dimension inherently and are flexible in input and output size. Here you can read more about LSTMs and here you can see how varied their architectures can be.

LSTM_model: is the simplest LSTM design with one layer of LSTM that only returns output from the last cell and returns no sequence. The output will be a two dimensional (samples, units) vector. We can then use a dense layer with units equal to the number of label features multiplied by label width and a reshape layer as before to produce the output size of interest.
LSTM_lambda_layer: is a “sequence to sequence” design. If you have a LSTM layer that returns all sequences (not just the last cell), you can use a time-distributed dense layer to apply the same dense layer to all time-steps (in Tensorflow 2.0 a simple dense layer after LSTM with 3d output is by default time-distributed). The output will then go through a custom lambda layer to be reshaped to the output size of our labels.
LSTM_model_encoder_decoder: is another “sequence to sequence” network where we use two LSTM layers: an encoder with “many to one” and a decoder with “one to many” design. The former is followed by a Repeat vector of the size of label width and the latter a dense layer with units equal to the number of label features.

And finally:

LSTM_model_recursive_train: is similar to the conv_model_recursive train. Shared layers have been defined to train the LSTM model recursively for all output time-steps. The shared LSTM layer in this case returns its state as well its output meaning the LSTM cells use the state of previous cells and are autoregressively trained.

Classic ML models:

One advantage of the windows generator in Numpy as opposed to Tensorflow is to use it to train classic machine learning (ML) models. Some ML models are capable of producing two dimensional vector outputs. We can use an in_reshape function to the 3d data from the windows generator to use the data as input for ML models followed by an out_reshape function on the output to convert it back to 3d.

The models_classic_ml (cm) module of the repository contains these reshape functions plus subtly changed direct and recursive classes from the nn module such that they can be used with classic ML models.

The goal is to compare the performance of all models in vector output predictions as well as recursive and direct predictions. Therefore the nn and cm modules have been designed to have similar APIs that facilitate this comparison and plotting.

The usage of the modules is demonstrated in the Jupyter analytics notebook inside the notebooks folder of the repository.

Results and discussion:

As it is shown in the notebooks, a good practice before choosing the window parameters, is to have a look at autocorrelation (acf/pacf) plots of the series. you can make a more informed decision on the window input size as you can see which lags are striking and check the seasonality of the data. You can read here [4] and see here how to interpret these plots.

Here I choose the same input and label features because we want to employ a recursive forecasting scheme as well as others and as mentioned for the recursive strategy this is the condition. We choose an input width that matches the seasonality of some the series (variables) of interest (24 time-steps)and plan to forecast 15 time-steps. The following screenshot shows the window parameters:

In the Jupyter analytics notebook, You can experiment with input and label width, window shift size, data shuffling and features to include until you get the best results as depicted by metrics.

Note that the weather dataset used here is the prepared and engineered version from here. You can do your own cleaning, engineering and smoothing on the data prior to using the Window generator. Also, it does not hurt to check stationarity of the series beforehand. You may get better results from the ML models when working with stationary data but it is not a necessary condition.

figure(1): performance metrics for all models.

Figure (1) shows the performance metrics for all models (except for the simple conv model as it did not match the window input width of interest) from the nn module and 3 classic ML models, linear regression, k-nearest neighbours regressor and extra tree regressor. For clarity the values of mean squared error (MSE) and mean absolute error (MAE) are shown on top of their bar plots.

For the dataset (weather dataset) and the window parameters chosen, the LSTM family of models seem to have the best performance. The fact that among classic models the linear model has done a seemingly good job may imply that non-linearity in autocorrelations was not so important for this data. Extra tree regressor has done a worse job than the naive model which may not be surprising at all, knowing ensemble methods are not so good at extrapolation and forecasting, read here.

A similar behaviour is observed when you use the same models on the bike sharing dataset in the data folder of the repository.

figure (2): Loss diagrams for the nn models. The number of trainable parameters is shown in titles.

Figure (2) Shows the loss diagrams for all the trained models from the nn module. The plot titles show the name of the model along with the number of trainable parameters. When the size of the data is small, attention should also be made to the number of learned parameters, as complex models can overfit¹ on small data.

figure(3): (a) built-in multi-step forecast, (b) recursive and (c) direct forecasts on 4 well-performing models from the previous section.

Figure (3) shows four well-performing models from figure (1) have been chosen and their performance for the prediction of the “p (mbar)” column on the last example from the test set is plotted (you can change which example to plot in the nn modules’ plotting functions and methods) using different multi-step forecast strategies. The one plotted example is not representative of the average performance of the model, for that the average MAEs are depicted in the title, but rather the trend and behaviour of the forecast.

It appears in the vector output prediction (a), aside from the average performance, the models have forecasted the trend of the time series pretty well. The recursive forecasts however, as speculated, suffer from the accumulation of error as the time-steps in the future increase, evident in figure (3b). The direct forecasts (c), although have a good overall metric, fail to see the trend of future events. Because the models in direct forecast predict events independently of each other this behaviour is also expected.

A similar observation can be made on the bike sharing dataset. The built-in vector output seems to outperform other types of forecasting for multiple steps in the future, both from the point of view of the overall performance metrics and the prediction of the trend of the future. Direct forecast takes the second place and the recursive finishes at third place.

Although the datasets produced by the window generator (train set/ validation set and test set) are all input/label pairs, you can easily use the models to produce an unforeseen future prediction (no labels) as well.

The future forecast notebook in the notebooks folder shows how it can be done for the three main methodologies of vector output, recursive and direct. The recursive and direct classes in both nn and cm modules have a future forecast method that produces multi-step forecasts in the future.

Other datasets:

You can run the Jupyter analytics and models as is on your own dataset or change models as you wish in the nn module. Here you can find multivariate time series data to try.

The focus of the above analysis was on a more general multivariate/multi-step forecasting. You can easily use the code and the Jupyter analytics for the special case of univariate and single-step forecasting.

Conclusion:

Using a flexible sliding window approach, time series were formulated as supervised learning problems. The window parameters were then used in designing arbitrary input and output size models. Aside from built-in multi-step forecasts, recursive and direct strategies were implemented and tested.

For the two datasets in this repository, the simple LSTM model and the LSTM encoder-decoder produce best multi-step forecasts. In a comparison between built-in vector output, recursive and direct strategies, the built-in multi-step forecasts are best at predicting the trend of the data.

Foot notes and References:

¹ For the weather dataset used here, data is large compared with the size of the networks and overfitting is not dramatic. Also at the time of writing this article Tensorflow 2.4 on Apple M1 is still buggy and sometimes gives “internal error” when dense layers are regularised, therefore I have commented out the regularisation on dense and convolutional layers in the nn module but feel free to add it back.

[1]- Makridakis S, Spiliotis E, Assimakopoulos V (2018) Statistical and Machine Learning forecasting methods: Concerns and ways forward. https://doi.org/ 10.1371/journal.pone.0194889

[2]- Ben Taieba S, Bontempia G, Atiyac A, Sorjamaa A (2012) A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition. https://doi.org/10.1016/j.eswa.2012.01.039

[3]- Ben Taieba S, Bontempia G, La Borgne Y (2013) Machine Learning Strategies for Time Series Forecasting. DOI:10.1007/978–3–642–36318–4_3

[4]- Sage publications, Learn About Time Series ACF and PACF in SPSS With Data From the USDA Feed Grains Database (1876–2015). https://dx.doi.org/10.4135/9781473995581