Abstract
Long sequence time-series forecasting (LSTF) problems, such as weather forecasting, stock market forecasting, and power resource management, are widespread in the real world. The LSTF problem requires a model with high prediction accuracy. Recent studies have shown that the transformer model architecture is the most promising model structure for LSTF problems compared with other model architectures. The transformer model has the property of permutation equivalence, which leads to the importance of sequence position encoding, an essential process in model training. Currently, the continuous dynamics models constructed for position encoding using the neural differential equations (neural ODEs) method can model sequence position information well. However, we have found that there are some limitations when neural ODEs are applied to the LSTF problem, including the time cost problem, the baseline drift problem, and the information loss problem; thus, neural ODEs cannot be directly applied to the LSTF problem. To address this problem, we design a binary position encoding-based regularization model for long sequence time-series prediction, named Seformer, which has the following structure: 1) The binary position encoding mechanism, including intrablock and interblock position encoding. For intrablock position encoding, we design a simple ODE method by discretizing the continuum dynamics model, which reduces the time cost required to compute neural ODEs while maintaining their dynamics properties to the maximum extent. In interblock position encoding, a chunked recursive form is adopted to alleviate the baseline drift problem caused by eigenvalue explosion. 2) Information transfer regularization mechanism: By regularizing the model intermediate hidden variables as well as the encoder-decoder connection variables, we can reduce information loss during the model training process while ensuring the smoothness of the position information. Extensive experimental results obtained on six large-scale datasets show a consistent improvement in our approach over the baselines.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The ETT dataset was acquired at https://github.com/zhouhaoyi/ETDataset
The ECL dataset was acquired at https://github.com/laiguokun/multivariate-time-series-data
Weather dataset was acquired at https://www.ncei.noaa.gov/data/local-climatological-data/
Traffic dataset was acquired at https://drive.google.com/drive/folders/1M3gTc1DSvnUFMI57p70VFH5MHhZh3wC8/
Meteorological dataset was acquired at https://drive.google.com/drive/folders/1Xz84ci5YKWL6O2I-58ZsVe42lYIfqui1/
References
Al Qundus J, Dabbour K, Gupta S, Meissonier R, Paschke A (2020) Wireless sensor network for ai-based flood disaster detection. Ann Oper Res:1–23
Castellini A, Bianchi F, Farinelli A (2022) Generation and interpretation of parsimonious predictive models for load forecasting in smart heating networks. Appl Intell:1–17
Zhang Z, Hong W-C (2021) Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl-Based Syst 228:107297
Cao J, Li Z, Li J, Physica A (2019) Financial time series forecasting model based on ceemdan and lstm. Stat Mech Appl 519:127–139
Bukhari AH, Raja MAZ, Sulaiman M, Islam S, Shoaib M, Kumam P (2020) Fractional neuro-sequential arfima-lstm for financial market forecasting. IEEE Access 8:71326–71338
Tran T, Pham L, Ngo Q (2020) Forecasting epidemic spread of sars-cov-2 using arima model (case study: Iran). Global J Environ Sci Manag 6(Special Issue (Covid-19)):1–10
Saqib M (2021) Forecasting covid-19 outbreak progression using hybrid polynomial-bayesian ridge regression model. Appl Intell 51(5):2703–2713
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Li Y, Li J, Zhang M (2021) Deep transformer modeling via grouping skip connection for neural machine translation. Knowl-Based Syst 234:107556
Ma J, Li J, Gao W, Yang Y, Wong K-F (2021) Improving rumor detection by promoting information campaigns with transformer-based generative adversarial learning. IEEE Trans Knowl Data Eng
Gu Q, Su J, Yuan L (2021) Visual affordance detection using an efficient attention convolutional neural network. Neurocomputing 440:36–44
Song L, Liu G, Ma M (2022) Td-net: unsupervised medical image registration network based on transformer and cnn. Appl Intell:1–9
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Maleki M, Mahmoudi MR, Wraith D, Pho K-H (2020) Time series modelling to forecast the confirmed and recovered cases of covid-19. Travel Med Infect Dis 37:101742
Papacharalampous G, Tyralis H, Koutsoyiannis D (2018) Predictability of monthly temperature and precipitation using automatic time series forecasting methods. Acta Geophys 66(4):807–831
Liu Y, Gong C, Yang L, Chen Y (2020) Dstp-rnn: a dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Syst Appl 143:113082
Matyjaszek M, Fernández PR, Krzemień A, Wodarski K, Valverde GF (2019) Forecasting coking coal prices by means of arima models and neural networks, considering the transgenic time series theory. Resour Policy 61:283–292
Salgotra R, Gandomi M, Gandomi AH (2020) Time series analysis and forecast of the covid-19 pandemic in India using genetic programming. Chaos, Solitons Fractals 138:109945
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of the AAAI Conference on Artificial Intelligence 35(12):11106–11115
Kitaev N, Kaiser L, Łevskaya A (2020) Reformer: The efficient transformer. arXiv:https://axiv.org/abs/2001.04451
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (long and short papers), pp 4171–4186
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 2 (short papers), pp 464–468
Liu X, Yu H-F, Dhillon I, Hsieh C-J (2020) Learning to encode position for transformer with continuous dynamical model. In: International conference on machine learning. PMLR, pp 6327–6335
Yan H, Deng B, Li X, Qiu X (2019) Tener: adapting transformer encoder for named entity recognition
Yang Z, Yan W-W, Huang X, Mei L (2020) Adaptive temporal-frequency network for time-series forecasting. IEEE Trans Knowl Data Eng
Sezer OB, Gudelek MU, Ozbayoglu AM (2020) Financial time series forecasting with deep learning: a systematic literature review: 2005–2019. Appl Soft Comput 90:106181
Du S, Li T, Yang Y, Horng S-J (2019) Deep air quality forecasting using hybrid deep learning framework. IEEE Trans Knowl Data Eng
Zhou Y, Ding F (2020) Modeling nonlinear processes using the radial basis function-based state-dependent autoregressive models. IEEE Signal Proc Lett 27:1600–1604
Stern M, Shazeer N, Uszkoreit J (2018) Blockwise parallel decoding for deep autoregressive models. Adv Neural Inf Process Syst 31
Ariyo AA, Adewumi AO, Ayo CK (2014) Stock price prediction using the arima model. In: 2014 UKSim-AMSS 16th international conference on computer modelling and simulation. IEEE, pp 106–112
Taylor SJ, Letham B (2018) Forecasting at scale. Am Stat 72(1):37–45
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate
Zhang Y, Chen Y, Wang J, Pan Z (2021) Unsupervised deep anomaly detection for multi-sensor time-series signals. IEEE Trans Knowl Data Eng
Lai G, Chang W-C, Yang Y, Liu H (2018) Modeling long-and short-term temporal patterns with deep neural networks. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 95–104
Salinas D, Flunkert V, Gasthaus J, Januschowski T (2020) Deepar: Probabilistic forecasting with autoregressive recurrent networks. Int J Forecast 36(3):1181–1191
Kumar S, Shrimal A, Akhtar MS, Chakraborty T (2022) Discovering emotion and reasoning its flip in multi-party conversations using masked memory network and transformer. Knowl-Based Syst:108112
Zheng W, Zhong J, Zhang Q, Zhao G (2022) Mtt: an efficient model for encrypted network traffic classification using multi-task transformer. Appl Intell:1–16
Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst:108146
Li S, Jin X, Xuan Y, Zhou X, Chen W, Wang Y-X, Yan X (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv Neural Inf Process Syst 32:5243–5253
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(140):1–67
He P, Liu X, Gao J, Chen W (2021) Deberta: decoding-enhanced bert with disentangled attention
Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK (2018) Neural ordinary differential equations. Ad Neural Inf Process Syst 31
Acknowledgments
This work was supported by the Key Research and Development Program of Liaoning Province of China (2020JH2/ 10100039).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: F-distribution critical values
Only F-distribution critical values with 10 or fewer algorithms can be found by looking up the table. To this end, we extended this value to increase the number of algorithms to 20, shown in Table 7 when the confidence level is α = 0.05 and shown in Table 8 when the confidence level is α = 0.1.
Regardless of the confidence level α = 0.05 or α = 0.1, only qα values with 10 or fewer algorithms can be found by looking up the table. We extended this value by increasing the number of algorithms to 20, and the results are shown in the Table 9.
Appendix B: Experimental hyperparameter design
Because our models are mainly compared with Transformer-based models, we fixed the hyperparameters of the transformer architecture on the dataset to ensure the fairness of our experiments. The hyperparameters are shown in Table 10.
Where ‘prediction’ is the prediction length of the LSTF task, ‘seq_len’ is the input sequence length, ‘label_len’ is the start token length, ‘pred_len’ is the prediction sequence length,‘ d_model’ is the dimension of a model, ‘n_heads’ is num of heads, ‘encoder_layers’ is num of encoder layers, ‘decoder_layers’ is num of decoder layers, ‘d_ff’ is dimension of fcn, ‘dropout’ is dropout ratio, ‘train_epochs’ is train epochs, ‘batch_size’ is the batch size of train input data, ‘patience’ is early stopping patience, ‘learning_rate’ is the optimizer learning rate.
Based on the above parametric comparisons, combined with the prediction results of each model on the dataset, the results show that our Seformer model dramatically improves the prediction performance of the model based on solving the time cost problem, baseline drift problem, and information loss problem of neural ODEs-style position encoding. It verifies the conjecture of the importance of position encoding as the base information for Transformer to construct global dependencies. It shows that our idea of improving the model’s prediction performance through the perspective of position encoding is correct.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zeng, P., Hu, G., Zhou, X. et al. Seformer: a long sequence time-series forecasting model based on binary position encoding and information transfer regularization. Appl Intell 53, 15747–15771 (2023). https://doi.org/10.1007/s10489-022-04263-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04263-z