Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Seformer: a long sequence time-series forecasting model based on binary position encoding and information transfer regularization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Long sequence time-series forecasting (LSTF) problems, such as weather forecasting, stock market forecasting, and power resource management, are widespread in the real world. The LSTF problem requires a model with high prediction accuracy. Recent studies have shown that the transformer model architecture is the most promising model structure for LSTF problems compared with other model architectures. The transformer model has the property of permutation equivalence, which leads to the importance of sequence position encoding, an essential process in model training. Currently, the continuous dynamics models constructed for position encoding using the neural differential equations (neural ODEs) method can model sequence position information well. However, we have found that there are some limitations when neural ODEs are applied to the LSTF problem, including the time cost problem, the baseline drift problem, and the information loss problem; thus, neural ODEs cannot be directly applied to the LSTF problem. To address this problem, we design a binary position encoding-based regularization model for long sequence time-series prediction, named Seformer, which has the following structure: 1) The binary position encoding mechanism, including intrablock and interblock position encoding. For intrablock position encoding, we design a simple ODE method by discretizing the continuum dynamics model, which reduces the time cost required to compute neural ODEs while maintaining their dynamics properties to the maximum extent. In interblock position encoding, a chunked recursive form is adopted to alleviate the baseline drift problem caused by eigenvalue explosion. 2) Information transfer regularization mechanism: By regularizing the model intermediate hidden variables as well as the encoder-decoder connection variables, we can reduce information loss during the model training process while ensuring the smoothness of the position information. Extensive experimental results obtained on six large-scale datasets show a consistent improvement in our approach over the baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. The ETT dataset was acquired at https://github.com/zhouhaoyi/ETDataset

  2. The ECL dataset was acquired at https://github.com/laiguokun/multivariate-time-series-data

  3. Weather dataset was acquired at https://www.ncei.noaa.gov/data/local-climatological-data/

  4. Traffic dataset was acquired at https://drive.google.com/drive/folders/1M3gTc1DSvnUFMI57p70VFH5MHhZh3wC8/

  5. Meteorological dataset was acquired at https://drive.google.com/drive/folders/1Xz84ci5YKWL6O2I-58ZsVe42lYIfqui1/

References

  1. Al Qundus J, Dabbour K, Gupta S, Meissonier R, Paschke A (2020) Wireless sensor network for ai-based flood disaster detection. Ann Oper Res:1–23

  2. Castellini A, Bianchi F, Farinelli A (2022) Generation and interpretation of parsimonious predictive models for load forecasting in smart heating networks. Appl Intell:1–17

  3. Zhang Z, Hong W-C (2021) Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl-Based Syst 228:107297

    Article  Google Scholar 

  4. Cao J, Li Z, Li J, Physica A (2019) Financial time series forecasting model based on ceemdan and lstm. Stat Mech Appl 519:127–139

    Article  Google Scholar 

  5. Bukhari AH, Raja MAZ, Sulaiman M, Islam S, Shoaib M, Kumam P (2020) Fractional neuro-sequential arfima-lstm for financial market forecasting. IEEE Access 8:71326–71338

    Article  Google Scholar 

  6. Tran T, Pham L, Ngo Q (2020) Forecasting epidemic spread of sars-cov-2 using arima model (case study: Iran). Global J Environ Sci Manag 6(Special Issue (Covid-19)):1–10

    Google Scholar 

  7. Saqib M (2021) Forecasting covid-19 outbreak progression using hybrid polynomial-bayesian ridge regression model. Appl Intell 51(5):2703–2713

    Article  Google Scholar 

  8. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  9. Li Y, Li J, Zhang M (2021) Deep transformer modeling via grouping skip connection for neural machine translation. Knowl-Based Syst 234:107556

    Article  Google Scholar 

  10. Ma J, Li J, Gao W, Yang Y, Wong K-F (2021) Improving rumor detection by promoting information campaigns with transformer-based generative adversarial learning. IEEE Trans Knowl Data Eng

  11. Gu Q, Su J, Yuan L (2021) Visual affordance detection using an efficient attention convolutional neural network. Neurocomputing 440:36–44

    Article  Google Scholar 

  12. Song L, Liu G, Ma M (2022) Td-net: unsupervised medical image registration network based on transformer and cnn. Appl Intell:1–9

  13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  14. Maleki M, Mahmoudi MR, Wraith D, Pho K-H (2020) Time series modelling to forecast the confirmed and recovered cases of covid-19. Travel Med Infect Dis 37:101742

    Article  Google Scholar 

  15. Papacharalampous G, Tyralis H, Koutsoyiannis D (2018) Predictability of monthly temperature and precipitation using automatic time series forecasting methods. Acta Geophys 66(4):807–831

    Article  Google Scholar 

  16. Liu Y, Gong C, Yang L, Chen Y (2020) Dstp-rnn: a dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Syst Appl 143:113082

    Article  Google Scholar 

  17. Matyjaszek M, Fernández PR, Krzemień A, Wodarski K, Valverde GF (2019) Forecasting coking coal prices by means of arima models and neural networks, considering the transgenic time series theory. Resour Policy 61:283–292

    Article  Google Scholar 

  18. Salgotra R, Gandomi M, Gandomi AH (2020) Time series analysis and forecast of the covid-19 pandemic in India using genetic programming. Chaos, Solitons Fractals 138:109945

    Article  MathSciNet  Google Scholar 

  19. Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of the AAAI Conference on Artificial Intelligence 35(12):11106–11115

    Article  Google Scholar 

  20. Kitaev N, Kaiser L, Łevskaya A (2020) Reformer: The efficient transformer. arXiv:https://axiv.org/abs/2001.04451

  21. Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (long and short papers), pp 4171–4186

  22. Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 2 (short papers), pp 464–468

  23. Liu X, Yu H-F, Dhillon I, Hsieh C-J (2020) Learning to encode position for transformer with continuous dynamical model. In: International conference on machine learning. PMLR, pp 6327–6335

  24. Yan H, Deng B, Li X, Qiu X (2019) Tener: adapting transformer encoder for named entity recognition

  25. Yang Z, Yan W-W, Huang X, Mei L (2020) Adaptive temporal-frequency network for time-series forecasting. IEEE Trans Knowl Data Eng

  26. Sezer OB, Gudelek MU, Ozbayoglu AM (2020) Financial time series forecasting with deep learning: a systematic literature review: 2005–2019. Appl Soft Comput 90:106181

    Article  Google Scholar 

  27. Du S, Li T, Yang Y, Horng S-J (2019) Deep air quality forecasting using hybrid deep learning framework. IEEE Trans Knowl Data Eng

  28. Zhou Y, Ding F (2020) Modeling nonlinear processes using the radial basis function-based state-dependent autoregressive models. IEEE Signal Proc Lett 27:1600–1604

    Article  Google Scholar 

  29. Stern M, Shazeer N, Uszkoreit J (2018) Blockwise parallel decoding for deep autoregressive models. Adv Neural Inf Process Syst 31

  30. Ariyo AA, Adewumi AO, Ayo CK (2014) Stock price prediction using the arima model. In: 2014 UKSim-AMSS 16th international conference on computer modelling and simulation. IEEE, pp 106–112

  31. Taylor SJ, Letham B (2018) Forecasting at scale. Am Stat 72(1):37–45

    Article  MathSciNet  MATH  Google Scholar 

  32. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate

  33. Zhang Y, Chen Y, Wang J, Pan Z (2021) Unsupervised deep anomaly detection for multi-sensor time-series signals. IEEE Trans Knowl Data Eng

  34. Lai G, Chang W-C, Yang Y, Liu H (2018) Modeling long-and short-term temporal patterns with deep neural networks. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 95–104

  35. Salinas D, Flunkert V, Gasthaus J, Januschowski T (2020) Deepar: Probabilistic forecasting with autoregressive recurrent networks. Int J Forecast 36(3):1181–1191

    Article  Google Scholar 

  36. Kumar S, Shrimal A, Akhtar MS, Chakraborty T (2022) Discovering emotion and reasoning its flip in multi-party conversations using masked memory network and transformer. Knowl-Based Syst:108112

  37. Zheng W, Zhong J, Zhang Q, Zhao G (2022) Mtt: an efficient model for encrypted network traffic classification using multi-task transformer. Appl Intell:1–16

  38. Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst:108146

  39. Li S, Jin X, Xuan Y, Zhou X, Chen W, Wang Y-X, Yan X (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv Neural Inf Process Syst 32:5243–5253

    Google Scholar 

  40. Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training

  41. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(140):1–67

    MathSciNet  MATH  Google Scholar 

  42. He P, Liu X, Gao J, Chen W (2021) Deberta: decoding-enhanced bert with disentangled attention

  43. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK (2018) Neural ordinary differential equations. Ad Neural Inf Process Syst 31

Download references

Acknowledgments

This work was supported by the Key Research and Development Program of Liaoning Province of China (2020JH2/ 10100039).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofeng Zhou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: F-distribution critical values

Only F-distribution critical values with 10 or fewer algorithms can be found by looking up the table. To this end, we extended this value to increase the number of algorithms to 20, shown in Table 7 when the confidence level is α = 0.05 and shown in Table 8 when the confidence level is α = 0.1.

Table 7 The common critical value for the Friedman test (α = 0.05)
Table 8 The common critical value for the Friedman test (α = 0.1)

Regardless of the confidence level α = 0.05 or α = 0.1, only qα values with 10 or fewer algorithms can be found by looking up the table. We extended this value by increasing the number of algorithms to 20, and the results are shown in the Table 9.

Table 9 Commonly used qα values in the Nemenyi test

Appendix B: Experimental hyperparameter design

Because our models are mainly compared with Transformer-based models, we fixed the hyperparameters of the transformer architecture on the dataset to ensure the fairness of our experiments. The hyperparameters are shown in Table 10.

Table 10 Ablation of the binary position encoding components

Where ‘prediction’ is the prediction length of the LSTF task, ‘seq_len’ is the input sequence length, ‘label_len’ is the start token length, ‘pred_len’ is the prediction sequence length,‘ d_model’ is the dimension of a model, ‘n_heads’ is num of heads, ‘encoder_layers’ is num of encoder layers, ‘decoder_layers’ is num of decoder layers, ‘d_ff’ is dimension of fcn, ‘dropout’ is dropout ratio, ‘train_epochs’ is train epochs, ‘batch_size’ is the batch size of train input data, ‘patience’ is early stopping patience, ‘learning_rate’ is the optimizer learning rate.

Based on the above parametric comparisons, combined with the prediction results of each model on the dataset, the results show that our Seformer model dramatically improves the prediction performance of the model based on solving the time cost problem, baseline drift problem, and information loss problem of neural ODEs-style position encoding. It verifies the conjecture of the importance of position encoding as the base information for Transformer to construct global dependencies. It shows that our idea of improving the model’s prediction performance through the perspective of position encoding is correct.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeng, P., Hu, G., Zhou, X. et al. Seformer: a long sequence time-series forecasting model based on binary position encoding and information transfer regularization. Appl Intell 53, 15747–15771 (2023). https://doi.org/10.1007/s10489-022-04263-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04263-z

Keywords

Navigation