Seformer: a long sequence time-series forecasting model based on binary position encoding and information transfer regularization

Pengyu Zeng^1,2,3,4,
Guoliang Hu^1,2,3,
Xiaofeng Zhou ORCID: orcid.org/0000-0001-9837-1261^1,2,3,
Shuai Li^1,2,3 &
…
Pengjie Liu^1,2,3,4

572 Accesses
Explore all metrics

Abstract

Long sequence time-series forecasting (LSTF) problems, such as weather forecasting, stock market forecasting, and power resource management, are widespread in the real world. The LSTF problem requires a model with high prediction accuracy. Recent studies have shown that the transformer model architecture is the most promising model structure for LSTF problems compared with other model architectures. The transformer model has the property of permutation equivalence, which leads to the importance of sequence position encoding, an essential process in model training. Currently, the continuous dynamics models constructed for position encoding using the neural differential equations (neural ODEs) method can model sequence position information well. However, we have found that there are some limitations when neural ODEs are applied to the LSTF problem, including the time cost problem, the baseline drift problem, and the information loss problem; thus, neural ODEs cannot be directly applied to the LSTF problem. To address this problem, we design a binary position encoding-based regularization model for long sequence time-series prediction, named Seformer, which has the following structure: 1) The binary position encoding mechanism, including intrablock and interblock position encoding. For intrablock position encoding, we design a simple ODE method by discretizing the continuum dynamics model, which reduces the time cost required to compute neural ODEs while maintaining their dynamics properties to the maximum extent. In interblock position encoding, a chunked recursive form is adopted to alleviate the baseline drift problem caused by eigenvalue explosion. 2) Information transfer regularization mechanism: By regularizing the model intermediate hidden variables as well as the encoder-decoder connection variables, we can reduce information loss during the model training process while ensuring the smoothness of the position information. Extensive experimental results obtained on six large-scale datasets show a consistent improvement in our approach over the baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

U-Net Inspired Transformer Architecture for Far Horizon Time Series Forecasting

Temporal patterns decomposition and Legendre projection for long-term time series forecasting

Article 06 July 2024

Adaptive Sparsity Level During Training for Efficient Time Series Forecasting with Transformers

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The ETT dataset was acquired at https://github.com/zhouhaoyi/ETDataset
The ECL dataset was acquired at https://github.com/laiguokun/multivariate-time-series-data
Weather dataset was acquired at https://www.ncei.noaa.gov/data/local-climatological-data/
Traffic dataset was acquired at https://drive.google.com/drive/folders/1M3gTc1DSvnUFMI57p70VFH5MHhZh3wC8/
Meteorological dataset was acquired at https://drive.google.com/drive/folders/1Xz84ci5YKWL6O2I-58ZsVe42lYIfqui1/

References

Al Qundus J, Dabbour K, Gupta S, Meissonier R, Paschke A (2020) Wireless sensor network for ai-based flood disaster detection. Ann Oper Res:1–23
Castellini A, Bianchi F, Farinelli A (2022) Generation and interpretation of parsimonious predictive models for load forecasting in smart heating networks. Appl Intell:1–17
Zhang Z, Hong W-C (2021) Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl-Based Syst 228:107297
Article Google Scholar
Cao J, Li Z, Li J, Physica A (2019) Financial time series forecasting model based on ceemdan and lstm. Stat Mech Appl 519:127–139
Article Google Scholar
Bukhari AH, Raja MAZ, Sulaiman M, Islam S, Shoaib M, Kumam P (2020) Fractional neuro-sequential arfima-lstm for financial market forecasting. IEEE Access 8:71326–71338
Article Google Scholar
Tran T, Pham L, Ngo Q (2020) Forecasting epidemic spread of sars-cov-2 using arima model (case study: Iran). Global J Environ Sci Manag 6(Special Issue (Covid-19)):1–10
Google Scholar
Saqib M (2021) Forecasting covid-19 outbreak progression using hybrid polynomial-bayesian ridge regression model. Appl Intell 51(5):2703–2713
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Li Y, Li J, Zhang M (2021) Deep transformer modeling via grouping skip connection for neural machine translation. Knowl-Based Syst 234:107556
Article Google Scholar
Ma J, Li J, Gao W, Yang Y, Wong K-F (2021) Improving rumor detection by promoting information campaigns with transformer-based generative adversarial learning. IEEE Trans Knowl Data Eng
Gu Q, Su J, Yuan L (2021) Visual affordance detection using an efficient attention convolutional neural network. Neurocomputing 440:36–44
Article Google Scholar
Song L, Liu G, Ma M (2022) Td-net: unsupervised medical image registration network based on transformer and cnn. Appl Intell:1–9
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Maleki M, Mahmoudi MR, Wraith D, Pho K-H (2020) Time series modelling to forecast the confirmed and recovered cases of covid-19. Travel Med Infect Dis 37:101742
Article Google Scholar
Papacharalampous G, Tyralis H, Koutsoyiannis D (2018) Predictability of monthly temperature and precipitation using automatic time series forecasting methods. Acta Geophys 66(4):807–831
Article Google Scholar
Liu Y, Gong C, Yang L, Chen Y (2020) Dstp-rnn: a dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Syst Appl 143:113082
Article Google Scholar
Matyjaszek M, Fernández PR, Krzemień A, Wodarski K, Valverde GF (2019) Forecasting coking coal prices by means of arima models and neural networks, considering the transgenic time series theory. Resour Policy 61:283–292
Article Google Scholar
Salgotra R, Gandomi M, Gandomi AH (2020) Time series analysis and forecast of the covid-19 pandemic in India using genetic programming. Chaos, Solitons Fractals 138:109945
Article MathSciNet Google Scholar
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of the AAAI Conference on Artificial Intelligence 35(12):11106–11115
Article Google Scholar
Kitaev N, Kaiser L, Łevskaya A (2020) Reformer: The efficient transformer. arXiv:https://axiv.org/abs/2001.04451
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (long and short papers), pp 4171–4186
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 2 (short papers), pp 464–468
Liu X, Yu H-F, Dhillon I, Hsieh C-J (2020) Learning to encode position for transformer with continuous dynamical model. In: International conference on machine learning. PMLR, pp 6327–6335
Yan H, Deng B, Li X, Qiu X (2019) Tener: adapting transformer encoder for named entity recognition
Yang Z, Yan W-W, Huang X, Mei L (2020) Adaptive temporal-frequency network for time-series forecasting. IEEE Trans Knowl Data Eng
Sezer OB, Gudelek MU, Ozbayoglu AM (2020) Financial time series forecasting with deep learning: a systematic literature review: 2005–2019. Appl Soft Comput 90:106181
Article Google Scholar
Du S, Li T, Yang Y, Horng S-J (2019) Deep air quality forecasting using hybrid deep learning framework. IEEE Trans Knowl Data Eng
Zhou Y, Ding F (2020) Modeling nonlinear processes using the radial basis function-based state-dependent autoregressive models. IEEE Signal Proc Lett 27:1600–1604
Article Google Scholar
Stern M, Shazeer N, Uszkoreit J (2018) Blockwise parallel decoding for deep autoregressive models. Adv Neural Inf Process Syst 31
Ariyo AA, Adewumi AO, Ayo CK (2014) Stock price prediction using the arima model. In: 2014 UKSim-AMSS 16th international conference on computer modelling and simulation. IEEE, pp 106–112
Taylor SJ, Letham B (2018) Forecasting at scale. Am Stat 72(1):37–45
Article MathSciNet MATH Google Scholar
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate
Zhang Y, Chen Y, Wang J, Pan Z (2021) Unsupervised deep anomaly detection for multi-sensor time-series signals. IEEE Trans Knowl Data Eng
Lai G, Chang W-C, Yang Y, Liu H (2018) Modeling long-and short-term temporal patterns with deep neural networks. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 95–104
Salinas D, Flunkert V, Gasthaus J, Januschowski T (2020) Deepar: Probabilistic forecasting with autoregressive recurrent networks. Int J Forecast 36(3):1181–1191
Article Google Scholar
Kumar S, Shrimal A, Akhtar MS, Chakraborty T (2022) Discovering emotion and reasoning its flip in multi-party conversations using masked memory network and transformer. Knowl-Based Syst:108112
Zheng W, Zhong J, Zhang Q, Zhao G (2022) Mtt: an efficient model for encrypted network traffic classification using multi-task transformer. Appl Intell:1–16
Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst:108146
Li S, Jin X, Xuan Y, Zhou X, Chen W, Wang Y-X, Yan X (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv Neural Inf Process Syst 32:5243–5253
Google Scholar
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(140):1–67
MathSciNet MATH Google Scholar
He P, Liu X, Gao J, Chen W (2021) Deberta: decoding-enhanced bert with disentangled attention
Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK (2018) Neural ordinary differential equations. Ad Neural Inf Process Syst 31

Download references

Acknowledgments

This work was supported by the Key Research and Development Program of Liaoning Province of China (2020JH2/ 10100039).

Author information

Authors and Affiliations

Key Laboratory of Networked Control Systems, Chinese Academy of Sciences, Shenyang, 110000, China
Pengyu Zeng, Guoliang Hu, Xiaofeng Zhou, Shuai Li & Pengjie Liu
Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, 110000, China
Pengyu Zeng, Guoliang Hu, Xiaofeng Zhou, Shuai Li & Pengjie Liu
Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang, 110000, China
Pengyu Zeng, Guoliang Hu, Xiaofeng Zhou, Shuai Li & Pengjie Liu
University of Chinese Academy of Sciences, Beijing, 100000, China
Pengyu Zeng & Pengjie Liu

Authors

Pengyu Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Li
View author publications
You can also search for this author in PubMed Google Scholar
Pengjie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Zhou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: F-distribution critical values

Only F-distribution critical values with 10 or fewer algorithms can be found by looking up the table. To this end, we extended this value to increase the number of algorithms to 20, shown in Table 7 when the confidence level is α = 0.05 and shown in Table 8 when the confidence level is α = 0.1.

Table 7 The common critical value for the Friedman test (α = 0.05)

Full size table

Table 8 The common critical value for the Friedman test (α = 0.1)

Full size table

Regardless of the confidence level α = 0.05 or α = 0.1, only q_α values with 10 or fewer algorithms can be found by looking up the table. We extended this value by increasing the number of algorithms to 20, and the results are shown in the Table 9.

Table 9 Commonly used q_α values in the Nemenyi test

Full size table

Appendix B: Experimental hyperparameter design

Because our models are mainly compared with Transformer-based models, we fixed the hyperparameters of the transformer architecture on the dataset to ensure the fairness of our experiments. The hyperparameters are shown in Table 10.

Table 10 Ablation of the binary position encoding components

Full size table

Where ‘prediction’ is the prediction length of the LSTF task, ‘seq_len’ is the input sequence length, ‘label_len’ is the start token length, ‘pred_len’ is the prediction sequence length,‘ d_model’ is the dimension of a model, ‘n_heads’ is num of heads, ‘encoder_layers’ is num of encoder layers, ‘decoder_layers’ is num of decoder layers, ‘d_ff’ is dimension of fcn, ‘dropout’ is dropout ratio, ‘train_epochs’ is train epochs, ‘batch_size’ is the batch size of train input data, ‘patience’ is early stopping patience, ‘learning_rate’ is the optimizer learning rate.

Based on the above parametric comparisons, combined with the prediction results of each model on the dataset, the results show that our Seformer model dramatically improves the prediction performance of the model based on solving the time cost problem, baseline drift problem, and information loss problem of neural ODEs-style position encoding. It verifies the conjecture of the importance of position encoding as the base information for Transformer to construct global dependencies. It shows that our idea of improving the model’s prediction performance through the perspective of position encoding is correct.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zeng, P., Hu, G., Zhou, X. et al. Seformer: a long sequence time-series forecasting model based on binary position encoding and information transfer regularization. Appl Intell 53, 15747–15771 (2023). https://doi.org/10.1007/s10489-022-04263-z

Download citation

Accepted: 12 October 2022
Published: 28 November 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-04263-z

Seformer: a long sequence time-series forecasting model based on binary position encoding and information transfer regularization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

U-Net Inspired Transformer Architecture for Far Horizon Time Series Forecasting

Temporal patterns decomposition and Legendre projection for long-term time series forecasting

Adaptive Sparsity Level During Training for Efficient Time Series Forecasting with Transformers

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: F-distribution critical values

Appendix B: Experimental hyperparameter design

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Seformer: a long sequence time-series forecasting model based on binary position encoding and information transfer regularization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

U-Net Inspired Transformer Architecture for Far Horizon Time Series Forecasting

Temporal patterns decomposition and Legendre projection for long-term time series forecasting

Adaptive Sparsity Level During Training for Efficient Time Series Forecasting with Transformers

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: F-distribution critical values

Appendix B: Experimental hyperparameter design

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation