Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Missing value imputation on multidimensional time series

Published: 01 July 2021 Publication History

Abstract

We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, whereas reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation, matrix factorization methods like SVD, statistical models like Kalman filters, and recent deep learning methods. We show that often these provide worse results on aggregate analytics compared to just excluding the missing data.
DeepMVI expresses the distribution of each missing value conditioned on coarse and fine-grained signals along a time series, and signals from correlated series at the same time. Instead of resorting to linearity assumptions of conventional matrix factorization methods, DeepMVI harnesses a flexible deep network to extract and combine these signals in an end-to-end manner. To prevent over-fitting with high-capacity neural networks, we design a robust parameter training with labeled data created using synthetic missing blocks around available indices. Our neural network uses a modular design with a novel temporal transformer with convolutional features, and kernel regression with learned embeddings.
Experiments across ten real datasets, five different missing scenarios, comparing seven conventional and three deep learning methods show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases, compared to the best existing method. Although slower than simpler matrix factorization methods, we justify the increased time overheads by showing that DeepMVI provides significantly more accurate imputation that finally impacts quality of downstream analytics.

References

[1]
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[2]
Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. 2010. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization 20, 4 (2010), 1956--1982.
[3]
José Cambronero, John K. Feser, Micah J. Smith, and Samuel Madden. 2017. Query Optimization for Dynamic Imputation. Proc. VLDB Endow. 10, 11 (2017).
[4]
Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, and Yitan Li. 2018. Brits: Bidirectional recurrent imputation for time series. arXiv preprint arXiv:1805.10572 (2018).
[5]
Prathamesh Deshpande and Sunita Sarawagi. 2019. Streaming adaptation of deep forecasting models using adaptive recurrent units. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1560--1568.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
[7]
Valentin Flunkert, David Salinas, and Jan Gasthaus. 2017. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. CoRR abs/1704.04110 (2017).
[8]
Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, and Stephan Mandt. 2020. Gp-vae: Deep probabilistic time series imputation. In International Conference on Artificial Intelligence and Statistics. PMLR, 1651--1661.
[9]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural networks 18, 5--6 (2005), 602--610.
[10]
Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. 2012. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of the International Working Conference on Advanced Visual Interfaces. 547--554.
[11]
Mourad Khayati, Philippe Cudré-Mauroux, and Michael H Böhlen. 2019. Scalable recovery of missing blocks in time series with high and low cross-correlations. Knowledge and Information Systems (2019), 1--24.
[12]
Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, and Philippe Cudré-Mauroux. 2020. Mind the gap: an experimental evaluation of imputation of missing values techniques in time series. Proceedings of the VLDB Endowment 13, 5 (2020), 768--782.
[13]
Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019).
[14]
Lei Li, James McCann, Nancy S Pollard, and Christos Faloutsos. 2009. Dynammo: Mining and summarization of coevolving sequences with missing values. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 507--516.
[15]
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Advances in Neural Information Processing Systems. 5243--5253.
[16]
Roderick JA Little and Donald B Rubin. 2002. Single imputation methods. Statistical analysis with missing data (2002), 59--74.
[17]
Yukai Liu, Rose Yu, Stephan Zheng, Eric Zhan, and Yisong Yue. 2019. NAOMI: Non-autoregressive multiresolution sequence imputation. In Advances in Neural Information Processing Systems. 11238--11248.
[18]
Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: A Database Approach for Statistical Inference and Data Cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.
[19]
Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. 2010. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research 11 (2010), 2287--2322.
[20]
Jiali Mei, Yohann De Castro, Yannig Goude, and Georges Hébrail. 2017. Nonnegative matrix factorization for time series recovery from a few temporal aggregates. In International Conference on Machine Learning. PMLR, 2382--2390.
[21]
Tova Milo and Amit Somech. 2020. Automating exploratory data analysis via machine learning: An overview. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2617--2622.
[22]
David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. 2019. High-dimensional multivariate forecasting with low-rank Gaussian Copula Processes. In Advances in Neural Information Processing Systems. 6827--6837.
[23]
Rajat Sen, Hsiang-Fu Yu, and Inderjit Dhillon. 2019. Think globally, act locally: A deep neural network approach to high-dimensional time series forecasting. arXiv preprint arXiv:1905.03806 (2019).
[24]
Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520--525.
[25]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.
[26]
Kevin Wellenzohn, Michael H Böhlen, Anton Dignös, Johann Gamper, and Hannes Mitterer. 2017. Continuous imputation of missing values in streams of pattern-determining time series. (2017).
[27]
Jinsung Yoon, William R Zame, and Mihaela van der Schaar. 2018. Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Transactions on Biomedical Engineering 66, 5 (2018), 1477--1490.
[28]
Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. 2016. Temporal regularized matrix factorization for high-dimensional time series prediction. In Advances in neural information processing systems. 847--855.

Cited By

View all
  • (2024)Generalizable Data Cleaning of Tabular Data in Latent SpaceProceedings of the VLDB Endowment10.14778/3704965.370498317:13(4786-4798)Online publication date: 1-Sep-2024
  • (2024)ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series DataProceedings of the VLDB Endowment10.14778/3685800.368586717:12(4329-4332)Online publication date: 8-Nov-2024
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 14, Issue 11
July 2021
732 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2021
Published in PVLDB Volume 14, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Generalizable Data Cleaning of Tabular Data in Latent SpaceProceedings of the VLDB Endowment10.14778/3704965.370498317:13(4786-4798)Online publication date: 1-Sep-2024
  • (2024)ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series DataProceedings of the VLDB Endowment10.14778/3685800.368586717:12(4329-4332)Online publication date: 8-Nov-2024
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)A Multi-Scale Decomposition MLP-Mixer for Time Series AnalysisProceedings of the VLDB Endowment10.14778/3654621.365463717:7(1723-1736)Online publication date: 1-Mar-2024
  • (2024)Sensor-Aware Data Imputation for Time-Series Machine Learning on Low-Power Wearable DevicesACM Transactions on Design Automation of Electronic Systems10.1145/369819530:1(1-27)Online publication date: 7-Oct-2024
  • (2024)Iterative Time Series Imputation by Maintaining Dependency ConsistencyACM Transactions on Knowledge Discovery from Data10.1145/369810719:1(1-24)Online publication date: 29-Sep-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Efficient Mixture of Experts based on Large Language Models for Low-Resource Data PreprocessingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671873(3690-3701)Online publication date: 25-Aug-2024
  • (2024)Mining of Switching Sparse Networks for Missing Value Imputation in Multivariate Time SeriesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671760(2296-2306)Online publication date: 25-Aug-2024
  • (2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media