research-article

Missing value imputation on multidimensional time series

Authors:

Parikshit Bansal,

Prathamesh Deshpande,

Sunita SarawagiAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 11

Pages 2533 - 2545

https://doi.org/10.14778/3476249.3476300

Published: 01 July 2021 Publication History

Abstract

We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, whereas reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation, matrix factorization methods like SVD, statistical models like Kalman filters, and recent deep learning methods. We show that often these provide worse results on aggregate analytics compared to just excluding the missing data.

DeepMVI expresses the distribution of each missing value conditioned on coarse and fine-grained signals along a time series, and signals from correlated series at the same time. Instead of resorting to linearity assumptions of conventional matrix factorization methods, DeepMVI harnesses a flexible deep network to extract and combine these signals in an end-to-end manner. To prevent over-fitting with high-capacity neural networks, we design a robust parameter training with labeled data created using synthetic missing blocks around available indices. Our neural network uses a modular design with a novel temporal transformer with convolutional features, and kernel regression with learned embeddings.

Experiments across ten real datasets, five different missing scenarios, comparing seven conventional and three deep learning methods show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases, compared to the best existing method. Although slower than simpler matrix factorization methods, we justify the increased time overheads by showing that DeepMVI provides significantly more accurate imputation that finally impacts quality of downstream analytics.

References

[1]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[2]

Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. 2010. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization 20, 4 (2010), 1956--1982.

[3]

José Cambronero, John K. Feser, Micah J. Smith, and Samuel Madden. 2017. Query Optimization for Dynamic Imputation. Proc. VLDB Endow. 10, 11 (2017).

Digital Library

[4]

Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, and Yitan Li. 2018. Brits: Bidirectional recurrent imputation for time series. arXiv preprint arXiv:1805.10572 (2018).

Digital Library

[5]

Prathamesh Deshpande and Sunita Sarawagi. 2019. Streaming adaptation of deep forecasting models using adaptive recurrent units. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1560--1568.

Digital Library

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).

[7]

Valentin Flunkert, David Salinas, and Jan Gasthaus. 2017. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. CoRR abs/1704.04110 (2017).

[8]

Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, and Stephan Mandt. 2020. Gp-vae: Deep probabilistic time series imputation. In International Conference on Artificial Intelligence and Statistics. PMLR, 1651--1661.

[9]

Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural networks 18, 5--6 (2005), 602--610.

Digital Library

[10]

Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. 2012. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of the International Working Conference on Advanced Visual Interfaces. 547--554.

Digital Library

[11]

Mourad Khayati, Philippe Cudré-Mauroux, and Michael H Böhlen. 2019. Scalable recovery of missing blocks in time series with high and low cross-correlations. Knowledge and Information Systems (2019), 1--24.

[12]

Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, and Philippe Cudré-Mauroux. 2020. Mind the gap: an experimental evaluation of imputation of missing values techniques in time series. Proceedings of the VLDB Endowment 13, 5 (2020), 768--782.

Digital Library

[13]

Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019).

[14]

Lei Li, James McCann, Nancy S Pollard, and Christos Faloutsos. 2009. Dynammo: Mining and summarization of coevolving sequences with missing values. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 507--516.

Digital Library

[15]

Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Advances in Neural Information Processing Systems. 5243--5253.

Digital Library

[16]

Roderick JA Little and Donald B Rubin. 2002. Single imputation methods. Statistical analysis with missing data (2002), 59--74.

[17]

Yukai Liu, Rose Yu, Stephan Zheng, Eric Zhan, and Yisong Yue. 2019. NAOMI: Non-autoregressive multiresolution sequence imputation. In Advances in Neural Information Processing Systems. 11238--11248.

Digital Library

[18]

Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: A Database Approach for Statistical Inference and Data Cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.

Digital Library

[19]

Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. 2010. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research 11 (2010), 2287--2322.

Digital Library

[20]

Jiali Mei, Yohann De Castro, Yannig Goude, and Georges Hébrail. 2017. Nonnegative matrix factorization for time series recovery from a few temporal aggregates. In International Conference on Machine Learning. PMLR, 2382--2390.

Digital Library

[21]

Tova Milo and Amit Somech. 2020. Automating exploratory data analysis via machine learning: An overview. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2617--2622.

Digital Library

[22]

David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. 2019. High-dimensional multivariate forecasting with low-rank Gaussian Copula Processes. In Advances in Neural Information Processing Systems. 6827--6837.

Digital Library

[23]

Rajat Sen, Hsiang-Fu Yu, and Inderjit Dhillon. 2019. Think globally, act locally: A deep neural network approach to high-dimensional time series forecasting. arXiv preprint arXiv:1905.03806 (2019).

Digital Library

[24]

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520--525.

[25]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.

Digital Library

[26]

Kevin Wellenzohn, Michael H Böhlen, Anton Dignös, Johann Gamper, and Hannes Mitterer. 2017. Continuous imputation of missing values in streams of pattern-determining time series. (2017).

[27]

Jinsung Yoon, William R Zame, and Mihaela van der Schaar. 2018. Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Transactions on Biomedical Engineering 66, 5 (2018), 1477--1490.

[28]

Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. 2016. Temporal regularized matrix factorization for high-dimensional time series prediction. In Advances in neural information processing systems. 847--855.

Digital Library

Cited By

Reis EAbdelaal MBinnig C(2024)Generalizable Data Cleaning of Tabular Data in Latent SpaceProceedings of the VLDB Endowment10.14778/3704965.370498317:13(4786-4798)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.14778/3704965.3704983
Khayati MNater QPasquier J(2024)ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series DataProceedings of the VLDB Endowment10.14778/3685800.368586717:12(4329-4332)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685867
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Show More Cited By

Index Terms

Missing value imputation on multidimensional time series

Index terms have been assigned to the content through auto-classification.

Recommendations

Missing Value Imputation with Unsupervised Backpropagation

Many data mining and data analysis techniques operate on dense matrices or complete tables of data. Real-world data sets, however, often contain unknown values. Even many classification algorithms that are designed to operate with missing values still ...
Missing value imputation based on data clustering
Transactions on computational science I

We propose an efficient nonparametric missing value imputation method based on clustering, called CMI (Clustering-based Missing value Imputation), for dealing with missing values in target attributes. In our approach, we impute the missing values of an ...
Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data

Motivation: Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 11

July 2021

732 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2021

Published in PVLDB Volume 14, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
152
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)2

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Reis EAbdelaal MBinnig C(2024)Generalizable Data Cleaning of Tabular Data in Latent SpaceProceedings of the VLDB Endowment10.14778/3704965.370498317:13(4786-4798)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.14778/3704965.3704983
Khayati MNater QPasquier J(2024)ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series DataProceedings of the VLDB Endowment10.14778/3685800.368586717:12(4329-4332)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685867
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Zhong SSong SZhuo WLi GLiu YChan S(2024)A Multi-Scale Decomposition MLP-Mixer for Time Series AnalysisProceedings of the VLDB Endowment10.14778/3654621.365463717:7(1723-1736)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654637
Hussein DBelkhouja TBhat GDoppa J(2024)Sensor-Aware Data Imputation for Time-Series Machine Learning on Low-Power Wearable DevicesACM Transactions on Design Automation of Electronic Systems10.1145/369819530:1(1-27)Online publication date: 7-Oct-2024
https://dl.acm.org/doi/10.1145/3698195
Hu HQian SYang DCao JXue G(2024)Iterative Time Series Imputation by Maintaining Dependency ConsistencyACM Transactions on Knowledge Discovery from Data10.1145/369810719:1(1-24)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1145/3698107
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Yan MWang YPang KXie MLi JBaeza-Yates RBonchi F(2024)Efficient Mixture of Experts based on Large Language Models for Low-Resource Data PreprocessingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671873(3690-3701)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671873
Obata KKawabata KMatsubara YSakurai YBaeza-Yates RBonchi F(2024)Mining of Switching Sparse Networks for Missing Value Imputation in Multivariate Time SeriesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671760(2296-2306)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671760
Fan WHan ZRen WWang DWang YXie MYan M(2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626763
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents