research-article

Data Imputation from the Perspective of Graph Dirichlet Energy

Authors:

Fugee TsungAuthors Info & Claims

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Pages 3237 - 3247

https://doi.org/10.1145/3627673.3679669

Published: 21 October 2024 Publication History

Abstract

Data imputation is a crucial task due to the widespread occurrence of missing data. Many methods adopt a two-step approach: initially crafting a preliminary imputation (the "draft") and then refining it to produce the final missing data imputation result, commonly referred to as "draft-then-refine". In our study, we examine this prevalent strategy through the lens of graph Dirichlet energy. We observe that a basic "draft" imputation tends to decrease the Dirichlet energy. Therefore, a subsequent "refine" step is necessary to restore the overall energy balance. Existing refinement techniques, such as the Graph Convolutional Network (GCN), often result in further energy reduction. To address this, we introduce a new framework, the Graph Laplacian Pyramid Network (GLPN). GLPN incorporates a U-shaped autoencoder and residual networks to capture both global and local details effectively. Through extensive experiments on multiple real-world datasets, GLPN consistently outperforms state-of-the-art methods across three different missing data mechanisms. The code is available at https://github.com/liguanlue/GLPN.

References

[1]

Saeed Anwar and Nick Barnes. 2020. Densely residual laplacian super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 3 (2020), 1192--1204.

[2]

Vincent Audigier, Franccois Husson, and Julie Josse. 2016. Multiple imputation for continuous variables using a Bayesian principal component analysis. Journal of statistical computation and simulation, Vol. 86, 11 (2016), 2140--2156.

[3]

Michal Bechny, Florian Sobieczky, Jürgen Zeindl, and Lisa Ehrlinger. 2021. Missing Data Patterns: From Theory to an Application in the Steel Industry. In 33rd International Conference on Scientific and Statistical Database Management. 214--219.

Digital Library

[4]

Dimitris Bertsimas, Colin Pawlowski, and Ying Daisy Zhuo. 2017. From predictive methods to missing data imputation: an optimization approach. The Journal of Machine Learning Research, Vol. 18, 1 (2017), 7133--7171.

Digital Library

[5]

Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. 2005. Protein function prediction via graph kernels. Bioinformatics, Vol. 21, suppl_1 (2005), i47--i56.

Digital Library

[6]

Shilpi Bose, Chandra Das, Tamaghna Gangopadhyay, and Samiran Chattopadhyay. 2013. A modified local least squares-based missing value estimation method in microarray gene expression data. In 2013 2nd International Conference on Advanced Computing, Networking and Security. IEEE, 18--23.

Digital Library

[7]

Shilpi Bose, Chandra Das, Tamaghna Gangopadhyay, and Samiran Chattopadhyay. 2013. A Modified Local Least Squares-Based Missing Value Estimation Method in Microarray Gene Expression Data. In 2013 2nd International Conference on Advanced Computing, Networking and Security, Mangalore, India, December 15--17, 2013. IEEE, 18--23.

Digital Library

[8]

Peter J Burt and Edward H Adelson. 1987. The Laplacian pyramid as a compact image code. In Readings in computer vision. Elsevier, 671--679.

[9]

Chen Cai and Yusu Wang. 2020. A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318 (2020).

[10]

Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. 2010. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, Vol. 20, 4 (2010), 1956--1982.

[11]

Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. 2019. Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. IEEE Transactions on Intelligent Transportation Systems (2019).

[12]

Zhiyong Cui, Ruimin Ke, and Yinhai Wang. 2018. Deep Bidirectional and Unidirectional LSTM Recurrent Neural Network for Network-wide Traffic Speed Prediction. arXiv preprint arXiv:1801.02143 (2018).

[13]

Hiroshi De Silva and A Shehan Perera. 2016. Missing data imputation using Evolutionary k-Nearest neighbor algorithm for gene expression data. In 2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer). IEEE, 141--146.

[14]

Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), Vol. 39, 1 (1977), 1--22.

[15]

Tlamelo Emmanuel, Thabiso M. Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona. 2021. A survey on missing data in machine learning. J. Big Data, Vol. 8, 1 (2021), 140. https://doi.org/10.1186/s40537-021-00516--9

[16]

Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. In international conference on machine learning. PMLR, 2083--2092.

[17]

Ziqi Gao, Yifan Niu, Jiashun Cheng, Jianheng Tang, Lanqing Li, Tingyang Xu, Peilin Zhao, Fugee Tsung, and Jia Li. 2023. Handling missing data via max-entropy regularized graph autoencoder. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7651--7659.

Digital Library

[18]

Pedro J García-Laencina, José-Luis Sancho-Gómez, and Aníbal R Figueiras-Vidal. 2010. Pattern classification with missing data: a review. Neural Computing and Applications, Vol. 19, 2 (2010), 263--282.

Digital Library

[19]

Alberto P García-Plaza, Víctor Fresno, Raquel Martínez Unanue, and Arkaitz Zubiaga. 2016. Using fuzzy logic to leverage HTML markup for web page representation. IEEE Transactions on Fuzzy Systems, Vol. 25, 4 (2016), 919--933.

Digital Library

[20]

Lovedeep Gondara and Ke Wang. 2018. Mida: Multiple imputation using denoising autoencoders. In Pacific-Asia conference on knowledge discovery and data mining. Springer, 260--272.

Digital Library

[21]

Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. 2015. Matrix completion and low-rank SVD via fast alternating least squares. The Journal of Machine Learning Research, Vol. 16, 1 (2015), 3367--3402.

Digital Library

[22]

Wenbing Huang, Yu Rong, Tingyang Xu, Fuchun Sun, and Junzhou Huang. 2020. Tackling over-smoothing for general graph convolutional networks. arXiv preprint arXiv:2008.09864 (2020).

[23]

O Ivanov, M Figurnov, and D Vetrov. 2019. Variational autoencoder with arbitrary conditioning. In 7th International Conference on Learning Representations, ICLR 2019.

[24]

Phimmarin Keerin and Tossapon Boongoen. 2021. Improved knn imputation for missing values in gene expression data. Computers, Materials and Continua, Vol. 70, 2 (2021), 4009--4025.

[25]

Ki-Yeol Kim, Byoung-Jin Kim, and Gwan-Su Yi. 2004. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC bioinformatics, Vol. 5, 1 (2004), 1--9.

[26]

Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).

[27]

Thomas N Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations.

[28]

Trent Kyono, Yao Zhang, Alexis Bellot, and Mihaela van der Schaar. 2021. MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms. Advances in Neural Information Processing Systems, Vol. 34 (2021).

[29]

Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. 2017. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition. 624--632.

[30]

Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. 2018. Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 11 (2018), 2599--2613.

[31]

Kamakshi Lakshminarayan, Steven A Harp, and Tariq Samad. 1999. Imputation of missing data in industrial databases. Applied intelligence, Vol. 11, 3 (1999), 259--275.

[32]

Marine Le Morvan, Julie Josse, Erwan Scornet, and Gaël Varoquaux. 2021. What'sa good imputation to predict with missing values? Advances in Neural Information Processing Systems, Vol. 34 (2021).

[33]

Junhyun Lee, Inyeop Lee, and Jaewoo Kang. 2019. Self-attention graph pooling. In International conference on machine learning. PMLR, 3734--3743.

[34]

Jia Li, Jiajin Li, Yang Liu, Jianwei Yu, Yueting Li, and Hong Cheng. 2021. Deconvolutional Networks on Graph Data. Advances in Neural Information Processing Systems, Vol. 34 (2021).

[35]

Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wenbing Huang, and Junzhou Huang. 2019. Semi-supervised graph classification: A hierarchical graph perspective. In The World Wide Web Conference. 972--982.

Digital Library

[36]

Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In International Conference on Learning Representations.

[37]

Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser Nam Lim. 2021. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. Advances in Neural Information Processing Systems, Vol. 34 (2021), 20887--20902.

[38]

Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. 2022. Revisiting Heterophily For Graph Neural Networks. In Advances in Neural Information Processing Systems.

[39]

Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep generative modelling and imputation of incomplete data sets. In International conference on machine learning. PMLR, 4413--4423.

[40]

Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. 2010. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, Vol. 11 (2010), 2287--2322.

Digital Library

[41]

Yimeng Min, Frederik Wenkel, and Guy Wolf. 2020. Scattering gcn: Overcoming oversmoothness in graph convolutional networks. Advances in Neural Information Processing Systems, Vol. 33 (2020), 14498--14508.

[42]

Stephen A Mistler and Craig K Enders. 2017. A comparison of joint model and fully conditional specification imputation for multilevel missing data. Journal of Educational and Behavioral Statistics, Vol. 42, 4 (2017), 432--466.

[43]

Karthika Mohan, Judea Pearl, and Jin Tian. 2013. Graphical models for inference with missing data. Advances in neural information processing systems, Vol. 26 (2013).

[44]

Christopher Morris, Nils M Kriege, Kristian Kersting, and Petra Mutzel. 2016. Faster kernels for graphs with continuous attributes via hashing. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 1095--1100.

[45]

Boris Muzellec, Julie Josse, Claire Boyer, and Marco Cuturi. 2020. Missing data imputation using optimal transport. In International Conference on Machine Learning. PMLR, 7130--7140.

[46]

Francesco Orsini, Paolo Frasconi, and Luc De Raedt. 2015. Graph invariant kernels. In Twenty-Fourth International Joint Conference on Artificial Intelligence.

[47]

Jiwoong Park, Minsik Lee, Hyung Jin Chang, Kyuewang Lee, and Jin Young Choi. 2019. Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6519--6528.

[48]

Neta Rabin and Dalia Fishelov. 2017. Missing data completion using diffusion maps and laplacian pyramids. In International Conference on Computational Science and Its Applications. Springer, 284--297.

[49]

Neta Rabin and Dalia Fishelov. 2019. Two directional Laplacian pyramids with application to data imputation. Advances in Computational Mathematics, Vol. 45, 4 (2019), 2123--2146.

Digital Library

[50]

Emanuele Rossi, Henry Kenlay, Maria I Gorinova, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael Bronstein. 2021. On the Unreasonable Effectiveness of Feature propagation in Learning on Graphs with Missing Node Features. arXiv preprint arXiv:2111.12128 (2021).

[51]

Donald B Rubin. 1976. Inference and missing data. Biometrika, Vol. 63, 3 (1976), 581--592.

[52]

Aude Sportisse, Claire Boyer, and Julie Josse. 2020. Estimation and imputation in probabilistic principal component analysis with missing not at random data. Advances in Neural Information Processing Systems, Vol. 33 (2020), 7067--7077.

[53]

Daniel J Stekhoven and Peter Bühlmann. 2012. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics (2012).

[54]

Hibiki Taguchi, Xin Liu, and Tsuyoshi Murata. 2021. Graph convolutional networks for graphs containing missing features. Future Generation Computer Systems, Vol. 117 (2021), 155--168.

[55]

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. 2001 a. Missing value estimation methods for DNA microarrays. Bioinformatics, Vol. 17, 6 (2001), 520--525.

[56]

Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001 b. Missing value estimation methods for DNA microarrays. Bioinform., Vol. 17, 6 (2001), 520--525.

[57]

Stef Van Buuren. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical methods in medical research, Vol. 16, 3 (2007), 219--242.

[58]

Stef Van Buuren and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate imputation by chained equations in R. Journal of statistical software, Vol. 45 (2011), 1--67.

[59]

Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph convolutional networks. In International conference on machine learning. PMLR, 6861--6871.

[60]

Yangyang Wu, Jun Wang, Xiaoye Miao, Wenjia Wang, and Jianwei Yin. 2023. Differentiable and scalable generative adversarial models for data imputation. IEEE Transactions on Knowledge and Data Engineering (2023).

Digital Library

[61]

Yuankai Wu, Dingyi Zhuang, Aurelie Labbe, and Lijun Sun. 2021. Inductive Graph Neural Networks for Spatiotemporal Kriging. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4478--4485.

[62]

Jinsung Yoon, James Jordon, and Mihaela Schaar. 2018. Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning. PMLR, 5689--5698.

[63]

Jiaxuan You, Xiaobai Ma, Yi Ding, Mykel J Kochenderfer, and Jure Leskovec. 2020. Handling missing data with graph representation learning. Advances in Neural Information Processing Systems, Vol. 33 (2020), 19075--19087.

[64]

Xin Zheng, Yixin Liu, Shirui Pan, Miao Zhang, Di Jin, and Philip S Yu. 2022. Graph neural networks for graphs with heterophily: A survey. arXiv preprint arXiv:2202.07082 (2022).

[65]

Kaixiong Zhou, Xiao Huang, Daochen Zha, Rui Chen, Li Li, Soo-Hyun Choi, and Xia Hu. 2021. Dirichlet energy constrained learning for deep graph neural networks. Advances in Neural Information Processing Systems, Vol. 34 (2021).

[66]

Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. 2020. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in Neural Information Processing Systems, Vol. 33 (2020), 7793--7804.

Index Terms

Data Imputation from the Perspective of Graph Dirichlet Energy
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

A new imputation method for small software project data sets

Effort prediction is a very important issue for software project management. Historical project data sets are frequently used to support such prediction. But missing data are often contained in these data sets and this makes prediction more difficult. ...
Imputation of missing values in multi-view data
Abstract
Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This may lead ...
Highlights
- A new imputation method for multi-view data is introduced.
- The new method shows competitive results at a much lower computational cost.
- The new method allows state-of-the-art algorithms to be used in much larger data sets than ...
Multiple imputation and genetic programming for classification with incomplete data
GECCO '17: Proceedings of the Genetic and Evolutionary Computation Conference

Many industrial and research datasets suffer from an unavoidable issue of missing values. One of the most common approaches to solving classification with incomplete data is to use an imputation method to fill missing values with plausible values before ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

October 2024

5705 pages

ISBN:9798400704369

DOI:10.1145/3627673

General Chairs:
Edoardo Serra
Boise State University, USA
,
Francesca Spezzano
Boise State University, USA

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China Grant
the Nansha Key Area Science and Technology Project
the Guangzhou Industrial Information and Intelligent Key Laboratory Project

Conference

CIKM '24

Sponsor:

SIGIR

CIKM '24: The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

ID, Boise, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
104
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)23

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten