research-article

A Unified View of Causal and Non-causal Feature Selection

Authors:

Jiuyong LiAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 15, Issue 4

Article No.: 63, Pages 1 - 46

https://doi.org/10.1145/3436891

Published: 18 April 2021 Publication History

Abstract

In this article, we aim to develop a unified view of causal and non-causal feature selection methods. The unified view will fill in the gap in the research of the relation between the two types of methods. Based on the Bayesian network framework and information theory, we first show that causal and non-causal feature selection methods share the same objective. That is to find the Markov blanket of a class attribute, the theoretically optimal feature set for classification. We then examine the assumptions made by causal and non-causal feature selection methods when searching for the optimal feature set, and unify the assumptions by mapping them to the restrictions on the structure of the Bayesian network model of the studied problem. We further analyze in detail how the structural assumptions lead to the different levels of approximations employed by the methods in their search, which then result in the approximations in the feature sets found by the methods with respect to the optimal feature set. With the unified view, we can interpret the output of non-causal methods from a causal perspective and derive the error bounds of both types of methods. Finally, we present practical understanding of the relation between causal and non-causal methods using extensive experiments with synthetic data and various types of real-world data.

Supplementary Material

a63-yu-suppl.pdf (yu.zip)

Supplemental movie, appendix, image and software files for, A Unified View of Causal and Non-causal Feature Selection

Download
116.52 KB

References

[1]

Alan Agresti and Maria Kateri. 2011. Categorical data analysis. In International Encyclopedia of Statistical Science. Springer, 206--208.

[2]

Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D. Koutsoukos. 2010. Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation. Journal of Machine Learning Research 11, 7 (2010), 171--234.

Digital Library

[3]

Constantin F. Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. 2003. HITON: A novel Markov blanket algorithm for optimal variable selection. In Proceedings of the AMIA Annual Symposium. Vol. 2003. American Medical Informatics Association, 21.

[4]

Kevin Bache and Moshe Lichman. 2013. UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml.

[5]

Kiran S. Balagani and Vir V. Phoha. 2010. On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 7 (2010), 1342--1343.

Digital Library

[6]

Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks 5, 4 (1994), 537--550.

Digital Library

[7]

Ingo A. Beinlich, Henri J. Suermondt, R. Martin Chavez, and Gregory F. Cooper. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Springer.

[8]

David A. Bell and Hui Wang. 2000. A formalism for relevance and its application in feature subset selection. Machine Learning 41, 2 (2000), 175--195.

Digital Library

[9]

Gianluca Bontempi and Patrick E. Meyer. 2010. Causal filter selection in microarray data. In Proceedings of the 27th International Conference on Machine Learning. 95--102.

[10]

Giorgos Borboudakis and Ioannis Tsamardinos. 2019. Forward-backward selection with early dropping. The Journal of Machine Learning Research 20, 1 (2019), 276--314.

Digital Library

[11]

Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research 13, 1 (2012), 27--66.

Digital Library

[12]

Peter Bühlmann, Markus Kalisch, and Marloes H. Maathuis. 2010. Variable selection in high-dimensional linear models: Partially faithful distributions and the PC-simple algorithm. Biometrika 97, 2 (2010), 261--278.

[13]

Thomas M. Cover and Joy A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons.

[14]

Manoranjan Dash and Huan Liu. 2003. Consistency-based search in feature selection. Artificial Intelligence 151, 1--2 (2003), 155--176.

Digital Library

[15]

R. M. Fano. 1961. Transmission of Information: A Statistical Theory of Communications. MIT Press.

[16]

François Fleuret. 2004. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5, 9 (2004), 1531--1555.

Digital Library

[17]

Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine Learning 29, 2--3 (1997), 131--163.

Digital Library

[18]

Shunkai Fu and Michel C. Desmarais. 2008. Fast Markov blanket discovery algorithm via local learning within single pass. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence. Springer, 96--107.

[19]

Keinosuke Fukunaga. 2013. Introduction to Statistical Pattern Recognition. Academic Press.

Digital Library

[20]

Tian Gao and Qiang Ji. 2017. Efficient Markov blanket discovery and its application. IEEE Transactions on Cybernetics 47, 5 (2017), 1169--1179.

[21]

Isabelle Guyon, Constantin Aliferis, and André Elisseeff. 2007. Causal feature selection. Computational Methods of Feature Selection, H. Liu and H. Motoda (Eds.). CRC Press.

[22]

Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. The Journal of Machine Learning Research 3, Mar (2003), 1157--1182.

[23]

Isabelle Guyon and André Elisseeff. 2006. An introduction to feature extraction. Feature Extraction. Springer, 1--25.

[24]

Martin E. Hellman and Josef Raviv. 1970. Probability of error, equivocation and the chernoff bound. IEEE Transactions on Information Theory 16, 4 (1970), 368--372.

Digital Library

[25]

Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1 (1997), 273--324.

Digital Library

[26]

Daphne Koller and Mehran Sahami. 1995. Toward optimal feature selection. In Proceedings of the 13th International Conference on International Conference on Machine Learning. 284--292.

[27]

Solomon Kullback and Richard A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79--86.

[28]

David D. Lewis. 1992. Feature selection and feature extraction for text categorization. In Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics, 212--217.

Digital Library

[29]

Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Computing Surveys 50, 6 (2017), 1--45.

Digital Library

[30]

Dahua Lin and Xiaoou Tang. 2006. Conditional infomax learning: An integrated framework for feature extraction and fusion. In Proceedings of the European Conference on Computer Vision. Springer, 68--82.

Digital Library

[31]

Zhaolong Ling, Kui Yu, Hao Wang, Lei Li, and Xindong Wu. 2020. Using feature selection for local causal structure learning. IEEE Transactions on Emerging Topics in Computational Intelligence. (2020).

[32]

Zhaolong Ling, Kui Yu, Hao Wang, Lin Liu, Wei Ding, and Xindong Wu. 2019. Bamb: A balanced markov blanket discovery approach to feature selection. ACM Transactions on Intelligent Systems and Technology 10, 5 (2019), 1--25.

Digital Library

[33]

Dimitris Margaritis and Sebastian Thrun. 2000. Bayesian network induction via local neighborhoods. In Proceedings of the Advances in Neural Information Processing Systems. 505--511.

[34]

Patrick Emmanuel Meyer, Colas Schretter, and Gianluca Bontempi. 2008. Information-theoretic feature selection in microarray data using variable complementarity. IEEE Journal of Selected Topics in Signal Processing 2, 3 (2008), 261--274.

[35]

Judea Pearl. 2014. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann.

[36]

Jose M. Peña, Roland Nilsson, Johan Björkegren, and Jesper Tegnér. 2007. Towards scalable and data efficient learning of Markov boundaries. International Journal of Approximate Reasoning 45, 2 (2007), 211--232.

Digital Library

[37]

Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 8 (2005), 1226--1238.

Digital Library

[38]

Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53, 1--2 (2003), 23--69.

Digital Library

[39]

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward causal representation learning. Proceedings of the IEEE.

[40]

Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1 (2001), 3--55.

Digital Library

[41]

Alexander Shishkin, Anastasia Bezzubtseva, Alexey Drutsa, Ilia Shishkov, Ekaterina Gladkikh, Gleb Gusev, and Pavel Serdyukov. 2016. Efficient high-order interaction-aware feature selection based on conditional mutual information. In Proceedings of the Advances in Neural Information Processing Systems. 4637--4645.

[42]

Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. 2012. Feature selection via dependence maximization. Journal of Machine Learning Research 13, 47 (2012), 1393--1434.

Digital Library

[43]

Xian-fang Song, Yong Zhang, Yi-nan Guo, Xiao-yan Sun, and Yong-li Wang. 2020. Variable-size cooperative coevolutionary particle swarm optimization for feature selection on high-dimensional data. IEEE Transactions on Evolutionary Computation 24, 5 (2020), 882--895.

[44]

Peter Spirtes, Clark N. Glymour, and Richard Scheines. 2000. Causation, Prediction, and Search. Vol. 81. MIT Press.

[45]

D. Tebbe and S. Dwyer. 1968. Uncertainty and the probability of error (Corresp.). IEEE Transactions on Information Theory 14, 3 (1968), 516--518.

Digital Library

[46]

Ioannis Tsamardinos and Constantin F. Aliferis. 2003. Towards principled feature selection: Relevancy, filters and wrappers. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics. Morgan Kaufmann Publishers.

[47]

Ioannis Tsamardinos, Constantin F. Aliferis, and Alexander Statnikov. 2003. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 673--678.

Digital Library

[48]

Ioannis Tsamardinos, Constantin F. Aliferis, Alexander R. Statnikov, and Er Statnikov. 2003. Algorithms for large scale Markov blanket discovery. In Proceedings of the FLAIRS Conference. Vol. 2. 376--380.

[49]

Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. 2006. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65, 1 (2006), 31--78.

Digital Library

[50]

Jorge R. Vergara and Pablo A. Estévez. 2014. A review of feature selection methods based on mutual information. Neural Computing and Applications 24, 1 (2014), 175--186.

[51]

Michel Vidal-Naquet and Shimon Ullman. 2003. Object recognition with informative features and linear classification. In Proceedings of the 9th IEEE International Conference on Computer Vision. Vol. 1. 281--281.

[52]

Nguyen Xuan Vinh, Shuo Zhou, Jeffrey Chan, and James Bailey. 2016. Can high-order dependencies improve mutual information based feature selection? Pattern Recognition 53, May (2016), 46--58.

Digital Library

[53]

De Wang, Danesh Irani, and Calton Pu. 2012. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Proceedings of the 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom’12). IEEE, 40--49.

Digital Library

[54]

Hao Wang, Zhaolong Ling, Kui Yu, and Xindong Wu. 2020. Towards efficient and effective discovery of Markov blankets for feature selection. Information Sciences 509, January (2020), 227--242.

[55]

Jun Wang, Jin-Mao Wei, Zhenglu Yang, and Shu-Qin Wang. 2017. Feature selection by maximizing independent classification information. IEEE Transactions on Knowledge and Data Engineering 29, 4 (2017), 828--841.

Digital Library

[56]

Bing Xue, Mengjie Zhang, Will N. Browne, and Xin Yao. 2015. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation 20, 4 (2015), 606--626.

Digital Library

[57]

Howard Hua Yang and John Moody. 2000. Data visualization and feature selection: New algorithms for nongaussian data. In Proceedings of the Advances in Neural Information Processing Systems. 687--693.

[58]

Sandeep Yaramakala. 2004. Fast Markov Blanket Discovery. Ph.D. Dissertation. Iowa State University.

[59]

Sandeep Yaramakala and Dimitris Margaritis. 2005. Speculative Markov blanket discovery for optimal feature selection. In Proceedings of the 5th IEEE International Conference on Data Mining. IEEE, 4.

Digital Library

[60]

Kui Yu, Xianjie Guo, Lin Liu, Jiuyong Li, Hao Wang, Zhaolong Ling, and Xindong Wu. 2020. Causality-based feature selection: Methods and evaluations. ACM Computing Surveys 53, 5 (2020), 1--36.

Digital Library

[61]

Kui Yu, Lin Liu, Jiuyong Li, Wei Ding, and Thuc Duy Le. 2020. Multi-source causal feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 9 (2020), 2240--2256.

Digital Library

[62]

Lei Yu and Huan Liu. 2004. Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research 5, December (2004), 1205--1224.

[63]

Yiteng Zhai, Yew-Soon Ong, and Ivor W. Tsang. 2014. The emerging “big dimensionality”. Computational Intelligence Magazine, IEEE 9, 3 (2014), 14--26.

Digital Library

[64]

Yong Zhang, Dun-wei Gong, and Jian Cheng. 2015. Multi-objective particle swarm optimization approach for cost-based feature selection in classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14, 1 (2015), 64--75.

Digital Library

[65]

Xun Zheng, Bryon Aragam, Pradeep K. Ravikumar, and Eric P. Xing. 2018. DAGs with NO TEARS: Continuous optimization for structure learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9472--9483.

Cited By

Cheng DLi JLiu LLiu JLe T(2024)Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A SurveyACM Computing Surveys10.1145/363642356:5(1-37)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3636423
Guo XYu KLiu LCao FLi J(2024)Causal Feature Selection With Dual CorrectionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317807535:1(938-951)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3178075
You DYan HXiao JChen ZWu DShen LWu X(2024)Online Learning for Data Streams With Incomplete Features and LabelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337435736:9(4820-4834)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3374357
Show More Cited By

Index Terms

A Unified View of Causal and Non-causal Feature Selection
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection

Recommendations

Causal Feature Selection with Missing Data
Causal feature selection aims at learning the Markov blanket (MB) of a class variable for feature selection. The MB of a class variable implies the local causal structure among the class variable and its MB and all other features are probabilistically ...
Error-aware Markov blanket learning for causal feature selection
Abstract
Causal feature selection has attracted much attention in recent years, since it has better robustness than the traditional feature selection. Existing causal feature selection algorithms aim to identify a Markov blanket (MB) of the ...
Fair Feature Selection: A Causal Perspective
Fair feature selection for classification decision tasks has recently garnered significant attention from researchers. However, existing fair feature selection algorithms fall short of providing a full explanation of the causal relationship between ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 15, Issue 4

August 2021

486 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3458847

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2021

Accepted: 01 November 2020

Revised: 01 September 2020

Received: 01 August 2019

Published in TKDD Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Anhui Province Key Research and Development Plan
Australian Research Council Discovery (ARC) Projects
National Natural Science Foundation of China
National Key Research and Development Program of China
Open Project Foundation of Intelligent Information Processing Key Laboratory of Shanxi Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
748
Total Downloads

Downloads (Last 12 months)177
Downloads (Last 6 weeks)22

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cheng DLi JLiu LLiu JLe T(2024)Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A SurveyACM Computing Surveys10.1145/363642356:5(1-37)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3636423
Guo XYu KLiu LCao FLi J(2024)Causal Feature Selection With Dual CorrectionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317807535:1(938-951)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3178075
You DYan HXiao JChen ZWu DShen LWu X(2024)Online Learning for Data Streams With Incomplete Features and LabelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337435736:9(4820-4834)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3374357
Liang JWang JYu GDomeniconi CZhang XGuo M(2024)Gradient-Based Local Causal Structure LearningIEEE Transactions on Cybernetics10.1109/TCYB.2023.323763554:1(486-495)Online publication date: Jan-2024
https://doi.org/10.1109/TCYB.2023.3237635
Yang SJiang TDang QGu LWu X(2024)Stable Learning via Triplex LearningIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.34044115:10(5267-5276)Online publication date: Oct-2024
https://doi.org/10.1109/TAI.2024.3404411
Cao LCao YGopaluni R(2024)Stable Soft Sensor Modeling for Industrial Systems2024 IEEE 7th International Conference on Industrial Cyber-Physical Systems (ICPS)10.1109/ICPS59941.2024.10640050(1-6)Online publication date: 12-May-2024
https://doi.org/10.1109/ICPS59941.2024.10640050
Wang YGao XSun PRu XWang J(2024)Local Bayesian Network Structure Learning for High-Dimensional Data2024 9th International Conference on Control and Robotics Engineering (ICCRE)10.1109/ICCRE61448.2024.10589754(259-263)Online publication date: 10-May-2024
https://doi.org/10.1109/ICCRE61448.2024.10589754
Zheng YZeng XZhang WWei BRen WQing D(2024)A novel ensemble causal feature selection approach with mutual information and group fusion strategy for multi-label dataInternational Journal of Intelligent Computing and Cybernetics10.1108/IJICC-04-2024-014417:4(671-704)Online publication date: 22-Jul-2024
https://doi.org/10.1108/IJICC-04-2024-0144
Wang YYu KXiang GCao FLiang J(2024)Discovering causally invariant features for out-of-distribution generalizationPattern Recognition10.1016/j.patcog.2024.110338150:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.patcog.2024.110338
You DWu HLiu JYan HMa CChen ZWu X(2024)Continuous causal structure learning from incremental instances and feature spacesInformation Fusion10.1016/j.inffus.2023.101975101:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.inffus.2023.101975
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents