Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Unified View of Causal and Non-causal Feature Selection

Published: 18 April 2021 Publication History

Abstract

In this article, we aim to develop a unified view of causal and non-causal feature selection methods. The unified view will fill in the gap in the research of the relation between the two types of methods. Based on the Bayesian network framework and information theory, we first show that causal and non-causal feature selection methods share the same objective. That is to find the Markov blanket of a class attribute, the theoretically optimal feature set for classification. We then examine the assumptions made by causal and non-causal feature selection methods when searching for the optimal feature set, and unify the assumptions by mapping them to the restrictions on the structure of the Bayesian network model of the studied problem. We further analyze in detail how the structural assumptions lead to the different levels of approximations employed by the methods in their search, which then result in the approximations in the feature sets found by the methods with respect to the optimal feature set. With the unified view, we can interpret the output of non-causal methods from a causal perspective and derive the error bounds of both types of methods. Finally, we present practical understanding of the relation between causal and non-causal methods using extensive experiments with synthetic data and various types of real-world data.

Supplementary Material

a63-yu-suppl.pdf (yu.zip)
Supplemental movie, appendix, image and software files for, A Unified View of Causal and Non-causal Feature Selection

References

[1]
Alan Agresti and Maria Kateri. 2011. Categorical data analysis. In International Encyclopedia of Statistical Science. Springer, 206--208.
[2]
Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D. Koutsoukos. 2010. Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation. Journal of Machine Learning Research 11, 7 (2010), 171--234.
[3]
Constantin F. Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. 2003. HITON: A novel Markov blanket algorithm for optimal variable selection. In Proceedings of the AMIA Annual Symposium. Vol. 2003. American Medical Informatics Association, 21.
[4]
Kevin Bache and Moshe Lichman. 2013. UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml.
[5]
Kiran S. Balagani and Vir V. Phoha. 2010. On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 7 (2010), 1342--1343.
[6]
Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks 5, 4 (1994), 537--550.
[7]
Ingo A. Beinlich, Henri J. Suermondt, R. Martin Chavez, and Gregory F. Cooper. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Springer.
[8]
David A. Bell and Hui Wang. 2000. A formalism for relevance and its application in feature subset selection. Machine Learning 41, 2 (2000), 175--195.
[9]
Gianluca Bontempi and Patrick E. Meyer. 2010. Causal filter selection in microarray data. In Proceedings of the 27th International Conference on Machine Learning. 95--102.
[10]
Giorgos Borboudakis and Ioannis Tsamardinos. 2019. Forward-backward selection with early dropping. The Journal of Machine Learning Research 20, 1 (2019), 276--314.
[11]
Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research 13, 1 (2012), 27--66.
[12]
Peter Bühlmann, Markus Kalisch, and Marloes H. Maathuis. 2010. Variable selection in high-dimensional linear models: Partially faithful distributions and the PC-simple algorithm. Biometrika 97, 2 (2010), 261--278.
[13]
Thomas M. Cover and Joy A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons.
[14]
Manoranjan Dash and Huan Liu. 2003. Consistency-based search in feature selection. Artificial Intelligence 151, 1--2 (2003), 155--176.
[15]
R. M. Fano. 1961. Transmission of Information: A Statistical Theory of Communications. MIT Press.
[16]
François Fleuret. 2004. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5, 9 (2004), 1531--1555.
[17]
Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine Learning 29, 2--3 (1997), 131--163.
[18]
Shunkai Fu and Michel C. Desmarais. 2008. Fast Markov blanket discovery algorithm via local learning within single pass. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence. Springer, 96--107.
[19]
Keinosuke Fukunaga. 2013. Introduction to Statistical Pattern Recognition. Academic Press.
[20]
Tian Gao and Qiang Ji. 2017. Efficient Markov blanket discovery and its application. IEEE Transactions on Cybernetics 47, 5 (2017), 1169--1179.
[21]
Isabelle Guyon, Constantin Aliferis, and André Elisseeff. 2007. Causal feature selection. Computational Methods of Feature Selection, H. Liu and H. Motoda (Eds.). CRC Press.
[22]
Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. The Journal of Machine Learning Research 3, Mar (2003), 1157--1182.
[23]
Isabelle Guyon and André Elisseeff. 2006. An introduction to feature extraction. Feature Extraction. Springer, 1--25.
[24]
Martin E. Hellman and Josef Raviv. 1970. Probability of error, equivocation and the chernoff bound. IEEE Transactions on Information Theory 16, 4 (1970), 368--372.
[25]
Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1 (1997), 273--324.
[26]
Daphne Koller and Mehran Sahami. 1995. Toward optimal feature selection. In Proceedings of the 13th International Conference on International Conference on Machine Learning. 284--292.
[27]
Solomon Kullback and Richard A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79--86.
[28]
David D. Lewis. 1992. Feature selection and feature extraction for text categorization. In Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics, 212--217.
[29]
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Computing Surveys 50, 6 (2017), 1--45.
[30]
Dahua Lin and Xiaoou Tang. 2006. Conditional infomax learning: An integrated framework for feature extraction and fusion. In Proceedings of the European Conference on Computer Vision. Springer, 68--82.
[31]
Zhaolong Ling, Kui Yu, Hao Wang, Lei Li, and Xindong Wu. 2020. Using feature selection for local causal structure learning. IEEE Transactions on Emerging Topics in Computational Intelligence. (2020).
[32]
Zhaolong Ling, Kui Yu, Hao Wang, Lin Liu, Wei Ding, and Xindong Wu. 2019. Bamb: A balanced markov blanket discovery approach to feature selection. ACM Transactions on Intelligent Systems and Technology 10, 5 (2019), 1--25.
[33]
Dimitris Margaritis and Sebastian Thrun. 2000. Bayesian network induction via local neighborhoods. In Proceedings of the Advances in Neural Information Processing Systems. 505--511.
[34]
Patrick Emmanuel Meyer, Colas Schretter, and Gianluca Bontempi. 2008. Information-theoretic feature selection in microarray data using variable complementarity. IEEE Journal of Selected Topics in Signal Processing 2, 3 (2008), 261--274.
[35]
Judea Pearl. 2014. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann.
[36]
Jose M. Peña, Roland Nilsson, Johan Björkegren, and Jesper Tegnér. 2007. Towards scalable and data efficient learning of Markov boundaries. International Journal of Approximate Reasoning 45, 2 (2007), 211--232.
[37]
Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 8 (2005), 1226--1238.
[38]
Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53, 1--2 (2003), 23--69.
[39]
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward causal representation learning. Proceedings of the IEEE.
[40]
Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1 (2001), 3--55.
[41]
Alexander Shishkin, Anastasia Bezzubtseva, Alexey Drutsa, Ilia Shishkov, Ekaterina Gladkikh, Gleb Gusev, and Pavel Serdyukov. 2016. Efficient high-order interaction-aware feature selection based on conditional mutual information. In Proceedings of the Advances in Neural Information Processing Systems. 4637--4645.
[42]
Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. 2012. Feature selection via dependence maximization. Journal of Machine Learning Research 13, 47 (2012), 1393--1434.
[43]
Xian-fang Song, Yong Zhang, Yi-nan Guo, Xiao-yan Sun, and Yong-li Wang. 2020. Variable-size cooperative coevolutionary particle swarm optimization for feature selection on high-dimensional data. IEEE Transactions on Evolutionary Computation 24, 5 (2020), 882--895.
[44]
Peter Spirtes, Clark N. Glymour, and Richard Scheines. 2000. Causation, Prediction, and Search. Vol. 81. MIT Press.
[45]
D. Tebbe and S. Dwyer. 1968. Uncertainty and the probability of error (Corresp.). IEEE Transactions on Information Theory 14, 3 (1968), 516--518.
[46]
Ioannis Tsamardinos and Constantin F. Aliferis. 2003. Towards principled feature selection: Relevancy, filters and wrappers. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics. Morgan Kaufmann Publishers.
[47]
Ioannis Tsamardinos, Constantin F. Aliferis, and Alexander Statnikov. 2003. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 673--678.
[48]
Ioannis Tsamardinos, Constantin F. Aliferis, Alexander R. Statnikov, and Er Statnikov. 2003. Algorithms for large scale Markov blanket discovery. In Proceedings of the FLAIRS Conference. Vol. 2. 376--380.
[49]
Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. 2006. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65, 1 (2006), 31--78.
[50]
Jorge R. Vergara and Pablo A. Estévez. 2014. A review of feature selection methods based on mutual information. Neural Computing and Applications 24, 1 (2014), 175--186.
[51]
Michel Vidal-Naquet and Shimon Ullman. 2003. Object recognition with informative features and linear classification. In Proceedings of the 9th IEEE International Conference on Computer Vision. Vol. 1. 281--281.
[52]
Nguyen Xuan Vinh, Shuo Zhou, Jeffrey Chan, and James Bailey. 2016. Can high-order dependencies improve mutual information based feature selection? Pattern Recognition 53, May (2016), 46--58.
[53]
De Wang, Danesh Irani, and Calton Pu. 2012. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Proceedings of the 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom’12). IEEE, 40--49.
[54]
Hao Wang, Zhaolong Ling, Kui Yu, and Xindong Wu. 2020. Towards efficient and effective discovery of Markov blankets for feature selection. Information Sciences 509, January (2020), 227--242.
[55]
Jun Wang, Jin-Mao Wei, Zhenglu Yang, and Shu-Qin Wang. 2017. Feature selection by maximizing independent classification information. IEEE Transactions on Knowledge and Data Engineering 29, 4 (2017), 828--841.
[56]
Bing Xue, Mengjie Zhang, Will N. Browne, and Xin Yao. 2015. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation 20, 4 (2015), 606--626.
[57]
Howard Hua Yang and John Moody. 2000. Data visualization and feature selection: New algorithms for nongaussian data. In Proceedings of the Advances in Neural Information Processing Systems. 687--693.
[58]
Sandeep Yaramakala. 2004. Fast Markov Blanket Discovery. Ph.D. Dissertation. Iowa State University.
[59]
Sandeep Yaramakala and Dimitris Margaritis. 2005. Speculative Markov blanket discovery for optimal feature selection. In Proceedings of the 5th IEEE International Conference on Data Mining. IEEE, 4.
[60]
Kui Yu, Xianjie Guo, Lin Liu, Jiuyong Li, Hao Wang, Zhaolong Ling, and Xindong Wu. 2020. Causality-based feature selection: Methods and evaluations. ACM Computing Surveys 53, 5 (2020), 1--36.
[61]
Kui Yu, Lin Liu, Jiuyong Li, Wei Ding, and Thuc Duy Le. 2020. Multi-source causal feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 9 (2020), 2240--2256.
[62]
Lei Yu and Huan Liu. 2004. Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research 5, December (2004), 1205--1224.
[63]
Yiteng Zhai, Yew-Soon Ong, and Ivor W. Tsang. 2014. The emerging “big dimensionality”. Computational Intelligence Magazine, IEEE 9, 3 (2014), 14--26.
[64]
Yong Zhang, Dun-wei Gong, and Jian Cheng. 2015. Multi-objective particle swarm optimization approach for cost-based feature selection in classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14, 1 (2015), 64--75.
[65]
Xun Zheng, Bryon Aragam, Pradeep K. Ravikumar, and Eric P. Xing. 2018. DAGs with NO TEARS: Continuous optimization for structure learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9472--9483.

Cited By

View all
  • (2024)Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A SurveyACM Computing Surveys10.1145/363642356:5(1-37)Online publication date: 12-Jan-2024
  • (2024)Causal Feature Selection With Dual CorrectionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317807535:1(938-951)Online publication date: Jan-2024
  • (2024)Online Learning for Data Streams With Incomplete Features and LabelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337435736:9(4820-4834)Online publication date: 1-Sep-2024
  • Show More Cited By

Index Terms

  1. A Unified View of Causal and Non-causal Feature Selection

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 15, Issue 4
    August 2021
    486 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3458847
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 April 2021
    Accepted: 01 November 2020
    Revised: 01 September 2020
    Received: 01 August 2019
    Published in TKDD Volume 15, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bayesian network
    2. Causal feature selection
    3. Markov blanket
    4. mutual information
    5. non-causal feature selection

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Anhui Province Key Research and Development Plan
    • Australian Research Council Discovery (ARC) Projects
    • National Natural Science Foundation of China
    • National Key Research and Development Program of China
    • Open Project Foundation of Intelligent Information Processing Key Laboratory of Shanxi Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)177
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A SurveyACM Computing Surveys10.1145/363642356:5(1-37)Online publication date: 12-Jan-2024
    • (2024)Causal Feature Selection With Dual CorrectionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317807535:1(938-951)Online publication date: Jan-2024
    • (2024)Online Learning for Data Streams With Incomplete Features and LabelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337435736:9(4820-4834)Online publication date: 1-Sep-2024
    • (2024)Gradient-Based Local Causal Structure LearningIEEE Transactions on Cybernetics10.1109/TCYB.2023.323763554:1(486-495)Online publication date: Jan-2024
    • (2024)Stable Learning via Triplex LearningIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.34044115:10(5267-5276)Online publication date: Oct-2024
    • (2024)Stable Soft Sensor Modeling for Industrial Systems2024 IEEE 7th International Conference on Industrial Cyber-Physical Systems (ICPS)10.1109/ICPS59941.2024.10640050(1-6)Online publication date: 12-May-2024
    • (2024)Local Bayesian Network Structure Learning for High-Dimensional Data2024 9th International Conference on Control and Robotics Engineering (ICCRE)10.1109/ICCRE61448.2024.10589754(259-263)Online publication date: 10-May-2024
    • (2024)A novel ensemble causal feature selection approach with mutual information and group fusion strategy for multi-label dataInternational Journal of Intelligent Computing and Cybernetics10.1108/IJICC-04-2024-014417:4(671-704)Online publication date: 22-Jul-2024
    • (2024)Discovering causally invariant features for out-of-distribution generalizationPattern Recognition10.1016/j.patcog.2024.110338150:COnline publication date: 2-Jul-2024
    • (2024)Continuous causal structure learning from incremental instances and feature spacesInformation Fusion10.1016/j.inffus.2023.101975101:COnline publication date: 1-Jan-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media