Simple Stopping Criteria for Information Theoretic Feature Selection
Abstract
:1. Introduction
- Efficient stopping criteria for feature selection are still missing in the literature. We present two rules based on the new estimators of information theoretic descriptors of Rényi’s -entropy in RKHS. One should note that the properties, utilities, and possible applications of these new estimators are rather new and mostly unknown to practitioners;
- Our criteria are extremely simple and flexible. First, the proposed tests and stopping criteria can be incorporated into any sequential feature selection procedure, such as mutual information-based feature selection (MIFS) [4] and maximum-relevance minimum-redundancy (MRMR) [24]. Second, benefiting from the higher accuracy of the estimation, we demonstrated that one can simply use a threshold in a heuristic manner, which is very valuable for practitioners across different domains.
2. Simple Stopping Criteria for Information Theoretic Feature Selection
2.1. Matrix-Based Rényi’s -Entropy Functional and Its Multivariate Extension
2.2. Stopping Criteria Based on Conditional Mutual Information
Algorithm 1 CMI-permutation |
Input: Feature set S; Selected feature subset ; Class labels y; Selected feature in the current iteration ; Permutation number P; Significance level . Output: decision (Stop selection or Continue selection). |
3. Experiments and Discussions
4. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
MI | Mutual information |
CMI | Conditional mutual information |
RKHS | Reproducing kernel Hilbert space |
MB | Markov blanket |
Probability density function |
References
- Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Lu, Y.; Fan, Y.; Lv, J.; Noble, W.S. DeepPINK: Reproducible feature selection in deep neural networks. In Proceedings of the 2018 Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 8690–8700. [Google Scholar]
- Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Meyer, P.E.; Schretter, C.; Bontempi, G. Information-theoretic feature selection in microarray data using variable complementarity. IEEE J. Sel. Top. Signal Process. 2008, 2, 261–274. [Google Scholar] [CrossRef]
- Gurban, M.; Thiran, J.P. Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. Signal Process. 2009, 57, 4765–4776. [Google Scholar] [CrossRef]
- Eriksson, T.; Kim, S.; Kang, H.G.; Lee, C. An information-theoretic perspective on feature selection in speaker recognition. IEEE Signal Process. Lett. 2005, 12, 500–503. [Google Scholar] [CrossRef]
- Brown, G.; Pocock, A.; Zhao, M.J.; Luján, M. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]
- Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 2004, 5, 1531–1555. [Google Scholar]
- Ross, B.C. Mutual information between discrete and continuous data sets. PLoS ONE 2014, 9, e87357. [Google Scholar] [CrossRef]
- Singha, S.; Shenoy, P.P. An adaptive heuristic for feature selection based on complementarity. Mach. Learn. 2018, 107, 1–45. [Google Scholar] [CrossRef]
- Vinh, N.X.; Zhou, S.; Chan, J.; Bailey, J. Can high-order dependencies improve mutual information based feature selection? Pattern Recognit. 2016, 53, 46–58. [Google Scholar] [CrossRef] [Green Version]
- Yu, S.; Giraldo, L.G.S.; Jenssen, R.; Principe, J.C. Multivariate Extension of Matrix-based Renyi’s α-order Entropy Functional. arXiv, 2018; arXiv:1808.07912. [Google Scholar]
- François, D.; Rossi, F.; Wertz, V.; Verleysen, M. Resampling methods for parameter-free and robust feature selection with mutual information. Neurocomputing 2007, 70, 1276–1288. [Google Scholar] [CrossRef] [Green Version]
- Gómez-Verdejo, V.; Verleysen, M.; Fleury, J. Information-theoretic feature selection for the classification of hysteresis curves. In International Work-Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, German, 2007; pp. 522–529. [Google Scholar]
- Koller, D.; Sahami, M. Toward Optimal Feature Selection; The Stanford University InfoLab: Stanford, CA, USA, 1996. [Google Scholar]
- Yaramakala, S.; Margaritis, D. Speculative Markov blanket discovery for optimal feature selection. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005; pp. 809–812. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Vinh, N.X.; Chan, J.; Bailey, J. Reconsidering Mutual Information Based Feature Selection: A Statistical Significance View. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI’14), Quebec City, QC, Canada, 27–31 July 2014; pp. 2092–2098. [Google Scholar]
- Good, P. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses; Springer: Berlin/Heidelberg, German, 2013. [Google Scholar]
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
- Campos, L.M. A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. J. Mach. Learn. Res. 2006, 7, 2149–2187. [Google Scholar]
- Giraldo, L.G.S.; Rao, M.; Principe, J.C. Measures of entropy from data using infinitely divisible kernels. IEEE Trans. Inf. Theory 2015, 61, 535–548. [Google Scholar] [CrossRef]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Rényi, A. On Measures of Entropy and Information; Hungarian Academy of Sciences: Budapest, Hungary, 1961. [Google Scholar]
- Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Pperspectives; Springer: Berlin/Heidelberg, German, 2010. [Google Scholar]
- Ohya, M.; Petz, D. Quantum Entropy and Its Use; Springer: Berlin/Heidelberg, German, 2004. [Google Scholar]
- Bhatia, R. Infinitely divisible matrices. Am. Math. Mon. 2006, 113, 221–235. [Google Scholar] [CrossRef]
- Müller-Lennert, M.; Dupuis, F.; Szehr, O.; Fehr, S.; Tomamichel, M. On quantum Rényi entropies: A new generalization and some properties. J. Math. Phys. 2013, 54, 122203. [Google Scholar] [CrossRef]
- Mosonyi, M.; Hiai, F. On the Quantum Rényi Relative Entropies and Related Capacity Formulas. IEEE Trans. Inf. Theory 2011, 57, 2474–2487. [Google Scholar] [CrossRef]
- Verdú, S. α-mutual information. In Proceedings of the 2015 IEEE Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar]
- Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–27. [Google Scholar] [CrossRef]
- Li, J.; Liu, H. Challenges of feature selection for big data analytics. IEEE Intell. Syst. 2017, 32, 9–15. [Google Scholar] [CrossRef]
- Martin, C.D.; Porter, M.A. The extraordinary SVD. Am. Math. Mon. 2012, 119, 838–851. [Google Scholar] [CrossRef]
CMI-Heuristic | CMI-Permutation | MI- [19] | MI-Permutation [14] | “Optimal” | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#F | acc. | rank | #F | acc. | rank | #F | acc. | rank | #F | acc. | rank | #F | acc. | |
waveform (21) | 11 | 84.7 ± 1.8 | 1 | 4 | 78.3 ± 1.8 | 3 | 3 | 76.1 ± 1.8 | 4 | 5 | 80.2 ± 1.8 | 2 | 11 | 84.7 ± 1.8 |
breast (30) | 2 | 92.3 ± 1.7 | 1 | 2 | 92.3 ± 1.7 | 1 | 2 | 92.3 ± 1.7 | 1 | 2 | 92.3 ± 1.7 | 1 | 2 | 95.2 ± 1.2 |
heart (13) | 13 | 81.7 ± 3.5 | 4 | 4 | 80.4 ± 3.3 | 1 | 2 | 76.9 ± 3.8 | 3 | 4 | 80.4 ± 3.3 | 1 | 6 | 82.6 ± 3.0 |
spect (22) | 22 | 80.6 ± 3.6 | 4 | 11 | 82.1 ± 3.3 | 1 | 1 | 80.1 ± 3.3 | 3 | 7 | 81.1 ± 3.3 | 2 | 11 | 82.1 ± 3.3 |
ionosphere (34) | 15 | 83.3 ± 2.8 | 1 | 7 | 81.8 ± 2.8 | 2 | 1 | 76.7 ± 3.2 | 4 | 7 | 81.8 ± 2.8 | 2 | 33 | 85.3 ± 3.0 |
parkinsons (22) | 12 | 85.2 ± 3.7 | 1 | 4 | 85.0 ± 3.2 | 2 | 1 | 85.1 ± 3.5 | 4 | 4 | 85.0 ± 3.2 | 2 | 9 | 86.5 ± 3.4 |
semeion (256) | 59 | 86.1 ± 1.3 | 1 | 20 | 77.7 ± 1.5 | 2 | 4 | 49.6 ± 1.7 | 4 | 20 | 77.7 ± 1.5 | 2 | 73 | 93.3 ± 1.3 |
Lung (325) | 5 | 74.2 ± 7.7 | 3 | 10 | 73.9 ± 8.0 | 2 | 1 | 46.5 ± 7.5 | 4 | 13 | 79.1 ± 7.9 | 1 | 41 | 84.3 ± 6.5 |
Lympth (4026) | 6 | 81.3 ± 5.8 | 1 | 248 | 88.7 ± 6.1 | 3 | 2 | 62.8 ± 6.5 | 2 | 249 | 88.9 ± 6.2 | 4 | 70 | 90.7 ± 5.4 |
Madelon (500) | 3 | 69.5 ± 1.6 | 2 | 2 | 59.5 ± 1.6 | 3 | 4 | 76.7 ± 1.5 | 1 | 2 | 59.5 ± 1.6 | 3 | 4 | 76.7 ± 1.5 |
average rank | 1.9 | 2.0 | 3.0 | 2.0 |
CMI-Heuristic | CMI-Permutation | MI- [19] | MI-Permutation [14] | “Optimal” | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#F | acc. | rank | #F | acc. | rank | #F | acc. | rank | #F | acc. | rank | #F | acc. | |
waveform (21) | 17 | 84.2 ± 1.9 | 1 | 4 | 78.2 ± 2.1 | 2 | 3 | 76.1 ± 2.0 | 4 | 4 | 78.2 ± 2.1 | 2 | 11 | 84.7 ± 1.8 |
breast (30) | 1 | 90.9 ± 1.5 | 2 | 1 | 90.9 ± 1.5 | 2 | 2 | 95.2 ± 1.2 | 1 | 1 | 90.9 ± 1.5 | 2 | 2 | 95.2 ± 1.2 |
heart (13) | 6 | 82.6 ± 3.0 | 1 | 1 | 55.0 ± 4.4 | 3 | 2 | 76.9 ± 3.8 | 2 | 1 | 55.0 ± 4.4 | 3 | 6 | 82.6 ± 3.0 |
spect (22) | 17 | 80.2 ± 4.2 | 1 | 2 | 78.2 ± 3.2 | 3 | 1 | 79.6 ± 3.2 | 4 | 3 | 79.1 ± 3.2 | 2 | 11 | 81.4 ± 4.2 |
ionosphere (34) | 18 | 84.7 ± 2.9 | 1 | 8 | 81.9 ± 3.5 | 2 | 1 | 76.4 ± 3.9 | 4 | 8 | 81.9 ± 3.5 | 2 | 33 | 85.3 ± 3.0 |
parkinsons (22) | 13 | 83.7 ± 3.7 | 1 | 4 | 84.6 ± 3.2 | 2 | 1 | 77.0 ± 5.0 | 4 | 3 | 84.6 ± 3.4 | 3 | 9 | 86.0 ± 3.9 |
semeion (256) | 53 | 85.5 ± 1.1 | 1 | 40 | 82.7 ± 1.4 | 2 | 4 | 49.6 ± 1.7 | 4 | 40 | 82.7 ± 1.4 | 2 | 73 | 93.3 ± 1.3 |
Lung (325) | 9 | 76.1 ± 7.9 | 2 | 5 | 74.2 ± 7.7 | 3 | 1 | 46.5 ± 7.5 | 4 | 11 | 73.0 ± 7.1 | 1 | 41 | 84.3 ± 6.5 |
Lympth (4026) | 8 | 86.0 ± 5.3 | 3 | 65 | 89.6 ± 5.5 | 1 | 2 | 62.8 ± 6.5 | 4 | 64 | 89.6 ± 5.5 | 2 | 70 | 90.7 ± 5.4 |
Madelon (500) | 3 | 69.5 ± 1.6 | 2 | 1 | 52.4 ± 1.7 | 3 | 4 | 76.7 ± 1.5 | 1 | 1 | 52.4 ± 1.7 | 3 | 4 | 76.7 ± 1.5 |
average rank | 1.5 | 2.3 | 3.2 | 2.2 |
MI- | MI-Permutation | |
CMI-heuristic | 0.0781 (1) | 0.5455 (0) |
CMI-permutation | 0.0561 (1) | 0.9036 (0) |
MI- | MI-Permutation | |
CMI-heuristic | 0.0081 (1) | 0.0341 (1) |
CMI-permutation | 0.0587 (1) | 0.7340 (0) |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, S.; Príncipe, J.C. Simple Stopping Criteria for Information Theoretic Feature Selection. Entropy 2019, 21, 99. https://doi.org/10.3390/e21010099
Yu S, Príncipe JC. Simple Stopping Criteria for Information Theoretic Feature Selection. Entropy. 2019; 21(1):99. https://doi.org/10.3390/e21010099
Chicago/Turabian StyleYu, Shujian, and José C. Príncipe. 2019. "Simple Stopping Criteria for Information Theoretic Feature Selection" Entropy 21, no. 1: 99. https://doi.org/10.3390/e21010099
APA StyleYu, S., & Príncipe, J. C. (2019). Simple Stopping Criteria for Information Theoretic Feature Selection. Entropy, 21(1), 99. https://doi.org/10.3390/e21010099