research-article

Contextual feature selection with conditional stochastic gates

AUTHORs:

Ram Dyuthi Sristi,

Ofir Lindenbaum,

Shira Lifshitz,

Jackie Schiller,

Hadas BenistyAuthors Info & Claims

ICML'24: Proceedings of the 41st International Conference on Machine Learning

Article No.: 1886, Pages 46375 - 46392

Published: 21 July 2024 Publication History

Abstract

Feature selection is a crucial tool in machine learning and is widely applied across various scientific disciplines. Traditional supervised methods generally identify a universal set of informative features for the entire population. However, feature relevance often varies with context, while the context itself may not directly affect the outcome variable. Here, we propose a novel architecture for contextual feature selection where the subset of selected features is conditioned on the value of context variables. Our new approach, Conditional Stochastic Gates (c-STG), models the importance of features using conditional Bernoulli variables whose parameters are predicted based on contextual variables. We introduce a hypernetwork that maps context variables to feature selection parameters to learn the context-dependent gates along with a prediction model. We further present a theoretical analysis of our model, indicating that it can improve performance and flexibility over population-level methods in complex feature selection settings. Finally, we conduct an extensive benchmark using simulated and real-world datasets across multiple domains demonstrating that c-STG can lead to improved feature selection capabilities while enhancing prediction accuracy and interpretability.

References

[1]

Al-Shedivat, M., Dubey, A., and Xing, E. Contextual explanation networks. The Journal of Machine Learning Research, 21(1):7950-7993, 2020.

Digital Library

[2]

Allen, G. I. Automatic feature selection via weighted kernels and regularization. Journal of Computational and Graphical Statistics, 22(2):284-299, 2013.

[3]

Baltrunas, L. and Ricci, F. Context-based splitting of item ratings in collaborative filtering. In Proceedings of the third ACM conference on Recommender systems, pp. 245-248, 2009.

Digital Library

[4]

Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks, 5(4):537-550, 1994.

Digital Library

[5]

Bertsimas, D., Copenhaver, M. S., and Mazumder, R. The trimmed lasso: Sparsity and robustness, 2017.

[6]

Chen, J., Stern, M., Wainwright, M. J., and Jordan, M. I. Kernel feature selection via conditional covariance minimization. In Advances in Neural Information Processing Systems, pp. 6946-6955, 2017.

[7]

Chen, J., Song, L., Wainwright, M., and Jordan, M. Learning to explain: An information-theoretic perspective on model interpretation. In International conference on machine learning, pp. 883-892. PMLR, 2018.

[8]

Covert, I. C., Qiu, W., Lu, M., Kim, N. Y., White, N. J., and Lee, S.-I. Learning to maximize mutual information for dynamic feature selection. In International Conference on Machine Learning, pp. 6424-6447. PMLR, 2023.

[9]

Daubechies, I., DeVore, R., Fornasier, M., and Güntürk, C. S. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 63(1):1-38, 2010.

[10]

Estévez, P. A., Tesmer, M., Perez, C. A., and Zurada, J. M. Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2):189-201, 2009.

Digital Library

[11]

Everaert, J., Benisty, H., Gadassi Polack, R., Joormann, J., and Mishne, G. Which features of repetitive negative thinking and positive reappraisal predict depression? an in-depth investigation using artificial neural networks with feature selection. Journal of Psychopathology and Clinical Science, 2022.

[12]

Figurnov, M., Mohamed, S., and Mnih, A. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pp. 441-452, 2018.

[13]

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249-256. JMLR Workshop and Conference Proceedings, 2010.

[14]

Hans, C. Bayesian Lasso regression. Biometrika, 96(4): 835-845, 2009.

[15]

He, X. and Niyogi, P. Locality preserving projections. In Advances in neural information processing systems, pp. 153-160, 2004.

Digital Library

[16]

Islam, M. R., Lima, A. A., Das, S. C., Mridha, M., Prodeep, A. R., and Watanobe, Y. A comprehensive survey on the process, methods, evaluation, and challenges of feature selection. IEEE Access, 2022.

[17]

Jana, S., Li, H., Yamada, Y., and Lindenbaum, O. Support recovery with projected stochastic gates: Theory and application for linear models. Signal Processing, 213: 109193, 2023.

Digital Library

[18]

Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=rkE3y85ee.

[19]

Janosi, A., Steinbrunn, W., Pfisterer, M., and Detrano, R. Heart Disease. UCI Machine Learning Repository, 1988.

[20]

Kohavi, R. and John, G. H. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273-324, 1997.

Digital Library

[21]

Kumar, V. and Minz, S. Feature selection: A literature review. Smart Comput. Rev., 4:211-229, 2014.

Digital Library

[22]

Lengerich, B., Ellington, C. N., Rubbi, A., Kellis, M., and Xing, E. P. Contextualized machine learning. arXiv preprint arXiv:2310.11340, 2023.

[23]

Levy, S., Lavzin, M., Benisty, H., Ghanayim, A., Dubin, U., Achvat, S., Brosh, Z., Aeed, F., Mensh, B. D., Schiller, Y., et al. Cell-type-specific outcome representation in the primary motor cortex. Neuron, 107(5):954-971, 2020.

[24]

Lewis, D. D. Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.

Digital Library

[25]

Li, F., Yang, Y., and Xing, E. P. From lasso regression to feature vector machine. In Advances in Neural Information Processing Systems, pp. 779-786, 2006.

[26]

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. Feature selection. ACM Computing Surveys, 50(6):1-45, dec 2017.

Digital Library

[27]

Li, W., Feng, J., and Jiang, T. IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. In International Conference on Research in Computational Molecular Biology, pp. 168-188. Springer, 2011.

[28]

Lianjia. Housing price in Beijing, 2017. URL https://www.kaggle.com/datasets/ruiqurm/lianjia.

[29]

Lindenbaum, O., Salhov, M., Averbuch, A., and Kluger, Y. L0-sparse canonical correlation analysis. In International Conference on Learning Representations, 2021a.

[30]

Lindenbaum, O., Shaham, U., Peterfreund, E., Svirsky, J., Casey, N., and Kluger, Y. Differentiable unsupervised feature selection based on a gated laplacian. Advances in Neural Information Processing Systems, 34:1530-1542, 2021b.

[31]

Lindenbaum, O., Aizenbud, Y., and Kluger, Y. Probabilistic robust autoencoders for outlier detection. The Conference on Uncertainty in Artificial Intelligence (UAI), 2024.

[32]

Liu, Y., Chen, P.-H. C., Krause, J., and Peng, L. How to read articles that use machine learning: users' guides to the medical literature. Jama, 322(18):1806-1816, 2019.

[33]

Louizos, C., Welling, M., and Kingma, D. P. Learning sparse neural networks through ℓ₀ regularization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1Y8hhg0b.

[34]

Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=S1jE5L5gl.

[35]

Miller, A., Foti, N., D'Amour, A., and Adams, R. P. Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems, pp. 3708-3718, 2017.

[36]

Peng, H., Long, F., and Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8):1226-1238, 2005.

Digital Library

[37]

Reunanen, J. Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, 3(Mar):1371-1382, 2003.

[38]

Shaham, U., Lindenbaum, O., Svirsky, J., and Kluger, Y. Deep unsupervised feature selection by discarding nuisance and correlated features. Neural Networks, 152: 34-43, 2022.

Digital Library

[39]

Shim, H., Hwang, S. J., and Yang, E. Joint active feature acquisition and classification with variable-size set encoding. Advances in neural information processing systems, 31, 2018.

[40]

Song, L., Smola, A., Gretton, A., Borgwardt, K. M., and Bedo, J. Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning, pp. 823-830. ACM, 2007.

Digital Library

[41]

Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. Feature selection via dependence maximization. Journal of Machine Learning Research, 13(May):1393-1434, 2012.

[42]

Sristi, R. D., Mishne, G., and Jaffe, A. Disc: Differential spectral clustering of features. In Advances in Neural Information Processing Systems, 2022.

[43]

Stein, G., Chen, B., Wu, A. S., and Hua, K. A. Decision tree classifier for network intrusion detection with ga-based feature selection. In Proceedings of the 43rd annual Southeast regional conference-Volume 2, pp. 136-141. ACM, 2005.

Digital Library

[44]

Thompson, R., Dezfouli, A., and Kohn, R. The contextual lasso: Sparse linear models via deep neural networks, 2023.

[45]

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267-288, 1996.

[46]

Yamada, Y., Lindenbaum, O., Negahban, S., and Kluger, Y. Feature selection using stochastic gates. In International Conference on Machine Learning, pp. 10648-10659. PMLR, 2020.

[47]

Yang, J., Lindenbaum, O., and Kluger, Y. Locally sparse neural networks for tabular biomedical data. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 25123-25153. PMLR, 17-23 Jul 2022a. URL https://proceedings.mlr.press/v162/yang22i.html.

[48]

Yang, J., Lindenbaum, O., and Kluger, Y. Locally sparse neural networks for tabular biomedical data. In International Conference on Machine Learning, pp. 25123-25153. PMLR, 2022b.

[49]

Yin, M., Ho, N., Yan, B., Qian, X., and Zhou, M. Probabilistic best subset selection via gradient-based optimization, 2022.

[50]

Yoon, J., Jordon, J., and van der Schaar, M. Invase: Instancewise variable selection using neural networks. In International Conference on Learning Representations, 2018.

[51]

Zhu, Z., Ong, Y.-S., and Dash, M. Wrapper-filter feature selection algorithm using a memetic framework. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(1):70-76, 2007.

Digital Library

Index Terms

Contextual feature selection with conditional stochastic gates
1. Computing methodologies
  1. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Nearest neighbor estimate of conditional mutual information in feature selection

Mutual information (MI) is used in feature selection to evaluate two key-properties of optimal features, the relevance of a feature to the class variable and the redundancy of similar features. Conditional mutual information (CMI), i.e., MI of the ...
Contextual Ontology-Based Feature Selection for Teachers
Learning Technologies and Systems
Abstract
The context of teacher is indescribable without considering the multiple overlapping contextual situations. Teacher Context Ontology (TCO) presents a unified representation of data of these contexts. This ontology provides a relatively high number ...
Feature selection based on conditional mutual information: minimum conditional relevance and minimum conditional redundancy

Feature selection is a process that selects some important features from original feature set. Many existing feature selection algorithms based on information theory concentrate on maximizing relevance and minimizing redundancy. In this paper, relevance ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'24: Proceedings of the 41st International Conference on Machine Learning

July 2024

63010 pages

Copyright © 2024.

Publisher

JMLR.org

Publication History

Published: 21 July 2024

Qualifiers

Research-article
Research
Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten