Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3692070.3693956guideproceedingsArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Contextual feature selection with conditional stochastic gates

Published: 21 July 2024 Publication History

Abstract

Feature selection is a crucial tool in machine learning and is widely applied across various scientific disciplines. Traditional supervised methods generally identify a universal set of informative features for the entire population. However, feature relevance often varies with context, while the context itself may not directly affect the outcome variable. Here, we propose a novel architecture for contextual feature selection where the subset of selected features is conditioned on the value of context variables. Our new approach, Conditional Stochastic Gates (c-STG), models the importance of features using conditional Bernoulli variables whose parameters are predicted based on contextual variables. We introduce a hypernetwork that maps context variables to feature selection parameters to learn the context-dependent gates along with a prediction model. We further present a theoretical analysis of our model, indicating that it can improve performance and flexibility over population-level methods in complex feature selection settings. Finally, we conduct an extensive benchmark using simulated and real-world datasets across multiple domains demonstrating that c-STG can lead to improved feature selection capabilities while enhancing prediction accuracy and interpretability.

References

[1]
Al-Shedivat, M., Dubey, A., and Xing, E. Contextual explanation networks. The Journal of Machine Learning Research, 21(1):7950-7993, 2020.
[2]
Allen, G. I. Automatic feature selection via weighted kernels and regularization. Journal of Computational and Graphical Statistics, 22(2):284-299, 2013.
[3]
Baltrunas, L. and Ricci, F. Context-based splitting of item ratings in collaborative filtering. In Proceedings of the third ACM conference on Recommender systems, pp. 245-248, 2009.
[4]
Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks, 5(4):537-550, 1994.
[5]
Bertsimas, D., Copenhaver, M. S., and Mazumder, R. The trimmed lasso: Sparsity and robustness, 2017.
[6]
Chen, J., Stern, M., Wainwright, M. J., and Jordan, M. I. Kernel feature selection via conditional covariance minimization. In Advances in Neural Information Processing Systems, pp. 6946-6955, 2017.
[7]
Chen, J., Song, L., Wainwright, M., and Jordan, M. Learning to explain: An information-theoretic perspective on model interpretation. In International conference on machine learning, pp. 883-892. PMLR, 2018.
[8]
Covert, I. C., Qiu, W., Lu, M., Kim, N. Y., White, N. J., and Lee, S.-I. Learning to maximize mutual information for dynamic feature selection. In International Conference on Machine Learning, pp. 6424-6447. PMLR, 2023.
[9]
Daubechies, I., DeVore, R., Fornasier, M., and Güntürk, C. S. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 63(1):1-38, 2010.
[10]
Estévez, P. A., Tesmer, M., Perez, C. A., and Zurada, J. M. Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2):189-201, 2009.
[11]
Everaert, J., Benisty, H., Gadassi Polack, R., Joormann, J., and Mishne, G. Which features of repetitive negative thinking and positive reappraisal predict depression? an in-depth investigation using artificial neural networks with feature selection. Journal of Psychopathology and Clinical Science, 2022.
[12]
Figurnov, M., Mohamed, S., and Mnih, A. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pp. 441-452, 2018.
[13]
Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249-256. JMLR Workshop and Conference Proceedings, 2010.
[14]
Hans, C. Bayesian Lasso regression. Biometrika, 96(4): 835-845, 2009.
[15]
He, X. and Niyogi, P. Locality preserving projections. In Advances in neural information processing systems, pp. 153-160, 2004.
[16]
Islam, M. R., Lima, A. A., Das, S. C., Mridha, M., Prodeep, A. R., and Watanobe, Y. A comprehensive survey on the process, methods, evaluation, and challenges of feature selection. IEEE Access, 2022.
[17]
Jana, S., Li, H., Yamada, Y., and Lindenbaum, O. Support recovery with projected stochastic gates: Theory and application for linear models. Signal Processing, 213: 109193, 2023.
[18]
Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=rkE3y85ee.
[19]
Janosi, A., Steinbrunn, W., Pfisterer, M., and Detrano, R. Heart Disease. UCI Machine Learning Repository, 1988.
[20]
Kohavi, R. and John, G. H. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273-324, 1997.
[21]
Kumar, V. and Minz, S. Feature selection: A literature review. Smart Comput. Rev., 4:211-229, 2014.
[22]
Lengerich, B., Ellington, C. N., Rubbi, A., Kellis, M., and Xing, E. P. Contextualized machine learning. arXiv preprint arXiv:2310.11340, 2023.
[23]
Levy, S., Lavzin, M., Benisty, H., Ghanayim, A., Dubin, U., Achvat, S., Brosh, Z., Aeed, F., Mensh, B. D., Schiller, Y., et al. Cell-type-specific outcome representation in the primary motor cortex. Neuron, 107(5):954-971, 2020.
[24]
Lewis, D. D. Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.
[25]
Li, F., Yang, Y., and Xing, E. P. From lasso regression to feature vector machine. In Advances in Neural Information Processing Systems, pp. 779-786, 2006.
[26]
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. Feature selection. ACM Computing Surveys, 50(6):1-45, dec 2017.
[27]
Li, W., Feng, J., and Jiang, T. IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. In International Conference on Research in Computational Molecular Biology, pp. 168-188. Springer, 2011.
[28]
Lianjia. Housing price in Beijing, 2017. URL https://www.kaggle.com/datasets/ruiqurm/lianjia.
[29]
Lindenbaum, O., Salhov, M., Averbuch, A., and Kluger, Y. L0-sparse canonical correlation analysis. In International Conference on Learning Representations, 2021a.
[30]
Lindenbaum, O., Shaham, U., Peterfreund, E., Svirsky, J., Casey, N., and Kluger, Y. Differentiable unsupervised feature selection based on a gated laplacian. Advances in Neural Information Processing Systems, 34:1530-1542, 2021b.
[31]
Lindenbaum, O., Aizenbud, Y., and Kluger, Y. Probabilistic robust autoencoders for outlier detection. The Conference on Uncertainty in Artificial Intelligence (UAI), 2024.
[32]
Liu, Y., Chen, P.-H. C., Krause, J., and Peng, L. How to read articles that use machine learning: users' guides to the medical literature. Jama, 322(18):1806-1816, 2019.
[33]
Louizos, C., Welling, M., and Kingma, D. P. Learning sparse neural networks through ℓ0 regularization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1Y8hhg0b.
[34]
Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=S1jE5L5gl.
[35]
Miller, A., Foti, N., D'Amour, A., and Adams, R. P. Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems, pp. 3708-3718, 2017.
[36]
Peng, H., Long, F., and Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8):1226-1238, 2005.
[37]
Reunanen, J. Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, 3(Mar):1371-1382, 2003.
[38]
Shaham, U., Lindenbaum, O., Svirsky, J., and Kluger, Y. Deep unsupervised feature selection by discarding nuisance and correlated features. Neural Networks, 152: 34-43, 2022.
[39]
Shim, H., Hwang, S. J., and Yang, E. Joint active feature acquisition and classification with variable-size set encoding. Advances in neural information processing systems, 31, 2018.
[40]
Song, L., Smola, A., Gretton, A., Borgwardt, K. M., and Bedo, J. Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning, pp. 823-830. ACM, 2007.
[41]
Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. Feature selection via dependence maximization. Journal of Machine Learning Research, 13(May):1393-1434, 2012.
[42]
Sristi, R. D., Mishne, G., and Jaffe, A. Disc: Differential spectral clustering of features. In Advances in Neural Information Processing Systems, 2022.
[43]
Stein, G., Chen, B., Wu, A. S., and Hua, K. A. Decision tree classifier for network intrusion detection with ga-based feature selection. In Proceedings of the 43rd annual Southeast regional conference-Volume 2, pp. 136-141. ACM, 2005.
[44]
Thompson, R., Dezfouli, A., and Kohn, R. The contextual lasso: Sparse linear models via deep neural networks, 2023.
[45]
Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267-288, 1996.
[46]
Yamada, Y., Lindenbaum, O., Negahban, S., and Kluger, Y. Feature selection using stochastic gates. In International Conference on Machine Learning, pp. 10648-10659. PMLR, 2020.
[47]
Yang, J., Lindenbaum, O., and Kluger, Y. Locally sparse neural networks for tabular biomedical data. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 25123-25153. PMLR, 17-23 Jul 2022a. URL https://proceedings.mlr.press/v162/yang22i.html.
[48]
Yang, J., Lindenbaum, O., and Kluger, Y. Locally sparse neural networks for tabular biomedical data. In International Conference on Machine Learning, pp. 25123-25153. PMLR, 2022b.
[49]
Yin, M., Ho, N., Yan, B., Qian, X., and Zhou, M. Probabilistic best subset selection via gradient-based optimization, 2022.
[50]
Yoon, J., Jordon, J., and van der Schaar, M. Invase: Instancewise variable selection using neural networks. In International Conference on Learning Representations, 2018.
[51]
Zhu, Z., Ong, Y.-S., and Dash, M. Wrapper-filter feature selection algorithm using a memetic framework. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(1):70-76, 2007.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICML'24: Proceedings of the 41st International Conference on Machine Learning
July 2024
63010 pages

Publisher

JMLR.org

Publication History

Published: 21 July 2024

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media