Abstract
While deep neural networks (DNNs) have achieved impressive performance on a wide variety of tasks, the black-box nature hinders their applicability to high-risk, decision-making fields. In such fields, besides accurate prediction, it is also desired to provide interpretable insights into DNNs, e.g., screening important features based on their contributions to predictive accuracy. To improve the interpretability of DNNs, this paper originally proposes a new feature selection algorithm for DNNs by integrating the knockoff technique and the distribution information of irrelevant features. With the help of knockoff features and central limit theorem, we state that the irrelevant feature’s statistic follows a known Gaussian distribution under mild conditions. This information is applied in hypothesis testing to discover key features associated with the DNNs. Empirical evaluations on simulated data demonstrate that the proposed method can select more true informative features with higher F1 scores. Meanwhile, the Friedman test and the post-hoc Nemenyi test are employed to validate the superiority of the proposed method. Then we apply our method to Coronal Mass Ejections (CME) data and uncover the key features which contribute to the DNN-based CME arrival time.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agarap AF (2018) Deep learning using rectified linear units (relu). arXiv:abs/1803.08375
Barber RF, Candès EJ (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43(5):2055–2085
Barber RF, Candès EJ (2019) A knockoff filter for high-dimensional selective inference. Ann Stat 47(5):2504–2537
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 57(1):289–300
Candès E, Fan Y, Janson L, Lv J (2018) Panning for gold: Model-free knockoffs for high-dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80:551–577
Cao B, Shen D, Sun JT, Yang Q, Chen Z (2007) Feature selection in a kernel space. In: Proceedings of the 24th International Conference on Machine Learning, pp 121–128
Chang CC, Lin CJ (2011) Libsvm: A library for support vector machines. ACM Trans Intell Syst Technol 2(27):1–27
Chen H, Guo C, Xiong H, Wang Y (2021) Sparse additive machine with ramp loss. Anal Appl 19(03):509–528
Chen H, Wang Y (2018) Kernel-based sparse regression with the correntropy-induced loss. Appl Comput Harmon Anal 44(1):144–164
Chen H, Wang Y, Zheng F, Deng C, Huang H (2021) Sparse modal additive model. IEEE Trans Neural Netw Learn Syst 32(6):2373–2387
Chen J, Stern M, Wainwright MJ, Jordan MI (2017) Kernel feature selection via conditional covariance minimization. In: Advances in neural information processing systems 30, pp 6946–6955
Collins M, Schapire R, Singer Y (2002) Logistic regression, adaboost and bregman distances. Mach Learn 48(1):253–285
Cox DR (1958) The regression analysis of binary sequences. J Royal Stat Soc: Series B (Methodological) 20(2):215–242
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30
Fan Y, Demirkaya E, Li G, Lv J (2020) Rank: Large-scale inference with graphical nonlinear knockoffs. J Am Stat Assoc 115(529):362–379
Fan Y, Lv J, Sharifvaghefi M, Uematsu Y (2020) Ipad: Stable interpretable forecasting with knockoffs inference. J Am Stat Assoc 115(532):1822–1834
Friedl M, Brodley C (1997) Decision tree classification of land cover from remotely sensed data. Remote Sens Environ 61(3):399–409
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
González J, Ortega J, Damas M, Martín-Smith P, Gan JQ (2019) A new multi-objective wrapper method for feature selection - accuracy and stability analysis for bci. Neurocomputing 333:407–418
Hocking RR (1976) A biometrics invited paper. the analysis and selection of variables in linear regression. Biometrics 32(1):1–49
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324
Liu H, Liu C, Wang JTL, Wang H (2020) Predicting coronal mass ejections using SDO/HMI vector magnetic data products and recurrent neural networks. The Astrophysical Journal 890(1):12
Liu J, Ye Y, Shen C, Wang Y, Erdélyi R (2018) A new tool for CME arrival time prediction using machine learning algorithms: CAT-PUMA. The Astrophysical Journal 855(2):109
Liu W, Ke Y, Liu J, Li R (2020) Model-free feature screening and fdr control with knockoff features. J Am Stat Assoc 0(0):1–16
Lu Y, Fan Y, Lv J, Stafford Noble W (2018) Deeppink: reproducible feature selection in deep neural networks. In: Advances in neural information processing systems 31, Curran Associates Inc, pp 8676–8686
Nemenyi P (1963) Distribution-free Multiple Comparisons. Princeton University
Nicholson WB, Wilms I, Bien J, Matteson DS (2020) High dimensional forecasting via interpretable vector autoregression. J Mach Learn Res 21(166):1–52
Rijsbergen CJV (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, USA
Romano Y, Sesia M, Candès E (2019) Deep knockoffs. J Am Stat Assoc 0(0):1–12
Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2020) Multi-resolution localization of causal variants across the genome. Nat Commun 11(1093)
Shrikumar A, Greenside P, Kundaje A (2017) Learning important features through propagating activation differences. In: Proceedings of the 34th International Conference on Machine Learning, vol 9, pp 3145–3153
Shrikumar A, Greenside P, Shcherbina A, Kundaje A (2016) Not just a black box: Learning important features through propagating activation differences. arXiv:abs/1605.01713
Simonyan K, Vedaldi A, Zisserman A (2014) Deep inside convolutional networks: Visualising image classification models and saliency maps. In: 2Nd international conference on learning representations, ICLR 2014, pp 1–8
Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks. In: Proceedings of the 34th International Conference on Machine Learning, vol 10, pp 3319– 3328
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Series B 58(1):267–288
Wang Y, Liu J, Jiang Y, Erdélyi R (2019) CME Arrival time prediction using convolutional neural network. The Astrophysical Journal 881(1):15
Zahavy T, Kang B, Sivak A, Feng J, Xu H, Mannor S (2018) Ensemble robustness and generalization of stochastic deep learning algorithms. In: 6Th international conference on learning representations, ICLR 2018
Zhang XL, Zhang Q, Chen M, Sun Y, Qin X, Li H (2018) A two-stage feature selection and intelligent fault diagnosis method for rotating machinery using hybrid filter and wrapper method. Neurocomputing 275:2426–2439
Zheng W, Zhu X, Wen G, Zhu Y, Yu H, Gan J (2020) Unsupervised feature selection by self-paced learning regularization. Pattern Recogn Lett 132:4–11
Zhu G, Zhao T (2021) Deep-gknock: Nonlinear group-feature selection with deep neural networks. Neural Netw 135:139– 147
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant Nos 12071166 and 11771130, by the Fundamental Research Funds for the Central Universities of China under Grant 2662020LXQD002 and Grant 2662019FW003. The corresponding author is Hong Chen.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhao, X., Li, W., Chen, H. et al. Distribution-dependent feature selection for deep neural networks. Appl Intell 52, 4432–4442 (2022). https://doi.org/10.1007/s10489-021-02663-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02663-1