Abstract
This paper investigates sparse optimization problems characterized by a sparse group structure, where element- and group-level sparsity are jointly taken into account. This particular optimization model has exhibited notable efficacy in tasks such as feature selection, parameter estimation, and the advancement of model interpretability. Central to our study is the scrutiny of the \(\ell _0\) and \(\ell _{2,0}\) norm regularization model, which, in comparison to alternative surrogate formulations, presents formidable computational challenges. We embark on our study by conducting the analysis of the optimality conditions of the sparse group optimization problem, leveraging the notion of a \(\gamma \)-stationary point, whose linkage to local and global minimizer is established. In a subsequent facet of our study, we develop a novel subspace Newton algorithm for sparse group \(\ell _0\) optimization problem and prove its global convergence property as well as local second-order convergence rate. Experimental results reveal the superlative performance of our algorithm in terms of both precision and computational expediency, thereby outperforming several state-of-the-art solvers.
Similar content being viewed by others
Data availability
The data used in Sect. 4.1 are generated randomly by Matlab’s built-in generation function. The orignal image data in Sect. 4.2 is publicly available from: http://www0.cs.ucl.ac.uk/staff/b.jin/software/gpdasc.zip.
References
Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005). https://doi.org/10.1109/TIT.2005.858979
Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22(2), 231–245 (2013)
Zhang, P., Wang, R., Xiu, N.: Multinomial logistic regression classifier via \(\ell _{q,0}\)-proximal newton algorithm. Neurocomputing 468, 148–164 (2021)
Lin, D., Zhang, J., Li, J., Calhoun, V.D., Deng, H.-W., Wang, Y.-P.: Group sparse canonical correlation analysis for genomic data integration. BMC Bioinform. 14(1), 1–16 (2013)
Li, J., Dong, W., Meng, D.: Grouped gene selection of cancer via adaptive sparse group lasso based on conditional mutual information. IEEE ACM Trans. Comput. Biol. Bioinform. 15(6), 2028–2038 (2018). https://doi.org/10.1109/TCBB.2017.2761871
Hu, Y., Lu, J., Yang, X., Zhang, K.: Mix sparse optimization: theory and algorithm (2022). https://www.polyu.edu.hk/ama/profile/xqyang/mix_sparse2022.pdf
Li, Y., Nan, B., Zhu, J.: Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 71(2), 354–363 (2015)
Matsuoka, R., Kyochi, S., Ono, S., Okuda, M.: Joint sparsity and order optimization based on admm with non-uniform group hard thresholding. IEEE Trans. Circuits Syst. I Regul. Pap. 65(5), 1602–1613 (2017)
Pan, L., Chen, X.: Group sparse optimization for images recovery using capped folded concave functions. SIAM J. Imaging Sci. 14(1), 1–25 (2021)
Li, W., Bian, W., Toh, K.-C.: Difference-of-convex algorithms for a class of sparse group \(\ell _0\) regularized optimization problems. SIAM J. Optim. 32(3), 1614–1641 (2022)
Chen, J., Dai, G., Zhang, N.: An application of sparse-group lasso regularization to equity portfolio optimization and sector selection. Ann. Oper. Res. 284, 243–262 (2020)
Bunea, F., Tsybakov, A., Wegkamp, M.: Sparsity oracle inequalities for the lasso. Electron. J. Stat. 1, 169–194 (2007)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57(11), 1413–1457 (2004)
Wang, H., Shao, Y., Zhou, S., Zhang, C., Xiu, N.: Support vector machine classifier via \(l_{0/1}\) soft-margin loss. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7253–7265 (2019)
Beck, A., Vaisbourd, Y.: The sparse principal component analysis problem: optimality conditions and algorithms. J. Optim. Theory Appl. 170, 119–143 (2015)
Zhou, S., Luo, Z., Xiu, N., Li, G.Y.: Computing one-bit compressive sensing via double-sparsity constrained optimization. IEEE Trans. Signal Process. 70, 1593–1608 (2021)
Shen, X., Pan, W., Zhu, Y., Zhou, H.: On constrained and regularized high-dimensional regression. Ann. Inst. Stat. Math. 65, 807–832 (2013)
Pati, Y.C., Rezaiifar, R., Krishnaprasad, P.S.: Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Proceedings of the 27th Asilomar Conference on Signals Systems, and Computers, pp. 40–44 (1993)
Needell, D., Tropp, J.A.: Cosamp: iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmon. Anal. 26(3), 301–321 (2009)
Blumensath, T., Davies, M.E.: Normalized iterative hard thresholding: guaranteed stability and performance. IEEE J. Sel. Top. Signal Process. 4(2), 298–309 (2010)
Beck, A., Eldar, Y.C.: Sparsity constrained nonlinear optimization: optimality conditions and algorithms. SIAM J. Optim. 23, 1480–1509 (2013)
Yuan, X., Li, P., Zhang, T.: Gradient hard thresholding pursuit. J. Mach. Learn. Res. 18(1), 6027–6069 (2017)
Zhou, S., Xiu, N., Qi, H.: Global and quadratic convergence of newton hard-thresholding pursuit. J. Mach. Learn. Res. 22, 12–11245 (2019)
Blumensath, T., Davies, M.E.: Iterative thresholding for sparse approximations. J. Fourier Anal. Appl. 14, 629–654 (2008)
Soubies, E., Blanc-Féraud, L., Aubert, G.: A continuous exact \(\ell _0\) penalty (cel0) for least squares regularized problem. SIAM J. Imaging Sci. 8(3), 1607–1639 (2015)
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)
Cheng, W., Chen, Z., Hu, Q.: An active set Barzilar–Borwein algorithm for \(\ell _0\) regularized optimization. J. Glob. Optim. 76(4), 769–791 (2020)
Bian, W., Chen, X.: A smoothing proximal gradient algorithm for nonsmooth convex regression with cardinality penalty. SIAM J. Numer. Anal. 58(1), 858–883 (2020)
Ito, K., Kunisch, K.: A variational approach to sparsity optimization based on Lagrange multiplier theory. Inverse Probl. 30(1), 015001 (2013)
Huang, J., Jiao, Y., Liu, Y., Lu, X.: A constructive approach to l0 penalized regression. J. Mach. Learn. Res. 19(1), 403–439 (2018)
Zhou, S., Pan, L., Xiu, N.: Newton method for \(\ell _0\)-regularized optimization. Numer. Algorithms 88, 1541–1570 (2021)
Nocedal, J., Wright, S.J.: Numerical optimization. In: Fundamental Statistical Inference (2018)
Facchinei, F.: Minimization of sc1 functions and the Maratos effect. Oper. Res. Lett. 17(3), 131–137 (1995). https://doi.org/10.1016/0167-6377(94)00059-F
Yang, J., Leung, H.C.M., Yiu, S.-M., Cai, Y., Chin, F.Y.L.: Intra- and inter-sparse multiple output regression with application on environmental microbial community study. In: 2013 IEEE International Conference on Bioinformatics Biomedicine, pp. 404–409 (2013)
Esser, E., Lou, Y., Xin, J.: A method for finding structured sparse solutions to nonnegative least squares problems with applications. SIAM J. Imaging Sci. 6(4), 2010–2046 (2013)
Jiao, Y., Jin, B., Lu, X.: Group sparse recovery via the \(\ell ^0(\ell ^2)\) penalty: theory and algorithm. IEEE Trans. Signal Process. 65, 998–1012 (2016)
Eldar, Y.C., Kuppinger, P., Bölcskei, H.: Block-sparse signals: uncertainty relations and efficient recovery. IEEE Trans. Signal Process. 58, 3042–3054 (2009)
Berg, E., Friedlander, M.P.: Probing the pareto frontier for basis pursuit solutions. SIAM J. Sci. Comput. 31, 890–912 (2008)
Huang, J., Breheny, P.J., Ma, S.: A selective review of group selection in high-dimensional models. Stat. Sci. 27, 4 (2012)
Acknowledgements
The authors would like to express sincere appreciation for the editor and all reviewers’ valuable comments and suggestions, which are of great help to our paper. Besides, the authors appreciate for the assistance provided by fellow colleagues within the research group, as well as for the dedicated guidance of the supervisors. This paper is supported by National Key R &D (research and development) Program of China (2021YFA1000403) and the National Natural Science Foundation of China (Nos. U23B2012, 11991022) and Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Proof of Theorem 2.1
Appendix A: Proof of Theorem 2.1
Lemma A.1
For any solution \(x\in {\textrm{PROX}}_{\lambda ,\mu }^{\gamma } (z)\), it holds that for all \(i\in {\textrm{supp}}(z)\), either \(x_i=z_i\) or \(x_i=0\), and for all \(i\notin {\textrm{supp}}(z)\), \(x_i=0\).
Proof
Define \( P_z(u):=\lambda \Vert u\Vert _{2,0}+\mu \Vert u\Vert _0+\frac{1}{2}\Vert u-z\Vert _2^2. \) Let \(x\in {\textrm{PROX}}_{\lambda ,\mu }^{\gamma } (z)=\arg \min P_z(u) \).
Assume there exists \(i\in {\textrm{supp}}(z)\), such that \(x_i>0\) and \(x_i \ne z_i\). Construct a new vector \(\bar{x}\) by letting \(\bar{x}_i=z_i\ne 0\), and \(\bar{x}_l=x_l\) for \(l\ne i\). It follows immediately that
Hence, \(P_z(x) > P_z(\bar{x})\), which is a contradiction to \(x\in \arg \min P_z(u) \). As a conclusion, for any \(i\in {\textrm{supp}}(z)\), either \(x_i=z_i\) or \(x_i=0\).
Through a similar manner, assume there exists \(i\notin {\textrm{supp}}(z)\), such that \(x_i\ne 0\). Construct a new vector \(\bar{x}\) by letting \(\bar{x}_i=0=z_i\), and \(\bar{x}_l=x_l\) for \(l\ne i\). It follows immediately that
Same contradiction is derived: \(P_z(x) > P_z(\bar{x})\). Hence, for any \(i\notin {\textrm{supp}}(z)\), \(x_i=0\). \(\square \)
Lemma A.2
\(\overline{\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\) containing the solution that has the most nonzero elements is a singleton.
Proof
Define
where \(\Vert u_{G_i}\Vert _{2,0}=\left\{ \begin{aligned}&1,\quad if\ \Vert u_{G_i}\Vert _2 \ne 0 \\&0,\quad if\ \Vert u_{G_i}\Vert _2=0 \end{aligned}\right. \ ,\) and use the previous notation
We then have
where the first and second equality holds due to \(G_1,\dots ,G_m\) are non-overlapping group division, i.e. \(G_i\cap G_j=\emptyset ,\forall i\ne j,\text { and }\bigcup _{i=1}^m G_i=[n]\). Equation (49) implies that we can consider the minimization problem within each group.
Suppose x and \(x'\) are two different elements in \(\overline{\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\). According to the definition of \(\overline{\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\) that contains the solution with the most nonzero elements in \({\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\), we have \(\Vert x\Vert _0 = \Vert x'\Vert _0\). Together with Lemma A.1, there must exists \(j \in [n]\) such that \(x_j =0\), while \(x'_j=z_j \ne 0\), otherwise \(x=x'=z\), which is a contradiction to \(x\ne x'\).
Construct a new vector \(\bar{x}\) by letting \(\bar{x}_j=x'_j\ne 0\), and \(\bar{x}_l=x_l\) for \(l\ne j\). Then \(\Vert \bar{x}\Vert _0>\Vert x\Vert _0\). Suppose element j belongs to group \(G_i\), we have from (49) that both \(x_{G_i}\) and \(x'_{G_i}\) are solutions to \(P_{z_G}(u_{G_i})\), since x and \(x'\) are solutions to \(P_z(u)\). Again by utilizing (49), it can be verified \(\bar{x}\) is a solution to \(P_z(u)\). However, here comes the contradiction that we construct a solution with more nonzero elements than x and \(x'\).
Hence, \(\overline{\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\) has to be a singleton. \(\square \)
Lemma A.3
Given \(z\in \mathbb {R}^d\), let \(y=\textbf{H}(z,\sqrt{2\gamma \mu })\). Then y has the most nonzero elements among all the solutions to \(\min _{u\in \mathbb {R}^d}\{\mu \Vert u\Vert _0+\frac{1}{2\gamma }\Vert u-z\Vert _2^2\}.\)
Proof
Observing that
any solution \(u^*\in \arg \min _{u\in \mathbb {R}^d}\{\mu \Vert u\Vert _0+\frac{1}{2\gamma }\Vert u-z\Vert _2^2\}\) has the form:
From the definition of hard-thresholding operator (4), the conclusion of the lemma naturally holds. \(\square \)
Lemma A.4
Let x be computed through two-step projection (5a,5a), then \(x\in {\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\).
Proof
We prove the lemma within each group \(G_i\). From the definition of hard-thresholding operator, we have \(y_{G_i}=\textbf{H}(z_{G_i},\sqrt{2\gamma \mu })\) is the solution to
(a). If \(\exists j\in G_i,\) such that \(|z_{G_i}|_j\ge \sqrt{2\gamma \mu }\). Therefore, \(y_{G_i}\) is a nonzero vector, and the following equations hold:
Further, define \(\widetilde{G}_i:={\textrm{supp}}(y_{{G}_i})\ne \emptyset \) and \(\widetilde{G}_i^o:=G_i\backslash \widetilde{G}_i\), if follows \(y_{\widetilde{G}_i}=z_{\widetilde{G}_i}\) and \(y_{\widetilde{G}_i^o}=\textbf{0}.\) We then obtain
where the last equality holds for \(\widetilde{G}_i\) captures all nonzero elements in \(y_{G_i}\).
(a-1). If \(\Vert y_{G_i}\Vert _2\ge \sqrt{2\gamma (\lambda +\mu \Vert y_{G_i}\Vert _0)}\), then
And we have from (52)\(\le 0\) that \(x_{G_i}\) is a solution to \(\min _{u_{G_i}}P_{z_G}(u_{G_i})\). Hence, combining (49), it can be deduced that \(x\in {\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z).\)
(a-2). If \(\Vert y_{G_i}\Vert _2<\sqrt{2\gamma (\lambda +\mu \Vert y_{G_i}\Vert _0)}\), then
And we have from (52) that
thus \(\textbf{0}\) is a solution to \(\min _{u_{G_i}}P_{z_G}(u_{G_i})\). Combining (49), it can be deduced that \(x\in {\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z).\)
(b). If \(|z_{G_i}|_j < \sqrt{2\gamma \mu },\,\forall j \in G_i\), then \(y_{G_i}=\textbf{0}\). It follows \(\sqrt{2\gamma (\lambda +\mu \Vert y_{G_i}\Vert _0)} > \Vert y_{G_i}\Vert _2=0. \) Hence,
Besides, we have
where the first inequality is due to the only solution to (50) is zero vector. Hence, \(x_{G_i}=\textbf{0}\in \arg \min P_{z_G}(u_{G_i})\). As a conclusion, \(x\in {\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z).\)
In summary, two-step projection finds an element in \({\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z).\) \(\square \)
We now proof Theorem 2.1:
Proof of Theorem 2.1
For the first part of the theorem, \(x={\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\) implies \({\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\) is a singleton. We just need to prove that x can be calculated from the two-step projection. Let \(x'\) satisfies (5a,5b), Lemma A.4 guarantees \(x'\in {\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\). Hence, we obtain \(x=x'\) for \({\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\) is a singleton.
For the second part of the theorem, let x be calculated from the two-step projection. It has already been proved that \(\overline{\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\) is a singleton and \(x\in {\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\). All that remains is to prove x has the most nonzero elements. Equivelently, let \(\bar{x}=\overline{\textrm{PROX}}_{\lambda ,\mu }^{\gamma }(z)\), we just need to show \({\textrm{supp}}(x)\supseteq {\textrm{supp}}(\bar{x}).\) According to (49), it is necessary and sufficient to prove within each group \(G_i\), there is \({\textrm{supp}}(x_{G_i})\supseteq {\textrm{supp}}(\bar{x}_{G_i}).\)
Case 1. If \(x_{G_i}\ne \textbf{0}\). We shall only consider the case when \(\bar{x}_{G_i}\ne \textbf{0},\) otherwise \({\textrm{supp}}(x_{G_i})\supseteq {\textrm{supp}}(\bar{x}_{G_i})\) holds naturally. Since both \(x_{G_i}\) and \(\bar{x}_{G_i}\) minimize \(P_{z_G}(u_{G_i})\), we obtain from
that both \(x_{G_i}\) and \(\bar{x}_{G_i}\) are the solution to the minimization on the right hand side of the above equation. Also note that \(x_{G_i}\ne \textbf{0}\) implies (5b) does not take place, therefore, \(x_{G_i}=y_{G_i}=\textbf{H}(z_{G_i},\sqrt{2\gamma \mu }).\) Utilizing Lemma A.3, it follows \({\textrm{supp}}(x_{G_i})\supseteq {\textrm{supp}}(\bar{x}_{G_i}).\)
Case 2. If \(x_{G_i} = \textbf{0}\). We claim \(\bar{x}_{G_i}= \textbf{0}.\) Prove by contradiction, suppose \(\bar{x}_{G_i}\ne \textbf{0}.\) From (5a,5b), we konw that either case \((2a):\ z_i < \sqrt{2\gamma \mu },\forall i\in {\textrm{supp}}(z)\), or case \((2b):\ \Vert y_{G_i}\Vert _2 < \sqrt{2\gamma (\lambda +\mu \Vert y_{G_i}\Vert _0)}\) holds. However, case (2a) implies (54) holds, which is a contradiction to \( \textbf{0}\ne \bar{x}_{G_i}\in \min _{u_{G_i}} P_{z_G}(u_{G_i})\), as zero vector is the only solution to \(\min _{u_{G_i}} P_{z_G}(u_{G_i})\); case (2b) implies (53) holds, which derives the same contradiction as case (2a) does.
As a conclusion, \({\textrm{supp}}(x_{G_i})\supseteq {\textrm{supp}}(\bar{x}_{G_i})\) holds, thus \({\textrm{supp}}(x)\supseteq {\textrm{supp}}(\bar{x})\). The second part of the theorem is proved. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liao, S., Han, C., Guo, T. et al. Subspace Newton method for sparse group \(\ell _0\) optimization problem. J Glob Optim 90, 93–125 (2024). https://doi.org/10.1007/s10898-024-01396-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-024-01396-y
Keywords
- Sparse group \(\ell _0\) optimization
- Subspace Newton method
- Global convergence
- Second-order convergence rate