Sparse classification: a scalable discrete optimization perspective

2218 Accesses
14 Citations
3 Altmetric
Explore all metrics

Abstract

We formulate the sparse classification problem of n samples with p features as a binary convex optimization problem and propose a outer-approximation algorithm to solve it exactly. For sparse logistic regression and sparse SVM, our algorithm finds optimal solutions for n and p in the 10,000 s within minutes. On synthetic data our algorithm achieves perfect support recovery in the large sample regime. Namely, there exists an $n_0$ such that the algorithm takes a long time to find an optimal solution and does not recover the correct support for $n<n_0$, while for $n\geqslant n_0$, the algorithm quickly detects all the true features, and does not return any false features. In contrast, while Lasso accurately detects all the true features, it persistently returns incorrect features, even as the number of observations increases. Consequently, on numerous real-world experiments, our outer-approximation algorithms returns sparser classifiers while achieving similar predictive accuracy as Lasso. To support our observations, we analyze conditions on the sample size needed to ensure full support recovery in classification. For k-sparse classification, and under some assumptions on the data generating process, we prove that information-theoretic limitations impose $n_0 < C \left( 2 + \sigma ^2\right) k \log (p-k)$, for some constant $C>0$.

Sparse Learning for Large-Scale and High-Dimensional Data: A Randomized Convex-Concave Optimization Approach

Fast and scalable Lasso via stochastic Frank–Wolfe methods with a convergence guarantee

Article 21 July 2016

Regularity Properties for Sparse Regression

Article 14 March 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Sparse classification is a central problem in machine learning as it leads to more interpretable models. Given data $\{(\varvec{x}^{(i)}, y_i)\}_{i=1,\dots ,n}$ with $y_i \in \{-1,1\}$ and $\varvec{x}^{(i)} \in \mathbb {R}^p$, we aim at computing an estimator $\varvec{w}$ which minimizes an empirical loss $\ell $ subject to the constraint that its number of nonzero entries does not exceed k:

$$\begin{aligned} \min _{\varvec{w} \in \mathbb {R}^p,b \in \mathbb {R}} \sum _{i=1}^n \ell (y_i, \varvec{w}^\top \varvec{x}^{(i)} +b) \text{ s.t. } \Vert \varvec{w} \Vert _0 \leqslant k. \end{aligned}$$

(1)

Problem (1) is an NP-hard optimization problem (Natarajan, 1995). Thus, much of the literature has focused on heuristic proxies and replaced the 0 pseudo-norm with so-called sparsity-inducing convex norms (Bach et al., 2012). Even though regularization enforces robustness more than sparsity (Bertsimas & Copenhaver, 2018), the $L_1$-penalty formulation

$$\begin{aligned} \min _{\varvec{w} \in \mathbb {R}^p,b \in \mathbb {R}} \sum _{i=1}^n \ell (y_i, \varvec{w}^\top \varvec{x}^{(i)} +b) + \lambda \Vert \varvec{w} \Vert _1, \end{aligned}$$

(2)

known as Lasso (Tibshirani, 1996) is abundantly used in practice. Efficient numerical algorithms exist (Friedman et al., 2010), off-the-shelf implementations are publicly available (Friedman et al., 2013) and recovery of the true sparsity is theoretically guaranteed under some assumptions on the data. In regression problems with i.i.d. Gaussian measurements for instance, Wainwright (2009b) proved that Lasso recovers the k correct features with high probability (w.h.p. in short) for $n>(2k + \sigma ^2) \log p$ where $\sigma ^2$ is the variance of the noise, a phenomenon they refer to as phase transition in accuracy. On the other hand, recent works (Fan & Song, 2010; Bühlmann, 2011; Su et al., 2017) highlighted the difficulty for $L_1$-regularized estimators to select correct features without making false discoveries, considering Lasso as a good feature screening but a poor feature selection procedure. The innate shortcomings of the convex heuristics such as Lasso are by now so well documented that in recent years a lot of attention has been diverted to sparsity-inducing nonconvex regularization methods. Such penalization formulations consider

$$\begin{aligned} \min _{\varvec{w} \in \mathbb {R}^p,b \in \mathbb {R}} \sum _{i=1}^n \ell (y_i, \varvec{w}^\top \varvec{x}^{(i)} +b) + \lambda \cdot R(\varvec{w}) \end{aligned}$$

(3)

where R is a nonconvex penalty function with a unique minimizer at zero. Frank and Friedman (1993) proposed for instance the use of an $L_p$ pseudo-norm with $0<p<1$ as a continuous approximation to the 0 pseudo-norm for least-squares regression problems. Nonconvex penalty functions are very actively studied nowadays as they promise stronger statistical guarantees when compared to their $L_1$-based convex counterparts. Fan and Li (2001) propose a smoothly clipped absolute deviation (SCAD) penalty while Zhang (2010a) advance their minimax concave penalty (MCP) for general convex loss functions. Both penalties and their associated regression models enjoy a so-called “oracle property”, that is, in the asymptotic sense, they perform as well as if the support of the nonzero coefficients was known in advance. As the formulation (3) is non-convex numerical challenges can be expected. Indeed, Chen et al. (2019) show that similarly to (1) the nonconvex formulation (3) is unfortunately NP-hard to solve (even approximately). Nevertheless, many empirical studies (Fan & Li, 2001; Hunter & Li, 2005; Zou & Li, 2008; Breheny & Huang, 2011; Mazumder et al., 2011) have shown that despite the nonconvexity of these formulations stationary points can be computed efficiently based on gradient-based optimization methods and consistently return better regressors than their $L_1$-based counterparts do. In an impressive line of work Loh and Wainwright (2015) and Loh et al. (2017) establish that in fact all stationary points in (3) enjoy support recovery consistency even when the usual incoherence conditions required by $L_1$-based methods fail to hold. Adaptive approximation approaches which attempt to address the nonconvex formulation (3) by solving a sequence of convex relaxations are discussed by Zhang (2010b), Zhang et al. (2013) and Fan et al. (2018) and likewise result in empirically strong sparse regressors.

New research in numerical algorithms for solving the exact sparse formulation (1) has also flourished and demonstrated significant improvement on existing heuristics. Bertsimas et al. (2016) and Bertsimas and King (2017) made use of recent advances in mixed-integer optimization to solve sparse linear and logistic regression problems. Pilanci et al. (2015) applied a Lagrangian relaxation and random rounding procedure for linear regression and provide sufficient conditions for support recovery with their method. Hazimeh and Mazumder (2020) developed a cyclic coordinate descent strategy combined with local search to efficiently find local optima of the $L_0$-penalized ordinary least square regression problem, later extended to other loss functions (Dedieu et al., 2021). Recently, sparse linear regression for dimensions n and p in 100,000 s was exactly solved for the first time, using a cutting-plane algorithm (Bertsimas & Van Parys, 2020). Their method demonstrates a clear phase transition in accuracy as the sample size n increases and requires less data than Lasso to achieve full recovery. Simultaneously, they observed a phase transition in false discovery, that is, the number of incorrect features selected, and in computational time, which is unique to their method: they exhibited a threshold $n_0$ for which for $n<n_0$ their algorithm takes a long time to find the optimal solution and it is does not recover the correct support, while for $n\geqslant n_0$, the algorithm is very fast and accurately detects all the true features, but does not return any false features. We refer to Bertsimas et al. (2020) for a comprehensive empirical evaluation of the relative merits of $L_1$-based, cardinality-constrained, and non-convex formulations in accurately selecting the true support.

Besides algorithm-specific performance, any support recovery algorithm faces information-theoretic limitations as well (Wainwright, 2009a; Wang et al., 2010). In regression, recent work (Gamarnik & Zadik, 2017) indeed proved the existence of a sharp information-theoretic threshold $n^\star $: If $n<n^\star $, exact support recovery by any algorithm is impossible, while it is theoretically achievable for $n>n^\star $. Such results call for further research in learning algorithms in the regime $n^\star< n < (2k + \sigma ^2) \log p$ where Lasso fails but full recovery is achievable in principle.

Regarding accuracy, such phase transition phenomena are actually not specific to regression problems but are observed in many data analysis and signal processing contexts (Donoho & Stodden, 2006; Donoho & Tanner, 2009). Surprisingly, little if no work focused on classification problems specifically. Guarantees in terms of $\ell _2$ error have been obtained in the so-called 1-bit compressed sensing setting (Boufounos & Baraniuk, 2008; Gupta et al., 2010; Plan & Vershynin, 2013a; Jacques et al., 2013) but they do not precisely address the question of support recovery. In recent work, Scarlett and Cevher (2017) offer a comprehensive treatment of information theoretic limitations in support recovery for both linear and 1-bit compressed sensing, in some regimes of noise and sparsity. Sparse classification in itself has mainly been regarded as a feature selection problem (Dash & Liu, 1997) and greedy procedures such as Recursive Feature Elimination (Guyon et al., 2002) have shown the most successful.

In this paper, we formulate exact sparse classification as a binary convex optimization problem, propose a tractable outer-approximation algorithm and a stochastic variant to solve it in high dimensions. To provide insights on the algorithm’s performance and behavior, we also derive an information-theoretic sufficient condition for support recovery in classification that complements existing results from the 1-bit compressed sensing literature.

Contributions

The contributions of the present paper can be summarized as follows:

1.
Based on duality results for regularized classification, we formulate the exact sparse classification problem as a binary convex optimization problem and propose a tractable outer-approximation algorithm to solve it. Our approach generalizes the one in Bertsimas and Van Parys (2020), who address linear regression for which a closed-form solution exists and make extensive use of this closed-form solution. Our framework, however, extends to cases where a closed-form solution is not available and includes, in addition to linear regression, logistic regression and 1- or 2-norm SVM. We also propose a stochastic constraint generating process to improve scalability of the outer-approximation algorithm with respect to the number of samples n.
2.
We demonstrate the tractability and relevance of our algorithm in solving large binary classification problems with logistic and Hinge loss. Among others, we solve a real-world gene-selection problem with $n=1145$, $p=14{,}858$ and select four to ten times fewer genes than the $L_1$ heuristic with little compromise on the predictive power. On synthetic data, our algorithm can scale to data sets with up to $p=50{,}000$ features and prove optimality in less than a minute for low sparsity regimes ($k=5$). Our present algorithm and the concurrent paper of Dedieu et al. (2021) are, to the best of our knowledge, the only methods currently available that solve sparse classification problems to provable optimality in such high dimensions within minutes. We believe, however, that the tools developed in the present paper may be more versatile and broadly applicable than the tailored “integrality generation” technique of Dedieu et al. (2021). Finally, we observe that our stochastic constraint generating process reduces computational time by a factor 2–10 compared with the standard outer-approximation procedure for the hardest instances, namely instances where the number of samples n is not large enough for Lasso to achieve perfect recovery.
3.
We demonstrate empirically that cardinality-constrained estimators asymptotically achieve perfect support recovery on synthetic data: As n increases, the method accurately detects all the true features, just like Lasso, but does not return any false features, whereas Lasso does. In addition, we observe that the computational time of our algorithm decreases as more data is available. We exhibit a smooth transition towards perfect support recovery in classification settings, whereas it is empirically observed (Donoho & Stodden, 2006; Bertsimas and Van Parys 2020) and theoretically defined (Wainwright, 2009a; Gamarnik & Zadik, 2017) as a sharp phase transition in the case of linear regression.
4.
Intrigued by this smoother empirical behavior, we show that there exists a threshold $n_0$ on the number of samples such that if $n > n_0$, the underlying truth $\varvec{w}^\star $ minimizes the empirical error with high probability. Assuming $p \geqslant 2k$ and data is generated by $y_i = \text {sign}\left( {\varvec{w}^{\star }}^\top \varvec{x}^{(i)} + \varepsilon _i\right) , $ with $\varvec{x}^{(i)}$ i.i.d. Gaussian, $\varvec{w}^\star \in \{0,1\}^p$, supp$(\varvec{w}^{\star } )=k$ and $\varepsilon _i\sim \mathcal {N} (0, \sigma ^2)$, we prove that $n_0 < C \left( 2 + \sigma ^2\right) k \log (p-k)$, for some constant $C>0$ (Theorem 3). Our information-theoretic sufficient condition on support recovery parallels the one obtained by Wainwright (2009a) in regression settings, although discreteness of the outputs substantially modify the analysis and the scaling of the bound. It also scales as $k \log p$, thus agreeing with the sufficient conditions required by non-convex formulations (Loh et al., 2017), although their generative process is different and their result does not explicitly capture the impact of noise. In recent work, Scarlett and Cevher (2017) performed a similar analysis of support recovery in 1-bit compressed sensing. For low signal-to-noise ratios (SNR), they exhibit matching necessary and sufficient conditions, which suggests the existence of a phase transition. In high signal-to-noise regimes, however, they proved necessary conditions only. Our result is novel, in that it is not only valid for specific sparsity and noise regimes, but for all values of k and $\sigma $ (see Table 1 for a summary of their results and ours). In particular, our sufficient condition holds when the sparsity k scales linearly in the dimension p and the signal-to-noise ratio is high. In this regime, there is a $\sqrt{\log p}$ factor between our bound and the necessary condition from Scarlett and Cevher (2017). As represented in Fig. 1, there is an intermediate sample size regime where support recovery is neither provably impossible nor provably achievable, and could explained the smooth phase transition observed. Such an observation is made possible by the combination of both our results. Of course, this observation only suggests that a lack of phase transition is plausible. So is the existence of one and future work might improve on these bounds and close this gap.

Table 1 Summary of the necessary and sufficient conditions provided in Scarlett and Cevher (2017), as compared to the sufficient condition we provide in Theorem 3, depending on the sparsity and signal-to-noise ratio (SNR) regime

Full size table

Structure

We derive the binary convex optimization formulation for exact sparse classification, present an outer-approximation algorithm to solve it, and propose a stochastic constraint generating process in Sect. 2. We evaluate the performance of our method, both in terms of tractability and support recovery ability, in Sect. 3. Finally, we prove information-theoretic sufficient condition for support recovery in Sect. 4.

Notations

We reserve bold characters ($\varvec{x}$) to denote vectors. We denote by $\mathbf{e} $ the vector whose components are equal to one. If not specified, its dimension should be inferred from the context. The set $\mathcal {S}_k^p$ denotes the set

$$\begin{aligned} \mathcal {S}_k^p := \left\{ \varvec{s} \in \lbrace 0, 1 \rbrace ^p \, : \, \mathbf{e} ^\top \varvec{s} \leqslant k \right\} , \end{aligned}$$

which contains all binary vectors $\varvec{s}$ selecting at most k components from p possibilities. Assume $\varvec{y} \in \mathbb {R}^p$ is a vector and $\varvec{s} \in \mathcal {S}_k^p$, then $\varvec{y}_{\varvec{s}}$ denotes the sub-vector or sub-collection of $y_j$’s where $s_j=1$. We use $\Vert \varvec{x} \Vert _0$ to denote the number of elements of a vector which are nonzero. We use bold capitalized letters ($\varvec{X}$) for matrices. For a matrix $\varvec{X}$, $\varvec{X}_j$ denotes the jth column of $\varvec{X}$, while $\varvec{X}_{\varvec{s}}$ for $\varvec{s} \in \mathcal {S}_k^p$ is the submatrix obtained from the k columns of $\varvec{X}$ selected by $\varvec{s}$.

2 Dual framework

In this section, we use duality results to formulate a regularized version of the sparse classification problem (1) as a binary convex optimization problem and propose an outer-approximation approach to solve it efficiently.

2.1 Regularized classification

We first introduce the basic notation and recall some well-known properties for the case of non-sparse classification. Two very popular classification methods in machine learning are logistic regression and Support Vector Machine (SVM). Despite different motivations and underlying intuition, both methods lead to a similar formulation which can be addressed under the unifying lens of regularized classification:

$$\begin{aligned} \min _{\varvec{w} \in \mathbb {R}^p,b \in \mathbb {R}} \sum _{i=1}^n \ell (y_i, \varvec{w}^\top \varvec{x}^{(i)} +b) + \dfrac{1}{2 \gamma } \Vert \varvec{w} \Vert _2^2, \end{aligned}$$

(4)

where $\ell $ is an appropriate loss function and $\gamma $ a regularization coefficient.

In the logistic regression framework, the loss function is the logistic loss

$$\begin{aligned} \ell (y, u) = \log \left( 1+e^{-yu} \right) , \end{aligned}$$

and the objective function can be interpreted as the negative log-likelihood of the data plus a regularization term, which ensures strict convexity of the objective and existence of an optimal solution.

In the SVM framework, the loss function $\ell $ is the hinge loss:

$$\begin{aligned} \ell (y, u) = \max (0, 1-yu). \end{aligned}$$

Under proper normalization assumptions, the square norm $\Vert \varvec{w} \Vert _2^2$ relates to the notion of margin, which characterizes the robustness of the separating hyperplane $\lbrace \varvec{x} : \varvec{w}^\top \varvec{x} + b = 0 \rbrace $, while the loss part penalizes the data points which do no satisfy $y_i(\varvec{w}^\top \varvec{x}^{(i)} + b) \geqslant 1$, that is points which are misclassified or lie within the margin (Vapnik, 1998).

In addition, this general formulation (4) accounts for any loss function used in classification (e.g. 2-norm SVM) or even in regression problems. Throughout the paper we make the following assumption:

Assumption 1

The loss function $\ell (y,\cdot )$ is convex for $y \in \{-1,1\}$.

Note that both the logistic and Hinge loss functions satisfy the convexity Assumption 1. In classification, deeper results and insights can typically be obtained by adopting a dual perspective. Denoting $\varvec{X} = ({\varvec{x}^{(i)}}^\top )_{i=1,\ldots ,n} \in \mathbb {R}^{n \times p}$ the design matrix, we have:

Theorem 1

Under Assumption 1, strong duality holds for problem (4) and its dual is

$$\begin{aligned} \max _{\varvec{\alpha } \in \mathbb {R}^n : \mathbf{e} ^\top \varvec{\alpha } = 0} - \sum _{i=1}^n {{\hat{\ell }}}(y_i,\alpha _i) - \dfrac{\gamma }{2} \varvec{\alpha }^\top \varvec{X} \varvec{X}^\top \varvec{\alpha } \; , \end{aligned}$$

(5)

where $ {\hat{\ell }}(y, \alpha ) := \max _{u \in \mathbb {R}} u \alpha - \ell (y,u) $ is the Fenchel conjugate of the loss function $\ell $ (see Boyd & Vandenberghe, 2004, Chap. 3.3).

Table 2 summarizes some popular loss functions in classification and their corresponding Fenchel conjugates.

Table 2 Examples of loss functions and their corresponding Fenchel conjugates, as defined in Theorem 1 (Bach, 2009)

Full size table

Proof

For regularized classification, we have that

$$\begin{aligned} \min _{\varvec{w},b} \sum _{i=1}^n \ell (y_i, \varvec{w}^\top \varvec{x}^{(i)} +b) + \dfrac{1}{2 \gamma } \Vert \varvec{w} \Vert _2^2 =\min _{\varvec{w},b,\varvec{z}}&\, \sum _{i=1}^n \ell (y_i, z_i) + \dfrac{1}{2 \gamma } \Vert \varvec{w} \Vert _2^2 \\ \text{ s.t. }&\, z_i = \varvec{z}^\top \varvec{x}^{(i)} + b. \end{aligned}$$

By Assumption 1, the objective is convex, the optimization set is convex and Slater’s conditions hold (Boyd & Vandenberghe, 2004). Hence, strong duality holds and the primal is equivalent to the dual problem. To derive the dual formulation, we introduce Lagrange multipliers $\alpha _i$ associated with the equality constraints:

$$\begin{aligned}&\min _{\varvec{w},b, \varvec{z}} \sum _{i=1}^n \ell (y_i, z_i) + \dfrac{1}{2 \gamma } \Vert \varvec{w} \Vert _2^2 \text{ s.t. } z_i = \varvec{w}^\top \varvec{x}^{(i)} + b \\&\quad =\min _{\varvec{w},b, \varvec{z}} \sum _{i=1}^n \ell (y_i, z_i) + \dfrac{1}{2 \gamma } \Vert \varvec{w} \Vert _2^2 + \max _{\varvec{\alpha } \in \mathbb {R}^n} \sum _{i=1}^n \alpha _i (\varvec{w}^\top \varvec{x}^{(i)} + b - z_i)\\&\quad =\min _{\varvec{w},b, \varvec{z}} \max _{\varvec{\alpha }} \left( \sum _{i=1}^n \ell (y_i, z_i) - \alpha _i z_i \right) + \left( \dfrac{1}{2 \gamma } \Vert \varvec{w} \Vert ^2_{ 2} + \varvec{w}^\top \left[ \sum _i \alpha _i \varvec{x}^{(i)} \right] \right) + b \, \mathbf{e} ^\top \varvec{\alpha } \\&\quad =\max _{\varvec{\alpha }} \sum _{i=1}^n \min _{z_i} \left( \ell (y_i, z_i) - \alpha _i z_i \right) + \min _{\varvec{w}} \, \left( \dfrac{1}{2 \gamma } \Vert \varvec{w} \Vert ^2_{ 2} + \varvec{w}^\top \varvec{X}^\top \varvec{\alpha } \right) + \min _{b} \, b \, \mathbf{e} ^\top \varvec{\alpha }. \end{aligned}$$

Let us consider the three inner minimization problems separately. First,

$$\begin{aligned} \min _{z_i} \left( \ell (y_i, z_i) - \alpha _i z_i \right)&= -\max _{z_i} \left( \alpha _i z_i - \ell (y_i, z_i) \right) = -{\hat{\ell }}(y_i, \alpha _i). \end{aligned}$$

Then, $\tfrac{1}{2 \gamma } \Vert \varvec{w} \Vert ^2_{ 2} + \varvec{w}^\top \varvec{X}^\top \varvec{\alpha } $ is minimized at $\varvec{w}^\star $ satisfying: $ \tfrac{1}{\gamma } \varvec{w}^\star + \varvec{X}^\top \varvec{\alpha } = 0$. Hence

$$\begin{aligned} \min _{w} \left( \dfrac{1}{2 \gamma } \Vert \varvec{w} \Vert ^2_{ 2} + \varvec{w}^\top \varvec{X}^\top \varvec{\alpha } \right)&= - \dfrac{1}{2 \gamma } \Vert \varvec{w}^\star \Vert ^2_{ 2} = - \dfrac{\gamma }{2} \varvec{\alpha }^\top \varvec{XX}^\top \varvec{\alpha }. \end{aligned}$$

Finally, $\min _{b} \, b \, \mathbf{e} ^\top \varvec{\alpha }$ is bounded if and only if $\mathbf{e} ^\top \varvec{\alpha } = 0$, thus we obtain (5). $\square $

The derivation of the dual (5) reveals that the optimal primal variables $\varvec{w}^\star $ can be recovered from the dual variables $\varvec{\alpha }^\star $ via the relationship $\varvec{w}^\star = - \gamma \varvec{X}^\top \varvec{\alpha }^\star $. In other words, $\varvec{w}^\star $ is a linear combination of the data points $\varvec{X}$. Such an observation has historically led to the intuition that $\varvec{w}^\star $ was supported by some observed vectors $\varvec{x}^{(i)}$ and the name Support Vector Machine was coined (Cortes & Vapnik, 1995). Conversely, $\varvec{\alpha }^\star $ relates to the primal variables $\varvec{w}^\star $ via the relationship $\varvec{\alpha }^\star _i \in \partial \ell (y_i, {\varvec{w}^\star }^\top \varvec{x}^{(i)})$, where $\partial \ell (y_i, {\varvec{x}^{(i)}}^\top \varvec{w}^\star )$ denotes the sub-differential of the loss function $\ell (y_i,\cdot )$ evaluated at ${\varvec{w}^\star }^\top \varvec{x}^{(i)}$. If $\ell $ is differentiable, like the logistic loss, this relationship uniquely defines $\varvec{\alpha }^\star _i$.

Moreover, the dual point of view opens the door to non-linear classification using kernels (Schölkopf et al., 2001). The positive semi-definite matrix $\varvec{X} \varvec{X}^\top $, often referred to as the kernel or Gram matrix, is central in the dual problem (5) and could be replaced by any kernel matrix $\varvec{K}$ whose entries $K_{ij}$ encode some measure of similarity between inputs $\varvec{x}^{(i)}$ and $\varvec{x}^{(j)}$.

Numerical algorithms There is a rich literature on numerical algorithms for solving either the primal (4) or the dual (5) formulation for the regularized classification problem in the case of logistic regression and SVM. Gradient descent or Newton-Raphson methods are well-suited when the loss function is smooth. In addition, in the case where the dual problem is constrained to $\mathbf{e} ^\top \varvec{\alpha } = 0$, particular step size rules (Calamai & Moré, 1987; Bertsekas, 1982) or trust regions (Lin et al., 2008) can be implemented to cope with such linear constraints. When the loss function is not continuously differentiable, sub-gradient descent as proposed in the Pegasos algorithm (Shalev-Shwartz et al., 2011) provides an efficient optimization procedure. Within the machine learning community, coordinate descent methods have also received a lot of attention recently, especially in the context of regularized prediction, because of their ability to compute a whole regularization path at a low computational cost (Friedman et al., 2010). For coordinate descent algorithms specific to the regularized classification problem we address in this paper, we refer to (Hsieh et al., 2008; Yu et al., 2011; Keerthi et al., 2005).

Remark 1

Theorem 1 does not hold when the loss is not convex, for instance in the case of the empirical misclassification rate

$$\begin{aligned} \min _{\varvec{w},b} \sum _{i=1}^n \mathbf{1} _{y_i( \varvec{w}^\top \varvec{x}^{(i)} +b) < 0} + \dfrac{1}{2 \gamma } \Vert \varvec{w} \Vert _2^2. \end{aligned}$$

(6)

Indeed, strong duality does not hold. The objective value is clearly finite and nonnegative but since ${\hat{\ell }}(y, \alpha ) = +\infty $, the dual problem has cost $-\infty $. In general, if the loss function $\ell $ is not convex, the dual problem (5) is not equivalent to (4) but provides a valid lower bound.

2.2 Dual approach to sparse classification

Sparsity is a highly desirable property for statistical estimators, especially in high-dimensional regimes ($p\gg n$) such as the ones encountered in biological applications, where interpretability is crucial. A natural way to induce sparsity is to add a constraint on the number of nonzero coefficients of $\varvec{w}$ and solve:

(7)

Actually, (7) can be expressed as a convex binary optimization problem as stated in the following theorem:

Theorem 2

Problem (7) is equivalent to

$$\begin{aligned} \min _{\varvec{s} \in \mathcal {S}_k^p} c(\varvec{s}), \end{aligned}$$

(8)

where for any $\varvec{s} \in \{0,1\}^p$ we have $c(\varvec{s}) := \max _{\varvec{\alpha } \in \mathbb {R}^n: \mathbf{e} ^\top \varvec{\alpha } = 0} f(\varvec{\alpha },\varvec{s})$ with

$$\begin{aligned} f(\varvec{\alpha },\varvec{s}) := -\sum _{i=1}^n {\hat{\ell }}(y_i, \alpha _i) - \dfrac{\gamma }{2} \sum _{j=1}^n s_j \varvec{\alpha }^\top \varvec{X}_j \varvec{X}_j^\top \varvec{\alpha } . \end{aligned}$$

(9)

In particular, $c(\varvec{s})$ is convex over $[0,1]^p$.

Proof

We introduce an additional binary variable $\varvec{s} \in \{0,1\}^p$ encoding for the support of the sparse classifier $\varvec{w}$. With these notations

$$\begin{aligned} \varvec{w}^\top \varvec{x}^{(i)}&= \sum _{j=1}^p w_j x^{(i)}_{j} = \sum _{j: s_j =1} w_j x^{(i)}_{j} = \varvec{w}_{\varvec{s}}^\top \varvec{x}^{(i)}_{\varvec{s}}, \\ \Vert \varvec{w} \Vert _2^2&= \sum _{j=1}^p w_j^2 = \sum _{j: s_j =1} w_j^2 = \Vert \varvec{w}_{\varvec{s}} \Vert _2^2, \end{aligned}$$

the cardinality constraint on $\varvec{w}$ yields a linear constraint on $\varvec{s}$

$$\begin{aligned} \varvec{s}^\top \mathbf{e} \leqslant k, \end{aligned}$$

and (7) can be equivalently written as

$$\begin{aligned} \min _{\varvec{s} \in \mathcal {S}_k^p} \min _{w,b} \sum _{i=1}^n \ell (y_i, \varvec{w}_{\varvec{s}}^\top \varvec{x}^{(i)}_{\varvec{s}} +b) + \dfrac{1}{2 \gamma } \Vert \varvec{w}_{\varvec{s}} \Vert _2^2. \end{aligned}$$

Denoting $c(\varvec{s})$ the inner minimization problem, we end up solving the pure binary to-be-proved-convex optimization problem

$$\begin{aligned} \min _{\varvec{s} \in \mathcal {S}_k^p} c(\varvec{s}). \end{aligned}$$

In addition, $c(\varvec{s}) = \min _{\varvec{w},b} \sum _{i=1}^n \ell (y_i, \varvec{w}_{\varvec{s}}^\top \varvec{x}^{(i)}_{\varvec{s}} +b) + \dfrac{1}{2 \gamma } \Vert \varvec{w}_{\varvec{s}} \Vert _2^2 $ is an unconstrained regularized classification problem based on the features selected by $\varvec{s}$ only. Hence, Theorem 1 applies and

$$\begin{aligned} c(\varvec{s})&= \max _{\varvec{\alpha } \in \mathbb {R}^n} \sum _{i=1}^n -{\hat{\ell }}(y_i, \alpha _i) - \dfrac{\gamma }{2} \varvec{\alpha }^\top \varvec{X}_{\varvec{s}} \varvec{X}_{\varvec{s}}^\top \varvec{\alpha } \; \text{ s.t. } \; \mathbf{e} ^\top \varvec{\alpha } = 0. \end{aligned}$$

Since $\varvec{\alpha }^\top \varvec{X}_{\varvec{s}} \varvec{X}_{\varvec{s}}^\top \varvec{\alpha } = \sum _{j:s_j =1} \varvec{\alpha }^\top \varvec{X}_j \varvec{X}_j^\top \varvec{\alpha } = \sum _{j=1}^p s_j \varvec{\alpha }^\top \varvec{X}_j \varvec{X}_j^\top \varvec{\alpha }$, we obtain the desired formulation.

Finally, let us denote

$$\begin{aligned} f(\alpha , s) :=-\sum _{i=1}^n {\hat{\ell }}(y_i, \alpha _i) - \dfrac{\gamma }{2} \sum _{j=1}^n s_j \varvec{\alpha }^\top \varvec{X}_j \varvec{X}_j^\top \varvec{\alpha }. \end{aligned}$$

The function f is convex—indeed linear—in $\varvec{s}$ over $[0,1]^p$, so c is convex over $[0,1]^p$. $\square $

In practice, for a given support $\varvec{s}$, we evaluate the function $c(\varvec{s})$ by solving the maximization problem (9) with any of the numerical procedures presented in the previous section. In what follows, we need to calculate a sub-gradient of the function c as well. Using the dual maximizer $\varvec{\alpha }^\star (\varvec{s})$ in (9) at a support $\varvec{s}$, we can compute one at no additional computational cost. Indeed, it follows that

$$\begin{aligned} \dfrac{\partial c(\varvec{s}) }{\partial s_j} = - \dfrac{\gamma }{2} \varvec{\alpha }^{\star }(\varvec{s})^\top \varvec{X}_j \varvec{X}_j^\top \varvec{\alpha }^\star (\varvec{s}). \end{aligned}$$

2.3 Enhanced outer-approximation algorithm

We solve the convex integer optimizaton (CIO) problem (8) taking into account that we can readily compute $c(\varvec{s})$ and $\nabla c (\varvec{s})$ for any given $\varvec{s}$. None of the commercial solvers available are targeted to solve such CIO problems where there is no closed-form expression for $c(\varvec{s})$. We propose to adopt an outer-approximation approach similar to the one introduced by Duran and Grossmann (1986) for linear mixed-integer optimization problems.

We first reformulate (8) as a mixed-integer optimization problem in epigraph form:

$$\begin{aligned} \min _{\varvec{s} \in \mathcal {S}_k^p, \eta } \eta \text{ s.t. } \eta \geqslant c(\varvec{s}). \end{aligned}$$

By convexity, we can approximate $c(\varvec{s})$ by the supremum of its tangents at all points $\tilde{\varvec{s}} \in \mathcal {S}_k^p$, i.e., $c(\varvec{s}) \ge \max _{\tilde{\varvec{s}} \in \mathcal {S}_k^p} \, \left\{ c(\tilde{\varvec{s}}) + \nabla c(\tilde{\varvec{s}})^\top (\varvec{s} - \tilde{\varvec{s}}) \right\} $. For $\varvec{s} \in \mathcal {S}_k^p$, this lower approximation matches $c(\varvec{s})$ exactly. Hence, we can equivalently solve

$$\begin{aligned} \min _{\varvec{s} \in \mathcal {S}_k^p, \eta } \eta \text{ s.t. } \eta \geqslant c(\tilde{\varvec{s}}) + \nabla c(\tilde{\varvec{s}})^\top (\varvec{s} - \tilde{\varvec{s}}), \forall \tilde{\varvec{s}} \in \mathcal {S}_k^p. \end{aligned}$$

(10)

As described in Fletcher and Leyffer (1994) and Bonami et al. (2008), we find a solution to (10) by iteratively introducing the linear constraints defining (10), i.e., iteratively refining a piece-wise linear lower approximation of c. The solver structure is given in pseudocode in Algorithm 1.

Under Assumption 1, Algorithm 1 is guaranteed to terminate in a finite, yet exponential in the worst-case, number of steps and return an optimal solution of (8) (Fletcher & Leyffer, 1994). On this regard, Algorithm 1 resembles the simplex algorithm whose worst-case complexity is exponential, yet is very efficient in practice (Goldfarb, 1994).

Several enhancements have been proposed to improve the convergence speed of Algorithm 1. First, in its original form, Algorithm 1 requires solving a mixed-integer linear optimization problem at each iteration, which is computationally inefficient. Modern solvers however, such as Gurobi (Gurobi Optimization, 2016), IBM CPLEX (CPLEX, 2011), or GLPK,^{Footnote 1} can handle lazy callbacks, a feature that integrates the constraint generation procedure within a unique branch-and-bound enumeration tree, shared by all subproblems. We implemented Algorithm 1 in this fashion. In addition, decomposition schemes as Algorithm 1 benefit from performing a rich root node analysis, as advocated by Fischetti et al. (2017). In essence, a “rich” root node analysis consists of a good initial feasible solution $\varvec{s}^{(1)}$ (i.e, a good upper bound) and a set of initial constraints of the form $\eta \geqslant c(\varvec{s}^{(i)}) + \nabla c(\varvec{s}^{(i)})^\top (\varvec{s}-\varvec{s}^{(i)})$ to obtain tight lower bound as well. Regarding the warm-start $\varvec{s}^{(1)} $, we recommend using the Lasso estimator provided by the glmnet package (Friedman et al., 2013) or estimators obtained by rounding the solution of the Boolean relaxation of (8). We refer to Pilanci et al. (2015) for a theoretical analysis of the latter, and Bertsimas et al. (2020) and Atamturk and Gomez (2019) for efficient numerical algorithms to solve it. Regarding the lower-bound and the initial constraints, Fischetti et al. (2017) suggests using the constraints obtained from solving the Boolean relaxation of (8) via a similar outer-approximation procedure—in which case there are no binary variables and the technique is often referred to as Kelley’s algorithm (Kelley, 1960). We refrain from implementing this strategy in our case. Indeed, computing $c(\varvec{s})$ and $\nabla c(\varvec{s})$ reduces to solving a binary classification problem over the k features encoded by the support of $\varvec{s}$. As a result, the scalability of our approach largely relies on the fact that constraints are computed for sparse vectors $\varvec{s}$. When solving the Boolean relaxation, however, $\varvec{s}$ can be dense while satisfying the constraint $\varvec{s}^\top \mathbf{e} \leqslant k$. Instead, we compute a regularization path for the Lasso estimator up to a sparsity level of $k+1$ using the glmnet package, and initialize the outer approximation with the constraints obtained from these solutions.

Finally, we propose a stochastic version of the outer-approximation algorithm to improve the scalability with respect to n. At the incumbent solution $\varvec{s}^{(t)}$, we observe that we do not need to solve Problem (9) to optimality to obtain a valid linear lower-approximation of $c(\varvec{s})$. Indeed, any $\varvec{\alpha } \in \mathbb {R}^n \, : \, \mathbf{e} ^\top \varvec{\alpha } = 0$ yields

$$\begin{aligned} c(\varvec{s}) \geqslant -\sum _{i=1}^n {\hat{\ell }}(y_i, \alpha _i) - \dfrac{\gamma }{2} \sum _{j=1}^n s_j \varvec{\alpha }^\top \varvec{X}_j \varvec{X}_j^\top \varvec{\alpha }. \end{aligned}$$

Hence, we propose a strategy (Algorithm 2) to find a good candidate solution $\alpha $ and the corresponding lower approximation $c(\varvec{s}) \geqslant \tilde{c}(\varvec{s}^{(t)}) + \nabla \tilde{c}(\varvec{s}^{(t)})^\top (\varvec{s} - \varvec{s}^{(t)})$. Our strategy relies on the fact the primal formulation in $\varvec{w}_{\varvec{s}} \in \mathbb {R}^k$ only involves k decision variables, with $k < n$ or even $k \ll n$ in practice. As a result, one should not need the entire data set to estimate $\varvec{w}^\star (\varvec{s}^{(t)})$. Instead, we randomly select N out of n observations, estimate $\varvec{w}^\star (\varvec{s}^{(t)})$ on this reduced data set, and finally average the result over B subsamples. Typically, we take ${ N} = \max ( 10\% n, 2k)$ and $B=10$ in our experiments. Then, we estimate $\varvec{\alpha } \in \mathbb {R}^n$ by solving the first-order optimality conditions $\alpha _i \in \partial \ell (y_i, \varvec{w}^\top \varvec{x}^{(i)})$. Since our objective is to generate a constraint that tightens the current lower-approximation of $c(\varvec{s}^{(t)})$, we compare $\tilde{c}(\varvec{s}^{(t)})$ with our current estimate of $c(\varvec{s}^{(t)})$, $\eta ^{(t)}$. If $\eta ^{(t)} > \tilde{c}(\varvec{s}^{(t)})$, i.e., if the approximate constraint does not improve the approximation, we reject it and compute $c(\varvec{s}^{(t)}), \nabla c(\varvec{s}^{(t)}))$ exactly instead.

In the next section, we provide numerical evidence that Algorithm 1, both for logistic and hinge loss functions, is a scalable method to compute cardinality constrained classifiers and select more accurately features than $L_1$-based heuristics.

2.4 Practical implementation considerations

Sparse regularized classification (7) involves two hyperparameters—a regularization factor $\gamma $ and the sparsity k. For the regularization parameter $\gamma $, we fit its value using cross-validation among values uniformly distributed in the log-space: we start with a low value $\gamma _0$—typically $\gamma _0$ scaling as $1 / \max _i \Vert \varvec{x}^{(i)} \Vert ^2$ as suggested in Chu et al. (2015)—and inflate it iteratively by a factor two. We similarly tune k by simple hold-out cross-validation over a range of values. We use out-of-sample Area Under the receiving operator Curve as the validation criterion (to maximize). Although we did not implement it, Kenney et al. (2018) present a general blueprint consisting of warm-start strategies and a bisection search procedure that can tangibly accelerate the overall cross-validation loop.

3 Numerical experiments: scalability and support recovery

In this section, we evaluate the numerical performance of our method both in terms of scalability and quality of the features selected.

The computational tests were performed on a computer with Xeon @2.3 GhZ processors, 1 core, 8 GB RAM. Algorithms were implemented in Julia 1.0 (Lubin & Dunning, 2015), a technical computing language, and Problem (8) was solved with Gurobi 8.1.0 (Gurobi Optimization, 2016). We interfaced Julia with Gurobi using the modeling language JuMP 0.21.1 (Dunning et al., 2017). Our code is available on GitHub.^{Footnote 2} Since our sparse regularized formulation (7) comprises 2 hyperparameters, $\gamma $ and k, that control the degree of regularization and sparsity respectively, we compare our method with the ElasticNet formulation

$$\begin{aligned} \min _{\varvec{w} \in \mathbb {R}^p,b \in \mathbb {R}} \sum _{i=1}^n \ell (y_i, \varvec{w}^\top \varvec{x}^{(i)} +b) + \lambda \left[ (1-\alpha ) \Vert \varvec{w} \Vert _1 + \alpha \Vert \varvec{w} \Vert _2^2 \right] , \end{aligned}$$

(11)

which similarly contains 2 hyper parameters and can be computed efficiently by the glmnet package (Friedman et al., 2013). For a fair comparison, we cross-validate $\lambda $ and $\alpha $ using the same procedure as $\gamma $ and k described in Sect. 2.4. Our intention in this section is first and foremost to assess the validity of our numerical algorithms for computing the cardinality-constrained estimator (7). We refer to Bertsimas et al. (2020) and Hastie et al. (2020) for extensive numerical experiments and discussions on the statistical and practical relevance of each formulation, as well as comparison with non-convex penalties and alternatives.

3.1 Support recovery on synthetic data

We first consider synthesized data sets to assess the feature selection ability of our method compared it to a state-of-the-art $L_1$-based estimators.

3.1.1 Methodology

We draw $\varvec{x}^{(i)} \sim \mathcal {N}(0_p, \Sigma ), i=1,\dots ,n$ independent realizations from a p-dimensional normal distribution with mean $0_p$ and covariance matrix $\Sigma _{ij} = \rho ^{|i-j|}$. Columns of $\varvec{X}$ are then normalized to have zero mean and unit variance. We randomly sample a weight vector $\varvec{w}_{true} \in \lbrace -1, 0, 1 \rbrace ^p$ with exactly k nonzero coefficients. We draw $\varepsilon _i, i=1,\dots ,n,$ i.i.d. noise components from a normal distribution scaled according to a chosen signal-to-noise ratio

$$\begin{aligned} \sqrt{SNR} = \Vert \varvec{X} \varvec{w}_{true} \Vert _2 / \Vert \varvec{\varepsilon } \Vert _2. \end{aligned}$$

Finally, we construct $y_i$ as $y_i = \text {sign} \left( \varvec{w}_{true}^\top \varvec{x}^{(i)} + \varepsilon _i > 0\right) .$ This methodology enables us to produce synthetic data sets where we control the sample size n, feature size p, sparsity k, feature correlation $\rho $ and signal-to-noise ratio SNR.

3.1.2 Support recovery metrics

Given the true classifier $\varvec{w}_{true}$ of sparsity $k_{true}$, we assess the correctness of a classifier $\varvec{w}$ of sparsity k by its accuracy, i.e., the number of true features it selects

$$\begin{aligned} A(\varvec{w}) = |\lbrace j : w_j \ne 0, w_{true,j} \ne 0 \rbrace | \in \{0,\ldots ,k_{true}\}, \end{aligned}$$

and the false discovery, i.e., the number of false features it incorporates

$$\begin{aligned} F(\varvec{w}) = |\lbrace j : w_j \ne 0, w_{true,j} = 0 \rbrace | \in \{ 0,\ldots ,p\}. \end{aligned}$$

Obviously, $A(\varvec{w}) + F(\varvec{w}) = |\lbrace j : w_j \ne 0 \rbrace | = k$. A classifier $\varvec{w}$ is said to perfectly recover the true support if it selects the truth ($A(\varvec{w})=k_{true}$) and nothing but the truth ($F(\varvec{w})=0$ or equivalently $k = k_{true}$).

3.1.3 Selecting the truth ...

We first compare the performance of our algorithm for sparse regression with ElasticNet, when both methods are given the true number of features in the support $k_{true}$. As mentioned in the introduction, a key property in this context for any best subset selection method, is that it selects the true support as sample size increases, as represented in Fig. 2. From that perspective, both methods demonstrate a similar convergence: As n increases, both classifiers end up selecting the truth, with Algorithm 1 needing a somewhat smaller number of samples than Lasso.

Apart from the absolute number of true/false features selected, one might wonder whether the features selected are actually good features in terms of predictive power. In this metric, sparse regression significantly outperforms the $L_1$-based classifier, both in terms of Area Under the Curve (AUC) and misclassification rate, as shown on Fig. 3, demonstrating a clear predictive edge of exact sparse formulation.

In terms of computational cost, Lasso remains the gold standard: computational times for glmnet range from 0.05 to 0.48 s, while CIO terminates between 0.03 and 114.5 s on the same instances. Figure 4 represents the number of constraints (left panel) and computational time (right panel) required by our outer-approximation algorithm as the problem size n increases, $\gamma \times n$ being fixed. For low values of n, the number of constraints is in the thousands. Surprisingly, computational effort does not increase with n. On the contrary, having more observations reduces the number of constraints down to less than twenty. Intuitively, as n grows, there is more signal and the problem becomes easier. Computational time evolves similarly: the algorithm reaches the time limit (here, 3 minutes) when n is low, but terminates in a few seconds for high values. For sparse linear regression, Bertsimas and Van Parys (2020) observed a similar, yet even sharper, phenomenon which they referred to as a phase transition in computational complexity. The threshold value, however, increases with $\gamma $. In other words, when n is fixed, computational time increases as $\gamma $ increases (see Fig. 9 in Sect. 3.2), which corroborates the intuition that in the limit $\gamma \rightarrow 0$, $\varvec{w}^\star = 0$ is obviously optimal, while the problem can be ill-posed as $\gamma \rightarrow +\infty $. Since the right regularization parameter is unknown a priori, one needs to test high but sometimes relevant values of $\gamma $ for which the time limit is reached and the overall procedures terminates in minutes, while glmnet computes the entire regularization path for Lasso in less than a second. As for the choice of the time limit, it does not significantly impact the performance of the algorithm: As shown on Fig. 5, the algorithm quickly finds the optimal solution and much of the computational time is spent improving the lower bound, i.e., proving the solution is indeed optimal. We will further explore the numerical scalability of the outer-approximation algorithm in Sect. 3.2.

3.1.4 ... and nothing but the truth

In practice, however, the length of the true support $k_{true}$ is unknown a priori and is to be determined using cross-validation.

Given a data set with a fixed number of samples n and features p, we compute classifiers with different values of sparsity parameter k and choose the value which leads to the best accuracy on a validation set. Irrespective of the method, AUC as a function of sparsity k should have an inverted-U shape: if k is too small, not enough features are taken into account to provide accurate predictions. If k is too big, the model is too complex and overfits the training data. Hence, there is some optimal value $k^\star $ which maximizes validation AUC (equivalently, one could use misclassification rate instead of AUC as a performance metric). Figure 6 represents the evolution of the AUC on a validation set as sparsity k increases for Lasso and the exact sparse logistic regression. The exact CIO formulation leads to an optimal sparsity value $k^\star _{CIO}$ which is much closer to the truth than $k^\star _{Lasso}$, and this observation remains valid when n increases as shown on the left panel of Fig. 7. In addition, Fig. 7 also exposes a major deficiency of Lasso as a feature selection method: even when the number of samples increases, Lasso fails to select the relevant features only and returns a support $k^\star _{Lasso}$ much larger than the truth whereas $k^\star _{CIO}$ converges to $k_{true}$ quickly as n increases, hence selecting the truth and nothing but the truth.

As mentioned in the introduction, these shortcomings of Lasso-based feature selection have already been documented and motivated alternatives such as non-convex penalties. Extensive numerical comparisons, in regression and classification settings, presented Bertsimas et al. (2020) corroborate our findings, and demonstrate that non-convex penalties and $L_0$-based formulations both provide a substantial improvement over $L_1$-regularization. Yet, in their analysis, discrete optimization methods do have some edge over the non-convex penalties SCAD and MCP in terms of false discoveries.

3.2 Scalability on synthetic data

We use the opportunity of working with synthetic data to numerically assess how computational time is impacted by the size of the problem—as measured by the number of samples, n, and features, p—and the two hyper parameters $\gamma $ and k.

Impact of sample size n: As previously observed, computational time does not increase as n increases. On the contrary, for a fixed value of $\gamma \times n$, computational time decreases as n grows. As displayed on Fig. 4 for Hinge loss, this phenomenon is observed for both the standard and stochastic version of the outer-approximation algorithm. Observe that the stochastic variant slightly increases the number of constraints for low values of n, yet, since generating a constraint is less expensive, computational time is reduced by a factor 2 for the hardest instances. In this regard, the stochastic version of the outer-approximation algorithm is more beneficial when using the logistic loss compared with the Hinge loss, as depicted on Fig. 8. Indeed, the number of constraints is reduced by a factor of 5 and the computational time by a factor of 10. In our understanding, this substantial improvement in the logistic case is due to the fact that the relationship $\alpha _i \in \partial \ell (y_i, {\varvec{x}^{(i)}}^\top \varvec{w})$ uniquely defines $\alpha $ and hence leads to higher quality constraints than in the case of the Hinge loss where $\ell $ is not differentiable at 0.

Impact of regularization $\gamma $: As $\gamma $ increases, however, computational time and the number of constraints required sharply increases, as depicted in Fig. 9. This phenomenon is consistently observed for all problem sizes and loss functions. Since the proper value of $\gamma $ is unknown a priori and needs to be cross-validated, this could result in prohibitive overall computational time. However, two aspects need to be kept in mind: First, computational time, here, corresponds to time needed to certify optimality, although high-quality solutions—or even the optimal solution—can be found much faster. Second, we do not implement any warm-starting strategy between instances with different values of $\gamma $ within the grid search. Smarter grid search procedures (Kenney et al., 2018) could further accelerate computations.

Impact of feature size p and sparsity level k: The master problem (8) obviously depends on the number of features p and the sparsity level k through the feasible set $\mathcal {S}_k^p$. As a first-order approximation, the combinatorial complexity of the problem can be captured by the total number of potential supports of size k, $\left( {\begin{array}{c}p\\ k\end{array}}\right) $. So one could expect a linear dependency on p for k fixed, and an exponential dependency on k, p being fixed. This intuition is largely confirmed empirically, as reported in Table 3: In low sparsity regimes ($k=5$), we can solve instances with up to 50, 000 features under a minute, while denser problems are not solved to optimality after 30 minutes even for $p=5,000$. Our experiments also highlight the benefits from having a good initialization solution $s_1$ and using the stochastic constraint generation technique. As displayed in Table 3, together, these enhancements reduce computational time up to a factor of 10.

Table 3 Computational time (in seconds) of Algorithm 1 for large p and varying k

Full size table

3.3 Experiments on real-world data sets

We now illustrate the practical implications of sparse classification algorithms on real-world data, of various size and dimensions.

3.3.1 Over-determined regime $n>p$

We consider data sets from the UC Irvine Machine Learning Repository (available at https://archive.ics.uci.edu/ml/datasets.html), split them into a training and a test set ($80\% / 20\%$), calibrate a classifier on the training data, using cross-validation to fit the hyper-parameters k and $\gamma $ in the sparse case and $\lambda $ and $\alpha $ in the ElasticNet (Enet) case, and compare AUC on the test set for both methods. Characteristics of these data sets and experimental results are given in Table 4. Experiments clearly demonstrate that (a) our outer-approximation algorithm scales to data sets of practical relevance, (b) cardinality constrained formulations generally lead to sparser classifiers than ElasticNet (on 7 out 8 data sets) with comparable predictive power, with ElasticNet being more accurate on 3 out of 8 data sets. It suggests that features selected by our discrete optimization carry more relevant information that those obtained by ElasticNet. Yet, since these data sets contain a limited number of features p, they may not make a strong case for exact sparse classification methods compared to Lasso, the original problem being relatively sparse already. Therefore, we investigate the under-determined regime $n < p$ in the next section.

Table 4 Comparative results of $L_1$-regularized and $L_0$-constrained estimators on data sets from UCI ML Repository

Full size table

3.3.2 Under-determined regime $p>n$

Performance of sparse classification in the under-determined regime is crucial for two reasons: Since the amount of data available is limited, such a regime favors estimators which can efficiently recover the truth even when the sample size n is small with regard to p. More importantly, under-determined regimes occur in highly impactful applications, such as medical research. To show the direct implications of our method on this field of research, we used data from The Cancer Genome Atlas Research Network (available at http://cancergenome.nih.gov) on $n = 1145$ lung cancer patients. Actually, tumor types often have distinct subtypes, each of them having its own genetic signature. In our sample for instance, 594 patients ($51.9\%$) suffered from Adenocarcinoma while the remaining 551 patients ($48.1\%$) suffered from Squamous Cell Carcinoma. The data set consists of gene expression data for $p=14{,}858$ genes for each patient. We apply both sparse and Lasso classification to identify the most relevant genes to discriminate between the two subtypes and compile the results in Table 5. The first conclusion to be drawn from our results is that the exact sparse classification problem scales to problems of such size, which is far above data sets usually encountered in the gene selection academic literature. In addition, explicitly constraining sparsity of the classifier leads to much sparser, thus more interpretable results with little compromise on the predictive power: Sparse SVM reaches an AUC of 0.977 with only 38 features while the $L_1$-regularized classifier selects ten times more genes for a $+0.005$ gain in AUC.

Table 5 Comparative results of $L_1$-regularized and $L_0$-constrained estimators on the Lung Cancer data

Full size table

3.4 Summary and guidelines

All in all, these experiments demonstrate that (a) the $L_2$-regularized $L_0$-constrained formulation (7) provides an edge over $L_1$-based estimators such as Lasso or ElasticNet in terms of support recovery (accuracy and false detection rate) and downstream predictive power (AUC and misclassification rate), (b) though NP-hard in the worst case, solving the discrete optimization problem (7) can be routinely achieved within minutes for large problem sizes of interest (up to $p \approx 10{,}000\,\hbox {s}$). Though reasonable for most applications, these computational times are still one to two orders of magnitude higher than those of Lasso, which remains the gold standard in terms of scalability. In particular, the regularization parameter $\lambda $ of (2) can be cross-validated at no additional computational cost. Designing efficient cross-validation procedures for cardinality-based estimators is an active area of research (Kenney et al., 2018; Hazimeh & Mazumder, 2020). In practice, for a large data set, one could use Lasso as a feature screening method to exclude non-relevant covariates before applying a more computationally demanding feature selection method on a reduced subset of features.

4 Towards a theoretical understanding of asymptotic support recovery

As mentioned in the introduction, a large body of literature has provided information-theoretic limitations (Wainwright, 2009a; Wang et al., 2010; Gamarnik & Zadik, 2017) or theoretical guarantees of support recovery by some specific algorithms (Wainwright, 2009b; Pilanci et al., 2015) for sparse linear regression, which supported the empirical observation of a phase transition (Donoho & Stodden, 2006; Bertsimas & Van Parys, 2020). In classification, however, we did not observe such a sharp phenomenon. For instance, we observed a smooth convergence in accuracy in Fig. 2, while Bertsimas et al. (2020, Figure 1) experienced a sharp phase transition in low-noise linear regression. In this section, we provide some intuition on the theoretical mechanisms involved for perfect support recovery specific to classification. We prove an information-theoretic sufficient condition to achieve perfect support recovery and compare it with analogous bounds for linear regression, as well recent theoretical results on 1-bit compressed sensing (Jacques et al., 2013; Scarlett & Cevher, 2017).

4.1 Notations and assumptions

To simplify the analysis, we consider a stylized framework where the data is generated according to the equation

$$\begin{aligned} y_i = \text {sign}\left( {\varvec{x}^{(i)}}^\top \varvec{w}^{\star } + \varepsilon _i\right) , \end{aligned}$$

where $x_i$ are i.i.d. standard Gaussian random variables, $\varepsilon _i \sim \mathcal {N}(0,\sigma ^2)$, $\varvec{w}^{\star } \in \mathbb {R}^p$ with $\Vert \varvec{w}^{\star }\Vert _0=k$ and $\text {sign}(\lambda ) = 1$ if $\lambda >0$, $-1$, otherwise. Given a classifier $\varvec{w}$ predictions will be made according to the rule

$$\begin{aligned} {\hat{y}}_i(\varvec{w}) = \text {sign}\left( {\varvec{x}^{(i)}}^\top \varvec{w} \right) . \end{aligned}$$

It is obvious from the definition that for any $\varvec{w} \in \mathbb {R}^p$, $\text {sign}\left( \lambda \varvec{x}^\top \varvec{w} \right) = \text {sign}\left( \varvec{x}^\top \varvec{w} \right) $, for any $\lambda >0$. In other words, predictions made by a classifier are insensitive to scaling. As a consequence, the difference in prediction between two classifiers should demonstrate the same invariance and indeed only depends on the angle between the classifiers as formally stated in Lemma 1 (proof in “Appendix”). This observation does not hold for $\text {sign}\left( \varvec{x}^\top \varvec{w} + \varepsilon \right) $, because of the presence of noise.

Lemma 1

Assume $\varvec{x} \sim \mathcal {N}(\varvec{0}_p,\varvec{I}_p)$ and $\varepsilon \sim \mathcal {N}(0,\sigma ^2)$ are independent. Then, for any $\varvec{w},\varvec{w}' \in \mathbb {R}^p$ we have that

$$\begin{aligned} \mathbb {P}\left( \text {sign}\left( \varvec{x}^\top \varvec{w} \right) \ne \text {sign}\left( \varvec{x}^\top \varvec{w}'+ \varepsilon \right) \right)&= \dfrac{1}{\pi } \arccos \left( \dfrac{\varvec{w}^\top \varvec{w}'}{\Vert \varvec{w}\Vert \sqrt{\Vert \varvec{w}'\Vert ^2 + \sigma ^2}} \right) . \end{aligned}$$

(12)

We consider classifiers with binary entries $\varvec{w}^\star \in \{0,1\}^p$ only, similar to the work of Gamarnik and Zadik (2017) on sparse binary regression. Moreover, we learn the optimal classifier from the data by solving the minimization problem

$$\begin{aligned} \min _{\varvec{w} \in \{0,1\}^p} \dfrac{1}{n} \sum _{i=1}^n \mathbf{1} \left( {\hat{y}}_i(\varvec{w}) \ne y_i \right) \text{ s.t. } \Vert \varvec{w} \Vert _0 = k, \end{aligned}$$

(13)

where the loss function above corresponds to the empirical misclassification rate. Even though it is not a tractable loss function choice in practice, it demonstrates some interesting theoretical properties: it isolates the probabilistic model used to generate the data from the behavior of the optimal value. Indeed, for any classifier $\varvec{w}$, the empirical misclassification rate $\sum _{i=1}^n \mathbf{1} \left( {\hat{y}}_i(\varvec{w}) \ne y_i \right) $ follows a Binomial distribution, as the covariate data are independent. In addition, the problem (13) can be considered as the authentic formulation for binary classification, while other loss functions used in practice such as Hinge and logistic loss are only smooth proxies for the misclassification rate, used for their tractability and statistical consistency (Steinwart, 2002; Zhang, 2004).

4.2 Intuition and statement on sufficient conditions

For a given binary classifier $w \in \{0,1\}^p$ of sparsity k, the accuracy of the classifier (the number of true features it selects) is equal to the inner product of $\varvec{w}$ with $\varvec{w}^\star $:

$$\begin{aligned} A(\varvec{w}) = |\lbrace j : w_j \ne 0, \varvec{w}^\star _j \ne 0 \rbrace | = |\lbrace j : w_j =1, \varvec{w}^\star _j = 1 \rbrace | = \textstyle \sum _j w_j {w}^\star _j = \varvec{w}^\top \varvec{w}^\star . \end{aligned}$$

Consider a binary sparse classifier $\varvec{w}$, i.e., $\Vert \varvec{w} \Vert _0=k$, with accuracy $\varvec{w}^\top \varvec{w}^\star = \ell $. Then, it follows that the indicators $\mathbf{1} \left( {\hat{y}}_i(\varvec{w}) \ne y_i \right) $ are distributed as independent Bernoulli random variable sharing the success parameter

$$\begin{aligned} q (\ell ;k,\sigma ^2)&:= \mathbb {P}\left( \text {sign}\left( {\varvec{x}^{(i)}}^\top \varvec{w} \right) \ne \text {sign}\left( {\varvec{x}^{(i)}}^\top \varvec{w}^\star + \varepsilon _i \right) \right) ,\\&\displaystyle = \dfrac{1}{\pi } \text {arccos}\left( \dfrac{\ell }{\sqrt{k(k+\sigma ^2)}} \right) . \end{aligned}$$

The success parameter $q_\ell =q (\ell ;k,\sigma ^2)$ can be checked to be a decreasing concave function of $\ell $. That is, the more accurate our binary classifier $\varvec{w}$, the smaller the probability of misclassification. The previous should come as no surprise to anybody. The central limit theorem states that

$$\begin{aligned} \sqrt{n} \left[ \dfrac{1}{n} \sum _{i=1}^n \mathbf{1} \left( {\hat{y}}_i(\varvec{w}) \ne y_i \right) - q_l \right] \rightarrow \mathcal {N}\left( 0, {q_\ell (1-q_\ell )}\right) , \end{aligned}$$

as $n \rightarrow \infty $. In words, asymptotically in n, a given classifier $\varvec{w}$ will have an empirical misclassification rate close to $q_\ell $. Since $q_\ell $ is decreasing in $\ell $, the truth $\varvec{w}^\star $ for which $\ell =k$ should minimize the misclassification error among all possible supports. As observed empirically, the number of true features selected corresponds to the true sparsity when n is sufficiently large (see Fig. 2). Intuitively, n should be high enough such that the variance on the performance of each support $\tfrac{q_\ell (1-q_\ell )}{n}$ is small, taken into account that there are $\left( {\begin{array}{c}k\\ \ell \end{array}}\right) \left( {\begin{array}{c}p-k\\ k-\ell \end{array}}\right) $ possible supports with exactly $\ell $ correct features. In this case, it should be rather unlikely that the binary classifier with the smallest empirical misclassification rate is anything other than the ground truth $\varvec{w}^\star $. We will now make the previous intuitive argument more rigorous.

Because we aim at minimizing the misclassification rate, we are guaranteed to recover the true support $\varvec{w}^\star $ if there exists no other support $\varvec{w}$ with an empirical performance at least as good as the truth.

Theorem 3

We assume the data is generated according to the equation $ y_i = \text {sign}\left( {\varvec{x}^{(i)}}^\top \varvec{w}^{\star } + \varepsilon _i\right) $, where $x_i$ are i.i.d. standard random variables, $\varepsilon _i \sim \mathcal {N}(0,\sigma ^2)$, $\varvec{w}^{\star } \in \{0,1\}^p$ with $\Vert \varvec{w}^{\star }\Vert _0=k$. Given a classifier $\varvec{w}$, we denote ${\hat{y}}_i(\varvec{w}) := \text {sign}\left( {\varvec{x}^{(i)}}^\top \varvec{w} \right) $. For any two binary classifiers $\varvec{w}^1$ and $\varvec{w}^2$, let $\Delta (\varvec{w}^1,\varvec{w}^2)$ denote the difference between their empirical performance, i.e.,

$$\begin{aligned} \Delta (\varvec{w}^1,\varvec{w}^2) := \dfrac{1}{n}\sum _{i=1}^n \mathbf{1} \left( {\hat{y}}_i(\varvec{w}^1) \ne y_i \right) - \mathbf{1} \left( {\hat{y}}_i(\varvec{w}^2) \ne y_i \right) . \end{aligned}$$

Assume that $p \geqslant 2k$. Then there exist a threshold $n_0 >0$ such that for any $n > n_0$,

$$\begin{aligned} \mathbb {P}\left( \exists \varvec{w} \ne \varvec{w}^\star , \, \varvec{w} \in \{0,1\}^p, \, \Vert \varvec{w}\Vert _0=k \text{ s.t. } \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0 \right) \leqslant e^{-\frac{n-n_0}{2\pi ^2 k (\sigma ^2+2)}}. \end{aligned}$$

Moreover, we have $n_0 < C \left( 2 + \sigma ^2\right) k \log (p-k)$ for some absolute constant $C > 0$.

In other words, if $n \geqslant C \left( 2 + \sigma ^2\right) k \log (p-k) + 2 \pi k (\sigma ^2 +2) \log (1/\delta )$ for some $\delta \in (0,1)$, then

The proof of Theorem 3 is given in “Appendix”. From a high-level perspective, our sufficient condition on n relies on two ingredients: (a) the union bound, which accounts for the log-dependence in $p-k$ and which is found in similar results for regression and signal processing (Wainwright, 2009a; Wang et al., 2010) and (b) controlling of the individual probability $\mathbb {P}\left( \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0 \right) $ using large deviation bounds, which depends on the size of $\varvec{w}$ and $\varvec{w}^\star $, k, and the noise $\sigma ^2$.

Before comparing the claim of Theorem 3 with similar results from statistics and signal processing, let us remember that k, the sparsity level of the true classifier $\varvec{w}^\star $, is assumed to be known. To put this assumption in perspective with our previous simulations, our statement only concerns the best achievable accuracy when k is fixed.

4.3 Discussion

For regression, Gamarnik and Zadik (2017) proved that support recovery was possible from an information-theoretic point of view if

$$\begin{aligned} n > n^\star = \dfrac{2k \log p}{\log \left( \dfrac{2k}{\sigma ^2}+1\right) }. \end{aligned}$$

Note that our threshold $n_0$ for classification does not vanish in the low noise $\sigma ^2$ setting. This observation is not surprising: the output $y_i$ depending only on the sign of $\varvec{w}^\top \varvec{x}^{(i)}$ can be considered as inherently noisy. An observation already made by Scarlett and Cevher (2017).

As mentioned earlier, recent works in 1-bit compressed sensing have developed algorithms to recover sparse classifiers which provably recover the truth as the number of samples increases (Gupta et al., 2010; Plan & Vershynin, 2013a, b). In particular, Plan and Vershynin (2013b) formulate the problem of learning $\varvec{w}$ from the observations as a convex problem and establish bounds on the $\ell _2$ error $\Vert \varvec{w}-\varvec{w}^\star \Vert _2$, in the case of logistic regression. In particular, they show that $n > C' \, k \log (2p/k)$ is sufficient to achieve an $\ell _2$ reconstruction error which is bounded. In contrast with our result, they exhibited an algorithm able to achieve low error in terms of $\ell _2$ distance with fewer samples than what we proved to be an information-theoretic sufficient condition. Yet, this does not trivialize our result. $\ell _2$ consistency is a related but distinct criterion for support recovery. Even a good estimate of $\varvec{w}^\star $ in terms of $\ell _2$ distance might have a very different support. The fact that their rate for $\ell _2$ consistency is faster than the rate for support recovery suggests that achieving a good $\ell _2$ error is presumably an easier problem. Similar observations were made in the case of linear regression in Wainwright (2009a) and intuitively explained: given a good support recovery procedure, one can restrict the number of features, use standard methods to estimate the values of the $w_j$’s and hope to achieve a good $\ell _2$ error, while the reverse might not be true. Loh et al. (2017) prove that non-convex regularization achieves perfect support recovery with high probability if $n \gtrsim k \log p$, which matches our bound up to a multiplicative constant. Together, our results suggest that the information-theoretic bounds are achievable in practice via a discrete optimization formulation (1) or non-convex penalties (3). Yet, they define $\varvec{w}^\star $ implicitely as the solution $\min _{\varvec{w}} \mathbb {E}[\mathcal {L}_n(\varvec{w})]$, where $\mathcal {L}_n$ is the empirical loss over n observations, whereas we posit a generative model which explicitly depends on $\varvec{w}^\star $. As a result, their model has the merit of being more general and covering many different statistical learning problems (e.g., linear regression, generalized linear models, sparse inverse covariance estimation). Unlike ours, however, it does isolate explicitly the impact of noise and does not capture the loss of information inherent in classification where discrete signals are generated from continuous inputs.

Finally, Scarlett and Cevher (2017) proved similar sufficient conditions for support recovery in 1-bit compressed sensing and accompanied them with necessary conditions and constants $C >0$ as tight as possible. To that extent, their result (Corollary 3) might appear stronger than ours. However, their condition is valid only for low sparsity and low signal-to-noise regimes. Theorem 3, on the other hand, remains valid for all values of k and $\sigma $. In particular, it holds even if k scales linearly in p and $\sigma $ is low: a regime for which Scarlett and Cevher (2017) provide necessary (Corollary 4) but no sufficient conditions. More precisely, they prove that perfect support recovery cannot be achieved if the number of samples is below a threshold scaling as $p \sqrt{\log p}$, while our bound scales as $p \log p$ in this regime. Combined together, there is a $\sqrt{\log p}$ factor between necessary and sufficient conditions, which hints at the absence of a clear phase transition in this setting. As illustrated in Fig. 1, there is an intermediate sample size regime where support recovery is neither provably impossible nor achievable. Of course, this regime could be a deficiency of the proof techniques used, but in any case, we believe it constitutes an exciting direction for future research.

5 Conclusion

In this paper, we have proposed a tractable binary convex optimization algorithm for solving sparse classification. Although the problem is theoretically NP-hard, our algorithm scales for logistic regression and SVM in problems with n, p in 10, 000s within minutes. We also introduce a stochastic version of the constraint generation process which further reduces computational time by a factor 2 to 10 on numerical experiments. Comparing our method and Lasso-based estimates, we observe empirically that as n increases, the number of true features selected by both methods converges to the true sparsity. We support our observations with information-theoretic sufficient conditions, stating that support recovery is achievable as soon as $n > n_0$, with $n_0 < C \left( 2 + \sigma ^2\right) k \log (p-k)$ for some positive constant C. This sufficient information-theoretic condition echoes and complements existing results in the literature for 1-bit compressed sensing. Apart from accuracy, the exact sparse formulation has an edge over Lasso in the number of false features: as n increases, the number of false features selected by our method converges to zero, while this is not observed for Lasso. This phenomenon is also observed for classifying the type of cancer using gene expression data from the Cancer Genome Atlas Research Network with $n=1145$ lung cancer patients and $p=14{,}858$ genes. Sparse classification using logistic and hinge loss returns a classifier based on 90 and 38 genes respectively compared with 378.4 genes for ElasticNet with similar predictive accuracy.

Availability of data and materials

Data for the experiments are available at https://archive.ics.uci.edu/ml/datasets.html and http://cancergenome.nih.gov.

Code availability

Code available at https://github.com/jeanpauphilet/SubsetSelectionCIO.jl.

Notes

References

Atamturk, A., & Gomez, A. (2019). Rank-one convexification for sparse regression. arXiv preprint arXiv:1901.10334
Bach, F. (2009). High-dimensional non-linear variable selection through hierarchical kernel learning. arXiv preprint arXiv:0909.0844
Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning, 4(1), 1–106.
Bertsekas, D. P. (1982). Projected newton methods for optimization problems with simple constraints. SIAM Journal on Control and Optimization, 20(2), 221–246.
Article MathSciNet MATH Google Scholar
Bertsimas, D., & Copenhaver, M. S. (2018). Characterization of the equivalence of robustification and regularization in linear, median, and matrix regression. European Journal of Operations Research, 270(3), 931–942.
Article MATH Google Scholar
Bertsimas, D., & King, A. (2017). Logistic regression: From art to science. Statistical Science, 32, 367–384.
Article MathSciNet MATH Google Scholar
Bertsimas, D., & Van Parys, B. (2020). Sparse high-dimensional regression: Exact scalable algorithms and phase transitions. Annals of Statistics, 48(1), 300–323.
Article MathSciNet MATH Google Scholar
Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. Annals of Statistics, 44(2), 813–852.
Article MathSciNet MATH Google Scholar
Bertsimas, D., Pauphilet, J., & Van Parys, B. (2020). Sparse regression: Scalable algorithms and empirical performance. Statistical Science, 35(4), 555–578.
MathSciNet MATH Google Scholar
Bonami, P., Biegler, L. T., Conn, A. R., Cornuéjols, G., Grossmann, I. E., Laird, C. D., et al. (2008). An algorithmic framework for convex mixed integer nonlinear programs. Discrete Optimization, 5(2), 186–204.
Article MathSciNet MATH Google Scholar
Boufounos, P. T., & Baraniuk R. G. (2008). 1-bit compressive sensing. In 42nd annual conference on information sciences and systems, 2008. CISS 2008. (pp. 16–21). IEEE.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.
Book MATH Google Scholar
Breheny, P., & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5(1), 232.
Article MathSciNet MATH Google Scholar
Bühlmann, P. (2011). Invited discussion on “regression shrinkage and selection via the lasso: A retrospective” (r. tibshirani). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3), 273–282.
Article MathSciNet Google Scholar
Calamai, P. H., & Moré, J. J. (1987). Projected gradient methods for linearly constrained problems. Mathematical Programming, 39(1), 93–116.
Article MathSciNet MATH Google Scholar
Chen, Y., Ye, Y., & Wang, M. (2019). Approximation hardness for a class of sparse optimization problems. Journal of Machine Learning Research, 20, 1–27.
MathSciNet MATH Google Scholar
Chu, B.-Y., Ho, C.-H., Tsai, C.-H., Lin, C.-Y., & Lin, C.-J. (2015). Warm start for parameter selection of linear classifiers. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 149–158). ACM.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Article MATH Google Scholar
Cramér, H. (2016). Mathematical methods of statistics (PMS-9) (Vol. 9). Princeton University Press.
Google Scholar
Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(1–4), 131–156.
Article Google Scholar
Dedieu, A., Hazimeh, H., & Mazumder, R. (2021). Learning sparse classifiers: Continuous and mixed integer optimization perspectives. Journal of Machine Learning Research, 22(135), 1–47.
MathSciNet MATH Google Scholar
Donoho, D., & Stodden, V. (2006). Breakdown point of model selection when the number of variables exceeds the number of observations. In International joint conference on neural networks (pp. 1916–1921). IEEE.
Donoho, D., & Tanner, J. (2009). Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 367(1906), 4273–4293.
MathSciNet MATH Google Scholar
Dunning, I., Huchette, J., & Lubin, M. (2017). JuMP: A modeling language for mathematical optimization. SIAM Review, 59(2), 295–320.
Article MathSciNet MATH Google Scholar
Duran, M. A., & Grossmann, I. E. (1986). An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Mathematical Programming, 36(3), 307–339.
Article MathSciNet MATH Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Article MathSciNet MATH Google Scholar
Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with np-dimensionality. Annals of Statistics, 38(6), 3567–3604.
Article MathSciNet MATH Google Scholar
Fan, J., Liu, H., Sun, Q., & Zhang, T. (2018). I-lamm for sparse learning: Simultaneous control of algorithmic complexity and statistical error. Annals of Statistics, 46(2), 814.
Article MathSciNet MATH Google Scholar
Fischetti, M., Ljubić, I., & Sinnl, M. (2017). Redesigning benders decomposition for large-scale facility location. Management Science, 63(7), 2146–2162.
Article Google Scholar
Fletcher, R., & Leyffer, S. (1994). Solving mixed integer nonlinear programs by outer approximation. Mathematical Programming, 66(1–3), 327–349.
Article MathSciNet MATH Google Scholar
Frank, L. L., & Friedman, J. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109–135.
Article MATH Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1.
Article Google Scholar
Friedman, J., Hastie, T., Tibshirani, R., et al. (2013). GLMNet: Lasso and elastic-net regularized generalized linear models. r package version 1.9-5.
Gamarnik, D., & Zadik, I. (2017). High-dimensional regression with binary coefficients. Estimating squared error and a phase transition. In Conference on Learning Theory (pp. 948–953). PMLR.
Goldfarb, D. (1994). On the complexity of the simplex method. In Advances in optimization and numerical analysis (pp. 25–38). Springer.
Gupta, A., Nowak, R., & Recht, B. (2010). Sample complexity for 1-bit compressed sensing and sparse classification. In 2010 IEEE international symposium on information theory proceedings (ISIT) (pp. 1553–1557). IEEE.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422.
Article MATH Google Scholar
Hastie, T., Tibshirani, R., & Tibshirani, R. (2020). Best subset, forward stepwise or lasso? Analysis and recommendations based on extensive comparisons. Statistical Science, 35(4), 579–592.
MathSciNet MATH Google Scholar
Hazimeh, H., & Mazumder, R. (2020). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5), 1517–1537.
Article MathSciNet MATH Google Scholar
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., & Sundararajan, S. (2008). A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th international conference on machine learning (pp. 408–415). ACM.
Hunter, D., & Li, R. (2005). Variable selection using MM algorithms. Annals of Statistics, 33(4), 1617.
Article MathSciNet MATH Google Scholar
IBM ILOG CPLEX. (2011). Cplex users manual.
Inc. Gurobi Optimization. (2016). Gurobi optimizer reference manual. http://www.gurobi.com
Jacques, L., Laska, J. N., Boufounos, P. T., & Baraniuk, R. G. (2013). Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Transactions on Information Theory, 59(4), 2082–2102.
Article MathSciNet MATH Google Scholar
Keerthi, S. S., Duan, K. B., Shevade, S. K., & Poo, A. N. (2005). A fast dual algorithm for kernel logistic regression. Machine Learning, 61(1–3), 151–165.
Article MATH Google Scholar
Kelley, J. E., Jr. (1960). The cutting-plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4), 703–712.
Article MathSciNet MATH Google Scholar
Kenney, A., Chiaromonte, F., & Felici, G. (2018). Efficient and effective$ l\_0 $feature selection. arXiv preprint arXiv:1808.02526
Lin, C.-J., Weng, R. C., & Keerthi, S. S. (2008). Trust region newton method for logistic regression. Journal of Machine Learning Research, 9(Apr), 627–650.
MathSciNet MATH Google Scholar
Loh, P.-L., & Wainwright, M. (2015). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. The Journal of Machine Learning Research, 16(1), 559–616.
MathSciNet MATH Google Scholar
Loh, P.-L., Wainwright, M. J., et al. (2017). Support recovery without incoherence: A case for nonconvex regularization. Annals of Statistics, 45(6), 2455–2482.
Article MathSciNet MATH Google Scholar
Lubin, M., & Dunning, I. (2015). Computing in operations research using Julia. INFORMS Journal on Computing, 27(2), 238–248.
Article MathSciNet MATH Google Scholar
Mazumder, R., Friedman, J., & Hastie, T. (2011). Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495), 1125–1138.
Article MathSciNet MATH Google Scholar
Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2), 227–234.
Article MathSciNet MATH Google Scholar
Pilanci, M., Wainwright, M. J., & El Ghaoui, L. (2015). Sparse learning via Boolean relaxations. Mathematical Programming, 151(1), 63–87.
Article MathSciNet MATH Google Scholar
Plan, Y., & Vershynin, R. (2013a). One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8), 1275–1297.
Article MathSciNet MATH Google Scholar
Plan, Y., & Vershynin, R. (2013b). Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1), 482–494.
Article MathSciNet MATH Google Scholar
Scarlett, J., & Cevher, V. (2017). Limits on support recovery with probabilistic models: An information-theoretic framework. IEEE Transactions on Information Theory, 63(1), 593–620.
Article MathSciNet MATH Google Scholar
Schölkopf, B., Smola, A. J., Bach, F., et al. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press.
Google Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), 3–30.
Article MathSciNet MATH Google Scholar
Steinwart, I. (2002). Support vector machines are universally consistent. Journal of Complexity, 18(3), 768–791.
Article MathSciNet MATH Google Scholar
Su, W., Bogdan, M., & Candes, E. (2017). False discoveries occur early on the lasso path. Annals of Statistics, 45, 2133–2150.
Article MathSciNet MATH Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 58, 267–288.
MathSciNet MATH Google Scholar
Vapnik, V. (1998). The support vector method of function estimation. In Nonlinear modeling (pp. 55–85). Springer.
Wainwright, M. J. (2009a). Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Transactions on Information Theory, 55(12), 5728–5741.
Article MathSciNet MATH Google Scholar
Wainwright, M. J. (2009b). Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5), 2183–2202.
Article MathSciNet MATH Google Scholar
Wang, W., Wainwright, M. J., & Ramchandran, K. (2010). Information-theoretic limits on sparse signal recovery: Dense versus sparse measurement matrices. IEEE Transactions on Information Theory, 56(6), 2967–2979.
Article MathSciNet MATH Google Scholar
Yu, H.-F., Huang, F.-L., & Lin, C.-J. (2011). Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(1–2), 41–75.
Article MathSciNet MATH Google Scholar
Zhang, C.-H. (2010a). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2), 894–942.
Article MathSciNet MATH Google Scholar
Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32, 56–85.
Article MathSciNet MATH Google Scholar
Zhang, T. (2010b). Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research, 11(3), 1081–1107.
MathSciNet MATH Google Scholar
Zhang, T., et al. (2013). Multi-stage convex relaxation for feature selection. Bernoulli, 19(5B), 2277–2293.
Article MathSciNet MATH Google Scholar
Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4), 1509.
MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the associate editor and two anonymous referees for their careful review and comments which improved the clarity and quality of the paper.

Funding

Not applicable.

Author information

Authors and Affiliations

Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, USA
Dimitris Bertsimas & Bart Van Parys
London Business School, London, UK
Jean Pauphilet

Authors

Dimitris Bertsimas
View author publications
You can also search for this author in PubMed Google Scholar
Jean Pauphilet
View author publications
You can also search for this author in PubMed Google Scholar
Bart Van Parys
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jean Pauphilet.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor: James Cussens.

Appendix: Proof of the sufficient condition for support recovery

1.1 Preliminary results on orthant probabilities

Let us recall an analytical expression for the probability that a bivariate normal distribution assigns to the positive orthant.

Lemma 2

(Cramér, 2016, p. 290) Assume we are given a zero mean bivariate normal random variable $(n_1, n_2)$ with ${\mathbb {E}}[n_1^2] = {\mathbb {E}}[n_2^2]=1$ and covariance $\rho _{12} = {\mathbb {E}}[n_1 n_2]$. Then,

$$\begin{aligned} \mathbb {P} \left( n_1 \geqslant 0, n_2 \geqslant 0 \right) = \frac{1}{2 \pi } \left( \frac{\pi }{2} + \arcsin (\rho _{12}) \right) . \end{aligned}$$

By continuity of the density function of normal distributions, the probability of the positive orthant and its interior are equivalent. Lemma 1 is an almost direct consequence of the previous result. We give here its proof.

Proof of Lemma 1

We can separate the event of interest in two disjunct cases as

$$\begin{aligned}&\mathbb {P}\left( \text {sign}\left( \varvec{x}^\top \varvec{w} \right) \ne \text {sign}\left( \varvec{x}^\top \varvec{w}' + \varepsilon \right) \right) \\&\qquad = \mathbb {P}\left( \varvec{x}^\top \varvec{w} \leqslant 0, \varvec{x}^\top \varvec{w}'+ \varepsilon> 0 \right) + \mathbb {P}\left( \varvec{x}^\top \varvec{w} > 0, \varvec{x}^\top \varvec{w}' + \varepsilon \leqslant 0 \right) . \end{aligned}$$

Each term corresponds to the probability that a zero mean bivariate normal variable $(\varvec{x}^\top \varvec{w}, \varvec{x}^\top \varvec{w}' +\varepsilon )$ realizes in an appropriate orthant. We define the random variables $n_1 := \varvec{x}^\top \varvec{w}/\left\| \varvec{w}\right\| _2$ and $n_2 := (\varvec{x}^\top \varvec{w}' +\varepsilon )/\sqrt{\Vert \varvec{w}' \Vert _2^2 + \sigma ^2}$ and obtain

$$\begin{aligned} \mathbb {P}\left( \text {sign}\left( \varvec{x}^\top \varvec{w} \right) \ne \text {sign}\left( \varvec{x}^\top \varvec{w}' + \varepsilon \right) \right) = {\mathbb {P}}(n_1\le 0, n_2>0) + {\mathbb {P}}(n_1>0, n_2\le 0). \end{aligned}$$

We have that ${\mathbb {E}}[n_1^2]=1$, ${\mathbb {E}}[n_2^2]=1$ and $\rho _{12} ={\mathbb {E}}[n_1 n_2] = {\varvec{w}^\top \varvec{w}'}/{(\Vert \varvec{w} \Vert _2 \sqrt{\Vert \varvec{w}' \Vert _2^2 + \sigma ^2})}$. Using the analytical expressions of such orthant probabilities for bivariate normal random variables given in Lemma 2, we have hence $ \mathbb {P}\left( \text {sign}\left( \varvec{x}^\top \varvec{w} \right) \ne \text {sign}\left( \varvec{x}^\top \varvec{w}' + \varepsilon \right) \right) = \frac{1}{\pi } \left( \frac{\pi }{2} - \arcsin (\rho _{12}) \right) = \frac{1}{\pi } \arccos (\rho _{12}). $ $\square $

We will need a minor generalization of Lemma 2 to the three dimensional case in the proof of Theorem 3.

Lemma 3

(Cramér, 2016, p. 290) Assume we are given a zero mean trivariate normal random variable $(n_1, n_2, n_3)$ with ${\mathbb {E}}[n_1^2] = {\mathbb {E}}[n_2^2]= {\mathbb {E}}[n_3^2]=1$ among which we have covariances $\rho _{12} = {\mathbb {E}}[n_1 n_2]$, $\rho _{13} = {\mathbb {E}}[n_1 n_3]$ and $\rho _{23} = {\mathbb {E}}[n_2 n_3]$. Then,

$$\begin{aligned} \mathbb {P} \left( n_1 \geqslant 0, n_2 \geqslant 0, n_3 \geqslant 0 \right) = \frac{1}{4 \pi } \left( \frac{\pi }{2} + \arcsin (\rho _{12}) + \arcsin (\rho _{13}) + \arcsin (\rho _{23}) \right) . \end{aligned}$$

1.2 Comparative performance of a given support with the truth

We first prove a large deviation bound for $\mathbb {P}\left( \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0\right) $ for any given binary classifier $\varvec{w}$, depending on the number of true features it selects. The following result can be derived using Hoeffdings inequality as illustrated in its proof.

Lemma 4

Let $\varvec{w} \in \{0,1\}^p$ be a binary classifier such that $\Vert \varvec{w}\Vert _0 =k$ and $\varvec{w}^\top \varvec{w}^\star = \ell \in \{0,\ldots ,k\}$. Its misclassification rate with respect to the ground truth satisfies the exponential bound

$$\begin{aligned} \mathbb {P}\left( \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0\right)&\leqslant \exp \left( -n \frac{(k-\ell )^2}{2 \pi ^2 (k (k+\sigma ^2)-\ell ^2)}\right) . \end{aligned}$$

Proof of Lemma 4

Let us consider a binary classifier $\varvec{w}\in \{0,1\}^p$ with sparsity $\Vert \varvec{w}\Vert _0 =k$ and true features $\varvec{w}^\top \varvec{w}^\star = \ell $. We compare the empirical misclassification rate of $\varvec{w}$ with the performance of the true support $\varvec{w}^\star $. We take the misclassification rate with respect to the ground truth $\varvec{w}^\star $ as

$$\begin{aligned} \Delta (\varvec{w},\varvec{w}^\star ) = \dfrac{1}{n}\sum _{i=1}^n \mathbf{1} \left( {\hat{y}}_i(\varvec{w}) \ne y_i \right) - \mathbf{1} \left( {\hat{y}}_i(\varvec{w}^\star ) \ne y_i \right) =: \dfrac{1}{n}\sum _{i=1}^n Z_i \end{aligned}$$

which is composed of the sum of independent random variables $Z_i$ taking values in $\{-1, 0, 1\}$ such that

$$\begin{aligned} Z_i= \left\{ \begin{array}{ll} +1, &{} \text{ if } \; y_i = {\hat{y}}_i(\varvec{w}^\star ) \ne {\hat{y}}_i(\varvec{w}), \\ -1, &{} \text{ if } \; y_i = {\hat{y}}_i(\varvec{w}) \ne {\hat{y}}_i(\varvec{w}^\star ), \\ 0, &{} \text{ otherwise } \end{array}\right. \end{aligned}$$

Each random variable $y_i$, ${\hat{y}}_i(\varvec{w}^\star )$, ${\hat{y}}_i(\varvec{w})$ is the sign of the normally distributed quantities ${\varvec{x}^{(i)}}^\top \varvec{w}^\star + \varepsilon _i$, ${\varvec{x}^{(i)}}^\top \varvec{w}^\star $ and ${\varvec{x}^{(i)}}^\top \varvec{w}$ respectively. Let us define three zero mean random variables $n_1 = ({\varvec{x}^{(i)}}^\top \varvec{w} + e_i)/\sqrt{k + \sigma ^2}$, $n_2 = {\varvec{x}^{(i)}}^\top \varvec{w} / \sqrt{k}$ and $n_3 = {\varvec{x}^{(i)}}^\top \varvec{w}^\star /\sqrt{k}$. Their covariance structure is characterized as $\rho _{12}={\mathbb {E}}[n_1 n_2] = k/\sqrt{k(k+\sigma ^2)}$, $\rho _{23}={\mathbb {E}}[n_2 n_3] = \ell /k$ and $\rho _{13}={\mathbb {E}}[n_1 n_3] = \ell /\sqrt{k(k+\sigma ^2)}$. We can then express the probabilities of each value of $Z_i$ as tridimensional orthant probabilities for these three zero mean correlated normal random variables and use the analytical expression given in Lemma 3. We hence arrive at

$$\begin{aligned} \mathbb {P} \left( Z_i=1 \right)&=\mathbb {P}\left( {\varvec{x}^{(i)}}^\top \varvec{w}^\star + \epsilon _i\ge 0 , {\varvec{x}^{(i)}}^\top \varvec{w} \leqslant 0, {\varvec{x}^{(i)}}^\top \varvec{w}^\star \geqslant 0 \right) \\&\qquad + \mathbb {P}\left( {\varvec{x}^{(i)}}^\top \varvec{w}^\star + \epsilon _i\le 0 , {\varvec{x}^{(i)}}^\top \varvec{w} \geqslant 0, {\varvec{x}^{(i)}}^\top \varvec{w}^\star \leqslant 0 \right) \\&= \mathbb {P} \left( n_1\ge 0, n_2\le 0, n_3 \geqslant 0 \right) + \mathbb {P} \left( n_1 \leqslant 0, n_2\ge 0, n_3\le 0 \right) \\&= \dfrac{1}{2 \pi } \left[ \dfrac{\pi }{2} - \arcsin \left( \dfrac{\ell }{\sqrt{k(k+\sigma ^2)}}\right) + \arcsin \left( \dfrac{k}{\sqrt{k(k+\sigma ^2)}}\right) - \arcsin \left( \dfrac{\ell }{k}\right) \right] , \\ \end{aligned}$$

and equivalently

$$\begin{aligned} \mathbb {P} \left( Z_i=-1 \right)&= \mathbb {P}\left( {\varvec{x}^{(i)}}^\top \varvec{w}^\star + \epsilon _i\ge 0 , {\varvec{x}^{(i)}}^\top \varvec{w} \geqslant 0, {\varvec{x}^{(i)}}^\top \varvec{w}^\star \leqslant 0 \right) \\&\qquad + \mathbb {P}\left( {\varvec{x}^{(i)}}^\top \varvec{w}^\star + \epsilon _i\le 0 , {\varvec{x}^{(i)}}^\top \varvec{w} \leqslant 0, {\varvec{x}^{(i)}}^\top \varvec{w}^\star \geqslant 0 \right) \\&= \mathbb {P} \left( n_1\ge 0, n_2\ge 0, n_3\le 0 \right) + \mathbb {P} \left( n_1\le 0, n_2\le 0, n_3\ge 0 \right) \\&= \dfrac{1}{2 \pi } \left[ \dfrac{\pi }{2} + \arcsin \left( \dfrac{\ell }{\sqrt{k(k+\sigma ^2)}}\right) - \arcsin \left( \dfrac{k}{\sqrt{k(k+\sigma ^2)}}\right) - \arcsin \left( \dfrac{\ell }{k}\right) \right] . \end{aligned}$$

Evidently, we can characterize the probability of $Z_i=0$ as ${\mathbb {P}}(Z_i=0) = 1 - {\mathbb {P}}(Z_i=1) - {\mathbb {P}}(Z_i=-1)$. The mean of $Z_i$ is now easily found as the expression

$$\begin{aligned} \mathbb {E}[Z_i]&= \dfrac{1}{\pi } \left[ \arccos \left( \dfrac{\ell }{\sqrt{k(k+\sigma ^2)}}\right) - \arccos \left( \dfrac{k}{\sqrt{k(k+\sigma ^2)}}\right) \right] .\\ \end{aligned}$$

Concavity of the arccos function on the interval [0, 1] enables us to state the gradient inequalities

$$\begin{aligned} \arccos \left( \dfrac{k}{\sqrt{k(k+\sigma ^2)}}\right)&\leqslant \arccos \left( \dfrac{\ell }{\sqrt{k(k+\sigma ^2)}}\right) + \arccos \varvec{s}^\prime \left( \dfrac{\ell }{\sqrt{k(k+\sigma ^2)}}\right) \dfrac{k-\ell }{\sqrt{k(k+\sigma ^2)}} \\&= \arccos \left( \dfrac{\ell }{\sqrt{k(k+\sigma ^2)}}\right) - \sqrt{\dfrac{k(k+\sigma ^2)}{k(k+\sigma ^2) - \ell ^2}} \dfrac{k-\ell }{\sqrt{k(k+\sigma ^2)}} \\&= \arccos \left( \dfrac{\ell }{\sqrt{k(k+\sigma ^2)}}\right) - \dfrac{k-\ell }{\sqrt{k(k+\sigma ^2) - \ell ^2}}. \end{aligned}$$

We thus obtain a somewhat simple lower bound on the mean of $Z_i$

$$\begin{aligned} \mathbb {E}[Z_i]&\geqslant \dfrac{1}{\pi } \dfrac{k-\ell }{\sqrt{k(k+\sigma ^2)-\ell ^2}}. \end{aligned}$$

We now have all the ingredients to upper-bound the probability that $\varvec{w}$ performs strictly better than $\varvec{w}^\star $, in other words that $\Delta (\varvec{w},\varvec{w}^\star ) := \sum _{i=1}^n Z_i < 0$. Applying Hoeffding’s inequality for independent random variables supported on $[-1, 1]$, we have for any $t >0$

$$\begin{aligned} \mathbb {P}\left( \sum _{i=1}^n (Z_i - \mathbb {E}[Z_i] ) < -n t\right)&\leqslant \exp \left( - \dfrac{n t^2}{2}\right) , \end{aligned}$$

and taking $t = \mathbb {E}[Z]$, which is non negative for $\ell < k$ because arccos is decreasing on [0, 1], leads to $ \mathbb {P}\left( \Delta (\varvec{w},\varvec{w}^\star ) < 0\right) \leqslant \exp \left( - \frac{n}{2} \mathbb {E}[Z_i]^2 \right) . $ Substituting in the previous expression our lower bound for the mean ${\mathbb {E}}[Z_i]$ gives the desired result. $\square $

Remark 2

In the absence of noise ($\sigma =0$), the bound in Lemma 4 can be improved. Indeed, the truth makes no mistakes (${\hat{y}}_i(\varvec{w}^\star ) = y_i, \; \forall i$) and $\Delta (\varvec{w},\varvec{w}^\star ) \geqslant 0$ for any $\varvec{w}$. More precisely, here $\sum _{i=1}^n \mathbf{1} \left( {\hat{y}}_i(\varvec{w}) \ne y_i \right) $ is a binomial random variable with parameters n and $q(\ell ; k, 0)$. Therefore,

$$\begin{aligned} \mathbb {P} \left( \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0 \right) = \mathbb {P} \left( \Delta (\varvec{w},\varvec{w}^\star ) = 0 \right) = \left( 1 - \dfrac{1}{\pi } \arccos \left( \tfrac{\ell }{k} \right) \right) ^n. \end{aligned}$$

Using concavity of the logarithm and the arccos function successively yields the upper bound

$$\begin{aligned} \mathbb {P}\left( \Delta (\varvec{w},\varvec{w}^\star ) < 0\right) \leqslant \exp \left( - \dfrac{n}{\pi } \sqrt{\dfrac{k-\ell }{k+\ell }} \right) . \end{aligned}$$

However, such a refinement eventually modifies the result in Theorem 3 by a constant multiplicative factor only.

1.3 Proof Theorem 3

Proof of Theorem 3

We are interested in bounding the probability that the binary classifier with minimal empirical misclassification rate is any other than $\varvec{w}^\star $. We can characterize the probability of such event as $\mathbb {P}\left( \exists \varvec{w} \ne \varvec{w}^\star \text{ s.t. } \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0 \right) $. Evidently,

$$\begin{aligned} \mathbb {P}\left( \exists \varvec{w} \ne \varvec{w}^\star \text{ s.t. } \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0 \right) = \textstyle \sum _{\ell \in \{0, \dots , k-1\}} \mathbb {P}\left( \exists \varvec{w} \text{ s.t. } \varvec{w}^\top \varvec{w}^\star = \ell , \, \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0 \right) . \end{aligned}$$

Recall that there are exactly $\left( {\begin{array}{c}k\\ \ell \end{array}}\right) \left( {\begin{array}{c}p-k\\ k-l\end{array}}\right) $ distinct binary classifiers $\varvec{w}$ with accuracy $\varvec{w}^\top \varvec{w}^\star = \ell $. Combining a union bound and the bound from Lemma 4 yields

$$\begin{aligned} \mathbb {P}\left( \exists \varvec{w} \ne \varvec{w}^\star \text{ s.t. } \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0 \right)&\leqslant \sum _{\ell =0}^{k-1} \left( {\begin{array}{c}k\\ \ell \end{array}}\right) \left( {\begin{array}{c}p-k\\ k-\ell \end{array}}\right) \exp \left( -n \frac{(k-\ell )^2}{2 \pi ^2 (k (k+\sigma ^2)-\ell ^2)} \right) , \end{aligned}$$

(14)

$$\begin{aligned}&\leqslant k \cdot \max _{\ell =0,\ldots ,k-1} \left( {\begin{array}{c}k\\ \ell \end{array}}\right) \left( {\begin{array}{c}p-k\\ k-\ell \end{array}}\right) \exp \left( -n \frac{(k-\ell )^2}{2 \pi ^2 (k (k+\sigma ^2)-\ell ^2)}\right) . \end{aligned}$$

(15)

In order for the previous error probability to be bounded away from one, it suffices to take n greater than a threshold T

$$\begin{aligned} n > T : = \max _{\ell =0,\ldots ,k-1} \frac{2 \pi ^2 (k (k+\sigma ^2)-\ell ^2)}{(k-\ell )^2} \left[ \log k + \log \left( {\begin{array}{c}k\\ \ell \end{array}}\right) + \log \left( {\begin{array}{c}p-k\\ k-\ell \end{array}}\right) \right] . \end{aligned}$$

We can obtain a more interpretable sufficient condition by upper-bounding the threshold T. Assuming $p \geqslant 2 k$, $\left( {\begin{array}{c}k\\ \ell \end{array}}\right) = \left( {\begin{array}{c}k\\ k-\ell \end{array}}\right) \leqslant \left( {\begin{array}{c}p-k\\ k-\ell \end{array}}\right) $ and $k \leqslant \left( {\begin{array}{c}p-k\\ k-\ell \end{array}}\right) $, so that

$$\begin{aligned} T&\lesssim \max _{\ell =0,\ldots ,k-1} \frac{k (k+\sigma ^2)-\ell ^2}{(k-\ell )^2} \log \left( {\begin{array}{c}p-k\\ k-\ell \end{array}}\right) , \end{aligned}$$

where $\lesssim $ signifies that the inequality holds up to a multiplicative factor. Since $\log \left( {\begin{array}{c}p-k\\ k-\ell \end{array}}\right) \lesssim (k-\ell ) \log \left( \dfrac{p-k}{k-\ell }\right) $, we now have

$$\begin{aligned} T&\lesssim \max _{\ell =0,\ldots ,k-1} \frac{k (k+\sigma ^2)-\ell ^2}{(k-\ell )} \log \left( \dfrac{p-k}{k-\ell }\right) , \\ \end{aligned}$$

The maximum over $\ell $ in the right hand side of the previous inequality occurs when $\ell = k -1$. The previous observation yields hence that $T \lesssim (2+\sigma ^2) k \log (p-k)$. Finally, it is easy to verify that when n exceeds some threshold value $n_0 \geqslant T$, the inequality (15) yields

$$\begin{aligned} \mathbb {P}\left( \exists \varvec{w} \ne \varvec{w}^\star \text{ s.t. } \Delta (\varvec{w},\varvec{w}^\star ) \leqslant 0 \right)&\leqslant \max _{\ell =0,\ldots ,k-1} \exp \left( - \frac{(k-\ell )^2}{2 \pi ^2 (k (k+\sigma ^2)-\ell ^2)} (n-n_0)\right) , \\&\leqslant \exp \left( -\frac{n - n_0}{2 \pi ^2 k (2+\sigma ^2)}\right) . \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bertsimas, D., Pauphilet, J. & Van Parys, B. Sparse classification: a scalable discrete optimization perspective. Mach Learn 110, 3177–3209 (2021). https://doi.org/10.1007/s10994-021-06085-5

Download citation

Received: 29 October 2020
Revised: 22 September 2021
Accepted: 23 September 2021
Published: 02 November 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10994-021-06085-5

Sparse classification: a scalable discrete optimization perspective

Abstract

Similar content being viewed by others

Sparse Learning for Large-Scale and High-Dimensional Data: A Randomized Convex-Concave Optimization Approach

Fast and scalable Lasso via stochastic Frank–Wolfe methods with a convergence guarantee

Regularity Properties for Sparse Regression

Explore related subjects

1 Introduction

Contributions

Structure

Notations

2 Dual framework

2.1 Regularized classification

Assumption 1

Theorem 1

Proof

Remark 1

2.2 Dual approach to sparse classification

Theorem 2

Proof

2.3 Enhanced outer-approximation algorithm

2.4 Practical implementation considerations

3 Numerical experiments: scalability and support recovery

3.1 Support recovery on synthetic data

3.1.1 Methodology

3.1.2 Support recovery metrics

3.1.3 Selecting the truth ...

3.1.4 ... and nothing but the truth

3.2 Scalability on synthetic data

3.3 Experiments on real-world data sets

3.3.1 Over-determined regime \(n>p\)

3.3.2 Under-determined regime \(p>n\)

3.4 Summary and guidelines

4 Towards a theoretical understanding of asymptotic support recovery

4.1 Notations and assumptions

Lemma 1

4.2 Intuition and statement on sufficient conditions

Theorem 3

4.3 Discussion

5 Conclusion

Availability of data and materials

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix: Proof of the sufficient condition for support recovery

Appendix: Proof of the sufficient condition for support recovery

1.1 Preliminary results on orthant probabilities

Lemma 2

Proof of Lemma 1

Lemma 3

1.2 Comparative performance of a given support with the truth

Lemma 4

Proof of Lemma 4

Remark 2

1.3 Proof Theorem 3

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation