Nothing Special   »   [go: up one dir, main page]

A Comparison of Optimization Methods and Software For Large-Scale L1-Regularized Linear Classification1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Journal of Machine Learning Research 11 (2010) 3183-3234 Submitted 11/09; Revised 7/10; Published 11/10

A Comparison of Optimization Methods and Software for


Large-scale L1-regularized Linear Classification

Guo-Xun Yuan r96042@csie.ntu.edu.tw


Kai-Wei Chang b92084@csie.ntu.edu.tw
Cho-Jui Hsieh b92085@csie.ntu.edu.tw
Chih-Jen Lin cjlin@csie.ntu.edu.tw
Department of Computer Science
National Taiwan University
Taipei 106, Taiwan

Editor: Sathiya Keerthi

Abstract
Large-scale linear classification is widely used in many areas. The L1-regularized form can
be applied for feature selection; however, its non-differentiability causes more difficulties in
training. Although various optimization methods have been proposed in recent years, these
have not yet been compared suitably. In this paper, we first broadly review existing meth-
ods. Then, we discuss state-of-the-art software packages in detail and propose two efficient
implementations. Extensive comparisons indicate that carefully implemented coordinate
descent methods are very suitable for training large document data.
Keywords: L1 regularization, linear classification, optimization methods, logistic regres-
sion, support vector machines, document classification

1. Introduction
Recently, L1-regularized classifiers have attracted considerable attention because they can
be used to obtain a sparse model. Given a set of instance-label pairs (xi , yi ), i = 1, . . . , l, xi ∈
Rn , yi ∈ {−1, +1}, training an L1-regularized linear classifier involves the following uncon-
strained optimization problem:

l
X
min f (w) ≡ kwk1 + C ξ(w; xi , yi ), (1)
w
i=1

where k·k1 denotes the 1-norm and ξ(w; xi , yi ) is a non-negative (convex) loss function. The
regularization term kwk1 is used to avoid overfitting the training data. The user-defined
parameter C > 0 is used to balance the regularization and loss terms.
If kwk2 is used in (1), we obtain an L2-regularized classifier. Although L2 regulariza-
tion is used more commonly, an L1-regularized formula often produces a sparse w. Nonzero
elements help to select important features; in addition, the time required to produce pre-
dictions may be reduced. Considerable literature has been published on the advantages of
using L1 regularization; see, for example, Zhao and Yu (2006). However, an L1-regularized

c 2010 Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh and Chih-Jen Lin.
Yuan, Chang, Hsieh and Lin

form (1) is not differentiable regardless of its loss function. This drawback leads to greater
difficulty in solving the optimization problem. Therefore, certain considerations are required
to handle the non-differentiability.
Many loss functions can be used for linear classification. A commonly used one is logistic
loss:
T
ξlog (w; x, y) = log(1 + e−yw x ). (2)
This loss function is twice differentiable. Note that minimizing logistic loss is equivalent to
maximizing the likelihood, whereas minimizing the regularized loss in (1) is equivalent to
maximizing the posterior with independent Laplace priors on the parameters. Two other
frequently used functions are the L1- and the L2-loss functions:

ξL1 (w; x, y) = max(1 − ywT x, 0) and (3)


T 2
ξL2 (w; x, y) = max(1 − yw x, 0) . (4)

Because of the max(·) operation, the L1-loss function is not differentiable. On the other
hand, L2 loss is differentiable, but not twice differentiable (Mangasarian, 2002). We refer
to (1) with logistic loss as L1-regularized logistic regression and (1) with L1/L2 loss as
L1-regularized L1-/L2-loss support vector machines (SVMs).
In some applications, we require a bias term b (also called as an intercept) in the loss
function; therefore, wT x in (2)–(4) is replaced with wT x + b. For example, the logistic loss
function becomes  
T
ξlog (w, b; x, y) = log 1 + e−y(w x+b) .

The optimization problem then involves variables w and b:


l
X
min kwk1 + C ξ(w, b; xi , yi ). (5)
w,b
i=1

Because b does not appear in the regularization term, most optimization methods used to
solve (1) can solve (5) as well. In this paper, except whenever b is required, we mainly
consider the formulation (1).
Many papers have proposed optimization methods for large-scale L1-regularized logistic
regression (i.e., using ξlog as the loss function). These studies did not consider L1- or L2-loss
functions because logistic loss has a probability interpretation and is twice differentiable.
These methods differ in various aspects such as the convergence speed, ease of implemen-
tation, and practical applicability. With so many available methods, it is important to
conduct a comprehensive comparison. Schmidt et al. (2007, 2009) compared optimization
methods for L1-regularized classifiers; however, they did not include some state-of-the-art
solvers. Moreover, their comparison is based on the number of function evaluations instead
of the actual running time. In this paper, we categorize and compare existing methods
for logistic regression. We also extend some methods to solve L2-loss SVMs. We exclude
L1 loss from the discussion because most methods for logistic regression or L2-loss SVMs
assume the differentiability of the loss functions. Readers interested in using L1-loss SVMs
can refer to Bradley and Mangasarian (1998), Zhu et al. (2004), Fung and Mangasarian
(2004), Mangasarian (2006), and references therein.

3184
Comparing Methods and Software for L1-regularized Linear Classification

1.1 Basic Properties of (1)


In this paper, we assume that the loss function ξ(w; xi , yi ) is convex, differentiable, and
nonnegative. The proof presented in Appendix A shows that (1) attains at least one global
optimum. Unlike L2 regularization, which leads to a unique optimal solution, here, (1) may
possess multiple optimal solutions. We use w∗ to denote any optimal solution of (1). The
convexity of f (w) implies that all optimal solutions have the same function value, which is
denoted as f ∗ . For more information about the set of optimal solutions, see, for example,
Hale et al. (2008, Section 2).
From standard convex analysis, w∗ is optimal for (1) if and only if w∗ satisfies the
following optimality conditions:

∗ if wj∗ > 0,
∇j L(w ) + 1 = 0

∇j L(w∗ ) − 1 = 0 if wj∗ < 0, (6)

 ∗
−1 ≤ ∇j L(w ) ≤ 1 if wj = 0,∗

where L(w) is the loss term in (1):


l
X
L(w) ≡ C ξ(w; xi , yi ). (7)
i=1

1.2 Organization and Notation


The remainder of this paper is organized as follows. In Section 2, we briefly survey existing
approaches. Section 3 lists state-of-the-art L1-regularized logistic regression packages com-
pared in this study. Sections 4–6 discuss these packages in detail; two of these (Sections
4.1.2 and 5.1) are our proposed implementations. In Section 7, we extend several methods
to train L2-loss SVMs. Section 8 describes the extensive experiments that were carried out.
Comparison results indicate that decomposition methods (Section 4) are very suitable for
large-scale document data. Finally, the discussions and conclusions are presented in Sec-
tion 9. A supplementary file including additional details and descriptions of experiments is
available at http://www.csie.ntu.edu.tw/~cjlin/papers/l1/supplement.pdf
In this study, we use consistent notation to describe several state-of-the-art methods
that are considered. The following table lists some frequently used symbols:

l: number of instances
n: number of features
i: index for a data instance
j: index for a data feature
k: iteration index for an optimization algorithm

We may represent training instances (xi , yi ), i = 1, . . . , l in the following form:


 T  
x1 y1
 ..  l×n  .. 
X= . ∈R and y =  .  ∈ {−1, +1}l .
xTl yl

3185
Yuan, Chang, Hsieh and Lin

For any vector v, we consider the following two representations for sub-vectors:

v t:s ≡ [vt , . . . , vs ]T and v I ≡ [vi1 , . . . , vi|I| ]T ,

where I = {i1 , . . . , i|I| } is an index set. Similarly,

xTi1
 
 . 
XI,: ≡  ..  . (8)
xTi|I|

The function τ (s) gives the first derivative of the logistic loss function log(1 + es ):
1
τ (s) ≡ . (9)
1 + e−s
An indicator vector for the jth component is

ej ≡ [0, . . . , 0, 1, 0, . . . , 0]T . (10)


| {z }
j−1

We use k · k or k · k2 to denote the 2-norm and k · k1 to denote the 1-norm.

2. A Survey of Existing Methods


In this section, we survey existing methods for L1-regularized problems. In Sections 2.1–2.3,
we focus on logistic regression and L2-loss SVMs for data classification. Section 2.5 briefly
discusses works on regression problems using the least-square loss.

2.1 Decomposition Methods


Decomposition methods are a classical optimization approach. Because it is difficult to
update all variables simultaneously, at each iteration, we can choose a subset of variables
as the working set and fix all others. The resulting sub-problem contains fewer variables
and is easier to solve. The various decomposition methods that have been applied to solve
L1-regularized problems are categorized into two types according to the selection of the
working set.

2.1.1 Cyclic Coordinate Descent Methods


A simple coordinate descent method cyclically chooses one variable at a time and solves the
following one-variable sub-problem:

min gj (z) ≡ f (w + zej ) − f (w), (11)


z

where ej is defined in (10). This function gj (z) has only one non-differentiable point at
z = −wj . In optimization, the cyclic method for choosing working variables is often called
the Gauss-Seidel rule (Tseng and Yun, 2009).
Several works have applied coordinate descent methods to solve (1) with logistic loss.
Here, a difficulty arises in that sub-problem (11) does not have a closed-form solution.

3186
Comparing Methods and Software for L1-regularized Linear Classification

Goodman (2004) assumed nonnegative feature values (i.e., xij ≥ 0, ∀i, j) and then approx-
imated gj (z) by a function Aj (z) at each iteration. Aj (z) satisfies Aj (z) ≥ gj (z), ∀z, and
Aj (0) = gj (0) = 0; therefore, minimizing Aj (z) may reduce the function value. Moreover,
there is a closed-form solution for minimizing Aj (z). Goodman (2004) actually studied a
problem with additional constraints wj ≥ 0; however, his approach can be extended to solve
the original L1-regularized logistic regression by taking the sign of wj into consideration.
Genkin et al. (2007) implemented a cyclic coordinate descent method called BBR to
solve L1-regularized logistic regression. BBR approximately minimizes sub-problem (11)
in a trust region and applies one-dimensional Newton steps. Balakrishnan and Madigan
(2005) reported an extension of BBR for online settings.
In Section 4.1.2, we propose a coordinate descent method by extending Chang et al.’s
(2008) approach for L2-regularized classifiers. Chang et al. (2008) approximately solved
the sub-problem (11) by a one-dimensional Newton direction with line search. Experiments
show that their approach outperforms a BBR variant for L2-regularized classification. There-
fore, for L1 regularization, this approach might be faster than BBR. Hereafter, we refer to
this efficient coordinate descent method as CDN (coordinate descent using one-dimensional
Newton steps).
Tseng and Yun (2009) broadly discussed decomposition methods for L1-regularized
problems. One of their approaches is a cyclic coordinate descent method. They consid-
ered a general cyclic setting so that several working variables are updated at each iteration.
We show that CDN is a special case of their general algorithms.
If we randomly select the working variable, then the procedure becomes a stochastic co-
ordinate descent method. Shalev-Shwartz and Tewari (2009, Algorithm 2) recently studied
this issue. Duchi and Singer (2009) proposed a similar coordinate descent method for the
maximum entropy model, which is a generalization of logistic regression.

2.1.2 Variable Selection Using Gradient Information


Instead of cyclically updating one variable, we can choose working variables based on the
gradient information.1 This method for selecting variables is often referred to as the Gauss-
Southwell rule (Tseng and Yun, 2009). Because of the use of gradient information, the
number of iterations is fewer than those in cyclic coordinate descent methods. However, the
cost per iteration is higher. Shevade and Keerthi (2003) proposed an early decomposition
method with the Gauss-Southwell rule to solve (1). In their method, one variable is chosen
at a time and one-dimensional Newton steps are applied; therefore, their method differs
from the cyclic coordinate descent methods described in Section 2.1.1 mainly in terms of
finding the working variables. Hsieh et al. (2008) showed that for L2-regularized linear
classification, maintaining the gradient for selecting only one variable at a time is not cost-
effective. Thus, for decomposition methods using the gradient information, a larger working
set should be used at each iteration.
In the framework of decomposition methods proposed by Tseng and Yun (2009), one
type of method selects a set of working variables based on the gradient information. The
working set can be large, and therefore, they approximately solved the sub-problem. For
the same method, Yun and Toh (2011) enhanced the theoretical results and carried out

1. Because f (w) is not differentiable, ∇j L(w) ± 1 is used according to the sign of wj .

3187
Yuan, Chang, Hsieh and Lin

experiments with document classification data. We refer to their method as CGD-GS because
the method described in Tseng and Yun (2009) is called “coordinate gradient descent” and
a Gauss-Southwell rule is used.

2.1.3 Active Set Methods

Active set methods are a popular optimization approach for linear-constrained problems.
For problem (1), an active method becomes a special decomposition method because at
each iteration, a sub-problem over a working set of variables is solved. The main difference
from decomposition methods described earlier is that the working set contains all non-zero
variables. Therefore, an active set method iteratively predicts what a correct split of zero
and non-zero elements in w is. If the split is correct, then solving the sub-problem gives
the optimal values of non-zero elements.
Perkins et al. (2003) proposed an active set method for (1) with logistic loss. This
implementation uses gradient information to predict w’s zero and non-zero elements.

2.2 Methods by Solving Constrained Optimization Problems

This type of method transforms (1) to equivalent constrained optimization problems. We


further classify them into two groups.

2.2.1 Optimization Problems with Smooth Constraints

We can replace w in (1) with w+ − w− , where w+ and w− are both nonnegative vectors.
Then, problem (1) becomes equivalent to the following bound-constrained optimization
problem:

n
X n
X l
X
min wj+ + wj− + C ξ(w+ − w− ; xi , yi )
w+ ,w− (12)
j=1 j=1 i=1

subject to wj+ ≥ 0, wj− ≥ 0, j = 1, . . . , n.

The objective function and constraints of (12) are smooth, and therefore, the problem can
be solved by standard bound-constrained optimization techniques. Schmidt et al. (2009)
proposed ProjectionL1 to solve (12). This is an extension of the “two-metric projection”
method (Gafni and Bertsekas, 1984). Limited-memory quasi Newton implementations, for
example, LBFGS-B by Byrd et al. (1995) and BLMVM by Benson and Moré (2001), require
function/gradient evaluations and use a limited amount of memory to approximate the Hes-
sian matrix. Kazama and Tsujii (2003) presented an example of using BLMVM for (12).
Lin and Moré’s (1999) trust region Newton method (TRON) can also be applied to solve
(12). In addition to function/gradient evaluations, TRON needs to calculate Hessian-vector
products for faster convergence. Lin et al. (2008) applied TRON to solve L2-regularized
logistic regression and showed that TRON outperforms LBFGS for document data. No pre-
vious work has applied TRON to solve L1-regularized problems, and therefore, we describe
this in Section 5.1.

3188
Comparing Methods and Software for L1-regularized Linear Classification

Koh et al. (2007) proposed an interior point method to solve L1-regularized logistic
regression. They transformed (1) to the following constrained problem:

n
X l
X
min uj + C ξ(w; xi , yi )
w,u (13)
j=1 i=1

subject to − uj ≤ wj ≤ uj , j = 1, . . . , n.

Equation (13) can be made equivalent to (12) by setting

uj + wj uj − wj
wj+ = and wj− = .
2 2
To ensure that w and u are in the interior of the feasible region set, Koh et al. (2007) added
a log barrier to the objective function of (13) and applied a Newton method.

2.2.2 Optimization Problems with Non-smooth Constraints


It is well-known that for any choice of C in (1), there exists a corresponding K such that
(1) is equivalent to the following problem:

l
X
min ξ(w; xi , yi )
w
i=1
(14)
subject to kwk1 ≤ K.

See the explanation in, for example, Donoho and Tsaig (2008, Section 1.2). Notice that in
(14), the constraint is not smooth at {w | wj = 0 for some j}. However, (14) contains fewer
variables and constraints than (12). Lee et al. (2006) applied the LARS algorithm described
in Efron et al. (2004) to find a Newton direction at each step and then used a backtracking
line search to minimize the objective value. Kivinen and Warmuth’s (1997) concept of
exponentiated gradient (EG) can solve (14) with additional constraints wj ≥ 0, ∀j. In
a manner similar to the technique for constructing (12), we can remove the nonnegative
constraints by replacing w with w+ − w− . If k is the iteration index, EG updates w by the
following rule:
Pl
wjk exp −ηk ∇j k

k+1 i=1 ξ(w ; xi , yi )
wj = ,
Zk
where Zk is a normalization term for maintaining kwk k1 = K, ∀k and ηk is a user-defined
learning rate. Duchi et al. (2008) applied a gradient projection method to solve (14). The
update rule is
 Xl 
k+1 k k

w = arg min k w − ηk ∇ ξ(w ; xi , yi ) − wk kwk1 ≤ K . (15)
w i=1

They developed a fast algorithm to project a solution to the closest point satisfying the con-
straint. They also considered replacing the gradient in (15) with a stochastic sub-gradient.
In a manner similar to (15), Liu et al. (2009) proposed a gradient projection method called

3189
Yuan, Chang, Hsieh and Lin

Lassplore and carefully addressed the selection of the learning rate ηk . However, they eval-
uated their method only on data with no more than 2, 000 instances.2
Kim and Kim (2004) discussed a coordinate descent method to solve (14). They used
the gradient information to select an element wj for update. However, because of the
constraint kwk1 ≤ K, the whole vector w is normalized at each step. Thus, the setting
is very different from the unconstrained situations described in Section 2.1.1. Kim et al.
(2008) made further improvements to realize faster convergence.
Active set methods have also been applied to solve (14). However, in contrast to the
active set methods described in Section 2.1.3, here, the sub-problem at each iteration is
a constrained optimization problem. Roth (2004) studied general loss functions including
logistic loss.

2.3 Other Methods for L1-regularized Logistic Regression/L2-loss SVMs


We briefly review other methods in this section.

2.3.1 Expectation Maximization


Many studies have considered Expectation Maximization (EM) frameworks to solve (1)
with logistic loss (e.g., Figueiredo, 2003; Krishnapuram et al., 2004, 2005). These works
find an upper-bound function fˆ(w) ≥ f (w), ∀w, and perform Newton steps to minimize
fˆ(w). However, as pointed out by Schmidt et al. (2009, Section 3.2), the upper-bound
function fˆ(w) may not be well-defined at some wi = 0 and hence certain difficulties must
be addressed.

2.3.2 Stochastic Gradient Descent


Stochastic gradient descent methods have been successfully applied to solve (1). At each it-
eration, the solution is updated using a randomly selected instance. These types of methods
are known to efficiently generate a reasonable model, although they suffer from slow local
convergence. Under an online setting, Langford et al. (2009) solved L1-regularized prob-
lems by a stochastic gradient descent method. Shalev-Shwartz and Tewari (2009) combined
Langford et al.’s (2009) method with other techniques to obtain a new stochastic gradient
descent method for (1).

2.3.3 Quasi Newton Methods


Andrew and Gao (2007) proposed an Orthant-Wise Limited-memory quasi Newton (OWL-
QN) method. This method is extended from LBFGS (Liu and Nocedal, 1989), a limited
memory quasi Newton approach for unconstrained smooth optimization. At each iteration,
this method finds a sub-space without considering some dimensions with wj = 0 and obtains
a search direction similar to LBFGS. A constrained line search on the same sub-space is then
conducted and the property wjk+1 wjk ≥ 0 is maintained. Yu et al. (2010) proposed a quasi
Newton approach to solve non-smooth convex optimization problems. Their method can
be used to improve the line search procedure in OWL-QN.
2. Although Table 1 in Liu et al. (2009) shows larger numbers, in Section 6 of the same paper, they stated
that “we use a total of 2,000 samples in the following experiments.”

3190
Comparing Methods and Software for L1-regularized Linear Classification

2.3.4 Hybrid Methods

Shi et al. (2010) proposed a hybrid algorithm for (1) with logistic loss. They used a fixed-
point method to identify the set {j | wj∗ = 0}, where w∗ is an optimal solution, and then
applied an interior point method to achieve fast local convergence.

2.3.5 Quadratic Approximation Followed By Coordinate Descent

Krishnapuram et al. (2005) and Friedman et al. (2010) replaced the loss term with a second-
order approximation at the beginning of each iteration and then applied a cyclic coordinate
descent method to minimize the quadratic function. We will show that an implementation
in Friedman et al. (2010) is efficient.

2.3.6 Cutting Plane Methods

Teo et al. (2010) implemented a bundle (cutting plane) method BMRM to handle non-
smooth loss functions. It includes an extension for L1 regularization.

2.3.7 Approximating L1 Regularization by L2 Regularization

Kujala et al. (2009) proposed the approximation of L1-regularized SVMs by iteratively


reweighting training data and solving L2-regularized SVMs. That is, using the current wj ,
they adjusted the jth feature of the training data and then trained an L2-regularized SVM
in the next step.

2.3.8 Solution Path

Several works have attempted to find the “solution path” of (1). The optimal solution of (1)
varies according to parameter C. It is occasionally useful to find all solutions as a function
of C; see, for example, Rosset (2005), Zhao and Yu (2007), Park and Hastie (2007), and
Keerthi and Shevade (2007). We do not provide details of these works because this paper
focuses on the case in which a single C is used.

2.4 Strengths and Weaknesses of Existing Methods

Although this paper will compare state-of-the-art software, we discuss some known strengths
and weaknesses of existing methods.

2.4.1 Convergence Speed

Optimization methods using higher-order information (e.g., quasi Newton or Newton meth-
ods) usually enjoy fast local convergence. However, they involve an expensive iteration.
For example, Newton methods such as TRON or IPM need to solve a large linear system
related to the Hessian matrix. In contrast, methods using or partially using the gradient
information (e.g., stochastic gradient descent) have slow local convergence although they
can more quickly decrease the function value in the early stage.

3191
Yuan, Chang, Hsieh and Lin

2.4.2 Implementation Efforts


Methods using higher-order information are usually more complicated. Newton methods
need to include a solver for linear systems. In contrast, methods such as coordinate de-
scent or stochastic gradient descent methods are easy to implement. They involve only
vector operations. Other methods such as expectation maximization are intermediate in
this respect.

2.4.3 Handling Large-scale Data


In some methods, the Newton step requires solving a linear system of n variables. Inverting
an n × n matrix is difficult for large n. Thus, one should not use direct methods (e.g.,
Gaussian elimination) to solve the linear system. Instead, TRON and IPM employ conjugate
gradient methods and Friedman et al. (2007) use coordinate descent methods. We observe
that for existing EM implementations, many consider direct inversions, and therefore, they
cannot handle large-scale data.

2.4.4 Feature Correlation


Methods that work on a block of variables at a time (e.g., decomposition methods) may
be more efficient if features are almost independent; however, they may be less efficient if
features are highly correlated.

2.4.5 Data Type


No method is the most suitable for all data types. A method that is efficient for one
application may be slow for another. This paper is focused on document classification, and
a viable method must be able to easily handle large and sparse features.

2.5 Least-square Regression for Signal Processing and Image Applications


Recently, L1-regularized problems have attracted considerable attention for signal process-
ing and image applications. However, they differ from (1) in several aspects. First, yi ∈ R,
and therefore, a regression instead of a classification problem is considered. Second, the
least-square loss function is used:

ξ(w; xi , yi ) = (yi − wT xi )2 . (16)

Third, in many situations, xi are not directly available. Instead, the product between the
data matrix X and a vector can be efficiently obtained through certain operators. We
briefly review some of the many optimization methods for least-square regression. If we use
formulation (14) with non-smooth constraints, the problem reduces to LASSO proposed
by Tibshirani (1996) and some early optimization studies include, for example, Fu (1998)
and Osborne et al. (2000a,b). Fu (1998) applied a cyclic coordinate descent method. For
least-square loss, the minimization of the one-variable sub-problem (11) has a closed-form
solution. Sardy et al. (2000) also considered coordinate descent methods, although they
allowed a block of variables at each iteration. Wu and Lange (2008) considered a coordinate
descent method, but used the gradient information for selecting the working variable at each

3192
Comparing Methods and Software for L1-regularized Linear Classification

iteration. Osborne et al. (2000a) reported one of the earliest active set methods for L1-
regularized problems. Roth’s (2004) method for general losses (see Section 2.2.2) reduces to
this method if the least-square loss is used. Friedman et al. (2007) extended Fu’s coordinate
descent method to find a solution path. Donoho and Tsaig (2008) also obtained a solution
path. Their procedure requires solving a linear system of a matrix X:,J T X , where J is
:,J
a subset of {1, . . . , n}. Figueiredo et al. (2007) transformed the regression problem to a
bound-constrained formula in (12) and applied a projected gradient method. Wright et al.
(2009) proposed the iterative minimization of the sum of the L1 regularization term and
a quadratic approximation of the loss term. In the quadratic approximation, they used a
diagonal matrix instead of the Hessian of the loss term, and therefore, the minimization can
be easily carried out. Hale et al. (2008) proposed a fixed point method to solve (1) with
the least-square loss (16). Their update rule is generated from a fixed-point view; however,
it is very similar to a gradient projection algorithm.
The dual of LASSO and the dual of (1) have been discussed in Osborne et al. (2000b)
and Kim et al. (2007), respectively. However, thus far, there have been few optimization
methods for the dual problem. Tomioka and Sugiyama (2009) proposed a dual augmented
Lagrangian method for L1-regularized least-square regression that theoretically converges
super-linearly.
Most optimization methods for classification discussed in the earlier sections can handle
general smooth loss functions, and therefore, they can be applied to the regression problem.
However, data for classification applications may be very different from those for regression
applications. For example, in text classification, the numbers of instances and features are
both large and data are very sparse. However, in certain signal processing applications, the
number of instances may be much smaller than the number of features (i.e., l  n) and the
data may be dense. Moreover, in classification, the parameter C is often selected by cross
validation; however, in signal processing, the selection may be application dependent. Few
optimization studies have investigated both types of applications. The interior point method
for logistic regression by solving (13) has been applied to the least-square regression problem
(Kim et al., 2007). Duchi et al. (2008) compared their gradient projection implementation
(see Section 2.2.2) with interior point methods using both logistic and least-square losses.
In Section 2.1.2, we mentioned a decomposition method CGD-GS by Yun and Toh (2011).
In the same paper, Yun and Toh have also investigated the performance of CGD-GS on
regression problems.
In this paper, we focus on data classification, and therefore, our conclusions may not
apply to regression applications. In particular, the efficient calculation between X and a
vector in some signal processing applications may afford advantages to some optimization
methods.

3. Methods and Software Included for Comparison

In the rest of this paper, we compare some state-of-the-art software BBR, SCD, CGD-GS,
IPM, BMRM, OWL-QN, Lassplore and GLMNET. We further develop two efficient implemen-
tations. One is a coordinate descent method (CDN) and the other is a trust region Newton
method (TRON). These packages are selected because of two reasons. First, they are pub-

3193
Yuan, Chang, Hsieh and Lin

licly available. Second, they are able to handle large and sparse data sets. We categorize
these packages into three groups:
• decomposition methods,
• methods by solving constrained optimization problems,
• other methods,
and describe their algorithms in Sections 4–6. The comparison results are described in
Sections 8. Note that classification (logistic regression and L2-loss SVMs) is our focus, and
therefore, software for regression problems using the least-square loss (16) is not considered.

4. Decomposition Methods
This section discusses three methods sequentially or randomly choosing variables for update,
and one method using gradient information for the working set selection.

4.1 Cyclic Coordinate Descent Methods


From the current solution wk , a coordinate descent method updates one variable at a time
to generate wk,j ∈ Rn , j = 1, . . . , n + 1, such that wk,1 = wk , wk,n+1 = wk+1 , and
h iT
k+1
wk,j = w1k+1 , . . . , wj−1 , wjk , . . . , wnk for j = 2, . . . , n.

To update wk,j to wk,j+1 , the following one-variable optimization problem is solved:


min gj (z) = |wjk,j + z| − |wjk,j | + Lj (z; wk,j ) − Lj (0; wk,j ), (17)
z

where
Lj (z; wk,j ) ≡ L(wk,j + zej ).
For simplicity, hereafter, we use Lj (z) in most places. If the solution of (17) is d, then we
update the jth element by
wjk,j+1 = wjk,j + d.
Typically, a coordinate descent method sequentially goes through all variables and then
repeats the same process; see the framework in Algorithm 1. We refer to the process of
updating every n elements (i.e., from wk to wk+1 ) as an outer iteration and the step of
updating one element (i.e., from wk,j to wk,j+1 ) as an inner iteration. Practically, we only
approximately solve (17), where several approaches are discussed below.
Using the same way to obtain the optimality condition in (6), for the one-variable
function gj (z), we have that z = 0 is optimal for (17) if and only if

if wjk,j > 0,

0
Lj (0) + 1 = 0

L0j (0) − 1 = 0 if wjk,j < 0, (18)
k,j

−1 ≤ L0j (0) ≤ 1 if wj = 0.

The optimality condition at z = 0 shows if modifying wjk,j may decrease the function value
or not.

3194
Comparing Methods and Software for L1-regularized Linear Classification

Algorithm 1 A framework of coordinate descent methods


1. Given w1 .
2. For k = 1, 2, 3, . . . (outer iterations)
(a) wk,1 = wk .
(b) For j = 1, 2, . . . , n (inner iterations)
• Find d by solving the sub-problem (17) exactly or approximately.
• wk,j+1 = wk,j + dej .
(c) w k+1 = wk,n+1 .

4.1.1 BBR
Genkin et al. (2007) propose a coordinate descent method BBR for (1) and (5) with logistic
loss. This method is extended from Zhang and Oles (2001) for L2-regularized logistic
regression. At each inner iteration, BBR approximately solves the sub-problem (17) by a
trust region Newton method. With the trust region ∆j , it requires the step z to satisfy
(
k,j ≥ 0 if wjk,j > 0,
|z| ≤ ∆j and wj + z (19)
≤ 0 if wjk,j < 0.

The first constraint confines the step size to be in the trust region and ∆j is updated at the
end of each inner iteration. Due to the non-differentiable point at z = −wjk,j , the second
constraint ensures that the function is differentiable in the search space.
To approximately solve (17), BBR minimizes a quadratic function upper-bounding the
function gj (z) in the trust region. Though gj (z) is not differentiable, by considering both
cases of wjk,j > 0 and wjk,j < 0, we obtain the following form:

1
gj (z) = gj (0) + gj0 (0)z + gj00 (ηz)z 2 , (20)
2
where 0 < η < 1,
(
L0j (0) + 1 if wjk,j > 0,
gj0 (0) ≡ and gj00 (ηz) ≡ L00j (ηz). (21)
L0j (0) − 1 if wjk,j < 0,

Notice that when wjk,j = 0, gj (z) is non-differentiable at z = 0 and gj0 (0) is not well-defined.
We will discuss this situation later. BBR finds an upper bound Uj of gj00 (z) such that

Uj ≥ gj00 (z), ∀|z| ≤ ∆j .

Then ĝj (z) is an upper-bound function of gj (z):

1
ĝj (z) ≡ gj (0) + gj0 (0)z + Uj z 2 .
2
Any step z satisfying ĝj (z) < ĝj (0) leads to

gj (z) − gj (0) = gj (z) − ĝj (0) ≤ ĝj (z) − ĝj (0) < 0,

3195
Yuan, Chang, Hsieh and Lin

Algorithm 2 BBR: Approximately solving the sub-problem by a trust region method


1. Given wk,j and ∆j .
2. Calculate Uj by (22).
3. Find a step d by (23).
4. Update ∆j by ∆j ← max(2|d|, ∆j /2).

so the function value is decreased. If logistic loss (2) is used, BBR suggests setting Uj as

l
X
x2ij F yi (wk,j )T xi , ∆j |xij | ,

Uj ≡ C (22)
i=1

where (
0.25 if |r| ≤ δ,
F (r, δ) = 1
2+e(|r|−δ) +e(δ−|r|)
otherwise.

If wjk,j 6= 0, BBR minimizes ĝj (z) under the constraints (19) to obtain

gj0 (0) k,j


 

d = min max P (− , wj ), −∆j , ∆j , (23)
Uj

where (
z if sgn(w + z) = sgn(w),
P (z, w) ≡
−w otherwise.

Now consider the case of wjk,j = 0, where gj0 (0) is not well-defined at this point. If L0j (0)+1 <
0, by defining gj0 (0) ≡ L0j (0)+1, any 0 < z ≤ −gj0 (0)/Uj gives a smaller ĝj (z) than ĝj (0). We
thus obtain the new point by mapping −gj0 (0)/Uj back to the trust region. The situation
for L0j (0) − 1 > 0 is similar. We do not need to handle the situation −1 ≤ L0j (0) ≤ 1 as (18)
and wjk,j = 0 indicate that z = 0 is the minimum of gj (z).
The procedure of BBR to approximately solve (17) is described in Algorithm 2. The
major cost is on calculating gj0 (0) and Uj . For logistic loss, L0j (0) needed for calculating
gj0 (0) is
Xl  
L0j (0) = C yi xij τ (yi (wk,j )T xi ) − 1 , (24)
i=1

where τ (·) is defined in (9). From (22) and (24), the most expensive operation is on obtaining
wT xi , ∀i. A common trick for saving the running time is to store wT xi , ∀i and update
them according to
wT xi ← wT xi + d · xij . (25)

If wT xi , ∀i are available, both (22) and (24) need O(l) operations. Because maintaining
wT xi via (25) also takes O(l), the cost per inner iteration is O(l).
Unfortunately, there is no convergence proof yet for the method BBR.

3196
Comparing Methods and Software for L1-regularized Linear Classification

4.1.2 Coordinate Descent Method Using One-dimensional Newton


Directions (CDN)
BBR replaces the second derivative term in (20) with an upper bound Uj . If we keep using
gj00 (0) and obtain the one-dimensional Newton direction at z = 0, the local convergence may
be faster. This issue has been studied in L2-regularized logistic regression/SVMs, where
BBR reduces to the approach by Zhang and Oles (2001), and Chang et al. (2008) showed
that a coordinate descent method using one-dimensional Newton directions is faster. Here,
we extend Chang et al.’s method for L1-regularized problems. The new approach, referred
to as CDN, is expected to be faster than BBR following a similar reason.
A Newton direction is obtained from minimizing the second-order approximation, but
gj (z) is not differentiable due to the L1-regularization term. Thus, we consider only the
second-order approximation of the loss term Lj (z) and solve
1
min |wjk,j + z| − |wjk,j | + L0j (0)z + L00j (0)z 2 . (26)
z 2
This problem can be reduced to a form commonly referred to as “soft-thresholding” in
signal processing. We show in Appendix B that (26) has the following closed-form solution:
 0
L (0)+1


 − Lj 00 (0) if L0j (0) + 1 ≤ L00j (0)wjk,j ,
 0j
d = − Lj (0)−1 if L0j (0) − 1 ≥ L00j (0)wjk,j , (27)

 L00j (0)
−wk,j

otherwise.
j

Because (26) is only a quadratic approximation of f (wk,j + zej ) − f (wk,j ), the direction
d may not ensure the decrease of the function value. For the convergence, Chang et al.
(2008) consider a line search procedure to find λ ∈ (0, 1) such that the step λd satisfies the
sufficient decrease condition:

f (wk,j + λdej ) − f (wk,j ) = gj (λd) − gj (0) ≤ −σ(λd)2 ,

where σ is any constant in (0, 1). However, as pointed out by Tseng and Yun (2009), this
condition may not be suitable here due to the non-smooth regularization term kwk1 . We
follow Tseng and Yun (2009) to use a modified condition:
 
gj (λd) − gj (0) ≤ σλ L0j (0)d + |wjk,j + d| − |wjk,j | . (28)

To find λ, CDN adopts a backtrack line search to sequentially check λ = 1, β, β 2 , . . . ,


where β ∈ (0, 1), until λd satisfies (28). A description of how CDN approximately solves
the sub-problem (17) is in Algorithm 3.
In Appendix D, we explain that Algorithm 3 falls into a class of Gauss-Seidel coordinate
descent methods in Tseng and Yun (2009). By showing that all required assumptions are
satisfied, we can directly enjoy some nice theoretical properties. First, following Lemma
5 in Tseng and Yun (2009), the line search procedure stops in a finite number of steps.
Second, for (1) with logistic loss, any limit point of {wk } is an optimum. Further, if the
loss function L(w) is strictly convex, the algorithm converges at least linearly. The proof
is in the supplementary file.

3197
Yuan, Chang, Hsieh and Lin

Algorithm 3 CDN: Approximately solving the sub-problem by Newton directions with line
search
1. Given wk,j . Choose β ∈ (0, 1).
2. Calculate the Newton direction d by (27).
3. Compute λ = max{1, β, β 2 , . . . } such that λd satisfies (28).

We discuss the computational cost. To obtain the sub-problem (26), we must calculate
L0j (0) and L00j (0). For logistic loss, L0j (0) is shown in (24) and

l
X   
L00j (0) = C x2ij τ (yi (wk,j )T xi ) 1 − τ (yi (wk,j )T xi ) . (29)
i=1

Similar to the situation in BBR, calculating wT xi , ∀i is the major cost here. We can apply
T
the same trick in (25) to maintain wT xi , ∀i. In our implementation, we maintain ew xi
instead:
T T
ew xi ← ew xi · eλdxij . (30)
The line search procedure needs to calculate gj (λd). From (2), the main cost is still on
T
obtaining (w + λdej )T xi , ∀i, so the trick in (30) is again useful. If ew xi , ∀i are available,
from (24), (29) and (2), the cost per inner iteration is

(1 + # line search steps) × O(l).

To reduce the cost for line search, Chang et al. (2008) obtain a function ĝj (·) satisfying
ĝj (λd) > gj (λd) and check ĝj (λd) − gj (0) in (28). Calculating ĝj (λd) is much easier than
gj (λd), so we may avoid calculating the last gj (λd) in the line search procedure. In some
iterations, λ = 1 already satisfies (28), and therefore, this trick makes the cost of the line
search procedure negligible. We do not show details here; however, derivations can be found
in Fan et al. (2008, Appendix G).
Next we describe two implementation techniques to improve the convergence speed. The
first one is to use a random permutation of sub-problems. In Algorithm 1, we cyclically
consider variables to form one-variable sub-problems. Chang et al. (2008) show that solving
sub-problems in a random order may lead to faster convergence. That is, at the kth iteration,
we randomly permute {1, 2, . . . , n} to {πk (1), πk (2), . . . , πk (n)} and update the elements of
w in the order of {wπk (1) , wπk (2) , . . . , wπk (n) }.
The second implementation technique is shrinking. Past works such as Hsieh et al.
(2008), Joachims (1998), and Krishnapuram et al. (2005, Section 3.5) heuristically remove
some variables to obtain a smaller optimization problem. If wj has remained zero at many
iterations, it is very possible that the optimal wj∗ is also zero. Therefore, we can remove
such variables. Our shrinking implementation is related to that in the software LIBSVM
(Chang and Lin, 2011). From the optimality condition (6),

− 1 < ∇j L(w∗ ) < 1 implies wj∗ = 0. (31)

We prove in Appendix C the following theorem:

3198
Comparing Methods and Software for L1-regularized Linear Classification

Theorem 1 Assume {wk } globally converges to w∗ . If −1 < ∇j L(w∗ ) < 1, then there is
Kj such that for all k ≥ Kj ,

−1 < ∇j L(wk,j ) < 1 and wjk,j = 0.

Using this theorem, we design the following shrinking strategy. Before updating wjk,j via
approximately solving the sub-problem (17), if

wjk,j = 0 and − 1 + M k−1 < ∇j L(wk,j ) < 1 − M k−1 , (32)

we conjecture that wjk,j may remain zero and hence remove it for optimization. We choose
maxj vj
M k−1 ≡ ,
l
where
if wjk−1,j > 0,

k−1,j ) + 1|
|∇j L(w

vj ≡ |∇j L(wk−1,j ) − 1| if wjk−1,j < 0,
if wjk−1,j = 0,

max ∇j L(wk−1,j ) − 1, −1 − ∇j L(wk−1,j ), 0
 

From (6), vj , j = 1, . . . , n measure the violation of the optimality condition at the (k − 1)st
iteration. The value M k−1 reflects how aggressive we remove variables. It is large in the
beginning, but approaches zero in the end.
The shrinking implementation introduces little extra cost. When updating the jth
component at the kth iteration, we calculate

∇j L(wk,j ) = L0j (0; wk,j )

for the direction d in (27), regardless of implementing shrinking or not. Thus, ∇j L(wk,j )
needed for checking (32) is already available. Moreover, it can be used to calculate vj and
M k , which are needed for the next iteration.

4.1.3 Stochastic Coordinate Descent Method (SCD)


Shalev-Shwartz and Tewari (2009) propose a stochastic coordinate descent method (SCD) to
solve the bound-constrained problem in (12). At the kth iteration, SCD randomly chooses
a working variable from {w1+ , . . . , wn+ , w1− , . . . , wn− }. The one-variable sub-problem is

min gj (z) ≡ z + Lj (z; wk,+ − wk,− ) − Lj (0; wk,+ − wk,− ),


z

subject to the non-negativity constraint

wjk,+ + z ≥ 0 or wjk,− + z ≥ 0,

according to whether wj+ or wj− is the working variable. SCD considers a second-order
approximation similar to BBR:
1
ĝj (z) = gj (0) + gj0 (0)z + Uj z 2 , (33)
2

3199
Yuan, Chang, Hsieh and Lin

Algorithm 4 SCD for L1-regularized logistic regression


1. Given (w+ , w− ) and Uj > 0.
2. While (w+ , w− ) is not optimal for (12)
(a) Select an element from {w1+ , . . . , wn+ , w1− , . . . , wn− }.
(b) Update wj+ or wj− by (36)–(37).

where
(
1 + L0j (0) for wj+
gj0 (0) = and Uj ≥ gj00 (z), ∀z.
1 − L0j (0) for wj−

BBR considers Uj to be an upper bound of gj00 (z) only in the trust region, while SCD finds
a global upper bound of gj00 (z). For logistic regression, following (9) and (29), we have
τ (·) (1 − τ (·)) ≤ 0.25 and
X l
Uj = 0.25C x2ij ≥ gj00 (z), ∀z. (34)
i=1

Shalev-Shwartz and Tewari (2009) assume −1 ≤ xij ≤ 1, ∀i, j, so a simpler upper bound is

Uj = 0.25Cl. (35)

Using the direction obtained by minimizing (33) and taking the non-negativity into
consideration, SCD updates w by the following way:

If wj+ is selected
1 + L0j (0)
wj+ ← wj+ + max(−wj+ , − ) (36)
Uj
Else
1 − L0j (0)
wj− ← wj− + max(−wj− , − ) (37)
Uj

A description of SCD is in Algorithm 4.

4.2 CGD-GS: a Decomposition Method Using Gradients for Selecting Variables


Instead of updating one variable at a time, some decomposition methods choose a larger
working set J ⊂ N ≡ {1, . . . , n}. We discuss a Gauss-Southwell method for selecting J
(Tseng and Yun, 2009; Yun and Toh, 2011). This method, referred to as CGD-GS, can
handle any smooth loss function including ξlog .
Following the principle of decomposition methods, if wk is the current solution and J
is the set of working variables, one should solve the following sub-problem:

min L(wk + d) − L(wk ) + kwk + dk1 − kwk k1


d
subject to dj = 0, ∀j ∈
/ J.

3200
Comparing Methods and Software for L1-regularized Linear Classification

Because it is still difficult to solve this sub-problem, CGD-GS considers a quadratic approx-
imation of the loss term:
1
min qk (d) ≡ ∇L(wk )T d + dT Hd + kwk + dk1 − kwk k1
d 2 (38)
subject to dj = 0, ∀j ∈
/ J,

where H is either ∇2 L(wk ) or its approximation. To ensure the convergence, CGD-GS


conducts a backtrack line search to find λ such that λd satisfies
 
f (wk + λd) − f (wk ) ≤ σλ ∇L(wk )T d + γdT Hd + kwk + dk1 − kwk k1 , (39)

where 0 < σ < 1 and 0 ≤ γ < 1. This condition is the same as (28) if γ = 0 and J
contains only one element. Tseng and Yun (2009) used (39) for both the cyclic selection
(Gauss-Seidel) or the selection using gradient information (Gauss-Southwell).
For selecting J using gradients, Tseng and Yun (2009) proposed two possible ways.3
The first one, referred to as the Gauss-Southwell-r rule, requires J to satisfy

kd(J)k∞ ≥ vkd(N )k∞ , (40)

where v ∈ (0, 1) is a constant and d(J) and d(N ) are the solution of (38) by considering
J and N as the working set, respectively. This condition connects the directions of solving
sub-problems using a subset and a full set. The other condition for selecting J is

qk (d(J)) ≤ v · qk (d(N )), (41)

where v ∈ (0, 1) and qk (d) is defined in (38). This condition is referred to as the Gauss-
Southwell-q rule. Algorithm 5 summarizes the procedure.
The remaining issues are how to solve the sub-problem (38) and how to obtain J satis-
fying (40) or (41). The CGD-GS implementation considers a diagonal matrix with positive
entries as H. For example, Hjj = max(∇2jj L(wk ), ), where  is a small positive value. Then,
the sub-problem becomes |J| separable one-variable problems like (26). Each one-variable
problem has a simple closed-form solution. Further, it is easy to find indices satisfying (40)
or (41). For example, the rule (40) becomes to find the larger elements of d(N ). Tseng and
Yun (2009) proved that any limit point of {wk } is an optimum of (1).
We discuss the computational cost for logistic regression. If H is diagonal, then solving
(38) takes only O(|J|) operations. Constructing (38) is more expensive because we need to
calculate
Xl
τ (yi wT xi ) − 1 yi xi .

∇L(w) = C (42)
i=1
T
Yun and Toh (2011) apply a trick similar to (30) and maintain ew xi , ∀i, but the gradient
calculation still needs O(ln). This high cost may not pay off, and therefore, Hsieh et al.
(2008) favor decomposition methods without maintaining the gradient. Finding the working
set J is another potentially expensive step, but it is cheap for a diagonal H.
3. Note that (40) indirectly uses gradient information. If the 1-norm term is not considered and H is
diagonal, then d(N ) is a “scaled” gradient with dj (N ) = −∇j L(wk )/Hjj . Thus, (40) selects indices
with larger scaled gradient values.

3201
Yuan, Chang, Hsieh and Lin

Algorithm 5 CGD-GS for L1-regularized logistic regression


1. Given w1 . Choose (40) or (41) as the strategy for selecting working sets. Given
0 < β, σ < 1 and 0 ≤ γ < 1.
2. For k = 1, 2, 3, . . .
• Choose an H and a working set J.
• Get dk by solving the sub-problem (38).
• Compute λ = max{1, β, β 2 , . . . } such that λdk satisfies (39).
• wk+1 = wk + λdk .

5. Methods by Solving Constrained Optimization Problems


Section 2.2 lists several methods for L1-regularized classification by solving the bound-
constrained problems (12), (13), and (14). In this section, we discuss TRON, IPM, and
Lassplore in detail.

5.1 A Trust Region Newton Method (TRON)


We apply the trust region Newton method in Lin and Moré (1999) to solve (12). A previous
study of this method for L1-regularized logistic regression is by Lee (2008). For convenience,
in this section, we slightly abuse the notation by redefining
 +
w
w≡ ∈ R2n . (43)
w−

Then, problem (12) can be written in a general form of bounded-constrained problems:

min f¯(w)
w
subject to w ∈ Ω ≡ {w | lj ≤ wj ≤ uj , ∀j},

where f¯(w) denotes the objective function of (12), and l and u are lower and upper bounds,
respectively.
At the kth iteration of the trust region Newton method, we have an iterate wk , a size
∆k of the trust region, and a quadratic model
1
qk (d) ≡ dT ∇2 f¯(wk )d + ∇f¯(wk )T d
2
to approximate the value f¯(wk + d) − f¯(wk ). Next, we find a step dk by approximately
solving the following trust region problem

min qk (d)
d (44)
subject to kdk ≤ ∆k , wk + d ∈ Ω.

We then update wk and ∆k by checking the ratio

f¯(wk + dk ) − f¯(wk )
ρk = (45)
qk (dk )

3202
Comparing Methods and Software for L1-regularized Linear Classification

Algorithm 6 A trust region method for L1-regularized logistic regression


1. Given w1 .
2. For k = 1, 2, 3, . . . (outer iterations)
• Approximately solve (44) and obtain a direction dk ; see Algorithm 7.
• Compute ρk via (45).
• Update wk to wk+1 according to (46) and update ∆k to ∆k+1 .

Algorithm 7 TRON: Finding a direction dk by approximately solving (44)


1. Given 0 <  < 1. Find the Cauchy step dk,C and the Cauchy point

wk,1 = wk + dk,C and dk,1 = dk,C .

2. For t = 1, . . . , 2n + 1 (inner iterations)


• Find Ft and Bt (free/bounded sets) at wk,t by (50).
• If Ft = ∅, then stop and return dk = wk,t − wk .
• Approximately solve

min qk (dk,t + v)
v Ft

subject to kdk,t + vk ≤ ∆k , v Bt = 0,

by Conjugate Gradient (CG) methods. Denote the solution as v k,t .


• Projected line search on wk,t + λv k,t to obtain wk,t+1 and dk,t+1 ; see Equation
(55). We ensure that Ft ⊂ Ft+1 and |Ft | < |Ft+1 |.
• If one of the following situations occurs:

k∇qk (dk,t+1 )Ft k ≤ k∇f¯(wk )Ft k,

or CG abnormally stops (explained in text), then stop and return

dk = wk,t+1 − wk .

of the actual reduction in the function to the predicted reduction in the quadratic model.
The direction dk is accepted if ρk is large enough:
(
wk + dk if ρk > η0 ,
wk+1 = (46)
wk if ρk ≤ η0 ,

where η0 > 0 is a pre-specified value. The size ∆k of the trust region is then updated
according to the reduction of the function value. If the reduction is significant, then ∆k is
enlarged. Otherwise, we reduce ∆k . More details can be found in Lin and Moré (1999).
The framework of our trust region method is given in Algorithm 6. Earlier works applying
trust region methods for L2-regularized problems include, for example, Lin et al. (2008).
The algorithm here is more complicated due to the bound constraints.

3203
Yuan, Chang, Hsieh and Lin

5.1.1 Cauchy Point


A challenge for minimizing bound-constrained problems is to quickly identify bounded com-
ponents at an optimal solution (i.e., components which are upper- or lower-bounded). Be-
cause each wj can be either bounded or free, the number of combinations is exponential.
One commonly used approach takes the negative gradient direction to obtain a new point
and projects it back to the feasible region Ω. With a proper line search to ensure the re-
duction of the quadratic model qk (d), not only do we effectively guess the set of bounded
components at an optimum, but also the convergence is guaranteed. To be more precise,
we find a step size λ > 0 so that

qk (dk,C ) ≤ qk (0) + σ∇qk (0)T dk,C and kdk,C k ≤ ∆k , (47)

where
dk,C = P [wk − λ∇f¯(wk )] − wk (48)
is called the Cauchy step in bound-constrained optimization and σ ∈ (0, 1/2) is a constant.
The projection operator P [·] maps wk − λ∇f¯(wk ) back to the feasible region Ω:

P [wj ] = min (uj , max(wj , lj )) , (49)

so some components become bounded. Although the negative gradient direction is projected
in (48), the resulting direction is still a descending one (i.e., ∇qk (0)T dk,C < 0). Hence, one
can always find a small λ > 0 such that (47) is satisfied. The point wk,C ≡ wk + dk,C is
referred to as the Cauchy point.

5.1.2 Newton Direction


Gradient descent methods suffer from slow convergence, so in (48) we should have used the
Newton direction. However, the second-order information is accurate only if there are no
bound constraints. Lin and Moré (1999) propose using Newton directions on the subspace
of the Cauchy point’s free components. Recall that we find the Cauchy point to predict the
bounded elements at an optimum. We obtain the free/bounded sets at the Cauchy point

F ≡ F (wk,C ) = {j | lj < wjk,C < uj } and B ≡ B(wk,C ) = {j | j ∈


/ F }, (50)

and find a Newton direction on the space F by solving

min qk (dk,C + v)
vF
(51)
subject to kdk,C + vk ≤ ∆k , v B = 0.

If the free set at the Cauchy point is close to that at an optimum, using a Newton direction
on this sub-space leads to fast convergence.
Because (51) does not enforce the feasibility of wk,C + v, one needs a projected line
search procedure similar to (47)–(48). Details are shown later in (55). The resulting point
may contain more bounded components than the Cauchy point. In this situation, Lin and
Moré (1999) continue to minimize a quadratic approximation on the new sub-space. They
thus generate inner iterates wk,1 = wk,C , wk,2 , wk,3 , . . ., until that the free/bounded sets do

3204
Comparing Methods and Software for L1-regularized Linear Classification

not change. If m inner iterations are taken, then the direction dk for the kth trust region
iteration in Algorithm 6 is
dk = wk,m+1 − wk .
Details of our procedure are described in Algorithm 7. Because each inner iteration enlarges
the bounded set, the number of inner iterations is bounded by 2n, the number of variables.
In practice, very few inner iterations (one or two) are taken. Another reason to take inner
iterations is for the quadratic convergence proof in Lin and Moré (1999).
Let t be the inner-iteration index and Bt , Ft be bounded/free sets at wk,t . With v Bt = 0,
1
qk (dk,t + v) = v TFt ∇2 f¯(wk )Ft ,Ft v Ft + ∇qk (dk,t )TFt v Ft + qk (dk,t ), (52)
2
so minimizing qk (dk,t + v) is equivalent to solving the following linear system

∇2 f¯(wk )Ft ,Ft v Ft = −∇qk (dk,t )Ft . (53)

To solve (53), we conduct conjugate gradient (CG) iterations until

k∇2 f¯(wk )Ft ,Ft v Ft + ∇qk (dk,t )Ft k = k∇qk (dk,t+1 )Ft k ≤ k∇f¯(wk )Ft k (54)

is satisfied, where  is a given positive constant. See Section 5.1.3 for reasons to choose CG.
CG may stop before reaching (54) if either the iterate causes our search direction to exceed
the trust region boundary or the singularity of the matrix ∇2 f¯(wk )Ft ,Ft is detected.
Once a direction v k,t is identified, we conduct a projected line search to ensure the
feasibility and the sufficient decrease of qk (d). This procedure is similar to (47)–(48) for
the Cauchy step. We find λ (e.g., by a backtrack line search) such that

wk,t+1 = P [wk,t + λv k,t ], dk,t+1 = wk,t+1 − wk , and


(55)
qk (dk,t+1 ) ≤ qk (dk,t ) + σ∇qk (dk,t )TFt (dk,t+1 − dk,t )Ft ,

where P [·] is defined in (49) and σ is the same as that in (47).

5.1.3 Hessian-vector Product


CG is very suitable for solving the linear system (53) as it requires only Hessian-vector
products. For logistic regression,
 T 
2¯ X  
∇ f (w)Ft ,Ft = C T D X −X :,F , (56)
−X F ,: t
t

where D is an l × l diagonal matrix with

Dii = τ yi (w+ − w− )T xi 1 − τ (yi (w+ − w− )T xi ) .


 
(57)

The matrix ∇2 f¯(w)Ft ,Ft may be too large to be stored. If using CG, the Hessian-vector
product can be conducted by a sequence of matrix-vector products:
 T   
X 
∇2 f¯(wk,t )Ft ,Ft v Ft = C
 
D X −X v
:,Ft Ft
. (58)
−X T F ,:
t

3205
Yuan, Chang, Hsieh and Lin

Thus, the memory problem is solved.


In Lin et al. (2008) for L2-regularized problems, Hessian-vector products are only needed
in CG, but here they are also used in the projected line search for calculating qk (dk,t+1 ) −
qk (dk,t ); see (52)
 and (55).
 Moreover, in (58) we use only a sub-matrix of the Hessian, so
|Ft | columns of X −X are needed. Because in general |Ft | is smaller than the number of
variables, calculating (58) may be faster than the product between the whole Hessian and
a vector. To quickly access X’s columns, storing the data matrix X in the column format
is more suitable. For a discussion between row and column formats, see Lin et al. (2008,
Section 4.3).

5.1.4 Convergence
From Theorem 2.1 of Lin and Moré (1999), any limit point of {wk } is an optimum of (12).
For the local convergence rate, in Appendix E, we indicate that in general TRON can achieve
quadratic convergence.

5.2 An Interior Point Method (IPM)


Koh et al. (2007) proposed an interior point method to solve (13) with logistic loss. In (13),
we omit the bias term b, but Koh et al. (2007) included this term. They consider a log
barrier function so that (w, u) is an interior point of the feasible region:
 
Xn X l Xn
φt (b, w, u) ≡ t  uj + C ξ(w, b; xi , yi ) −
 log(u2j − wj2 ),
j=1 i=1 j=1

where t > 0 is a parameter. The unique minimizer (b∗ (t), w∗ (t), u∗ (t)) under any given t
forms a curve called the “central path,” which approaches an optimal solution of (13) as
t → ∞. An interior point method thus alternatively minimizes φt (b, w, u) and adjusts t.
From a set of (w, b) on the search path, we can construct a feasible solution to the dual
problem of (13) and evaluate the duality gap. The duality gap is guaranteed to converge
to 0 as we walk along the central path when t → ∞. Thus, we can check the duality gap
for the stopping condition.
At the kth iteration, using the current tk , interior point methods approximately minimize
φtk (b, w, u) by finding a Newton direction. The following linear system is solved:
 
∆b
∇2 φtk (bk , wk , uk ) ∆w = −∇φtk (bk , wk , uk ). (59)
∆u
For logistic loss,
Pl
τ (yi (wT xi + b))
  
tC i=1 yi − 1
u21 − w12 
 

 P 2w1 / 
tC li=1
 T

τ (yi (w xi + b)) − 1 yi xi + 
 .. 
 . 
 
∇φt (b, w, u) =  2 2
un − wn 
  2wn / 
2u1 / u21 − w12
  
 

ten − 
 ..  
 .  
2 2

2un / un − wn

3206
Comparing Methods and Software for L1-regularized Linear Classification

Algorithm 8 IPM for L1-regularized logistic regression


1
1. Given b1 and an interior point (w1 , u1 ). Let t1 = Cl .
2. For k = 1, 2, 3, . . .  
∆b
• Obtain a Newton direction ∆w by solving (59).
∆u
• A backtrack line search procedure to ensure the sufficient decrease of φtk (·)
• Update  k+1   k 
b b + λ∆b
wk+1  = wk + λ∆w .
uk+1 uk + λ∆u
• Construct a dual feasible point and evaluate the duality gap η.
• Set (
max µ min(2n/η, tk ), tk

k+1 if λ ≥ smin ,
t = k
t otherwise,
where µ and smin are constants.

and
tCy T Dy tCeTl DX 0T
 

∇2 φt (b, w, u) = tCX T Del tCX T DX + D1 D2  ,


0 D2 D1
where τ (·) is defined in (9), en ∈ Rn and el ∈ Rl are the vectors of all ones, D is similar to
(57) but includes b, and D1 and D2 are n × n diagonal matrices:
2 2
(D1 )jj = 2 u2j + wj2 / u2j − wj2 (D2 )jj = −4uj wj / u2j − wj2

and .

Koh et al. (2007) apply preconditioned conjugate gradient methods (PCG) to solve (59)
with diagonal preconditioning.
For the convergence, a backtrack line search procedure is needed to ensure the sufficient
decrease of φtk (·). Koh et al. (2007) did not discuss details of their method’s convergence.
However, because interior point methods are a type of Newton methods, they often enjoy
fast local convergence.

5.3 Lassplore Method


Liu et al. (2009) apply Nesterov’s method (Nesterov, 2003) to solve (14). This method can
handle (14) with and without the bias term. For simplicity, we do not consider the bias
term. In addition to the sequence of iterations {wk }, for faster convergence, Nesterov’s
method uses another sequence of searching points {sk }, where

sk = wk + βk (wk − wk−1 ),

for some positive parameter βk . From sk , we obtain wk+1 by taking the negative gradient
direction:
wk+1 = sk − λk ∇L̄(sk ),

3207
Yuan, Chang, Hsieh and Lin

Algorithm 9 Lassplore for solving (14)


• Given w0 and w1 , α0 = 0.5, λ1 > 0, γ1 ≥ 0.
• For k = 1, 2, 3, . . .
1. While 1 do
– Compute αk ∈ (0, 1) as the root of ηk (α), βk by (60), and γk+1 by (61).
– Compute sk = xk + βk (wk − wk−1 ).
– Compute wk+1 by (63).
– If L̄(wk+1 ) satisfies (62)
goto Step 2.
Else
λk ← λk /2.
2. Find the initial λk+1 for the next iteration by an adaptive scheme.4

where
l
X
L̄(w) ≡ ξi (w; xi , yi )
i=1

is the objective function of (14) and λk is the step size. Note that L̄(w) is different from
L(w) defined in (7) because the penalty parameter C is not needed in (14). Liu et al. (2009)
suggest to estimate βk by
γk (1 − αk−1 )
βk = , (60)
αk−1 (γk + αλkk )
where αk ∈ (0, 1) is the root of a quadratic function

α2
ηk (α) ≡ + γk α − γk
λk

and γk ≥ 0 satisfies
γk+1 = (1 − αk )γk if k ≥ 1 and γ1 ≥ 0. (61)
We have αk ∈ (0, 1) because ηk (0) = −γk < 0 and ηk (1) = 1/λk > 0.
Lassplore applies an adaptive line search scheme that adjusts λk so that wk+1 satisfies
1
L̄(wk+1 ) ≤ L̄(sk ) + ∇L̄(sk )T (wk+1 − sk ) + kwk+1 − sk k22 . (62)
2λk

The new solution wk+1 generated from the above process may not satisfy the constraint in
(14), so Lassplore projects the solution to the feasible region:

wk+1 ≡ arg min{k(sk − λk ∇L(sk )) − wk | kwk1 ≤ K}. (63)


w

This projection is related to (15) in Section 2.2.2, but Lassplore applies the method in Liu
and Ye (2009) to efficiently compute (63).
A summary of Lassplore is given in Algorithm 9.
4. See Liu et al. (2009, Algorithm 3) for details.

3208
Comparing Methods and Software for L1-regularized Linear Classification

6. Three Other Methods for L1-regularized Logistic Regression


We describe details of three more methods because they are included in our comparison.

6.1 Orthant-Wise Limited-memory Quasi-Newton (OWL-QN)


LBFGS (Liu and Nocedal, 1989) is a limited memory quasi Newton method for unconstrained
smooth optimization. It can not deal with (1) because of the non-differentiability. Andrew
and Gao (2007) modified LBFGS to solve (1) and named their method as OWL-QN. Here,
we discuss how they handle the non-differentiability.
From earlier discussion, we know that (1) is differentiable in a region where the sign
of wj does not change. Thus, OWL-QN restricts the search space to such a region (called
orthant). At the kth iteration, it searches wk+1 on the space:

Ωk ≡ {w ∈ Rn | sgn(wj ) = skj , j = 1, . . . , n},

where (
sgn(wjk ) if wjk 6= 0,
skj ≡ (64)
sgn(−∇¯ j f (wk )) otherwise,
and 
0 if wj > 0 or (wj = 0 and L0j (w) + 1 < 0),
Lj (w) + 1

¯ j f (w) ≡ L0 (w) − 1
∇ if wj < 0 or (wj = 0 and L0j (w) − 1 > 0), (65)
 j
0 otherwise

is defined as the pseudo gradient of f (w). In (64), if wjk = 0, we consider the space where
wj can be moved by taking the negative gradient direction. OWL-QN then approximately
minimizes a quadratic approximation of (1) in the search space Ωk :

¯ (wk )T d + 1 dT Hk d
min f (wk ) + ∇f
d 2 (66)
subject to wk + d ∈ Ωk ,

where Hk approximates the Hessian of f (wk ) by the first-order information gathered from
previous iterations. Details for getting Hk can be found in Liu and Nocedal (1989). To
approximately solve (66), OWL-QN finds the minimum of the quadratic objective function:

dk = −Hk−1 ∇f
¯ (wk ), (67)
k ¯ (wk ):
obtains a direction d̄ by confining dk on the same orthant as −∇f
(
¯ j f (wk )),
dkj if sgn(dkj ) = sgn(−∇
d¯kj = (68)
0 otherwise,

and then conducts a backtracking line search to find λ such that the sufficient decrease of
the function value is satisfied. Note that wk+1 must be in Ωk , so following (64),
(
k+1 wjk + λd¯kj if sgn(wjk + λd¯kj ) = skj ,
wj =
0 otherwise.

3209
Yuan, Chang, Hsieh and Lin

Algorithm 10 OWL-QN: An extension of LBFGS for L1-regularized logistic regression


1. Given w1 .
2. For k = 1, 2, 3, . . .
• Compute ∇f ¯ (wk ) by (65).
• Obtain H k by information gathered from previous iterations and compute dk by
(67).
k
• Compute d̄ by (68).
• Find w k+1 ∈ Ωk by a backtracking line search.

Algorithm 10 summarizes the procedure.


Regarding the convergence, Yu et al. (2010, Appendix D) point out that the proof by
Andrew and Gao (2007) is flawed. Yu et al. (2010) give the convergence proof for a slightly
modified algorithm.

6.2 Generalized Linear Model with Elastic Net


Friedman et al. (2010) proposed GLMNET to handle least-square and log-linear losses with
L1/L2 regularization. Here, we discuss how GLMNET solves L1-regularized logistic regres-
sion. Although GLMNET can solve (5) with the bias term, for simplicity, we only indicate
how GLMNET solves (1).
Because of the twice differentiability of the logistic loss function, the gradient of L(w)
is shown in (42) and the Hessian is
∇2 L(w) =CX T DX, (69)

where D ∈ Rl×l is a diagonal matrix with


Dii = τ (yi wT xi ) 1 − τ (yi wT xi )

(70)
and τ (·) is defined in (9). See similar formulations derived earlier for TRON in (56) and (57).
Given the current solution wk , GLMNET considers a quadratic approximation of L(w). By
the second-order Taylor expansion,
f (wk + d) − f (wk )
   
= kwk + dk1 + L(wk + d) − kwk k1 + L(wk )
1
≈ ∇L(wk )T d + dT ∇2 L(wk )d + kwk + dk1 − kwk k1 .
2
Then, GLMNET solves the Newton-like system
1
min qk (d) ≡ ∇L(wk )T d + dT ∇2 L(wk )d + kwk + dk1 − kwk k1 (71)
d 2
by a cyclic coordinate descent method. Following the framework in Algorithm 1, d’s values
are sequentially updated by minimizing the following one-variable function:
gj (z) ≡ qk (d + zej ) − qk (d)
1
= |wjk + dj + z| − |wjk + dj | + Bz + Az 2 ,
2

3210
Comparing Methods and Software for L1-regularized Linear Classification

Algorithm 11 GLMNET for L1-regularized logistic regression


1. Given w1 .
2. For k = 1, 2, 3, . . .
• Let dk ← 0.
• While dk is not optimal for minimizing qk (d)
– For j = 1, . . . , n
∗ Solve the following one-variable problem by (27):

z̄ = arg min qk (dk + zej ) − qk (dk ).


z

∗ dkj ← dkj + z̄.


• wk+1 = wk + dk .

where X
B ≡ ∇j L(wk ) + ∇2jt L(wk )dt and A ≡ ∇2jj L(wk )
t
can be calculated using (42) and (69). It is easy to minimize gj (z) by (27).5
Because calculating the matrix D involves many exponential operations, GLMNET also
considers using an approximation of ∇2 L(w) and minimizes
1
qk (d) ≡ ∇L(wk )T d + dT Hd + kwk + dk1 − kwk k1 ,
2
T
where H ≡ 0.25CX IX and I is an identity matrix. That is, we use a cheaper but less
accurate approximation of f (wk + d) − f (wk ). A sketch of GLMNET is in Algorithm 11.
We briefly describe some GLMNET’s implementation details. GLMNET applies a shrink-
ing technique to solve a smaller optimization problem than (71); see similar techniques for
decomposition methods in Section 4.1.2. Using a sparse representation of w and maintain-
ing an index set Ω to indicate the non-zero elements of d, GLMNET solves a smaller problem
by a coordinate descent method:
1
min ∇L(wk )T d + dT Hd + kwk + dk1 − kwk k1 .
dΩ 2
GLMNET conducts feature-wise normalization before solving the optimization problem.
That is, it solves (1) by replacing xi with x̃i , where
v
Pl u l l
xij − x̄j i=1 xij
uX X
x̃ij ≡ , ∀i, j, x̄j = , ∀j, and σj = t x2ij − x̄2j , ∀j.
σj l
i=1 i=1

Notice that there is no guarantee that GLMNET converges to an optimum. Furthermore,


the function value may not decrease because GLMNET does not conduct a line search
procedure on the direction d. For minimizing the quadratic approximation qk (d), GLMNET
measures the relative step change in the successive coordinate descent iterations. Deciding
when to stop minimizing qk (d) is an issue because a strict stopping condition may already
cause long running time for q1 (d).
5. In GLMNET implementation, instead of finding z, a different but equivalent update is used to get new
wjk + dj .

3211
Yuan, Chang, Hsieh and Lin

Algorithm 12 BMRM for L1-regularized logistic regression


1. Given w1 .
2. For k = 1, 2, 3, . . .
• Compute and store ak and bk by (73).
• Obtain wk+1 by solving the linear program (76).

6.3 Bundle Method


Bundle method is a cutting plane approach for minimizing non-smooth convex problems.
Teo et al. (2010) proposed a bundle method BMRM to handle non-differentiable loss func-
tions (e.g., L1 loss). They provide an extension to handle L1 regularization. Interestingly,
BMRM applies cutting planes only to the loss function, regardless of whether it is differ-
entiable or not. Therefore, for problem (1) with logistic loss, BMRM uses a non-smooth
method to handle the smooth loss function and some other ways for the non-smooth regu-
larization term kwk1 .
Let wk be the solution at the kth iteration. Using the convexity of the loss function,
BMRM builds a cutting plane (i.e., the first-order Taylor expansion) of L(w) at w = wk :

L(w) ≥ ∇L(wk )T (w − wk ) + L(wk )


(72)
= aTk w + bk , ∀w,

where
ak ≡ ∇L(wk ) and bk ≡ L(wk ) − aTk wk . (73)
If L(w) is non-differentiable, in (72), BMRM substitutes ∇L(w) with the sub-gradient of
L(w).
BMRM maintains all cutting planes from the earlier iterations to form a lower-bound
function for L(w):
L(w) ≥ LCP T
k (w) ≡ max at w + bt , ∀w. (74)
1≤t≤k

BMRM obtains wk+1 by solving the following sub-problem:

min kwk1 + LCP


k (w). (75)
w

Using (74) and the splitting of w in (12) by w = w+ − w− , Equation (75) can be reformu-
lated to the following linear programming problem:
n
X n
X
min wj+ + wj− + ζ
w+ ,w− ,ζ
j=1 j=1
(76)
subject to aTt (w+ − w− ) + bt ≤ ζ, t = 1, . . . , k,
wj+ ≥ 0, wj− ≥ 0, j = 1, . . . , n.

A summary of the BMRM approach is given in Algorithm 12. Teo et al. (2010, Appendix C)
indicated that because of the L1 regularization, the convergence result has not been fully
established yet.

3212
Comparing Methods and Software for L1-regularized Linear Classification

7. L1-regularized L2-loss Support Vector Machines


Previous sections have focused on L1-regularized logistic regression. Now we consider (1)
with the L2-loss function (4). The optimization problem can be rewritten as
X
min f (w) ≡ kwk1 + C bi (w)2 , (77)
w
i∈I(w)

where
bi (w) ≡ 1 − yi wT xi and I(w) ≡ {i | bi (w) > 0}. (78)
Therefore, the sum of losses is
X
L(w) = C bi (w)2 . (79)
i∈I(w)

In contrast to logistic loss, the L2-loss function is differentiable but not twice differentiable
(Mangasarian, 2002). Thus, some methods discussed in Sections 4–6 may not be directly
applicable because they use second-order information. However, as shown in Mangasarian
(2002, Section 3), (79) is twice differentiable at all but {w | bi (w) = 0 for some i}. More-
over, ∇L(w) is globally Lipschitz continuous, so a generalized Hessian exists everywhere.
Using a generalized Hessian, we may modify algorithms using the second-order information
for L1-regularized L2-loss SVMs. In the following two sections, we extend CDN and TRON
to solve (77).

7.1 CDN for L2-loss SVMs


To apply CDN for L2-loss SVMs, in the sub-problem (17), we have
X
Lj (z) = C bi (wk,j + zej )2 .
i∈I(wk,j +zej )

For a second-order approximation similar to (26), we need L0j (0) and L00j (0):
X
L0j (0) = −2C yi xij bi (wk,j ).
i∈I(wk,j )

Unfortunately, L00j (0) is not well-defined if there exists some i such that bi (wk,j ) = 0.
Following Chang et al. (2008), we consider the generalized second derivative:
X
2C x2ij . (80)
i∈I(wk,j )

By replacing L00j (0) in (26) with the above quantity, we can easily obtain a direction d.
However, the value in (80) may be zero if xij = 0, ∀i ∈ I(wk,j ). To apply the convergence
result in Tseng and Yun (2009), we ensure the strict positivity by taking
 X 
2
max 2C xij ,  , (81)
i∈I(wk,j )

3213
Yuan, Chang, Hsieh and Lin

where  is a small positive value. Following the explanation in Appendix F, we have the finite
termination of the line search procedure and the asymptotic convergence of the function
value. If the loss function L(w) is strictly convex, the algorithm converges at least linearly.
The proof is in the supplementary file.
Like the situation for logistic regression, the major cost for finding the Newton direction
d and for the line search procedure is to calculate wT xi , ∀i ∈ I. We maintain bi (w), ∀i
using the same trick in (25). All other implementation techniques discussed in Section 4.1.2
for logistic regression can be applied here.

7.2 TRON for L2-loss SVMs


We apply TRON to solve the bound-constrained problem (12) using L2 loss. Following the
notation in Section 5.1, w ∈ R2n is defined in (43) and f¯(w) is the objective function in
(12).
The quadratic model qk (d) requires the gradient and the Hessian of f¯(w). We have
∇f¯(w) = e + 2C X̄I,: T T

X̄I,: w − X̄I,: yI ,
where e ∈ R2n is a vector of all ones, X̄ ≡ X −X , and I is defined in (78). X̄I,: denotes
 

a sub-matrix including X̄’s rows corresponding to I; see (8). Note that bi (w) defined in
(78) is now calculated by
1 − yi (w1:n − w(n+1):2n )T xi .
Following Mangasarian (2002) and Fan et al. (2008, Appendix D), we consider the
generalized Hessian matrix:
2C X̄ T DX̄,
where D ∈ Rl×l is a diagonal matrix with
(
1 if bi (w) > 0,
Dii =
0 if bi (w) ≤ 0.
We then apply Algorithm 7 to approximately minimize qk (d). For the Hessian-vector prod-
uct, only instances in the index set I and variables in the free set F are considered. Thus,
the Hessian-vector product in (58) becomes
T

2C X̄I,F DI,I X̄I,F v F ,
where v is a vector in R2n .
Regarding the convergence, from Theorem 2.1 of Lin and Moré (1999), any limit point
of {wk } is an optimal solution. However, without the twice differentiability, it is unclear if
the local quadratic convergence holds.

8. Numerical Experiments
In Sections 4–7, we have described details of several large-scale optimization methods for
L1-regularized linear classification. In this section, we conduct experiments to investigate
their performances. We describe data sets and experimental settings first. Then, we com-
prehensively compare methods for logistic regression and L2-loss SVMs. Programs used in
this paper are available at http://www.csie.ntu.edu.tw/~cjlin/liblinear/exp.html.

3214
Comparing Methods and Software for L1-regularized Linear Classification

Data set l n #nz


a9a 32,561 123 451,592
real-sim 72,309 20,958 3,709,083
news20 19,996 1,355,191 9,097,916
rcv1 677,399 47,236 49,556,258
yahoo-japan 176,203 832,026 23,506,415
yahoo-korea 460,554 3,052,939 156,436,656

Table 1: Statistics of data sets: l and n denote the numbers of instances and features in a
data set, respectively. The column #nz indicates the number of non-zero entries.

LR without bias LR with bias L2-loss SVM


Data set
C Density Acc. C Density Acc. C Density Acc.
a9a 4.0 94.3 85.2 2.0 89.3 85.3 0.5 90.2 85.2
real-sim 4.0 16.8 97.1 4.0 16.2 97.0 1.0 18.9 97.1
news20 64.0 0.2 95.1 64.0 0.2 94.9 64.0 0.7 96.1
rcv1 4.0 23.8 97.8 4.0 23.0 97.8 1.0 25.9 97.8
yahoo-japan 4.0 1.3 91.9 4.0 1.0 93.1 1.0 1.4 92.2
yahoo-korea 4.0 1.0 87.6 4.0 0.9 87.7 1.0 1.0 87.7

Table 2: The best parameter C, the model density (%), and the testing accuracy (%). We
conduct five-fold cross validation on the training set to select C in 2k | k = −4, −3, . . . , 6 .


Using the selected C, we build a model to predict the testing set.

8.1 Data Sets


Table 1 lists data statistics. Most data sets include documents, where the numbers of both
features and instances are large. An exception is the problem a9a from UCI “adults” data
sets; it has l  n. For other document sets, real-sim includes some Usenet articles, news20
is a collection of news documents, rcv1 is an archive of manually categorized newswire
stories from Reuters, and yahoo-japan/yahoo-korea are document data from Yahoo!. Except
yahoo-japan and yahoo-korea, other data sets are publicly available at http://www.csie.
ntu.edu.tw/~cjlin/libsvmtools/datasets/. For every document data set, instance-wise
normalization has been conducted so that the length of each instance is one.
To estimate the testing accuracy, a stratified selection is taken to split each data set
into one fifth for testing and the rest for training.

8.2 Experimental Settings


We consider the following implementations discussed in Sections 4–7.
• BBR: the cyclic coordinate descent method for logistic regression is described in Section
4.1.1. We download version 4.03 from http://www.bayesianregression.org/.
• CDN: the cyclic coordinate descent methods for logistic regression and L2-loss SVMs
are respectively described in Sections 4.1.2 and 7.1. To check the sufficient decrease
condition, we use σ = 0.01 and β = 1/2. For L2-loss SVM, we use  = 10−12 in (81). The

3215
Yuan, Chang, Hsieh and Lin

implementation is the same as that included in version 1.6 of our software LIBLINEAR
(http://www.csie.ntu.edu.tw/~cjlin/liblinear/).
In Section 4.1.2, we discuss the shrinking technique for CDN. We defer the investigation
of its effectiveness to Section 8.4. In all other places of the comparison, we run CDN with
the shrinking strategy.
• SCD: the stochastic coordinate descent method for logistic regression is described in
Section 4.1.3. The source code is available at http://ttic.uchicago.edu/~tewari/
code/scd/.
• CGD-GS: the block coordinate gradient descent method for logistic regression is described
in Section 4.2. Following Yun and Toh’s (2011) settings, we choose
h i
H = diag min max(∇2jj L(wk ), 10−10 ), 1010
j=1,...,n

and apply the Gauss-Southwell-r rule to choose the working set J; see Equation (40).
Note that from Table 9 of Yun and Toh (2011), implementations using Gauss-Southwell-r
and Gauss-Southwell-q rules perform similarly on document data. We use default values
for all parameters; see Section 5.1 in Yun and Toh (2011). In particular, σ = 0.1, β = 1/2,
and γ = 0. The source code is available at http://www.math.nus.edu.sg/~matys/.
• TRON: the trust region Newton methods for logistic regression and L2-loss SVMs are
respectively described in Sections 5.1 and 7.2. In the projected line search (47) and
(55), we use σ = 0.01 and β = 0.1. For the stopping condition of the CG procedure in
Algorithm 7, we use  = 0.1. The parameter η0 in (46) is 0.0001.
• IPM: the interior point method for logistic regression is described in Section 5.2. For small
and dense data sets, IPM solves a linear system in Algorithm 8 by Cholesky factorization.
For large and sparse data sets, it applies a preconditioned conjugate gradient method to
approximately solve the linear system. In the experiment, we only consider the large and
sparse setting. We use default parameters: σ = 0.01 and β = 1/2 for the line search
procedure. To update tk , we use smin = 1/2 and µ = 2. The source code (version 0.8.2)
is downloaded from http://www.stanford.edu/~boyd/l1_logreg/.
As w lies in the interior of the feasible region, every wj is non-zero after IPM stops. To
gain the sparsity, following the condition (31), Koh et al. (2007) assign those wj satisfying
|∇j L(w)| ≤ 0.9999 to zero. However, if we have not run enough iterations to obtain an
accurate solution, this modification of w may result in an erroneous model. We thus add
another condition |wj | < 1 in deciding if wj should be assigned to zero. We will address
this issue again in Section 8.3.
We find that because of a problem of not initializing an element in an array, the pre-
vious version (0.8.1) of IPM is two or three times slower than the latest version (0.8.2).
This observation indicates the difficulty in comparing software. Some minor issues may
significantly affect the conclusions.
• OWL-QN: the quasi Newton method is described in Section 6.1. The source code (version
1.1.2) is available at http://research.microsoft.com/en-us/um/people/jfgao/.
• GLMNET: the method is described in Section 6.2 for logistic regression. Following the
default setting, we use the full Hessian in (71) instead of an approximation. We gradually
reduce the stopping tolerance to obtain different models. The coordinate descent method
for minimizing the quadratic approximation qk (d) stops according to a tolerance the

3216
Comparing Methods and Software for L1-regularized Linear Classification

same as the overall stopping tolerance. The source code (version 1.5) is available at
http://cran.r-project.org/web/packages/glmnet/index.html.
• Lassplore: this method is described in Section 5.3. We check the 1-norm of the optimal
solution (obtained by other solvers) to calculate the value K in (14). The source code (ver-
sion 1.0) is available at http://www.public.asu.edu/~jye02/Software/lassplore.
• BMRM: this bundle method is described in Section 6.3. We apply it to both logistic regres-
sion and L2-loss SVMs. The linear programming problem (76) is solved via GNU Linear
Programming Kit (GLPK). The source code (version 2.2) and GLPK are available at http:
//users.rsise.anu.edu.au/~chteo/BMRM.html and http://www.gnu.org/software/
glpk/, respectively.
GLMNET is implemented in Fortran along with an R interface. CGD-GS and Lassplore
are primarily implemented in MATLAB, but expensive operations are coded in C/C++.
All other solvers are implemented in C/C++ with double precision.
Some implementations (SCD, TRON, OWL-QN, and BMRM) solve (1), while some (CGD-
GS, IPM, GLMNET, and Lassplore) consider the bias term and solve (5). BBR6 and our
CDN implementation can handle both (1) and (5). According to whether the bias term is
considered, we categorize methods into two groups for comparison.7 Moreover, for certain
solvers their formulations are scaled by a constant. We ensure that equivalent optimization
problems are solved.
Software such as IPM and GLMNET supports finding a solution path of various C values.
However, in our experiments, we focus on the performance of an algorithm using a fixed C.
For all methods, we set the initial solution w1 = 0.
The parameter C is chosen by five-fold cross validation (CV) on the training set. Using
models trained under the best C, we predict the testing set to obtain the testing accuracy.
The best C, the model density (the number of non-zero coefficients in w divided by n), and
the corresponding testing accuracy are recorded in Table 2.
In the rest of this section, we compare the training speed of solvers for logistic regression
and L2-loss SVMs by using the best parameter C of each data set. We run all experiments
on a 64-bit machine with Intel Xeon 2.0GHz CPU (E5504), 4MB cache, and 32GB main
memory. We use GNU C/C++/Fortran compilers (version 4.4.1) and ensure that for each
package the “-O3” optimization flag is set.

8.3 Comparing Methods for L1-regularized Logistic Regression


We separately compare solvers for optimization problems without/with the bias term. The
first group, which solves (1), includes BBR, CDN, SCD, TRON, OWL-QN, and BMRM. We
begin at showing in Figure 1 the relation between the function value and the training time.
In each figure, the x-axis is the training time and the y-axis is the relative difference to the
optimal function value:
f (w) − f ∗
, (82)
f∗

6. BBR considers bias term as a feature and allows users to use an option “-I” to specify weights to individual
features.
7. For simplicity, we apply BBR only to solve (1).

3217
Yuan, Chang, Hsieh and Lin

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 1: Relative difference between the objective function and the optimum value, versus
training time. Logistic regression without using the bias term is solved. Both x-axis and
y-axis are log-scaled.

where f ∗ is obtained by running TRON with a strict stopping condition. Both x-axis and
y-axis are log-scaled. We draw a dotted reference line in Figure 1 to indicate the relative
error 0.1. From Figure 1, BBR and CDN can more quickly give a good solution than SCD,
TRON, OWL-QN, and BMRM. However, quasi Newton and Newton methods may have
faster local convergence. In Figures 1(c) and 1(d), TRON’s curve is almost vertical in the
end. This fast local convergence is mainly useful for problems such as a9a, for which BBR
and CDN are less competitive. We note that a9a is not a document data set and its number
of features is much smaller than the number of instances. Earlier studies for L2-regularized
classifiers have shown that coordinate descent methods are less competitive for this type of
data (Hsieh et al., 2008). The same situation seems to occur here for L1 regularization.
It is surprising that SCD is much slower than CDN and BBR because they are all coordi-
nate descent methods. We modify CDN to randomly select working variables, a stochastic
method similar to SCD; the result was still much faster than SCD, suggesting that the
stochastic selection is not the culprit. It turns out that a too large upper bound of the
second derivative causes the slow convergence of SCD. When xij ∈ [−1, 1], SCD replaces
Pl 2
i=1 xij in (34) with l. Then its Uj in (35) is much larger than Uj in (22) for BBR.
Therefore, SCD’s step size −gj0 (0)/Uj is often too small.
Equation (82) cannot be used for a practical stopping condition because f ∗ is not easily
available. Instead, we often rely on the gradient information. Now (1) is not differentiable,

3218
Comparing Methods and Software for L1-regularized Linear Classification

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 2: The 2-norm of the minimum-norm sub-gradient (83) versus training time. Logistic
regression without using the bias term is solved. Both x-axis and y-axis are log-scaled.

so we follow the optimality condition (6) to define the minimum-norm sub-gradient:



∇j L(w) + 1
 if wj > 0,
S
∇j f (w) ≡ ∇j L(w) − 1 if wj < 0,

sgn(∇j L(w)) max(|∇j L(w)| − 1, 0) otherwise.

Since ∇S f (w) = 0 if and only if w is optimal, we can check k∇S f (w)k for the stopping
condition. Figure 2 shows the scaled 2-norm of the minimum-norm sub-gradient,
l
k∇S f (w)k, (83)
min( i:yi =1 1, i:yi =−1 1)k∇S f (w1 )k
P P

along the training time. From Figure 2, CDN and BBR are the best in the early stage
of the procedure, while Newton and quasi Newton methods such as TRON and OWL-QN
have faster local convergence. This observation is consistent with Figure 1. Some methods
(e.g., decomposition methods) do not calculate ∇S f (w) in their procedures, so a gradient-
based stopping condition may introduce extra cost. However, these methods may be able to
use approximate gradient values obtained during the calculation. For example, coordinate
descent methods calculate only ∇j f (wk,j ), j = 1, . . . , n instead of ∇f (wk+1 ), but these
values can be directly used in a stopping condition; see details in Appendix F of Fan et al.
(2008).
In Figure 3, we investigate how the testing accuracy is improved along the training time.
The testing accuracy is the percentage of correct predictions on the testing set. Results show

3219
Yuan, Chang, Hsieh and Lin

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 3: The testing accuracy versus training time (log-scaled). Logistic regression without
using the bias term is solved. Curves of BMRM may not be shown because values are out
of the range.

that BBR and CDN achieve the final accuracy more quickly. In addition, we are interested
in the progress of these methods on gaining the sparsity. Figure 4 shows the number of
non-zero coefficients of w and the training time. For most data sets, BBR and CDN more
efficiently obtain a sparse solution.
Next, we compare methods solving (5) with the bias term. These solvers include CDN,
CGD-GS, IPM, Lassplore, and GLMNET. Although BBR can solve (5), we omit it in the
comparison as its performance is similar to that of solving (1). Following the comparison
for methods solving (1), we begin with checking the running time to reduce the relative
error of the function value and the scaled norm of the minimum-norm sub-gradient; see
(82) and (83), respectively. The results are given in Figures 5 and 6. The reference value f ∗
is obtained using IPM with a strict stopping condition. From Figure 5, CDN is the fastest,
IPM comes the second, and CGD-GS is the third. However, in Figure 6 for reducing the
gradient norm, IPM often surpasses CDN in the final stage.
GLMNET gives good performances in Figures 5(b)–5(d). In particular, it has fast local
convergence; see curves that are close to vertical in Figures 5(c) and 5(d). This fast local
convergence is due to using the quadratic approximation qk (d), which generates a Newton-
like direction. We find that the approximation of replacing D in (69) with a constant
diagonal matrix is effective for micro-array data used in Friedman et al. (2010). However,
for large document sets, using the exact Hessian is better. Unfortunately, GLMNET fails
to generates results for yahoo-japan and yahoo-korea after sufficient running time. We find
that setting an appropriate stopping tolerance for minimizing the quadratic approximation

3220
Comparing Methods and Software for L1-regularized Linear Classification

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 4: The number of non-zero coefficients versus training time (log-scaled). Logistic
regression without using the bias term is solved. Curves of BMRM may not be shown
because values are out of the range. The solid horizontal line indicates the final number of
non-zero coefficients.

in (71) is sometimes difficult. This issue might cause the problem in training yahoo-japan
and yahoo-korea. For IPM, the performance is in general competitive. Because IPM is a
type of Newton method, it has fast local convergence.
The result that CDN is faster than CGD-GS is worth for further discussion. Tseng and
Yun (2009) show that for some least-square regression problems, an implementation similar
to CDN (i.e., a Gauss-Seidel rule for selecting working variables) is slower than CGD-GS,
which selects variables using the gradient information. This result is opposite to ours. Two
issues might cause the different results. First, problems are different. We aim at classifying
large and sparse document data, but they solve regression problems. Second, we implement
two techniques (permutation of sub-problems and shrinking) to improve the convergence.
Yun and Toh (2011) show that for some document data sets used here, their CGD-
GS is faster than IPM. However, our results indicate that IPM is better. This difference
is apparently due to that Yun and Toh (2011) run an earlier version (0.8.1) of IPM. We
indicate in Section 8.2 that a minor problem in this version causes this version to be two or
three times slower than a later version (0.8.2) used here.
Figure 7 indicates the testing accuracy versus the training time. We can see in Figures
7(e) and 7(f) that IPM’s accuracy may not improve as the training time increases. As we
mentioned in Section 8.2, this result is because we modify certain w elements to zero. In the
middle of the procedure, w is not close to an optimum yet, but many elements have satisfied
|∇j L(w)| ≤ 0.9999 and are trimmed to zero. Hence, the resulting accuracy may be worse

3221
Yuan, Chang, Hsieh and Lin

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 5: Relative difference between the objective function and the optimum value, versus
training time. Logistic regression with the bias term is solved. Both x-axis and y-axis are
log-scaled.

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 6: The 2-norm of the minimum-norm sub-gradient (83) versus training time. Logistic
regression with the bias term is solved. Both x-axis and y-axis are log-scaled.

3222
Comparing Methods and Software for L1-regularized Linear Classification

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 7: The testing accuracy versus training time (log-scaled). Logistic regression with
the bias term is solved.

than that in the early stage of the procedure. In fact, due to IPM’s dense w throughout
iterations, to get the final sparsity and the testing accuracy, we need to accurately solve the
optimization problem.
Figure 8 presents the number of w’s non-zero coefficients. Similar to methods solving
(1), all methods here, except CGD-GS, have solutions with many non-zero coefficients in
the beginning and gradually gain the sparsity. In contrast, CGD-GS’s numbers of non-
zero coefficients in the whole optimization process are not much more than those of the
final solutions. We find that this nice property is from using the gradient information for
selecting working variables. In contrast, without the gradient information, CDN wrongly
updates some elements of w to be non-zeros in the early stage of the procedure.
In summary, for large document data, coordinate descents methods such as CDN perform
well in the early stage of the optimization procedure. As a result, they achieve the final
accuracy more quickly. However, for some applications, a correct sparsity pattern of the
model is important and a more accurate solution is sought. Then, GLMNET by combining
both Newton-type and coordinate descent approaches is useful.

8.4 Comparing Methods for L1-regularized L2-loss SVMs and Investigating


CDN’s Shrinking Strategy
In Section 7, we have extended CDN and TRON to handle L2 loss, so methods included for
comparison are CDN, TRON, and BMRM. Note that we solve (77) without considering the
bias term.

3223
Yuan, Chang, Hsieh and Lin

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 8: The number of non-zero coefficients versus training time (log-scaled). Logistic
regression with the bias term is solved. The solid horizontal line indicates the final number
of non-zero coefficients.

Following the experiment for logistic regression, we plot the relative difference to the
optimal function value in Figure 9 and the scaled norm of the minimum-norm sub-gradient
in Figure 10. The reference f ∗ is obtained by running TRON with a strict stopping condition.
One can see that CDN’s training time is the shortest among the three solvers and TRON is
better than BMRM. BMRM does not properly decrease the function value and the norm of
gradient on large data sets.
In Figures 9 and 10, we also present results of CDN without implementing the shrinking
strategy (denoted as CDN-NS). In most cases, the implementation with shrinking is only
slightly better. However, shrinking is effective if the sparsity is high (e.g., news20 and
yahoo-japan). In such a situation, most w components are zero. We can safely remove some
zero components and more efficiently solve smaller optimization problems.

9. Discussions and Conclusions

In Section 2.5, we briefly discuss optimization methods for L1-regularized least-square re-
gression. Some comparisons can be seen in, for example, the experiment section of Wright
et al. (2009) and Yun and Toh (2011). Note that an optimization method may perform dif-
ferently on classification and regression problems. For instance, Yun and Toh (2011) show
that CGD-GS is faster than CDN for regression problems, but here we have an opposite
observation for document classification.

3224
Comparing Methods and Software for L1-regularized Linear Classification

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 9: Relative difference between the objective function and the optimum value, versus
training time. L2-loss SVM without using the bias term is solved. Both x-axis and y-axis
are log-scaled.

(a) a9a (b) real-sim (c) news20

(d) rcv1 (e) yahoo-japan (f) yahoo-korea


Figure 10: The 2-norm of the minimum-norm sub-gradient (83) versus training time. L2-loss
SVM without using the bias term is solved. Both x-axis and y-axis are log-scaled.

3225
Yuan, Chang, Hsieh and Lin

(a) rcv1 with 10C ∗ (b) rcv1 with 0.1C ∗

Figure 11: A comparison of logistic regression solvers with regularization parameters 10C ∗
and 0.1C ∗ . Both x-axis (relative difference between the objective function and the optimum
value) and y-axis (training time) are log-scaled.

Figures 1–8 indicate that CDN is faster for solving (1) than (5). Our past work (Huang
et al., 2010, Section 5) has indicated that the bias term may affect the running time of the
same optimization method. As the accuracy does not differ much, it seems that in general
the bias term should not be considered.
Among quasi Newton and Newton approaches, IPM and OWL-QN are faster than TRON.
This result seems to indicate that TRON suffers from the difficulty for finding w’s final non-
zero coefficients. However, IPM’s w elements are all non-zeros, so we must accurately solve
the optimization problem before trimming some elements to zero.
Figures 9 and 10 indicate that CDN/TRON for L2-loss SVMs is faster than CDN/TRON
for logistic regression. Each iteration of CDN/TRON for L2-loss SVMs is cheaper because
no exp/log operations are involved. Because the accuracy is similar, in general we prefer
L2-loss SVMs over logistic regression.
The choice of the regularization parameter C affects the performance of solvers. In
Figure 11, we present results on the rcv1 data set with regularization parameters 10C ∗ and
0.1C ∗ , respectively, where C ∗ = 4 is the best parameter obtained from cross validation; see
Table 2. Results show that all solvers take longer running time when C is large. Therefore,
one should avoid a value much larger than C ∗ by trying from a small C. Moreover, methods
using low-order (e.g., gradient only) information seem to be more sensitive to the change of
C. For example, with C ∗ and 0.1C ∗ we respectively observe in Figures 1(d) and 11(b) that
CDN is faster than OWL-QN, but with 10C ∗ , OWL-QN surpasses CDN in the final stage.
GLMNET iteratively considers quadratic approximations and applies coordinate descent
methods at each iteration. The discussion in Section 8.3 indicates that GLMNET’s speed
may be further improved if an adaptive stopping condition is properly designed for mini-
mizing each quadratic approximation.
It is challenging to efficiently solve L1-regularized linear classification problems. The 1-
norm term causes the non-differentiability of the objective function. In this paper, we review

3226
Comparing Methods and Software for L1-regularized Linear Classification

many existing methods for logistic regression and L2-loss SVMs. We discuss some state-
of-the-art methods in detail. Our extensive comparison shows that carefully implemented
coordinate descent methods are effective for L1-regularized classification with large-scale
document data.

Acknowledgments

This work was supported in part by the National Science Council of Taiwan via the grant 98-
2221-E-002-136-MY3. The authors thank associate editor, reviewers, Trevor Hastie, Robert
Tibshirani, and Jun Liu for valuable comments.

Appendix A. Existence of Optimal Solutions of (1)


Consider the following level set:

S ≡ {w | f (w) ≤ f (0)}.

We prove that S is compact (closed and bounded). Clearly, S is closed due to the continuity
of f (w). If S is not bounded, then there is a sequence {wk } ⊂ S such that kwk k1 → ∞.
Because we assume that ξ(w; xi , yi ) ≥ 0, f (wk ) ≥ kwk k1 . Then, f (wk ) → ∞ contradicts
f (wk ) ≤ f (0). Thus, S is a compact set. From Weierstrass’ Theorem, f (w) has at least
one minimum in S.

Appendix B. Solution of (26)


We consider the following general form with A > 0:

1
min |wj + z| + Bz + Az 2 . (84)
z 2
Clearly, (
1 g1 (z) if z ≥ −wj ,
|wj + z| + Bz + Az 2 =
2 g2 (z) if z ≤ −wj ,

where
1 1
g1 (z) ≡ wj + z + Bz + Az 2 and g2 (z) ≡ −wj − z + Bz + Az 2 .
2 2
By checking if the minimum of the quadratic function occurs on the right or the left side
of −wj , we know
(
− B+1
A if B + 1 ≤ Awj ,
arg min g1 (z) =
z≥−wj −wj otherwise,

and (
− B−1
A if B − 1 ≥ Awj ,
arg min g2 (z) =
z≤−wj −wj otherwise.

3227
Yuan, Chang, Hsieh and Lin

Because B + 1 ≤ Awj and B − 1 ≥ Awj cannot hold at the same time, and g1 (−wj ) =
g2 (−wj ), we can easily conclude that the solution of (84) is

B+1
− A
 if B + 1 ≤ Awj ,
B−1
− A if B − 1 ≥ Awj ,

−wj otherwise.

Appendix C. Proof of Theorem 1


From the assumption that wk → w∗ and the way that w is cyclically updated, we have
wk,j → w∗ as well. The first result −1 < ∇j L(wk,j ) < 1 immediately follows from the
assumption −1 < ∇j L(w∗ ) < 1 and the continuity of ∇L(w).
We then focus on the second result to show that wjk,j = 0 after k is large enough. The
optimality condition (6) and the assumption −1 < ∇j L(w∗ ) < 1 imply that wj∗ = 0. From
the continuity of ∇2 L(w) and the compactness of the level set S proved in Appendix A, we
can define
M ≡ max{∇2jj L(w) | w ∈ S} ≥ 0. (85)
With the assumption −1 < ∇j L(w∗ ) < 1 and the property wjk,j → wj∗ = 0, for any
σ ∈ (0, 1), there exists an iteration index Kj such that for all k ≥ Kj ,
M M
−1+ |wjk,j | ≤ L0j (0; wk,j ) = ∇j L(wk,j ) ≤ 1 − |wk,j | (86)
1−σ 1−σ j
and n o
w kw − wk,j k ≤ kwk,j − w∗ k for some k ≥ Kj ⊂ S.8 (87)

We prove that wjk,j = 0, ∀k ≥ Kj . Recall that in the decomposition method we use (27)
to obtain a direction d. From (86), we see that at k = Kj , the third case in (27) is taken.
That is, d = −wjk,j . If σ in (86) is chosen to be the constant used in the sufficient decrease
condition (28), we claim that d = −wjk,j satisfies (28) with λ = 1: If wjk,j ≥ 0,
 
gj (−wjk,j ) − gj (0) − σ −wjk,j − L0j (0; wk,j )wjk,j
  1
= (1 − σ) −wjk,j − L0j (0; wk,j )wjk,j + L00j (z̃; wk,j )(wjk,j )2 (88)
2
  1
k,j k,j
0
≤ (1 − σ) −wj − Lj (0; w )wj k,j
+ M (wjk,j )2 (89)
2
−1
≤ M (wjk,j )2 ≤ 0, (90)
2
where z̃ is between 0 and −wjk,j , Equation (88) follows from (20)–(21), Equation (89) is
from (87) and (85),9 and Equation (90) is from the first inequality of (86). The above result
8. We need f (w∗ ) < f (0). This property generally holds. If not, we can slightly enlarge S so that (87) and
all subsequent derivations still follow.
9. More precisely,
kwk,j + z̃ej − wk,j k ≤ |wjk,j | = |wjk,j − wj∗ | ≤ kwk,j − w∗ k.

3228
Comparing Methods and Software for L1-regularized Linear Classification

is precisely the sufficient decrease condition (28) with λ = 1. The situation for wjk,j < 0
is similar. By taking the step d = −wjk,j , we have wjk+1,j = 0. At the (k + 1)st iteration,
wjk+1,j = 0 and (86) imply that the optimality condition (18) for the sub-problem has been
satisfied. Thus, the jth element of w remains zero in all subsequent iterations.

Appendix D. Convergence of CDN: Logistic Regression


If each time one variable is used, the sub-problem considered in Tseng and Yun (2009) is
1
min |wj + z| − |wj | + L0j (0)z + Hz 2 . (91)
z 2
Clearly, (26) is a special case of (91) by taking H = L00j (0). Therefore, we can apply results
proved in Tseng and Yun (2009).
To have the finite termination of the line search procedure, Tseng and Yun (2009, Lemma
5) require that there exists A > 0 such that

k∇L(w1 ) − ∇L(w2 )k ≤ Akw1 − w2 k, ∀w1 , w2 ∈ Rn (92)

and
L00j (0; w) > 0, (93)
where w is the current solution used to generate the direction d in (27). Equation (92)
means that ∇L(·) is globally Lipschitz continuous. For logistic regression, we have

k∇L(w1 ) − ∇L(w2 )k ≤ k∇2 L(w̃)kkw1 − w2 k,

where w̃ is between w1 and w2 . Moreover,

k∇2 L(w̃)k = CkX T D(w̃)Xk ≤ CkX T kkXk, (94)

where D(w̃) ∈ Rl×l is a diagonal matrix similar to (70). Because any diagonal element of
D(w̃) is less than one, we have the inequality in (94). Thus, we can use CkX T kkXk as the
constant A in (92).
For (93), from (29), L00j (0; w) in the level set S is lower bounded by a positive value.
The only exception is that L00j (0; w) = 0 when xij = 0, ∀i = 1, . . . , l. In this situation,
L0j (0; w) = 0 and the optimal wj∗ = 0. The one-variable function gj (z) is reduced to

gj (z) = |wj + z| − |wj |,

so d = −wj from (27) and d satisfies the sufficient decrease condition (28). Therefore, wj
becomes zero after the first iteration and is not changed after. It is like that the jth feature
has been removed in the beginning.
Next, we discuss the asymptotic convergence of function values. Tseng and Yun (2009)
impose certain conditions on choosing the working set; see Equation (12) in their paper. If
one variable is updated at a time, their condition is reduced to that between wk and wk+1 ,
one must go through n sub-problems covering all w1 , . . . , wn . This setting is exactly what
we do, regardless of using a sequential order of 1, . . . , n or a permutation π(1), . . . , π(n).

3229
Yuan, Chang, Hsieh and Lin

Tseng and Yun (2009) need an additional assumption for the asymptotic convergence
(see Assumption 1 in their paper):
0 < λmin ≤ H k,j ≤ λmax , ∀j = 1, . . . , n, k = 1, . . . , (95)
where λmin and λmax are positive constants, and H k,j = ∇2jj L(wk,j ) is the coefficient H
in (91) at wk,j . From (29) and the boundedness of the level set S (see Appendix A), this
assumption holds except that for some j, xij = 0, ∀i. For such j, ∇j L(w) = 0, ∀w. Hence,
wj becomes zero by (27) at the first iteration and is not changed subsequently. Because
wj = 0 is optimal in this situation, it is like that the jth variable has been removed for
optimization. Therefore, we have (95) without problems.
Following Tseng and Yun (2009, Theorem 1(e)), any limit point of {wk } is an optimum
of (1) with logistic loss.

Appendix E. Convergence of TRON: Logistic Regression


Consider any limit point w̄ generated by {wk }. Lin and Moré (1999) proved that if
∇2 f¯(w̄)J,J is positive definite, (96)
where
J ≡ {j | ∇j f¯(w̄) = 0},
then the following two results hold:
1. {wk } globally converges to w̄.
2. {wk } quadratically converges. That is, there exist µ ∈ (0, 1) and a positive integer K
such that
kwk+1 − w̄k ≤ (1 − µ)kwk − w̄k2 , ∀k > K.
The remaining question is whether (96) holds. From (58), we have
 T 
2¯ X  
∇ f (w̄) = C T D X −X , (97)
−X
where D is defined in (57). ∇2 f¯(w̄) cannot be positive definite if there exists 1 ≤ j ≤ n
such that {j, j + n} ⊂ J. The reason is that one can easily find a vector v ∈ R2n such that
v T ∇2 f¯(w̄)v = 0. For example, let vj = vj+n 6= 0 and other elements be zero. However, we
claim that {j, j + n} ⊂ J does not happen. Otherwise,
∇j f¯(w̄) = 1 + ∇j L(w̄1:n − w̄n+1:2n ) and
∇j+n f¯(w̄) = 1 − ∇j L(w̄1:n − w̄n+1:2n )
are both zero, but this situation is not possible.
From the optimality condition (6), J is the same as the following set:
{j | w̄j > 0 or w̄j = ∇j f¯(w̄) = 0}.
Therefore, if the solution is sparse and those w̄j = 0 or w̄j+n = 0, 1 ≤ j ≤ n satisfy
(
∇j f¯(w̄) > 0 if w̄j = 0,
∇j+n f¯(w̄) > 0 if w̄j+n = 0,

3230
Comparing Methods and Software for L1-regularized Linear Classification

(i.e., the optimization problem is non-degenerate), then |J| is small. From (97), ∇2 f¯(w̄)J,J
tends to be positive definite if |J| is small. Then, we can have the quadratic convergence.

Appendix F. Convergence of CDN: L2-loss SVM


To have the finite termination of the line search procedure, we need conditions similar to
(92) and (93). The condition (92) means that ∇L(w) is globally Lipschitz continuous. L2-
loss SVM satisfies this property (Mangasarian, 2002, Section 3). For (93), the setting in
(81) ensures that H in (91) is always positive. Therefore, the line search procedure stops
in a finite number of steps.
For the asymptotic convergence, we need again the condition (95). Our setting in (81)
meets this assumption, so any limit point of {wk } is an optimal solution of (77).

References
Galen Andrew and Jianfeng Gao. Scalable training of L1-regularized log-linear models.
In Proceedings of the Twenty Fourth International Conference on Machine Learning
(ICML), 2007.

Suhrid Balakrishnan and David Madigan. Algorithms for sparse linear classifiers in the
massive data setting. 2005. URL http://www.stat.rutgers.edu/~madigan/PAPERS/
sm.pdf.

Steven Benson and Jorge J. Moré. A limited memory variable metric method for bound
constrained minimization. Preprint MCS-P909-0901, Mathematics and Computer Science
Division, Argonne National Laboratory, Argonne, Illinois, 2001.

Paul S. Bradley and Olvi L. Mangasarian. Feature selection via concave minimization and
support vector machines. In Proceedings of the Twenty Fifth International Conference
on Machine Learning (ICML), 1998.

Richard H. Byrd, Peihung Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm
for bound constrained optimization. SIAM Journal on Scientific Computing, 16:1190–
1208, 1995.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Coordinate descent method for large-
scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008.
URL http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.

David L. Donoho and Yaakov Tsaig. Fast solution of l1 minimization problems when the
solution may be sparse. IEEE Transactions on Information Theory, 54:4789–4812, 2008.

John Duchi and Yoram Singer. Boosting with structural sparsity. In Proceedings of the
Twenty Sixth International Conference on Machine Learning (ICML), 2009.

3231
Yuan, Chang, Hsieh and Lin

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections
onto the L1-ball for learning in high dimensions. In Proceedings of the Twenty Fifth
International Conference on Machine Learning (ICML), 2008.

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.
Annals of Statistics, 32:407–499, 2004.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLIN-
EAR: A library for large linear classification. Journal of Machine Learning Research, 9:
1871–1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf.

Mário A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions


on Pattern Analysis and Machine Intelligence, 25:1150–1159, 2003.

Mário A. T. Figueiredo, Robert Nowak, and Stephen Wright. Gradient projection for sparse
reconstruction: Applications to compressed sensing and other inverse problems. IEEE
Journal on Selected Topics in Signal Processing, 1:586–598, 2007.

Jerome H. Friedman, Trevor Hastie, Holger Höfling, and Robert Tibshirani. Pathwise
coordinate optimization. Annals of Applied Statistics, 1(2):302–332, 2007.

Jerome H. Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for gen-
eralized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22,
2010.

Wenjiang J. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computa-
tional and Graphical Statistics, 7(3):397–416, 1998.

Glenn M. Fung and Olvi L. Mangasarian. A feature selection Newton method for support
vector machine classification. Computational optimization and applications, 28:185–202,
2004.

Eli M. Gafni and Dimitri P. Bertsekas. Two-metric projection methods for constrained
optimization. SIAM Journal on Control and Optimization, 22:936–964, 1984.

Alexandar Genkin, David D. Lewis, and David Madigan. Large-scale Bayesian logistic
regression for text categorization. Technometrics, 49(3):291–304, 2007.

Joshua Goodman. Exponential priors for maximum entropy models. In Proceedings of the
Annual Meeting of the Association for Computational Linguistics, 2004.

Elaine T. Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for l1-minimization:
Methodology and convergence. SIAM Journal on Optimization, 19(3):1107–1130, 2008.

Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and Sellamanickam Sun-
dararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings
of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.

3232
Comparing Methods and Software for L1-regularized Linear Classification

Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin. Iterative scaling
and coordinate descent methods for maximum entropy. Journal of Machine Learning Re-
search, 11:815–848, 2010. URL http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_
journal.pdf.

Thorsten Joachims. Making large-scale SVM learning practical. In Bernhard Schölkopf,


Christopher J. C. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods
– Support Vector Learning, pages 169–184, Cambridge, MA, 1998. MIT Press.

Jun’ichi Kazama and Jun’ichi Tsujii. Evaluation and extension of maximum entropy models
with inequality constraints. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, pages 137–144, 2003.

S. Sathiya Keerthi and Shrish Shevade. A fast tracking algorithm for generalized
LARS/LASSO. IEEE Transactions on Neural Networks, 18(6):1826–1830, 2007.

Jinseog Kim, Yuwon Kim, and Yongdai Kim. A gradient-based optimization algorithm for
LASSO. Journal of Computational and Graphical Statistics, 17(4):994–1009, 2008.

Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky.
An interior point method for large-scale l1-regularized least squares. IEEE Journal on
Selected Topics in Signal Processing, 1:606–617, 2007.

Yongdai Kim and Jinseog Kim. Gradient LASSO for feature selection. In Proceedings of
the 21st International Conference on Machine Learning (ICML), 2004.

Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent
for linear predictors. Information and Computation, 132:1–63, 1997.

Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for large-
scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519–
1555, 2007. URL http://www.stanford.edu/~boyd/l1_logistic_reg.html.

Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, and Mário A. T.


Figueiredo. A Bayesian approach to joint feature selection and classifier design. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 26(6):1105–1111, 2004.

Balaji Krishnapuram, Lawrence Carin, Mário A. T. Figueiredo, and Alexander J.


Hartemink. Sparse multinomial logistic regression: fast algorithms and generalization
bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):957–
968, 2005.

Jussi Kujala, Timo Aho, and Tapio Elomaa. A walk from 2-norm SVM to 1-norm
SVM. Preprint available from http://www.cs.tut.fi/~kujala2/papers/1norm2norm.
pdf, 2009.

John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient.
Journal of Machine Learning Research, 10:771–801, 2009.

3233
Yuan, Chang, Hsieh and Lin

Cheng-Yu Lee. A comparison of optimization methods for large-scale l1-regularized logistic


regression. Master’s thesis, Department of Computer Science and Information Engineer-
ing, National Taiwan University, 2008.
Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y. Ng. Efficient l1 regularized lo-
gistic regression. In Proceedings of the Twenty-first National Conference on Artificial
Intelligence (AAAI-06), pages 1–9, Boston, MA, USA, July 2006.
Chih-Jen Lin and Jorge J. Moré. Newton’s method for large-scale bound constrained prob-
lems. SIAM Journal on Optimization, 9:1100–1127, 1999.
Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. Trust region Newton method for
large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008.
URL http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf.
Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale
optimization. Mathematical Programming, 45(1):503–528, 1989.
Jun Liu and Jieping Ye. Efficient euclidean projections in linear time. In Proceedings of the
Twenty Sixth International Conference on Machine Learning (ICML), 2009.
Jun Liu, Jianhui Chen, and Jieping Ye. Large-scale sparse logistic regression. In Proceedings
of The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 547–556, 2009.
Olvi L. Mangasarian. A finite Newton method for classification. Optimization Methods and
Software, 17(5):913–929, 2002.
Olvi L. Mangasarian. Exact 1-norm support vector machines via unconstrained convex
differentiable minimization. Journal of Machine Learning Research, 7:1517–1530, 2006.
Yurii E. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer
Academic Publishers, 2003.
Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. A new approach to variable
selection in least squares problems. IMA Journal of Numerical Analysis, 20:389–404,
2000a.
Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the LASSO and its dual.
Journal of Computational and Graphical Statistics, 9(2):319–337, 2000b.
Mee Young Park and Trevor Hastie. L1 regularization path algorithm for generalized linear
models. Journal of the Royal Statistical Society Series B, 69:659–677, 2007.
Simon Perkins, Kevin Lacker, and James Theiler. Grafting: Fast, incremental feature
selection by gradient descent in function space. Journal of Machine Learning Research,
3:1333–1356, 2003.
Saharon Rosset. Following curved regularized optimization solution paths. In Lawrence K.
Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing
Systems 17, pages 1153–1160, Cambridge, MA, 2005. MIT Press.

3234
Comparing Methods and Software for L1-regularized Linear Classification

Volker Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):
16–28, 2004.

Sylvain Sardy, Andrew G. Bruce, and Paul Tseng. Block coordinate relaxation methods for
nonparametric signal denoising with wavelet dictionaries. Journal of Computational and
Graphical Statistics, 9:361–379, 2000.

Mark Schmidt, Glenn Fung, and Romer Rosales. Fast optimization methods for l1 regu-
larization: A comparative study and two new approaches. In Proceedings of European
Conference on Machine Learning, pages 286–297, 2007.

Mark Schmidt, Glenn Fung, and Romer Rosales. Optimization methods for l1-
regularization. Technical Report TR-2009-19, University of British Columbia, 2009.

Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for l1 regularized loss mini-
mization. In Proceedings of the Twenty Sixth International Conference on Machine Learn-
ing (ICML), 2009.

Shirish Krishnaj Shevade and S. Sathiya Keerthi. A simple and efficient algorithm for gene
selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.

Jianing Shi, Wotao Yin, Stanley Osher, and Paul Sajda. A fast hybrid algorithm for large
scale l1-regularized logistic regression. Journal of Machine Learning Research, 11:713–
741, 2010.

Choon Hui Teo, S.V.N. Vishwanathan, Alex Smola, and Quoc V. Le. Bundle methods for
regularized risk minimization. Journal of Machine Learning Research, 11:311–365, 2010.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society Series B, 58:267–288, 1996.

Ryota. Tomioka and Masashi Sugiyama. Dual augmented Lagrangian method for efficient
sparse reconstruction. IEEE Signal Processing Letters, 16(12):1067–1070, 2009.

Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth
separable minimization. Mathematical Programming, 117:387–423, 2009.

Stephen J. Wright, Robert D. Nowak, and Mário A.T. Figueiredo. Sparse reconstruction by
separable approximation. IEEE Transactions on Signal Processing, 57:2479–2493, 2009.

Tong Tong Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized
regression. The Annals of Applied Statistics, 2(1):224–244, 2008.

Jin Yu, S.V.N. Vishwanathan, Simon Gunter, and Nicol N. Schraudolph. A quasi-Newton
approach to nonsmooth convex optimization problems in machine learning. Journal of
Machine Learning Research, 11:1–57, 2010.

Sangwoon Yun and Kim-Chuan Toh. A coordinate gradient descent method for l1-
regularized convex minimization. Computational Optimizations and Applications, 48(2):
273–307, 2011.

3235
Yuan, Chang, Hsieh and Lin

Tong Zhang and Frank J. Oles. Text categorization based on regularized linear classification
methods. Information Retrieval, 4(1):5–31, 2001.

Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine
Learning Research, 7:2541–2563, 2006.

Peng Zhao and Bin Yu. Stagewise lasso. Journal of Machine Learning Research, 8:2701–
2726, 2007.

Ji Zhu, Saharon Rosset, Trevor Hastie, and Rob Tibshirani. 1-norm support vector ma-
chines. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances
in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.

3236

You might also like