18.657: Mathematics of Machine Learning: S R LR LK K

18.
657: Mathematics of Machine Learning

Lecturer: Philippe Rigollet Lecture 13
Scribe: Mina Karzand Oct. 21, 2015
Previously, we analyzed the convergence of the projected gradient descent algorithm.

We proved that optimizing the convex L-Lipschitz function f on a closed, convex set C with
diam(C) ≤ R with step sizes ηs = R √ would give us accuracy of f (x) ≤ f (x∗ ) + LR
√ after
L k k
k iterations.
Although it might seem that projected gradient descent algorithm provides dimension-
free convergence rate, it is not always true. Reviewing the proof of convergence rate, we
realize that dimension-free convergence is possible when the objective function f and the
constraint set C are well-behaved in Euclidean norm (i.e., for all x ∈ C and g ∈ ∂f (x), we
have that |x|2 and |g|2 are independent of the ambient dimension). We provide an examples
of the cases that these assumptions are not satisfied.
• Consider the differentiable, convex function f on the Euclidean ball B2,n such that
√
k∇f (x)k∞ ≤ 1, ∀x ∈ B2,n . This implies that |∇f (x)|2 ≤ n and p nthe projected
gradient descent converges to the minimum of f in B2,n at rate k . Using the
q
method of mirror descent we can get convergence rate of log(n)
k
To get better rates of convergence in the optimization problem, we can use the Mirror
Descent algorithm. The idea is to change the Euclidean geometry to a more pertinent
geometry to a problem at hand. We will define a new geometry by using a function which
is sometimes called potential function Φ(x). We will use Bregman projection based on
Bregman divergence to define this geometry.
The geometric intuition behind the mirror Descent algorithm is the following: The
projected gradient described in previous lecture works in any arbitrary Hilbert space H so
that the norm of vectors is associated with an inner product. Now, suppose we are interested
in optimization in a Banach space D. In other words, the norm (or the measure of distance)
that we use does not derive from an inner product. In this case, the gradient descent does
not even make sense since the gradient ∇f (x) are elements of dual space. Thus, the term
x − η∇f (x) cannot be performed. (Note that in Hilbert space used in projected gradient
descent, the dual space of H is isometric to H. Thus, we didn’t have any such problems.)
The geometric insight of the Mirror Descent algorithm is that to perform the optimiza-
tion in the primal space D, one can first map the point x ∈ D in primal space to the dual
space D ∗ , then perform the gradient update in the dual space and finally map the optimal
point back to the primal space. Note that at each update step, the new point in the primal
space D might be outside of the constraint set C ⊂ D, in which case it should be projected
into the constraint set C. The projection associate with the Mirror Descent algorithm is
Bergman Projection defined based on the notion of Bergman divergence.
Definition (Bregman Divergence): For given differentiable, α-strongly convex func-

tion Φ(x) : D → R, we define the Bregman divergence associated with Φ to be:
DΦ (y, x) = Φ(y) − Φ(x) − ∇Φ(x)T (y − x)
1
We will use the convex open set D ⊂ Rn whose closure contains the constraint set C ⊂ D.
Bregman divergence is the error term of the first order Taylor expansion of the function Φ
in D.
Also, note that the function Φ(x) is said to be α-strongly convex w.r.t. a norm k.k if
α
Φ(y) − Φ(x) − ∇Φ(x)T (y − x) ≥ ky − xk2 .
2
We used the following property of the Euclidean norm:
2a⊤ b = kak2 + kbk2 − ka − bk2
in the proof of convergence of projected gradient descent, where we chose a = xs − ys+1 and
b = xs − x∗ .
To prove the convergence of the Mirror descent algorithm, we use the following property
of the Bregman divergence in a similar fashion. This proposition shows that the Bregman di-
vergence essentially behaves as the Euclidean norm squared in terms of projections:
Proposition: Given α-strongly differentiable convex function Φ : D → R, for all

x, y, z ∈ D,
[∇Φ(x) − ∇Φ(y)]⊤ (x − z) = DΦ (x, y) + DΦ (z, x) − DΦ (z, y) .
As described previously, the Bregman divergence is used in each step of the Mirror descent
algorithm to project the updated value into the constraint set.
Definition (Bregman Projection): Given α-strongly differentiable convex function

Φ : D → R and for all x ∈ D and closed convex set C ⊂ D
ΠΦ
C (x) = argmin DΦ (z, x)
z∈C∩D
2.4.2 Mirror Descent Algorithm
Algorithm 1 Mirror Descent algorithm

Input: x1 ∈ argminC∩D Φ(x), ζ : Rd → Rd such that ζ(x) = ∇Φ(x)
for s = 1, · · · , k do
ζ(ys+1 ) = ζ(xs ) − ηgs for gs ∈ ∂f (xs )
xs+1 = ΠCΦ (ys+1 )
end for P
return Either x = k1 ks=1 xs or x◦ ∈ argminx∈{x1 ,··· ,xk } f (x)
Proposition: Let z ∈ C ∩ D, then ∀y ∈ D,
(∇Φ(π(y) − ∇Φ(y))⊤ (π(y) − z) ≤ 0
2
Moreover, DΦ (z, π(y)) ≤ DΦ (z, y).
Proof. Define π = ΠΦ C (y) and h(t) = DΦ (π + t(z − π), y) . Since h(t) is minimized at t = 0
(due to the definition of projection), we have
h′ (0) = ∇x DΦ (x, y)|x=π (z − π) ≥ 0
where suing the definition of Bregman divergence,
∇x DΦ (x, y) = ∇Φ(x) − ∇Φ(y)
Thus,
(∇Φ(π) − ∇Φ(y))⊤ (π − z) ≤ 0 .
Using proposition 1, we know that
(∇Φ(π) − ∇Φ(y))⊤ (π − z) = DΦ (π, y) + DΦ (z, π) − DΦ (z, y) ≤ 0 ,
and since DΦ (π, y) ≥ 0, we would have DΦ (z, π) ≤ DΦ (z, y).
Theorem: Assume that f is convex and L-Lipschitz w.r.t. k.k. Assume that Φ is
α-strongly convex on C ∩ D w.r.t. k.k and
R2 = sup Φ(x) − min Φ(x)

x∈C∩D x∈C∩D
taq
ke x1 = argminx∈C∩D Φ(x) (assume that it exists). Then, Mirror Descent with η =
R 2α
L R gives,
r r
∗ 2 ◦ ∗ 2
f (x) − f (x ) ≤ RL and f (x ) − f (x ) ≤ RL ,
αk αk
Proof. Take x♯ ∈ C ∩ D. Similar to the proof of the projected gradient descent, we have:
(i)
f (xs ) − f (x♯ ) ≤ gs⊤ (xs − x♯ )
(ii) 1
= (ζ(xs ) − ζ(ys+1 ))⊤ (xs − x♯ )
η
(iii) 1
= (∇Φ(xs ) − ∇Φ(ys+1 ))⊤ (xs − x♯ )
η
(iv) 1
h i
= DΦ (xs , ys+1 ) + DΦ (x♯ , xs ) − DΦ (x♯ , ys+1 )
η
(v) 1 h i
≤ DΦ (xs , ys+1 ) + DΦ (x♯ , xs ) − DΦ (x♯ , xs+1 )
η
(vi) ηL2 1h ♯ ♯
i
≤ + D Φ (x , x s ) − DΦ (x , x s+1 )
2α2 η
Where (i) is due to convexity of the function f .
3
Equations (ii) and (iii) are direct results of Mirror descent algorithm.
Equation (iv) is the result of applying proposition 1.
Inequality (v) is a result of the fact that xs+1 = ΠΦ ♯
C (ys+1 ), thus for x ∈ C ∩ D, we have
♯ ♯
DΦ (x , ys+1 ) ≥ DΦ (x , xs+1 ).
We will justify the following derivations to prove inequality (vi):
(a)
DΦ (xs , ys+1 ) = Φ(xs ) − Φ(ys+1 ) − ∇Φ(ys+1 )⊤ (xs − ys+1 )
(b) α
≤ [∇Φ(xs ) − ∇Φ(ys+1 )]⊤ (xs − ys+1 ) − kys+1 − xs k2
2
(c) α
≤ ηkgs k∗ kxs − ys+1 k − kys+1 − xs k2
2
(d) η 2 L2
≤ .
2α
Equation (a) is the definition of Bregman divergence.
To show inequality (b), we used the fact that Φ is α-strongly convex which implies that
Φ(ys+1 ) − Φ(xs ) ≥ ∇Φ(xs )T (ys+1 − xs ) α2 kys+1 − xs k2 .
According to the Mirror descent algorithm, ∇Φ(xs ) − ∇Φ(ys+1 ) = ηgs . We use Hölder’s
inequality to show that gs⊤ (xs − ys+1 ) ≤ kgs k∗ kxs − ys+1 k and derive inequality (c).
Looking at the quadratic term ax−bx2 for a, b > 0 , it is not hard to show that max ax − bx2 =
a2 α
4b . We use this statement with x = kys+1 − xs k , a = η kgs k∗ ≤ L and b = 2 to derive
inequality (d).
Again, we use telescopic sum to get
k
1X ηL2 DΦ (x♯ , x1 )
[f (xs ) − f (x♯ )] ≤ + . (2.1)
k s=1 2α kη
We use the definition of Bregman divergence to get
DΦ (x♯ , x1 ) = Φ(x♯ ) − Φ(x1 ) − ∇Φ(x1 )(x♯ − x1 )

≤ Φ(x♯ ) − Φ(x1 )
≤ sup Φ(x) − min Φ(x)
x∈C∩D x∈C∩D
2
≤R .
Where we used the fact x1 ∈ argminC∩D Φ(x) in the description of the Mirror Descent
algorithm to prove ∇Φ(x1 )(x♯ − x1 ) ≥ 0. We optimize the right hand side of equation (2.1)
for η to get
k r
1X ♯ 2
[f (xs ) − f (x )] ≤ RL .
k αk
s=1
To conclude the proof, let x♯ → x∗ ∈ C.
Note that with the right geometry, we can get projected gradient descent as an instance
the Mirror descent algorithm.
4
2.4.3 Remarks
The Mirror Descent is sometimes called Mirror Prox. We can write xs+1 as
xs+1 = argmin DΦ (x, ys+1 )

x∈C∩D
= argmin Φ(x) − ∇Φ⊤ (ys+1 )x
x∈C∩D
= argmin Φ(x) − [∇Φ(xs ) − ηgs ]⊤ x
x∈C∩D
= argmin η(gs⊤ x) + Φ(x) − ∇Φ⊤ (xs )x
x∈C∩D
= argmin η(gs⊤ x) + DΦ (x, xs )
x∈C∩D
Thus, we have
xs+1 = argmin η(gs⊤ x) + DΦ (x, xs ) .
x∈C∩D
To get xs+1 , in the first term on the right hand side we look at linear approximations
close to xs in the direction determined by the subgradient gs . If the function is linear, we
would just look at the linear approximation term. But if the function is not linear, the
linear approximation is only valid in a small neighborhood around xs . Thus, we penalized
by adding the term DΦ (x, xs ). We can penalized by the square norm when we choose
DΦ (x, xs ) = kx − xs k2 . In this case we get back the projected gradient descent algorithm
as an instance of Mirror descent algorithm.
But if we choose a different divergence DΦ (x, xs ), we are changing the geometry and we
can penalize differently in different directions depending on the geometry.
Thus, using the Mirror descent algorithm, we could replace the 2-norm in projected
gradient descent algorithm by another norm, hoping to get less constraining Lipschitz con-
stant. On the other hand, the norm is a lower bound on the strong convexity parameter.
Thus, there is trade off in improvement of rate of convergence.
2.4.4 Examples
Euclidean Setup:
Φ(x) = 12 kxk2 , D = Rd , ∇Φ(x) = ζ(x) = x. Thus, the updates will be similar to
the gradient descent.
1 1
DΦ (y, x) = kyk2 − kxk2 − x⊤ y + kxk2
2 2
1
= kx − yk2 .
2
Thus, Bregman projection with this potential function Φ(x) is the same as the usual Eu-
clidean projection and the Mirror descent algorithm is exactly the same as the projected
descent algorithm since it has the same update and same projection operator.
Note that α = 1 since DΦ (y, x) ≥ 12 kx − yk2 .
ℓ1 Setup:
We look at D = Rd+ \ {0}.
5
Define Φ(x) to be the negative entropy so that:
d
X
Φ(x) = xi log(xi ), ζ(x) = ∇Φ(x) = {1 + log(xi )}di=1
i=1
(s+1)
Thus, looking at the update function y (s+1) = ∇Φ(x(s) ) − ηgs , we get log(yi ) =
(s) (s) (s+1) (s) (s)
log(xi ) − ηgi and for all i = 1, · · · , d, we have yi = xi exp(−ηgi ). Thus,
y (s) = x(s) exp(−ηg (s) ) .
We call this setup exponential Gradient Descent or Mirror Descent with multiplicative
weights.
The Bregman divergence of this mirror map is given by
DΦ (y, x) = Φ(y) − Φ(x) − ∇Φ⊤ (x)(y − x)

d
X d
X d
X
= yi log(yi ) − xi log(xi ) − (1 + log(xi ))(yi − xi )
i=1 i= 1 i=1
d
X Xd
yi
= yi log( )+ (yi − xi )
xi
i=1 i=1
P
Note that di=1 yi log( xyii ) is call the Kullback-Leibler divergence (KL-div) between y
and x.
We show that Pthe projection with respect to this Bregman divergence on the simplex
∆d = {x ∈ Rd : di=1 xi = 1, xi ≥ 0} amounts to a simple renormalization y 7→ y/|y|1 . To
prove so, we provide the Lagrangian:
d
X d
X Xd
yi
L= yi log( )+ (xi − yi ) + λ( xi − 1) .
xi
i=1 i=1 i=1
To find the Bregman projection, for all i = 1, · · · , d we write

∂ yi
L = − +1+λ= 0
∂xi xi
P
Thus, for all i, we have xi = γyi . We know that di=1 xi = 1. Thus, γ = P1yi .
y
Thus, we have ΠΦ ∆d (y) = |y|1 . The Mirror Descent algorithm with this update and
projection would be:
ys+1 = xs exp(−ηgs )
y
xs+1 = .
|y|1
To analyze the rate of convergence, we want to study the ℓ1 norm on ∆d . Thus, we have
to show that for some α, Φ is α-strongly convex w.r.t | · |1 on ∆d .
6
X
DΦ (y, x) = KL(y, x) + (xi − yi )
i
= KL(y, x)
1
≥ |x − y|21
2
P
Where we used the fact that x, y ∈ ∆d to show i (xi − yi ) = 0 and used Pinsker
inequality show the result. ThuP
s, Φ is 1-strongly convex w.r.t. | · |1 on ∆d .
Remembering that Φ(x) = di=1 xi log(xi ) was defined to be negative entropy, we know
that − log(d) ≤ Φ(x) ≤ 0 for x ∈ ∆d . Thus,
R2 = max Φ(x) − min Φ(x) = log(d) .

x∈∆d x∈∆d
Corollary: Let f be a convex function on ∆d such that
kgk∞ ≤ L, ∀g ∈ ∂f (x), ∀x ∈ ∆d .
q
Then, Mirror descent with η = L1 2 log(d)
k gives
r r
∗ 2 log(d) ◦ ∗ 2 log(d)
f (xk ) − f (x ) ≤ L , f (xk ) − f (x ) ≤ L
k k
Boosting: For weak classifiers f1 (x), · · · , fN (x) and α ∈ ∆n , we define

 
N f1 (x)
X  
fα = αj fj and F (x) =  ... 
j=1 fN (x)
so that fα (x) is the weighted majority vote classifier. Note that |F |∞ ≤ 1.
As shown before, in boosting, we have:
n
bn,φ (fα ) = 1X ′
g = ∇R φ (−yi fα (xi ))(−yi )F (xi ) ,
n
i=1
Since |F |∞ ≤ 1 and |y|∞ ≤ 1, then |g|∞ ≤ L where L is the Lipschitz constant of φ

(e.g., a constant like e or 2).
r
R bn,φ (fα ) ≤ L 2 log(N )
bn,φ (fα◦ ) − min R
k α∈∆n k
We need the number of iterations k ≈ n2 log(N ).
The functions fj ’s√could hit all the vertices. Thus, if we want to fit them in a ball, the
ball
q has to be radius N . This is why the projected gradient descent would give the rate of
N
k. But by looking at the gradient we can determine the right geometry. In this case, the
gradient is bounded by sup-norm which is usually the most constraining norm in projected
7
gradient descent. Thus, using Mirror descent would be most beneficial.
Other Potential Functions:

There are other potential functions which are strongly convex w.r.t ℓ1 norm. In partic-
ular, for
1 p 1
Φ(x) = |x| , p=1+
p p log(d)
p
then Φ is c log(d)-strongly convex w.r.t ℓ1 norm.
8
MIT OpenCourseWare
http://ocw.mit.edu
18.657 Mathematics of Machine Learning

Fall 2015
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

18.657: Mathematics of Machine Learning: S R LR LK K

Uploaded by

Copyright:

Available Formats

18.657: Mathematics of Machine Learning: S R LR LK K

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

18.657: Mathematics of Machine Learning: S R LR LK K

Uploaded by

Copyright:

Available Formats

18.

657: Mathematics of Machine Learning

Previously, we analyzed the convergence of the projected gradient descent algorithm.

Definition (Bregman Divergence): For given differentiable, α-strongly convex func-

DΦ (y, x) = Φ(y) − Φ(x) − ∇Φ(x)T (y − x)

Proposition: Given α-strongly differentiable convex function Φ : D → R, for all

[∇Φ(x) − ∇Φ(y)]⊤ (x − z) = DΦ (x, y) + DΦ (z, x) − DΦ (z, y) .

Definition (Bregman Projection): Given α-strongly differentiable convex function

2.4.2 Mirror Descent Algorithm

Algorithm 1 Mirror Descent algorithm

Proposition: Let z ∈ C ∩ D, then ∀y ∈ D,

(∇Φ(π(y) − ∇Φ(y))⊤ (π(y) − z) ≤ 0

h′ (0) = ∇x DΦ (x, y)|x=π (z − π) ≥ 0

where suing the definition of Bregman divergence,

∇x DΦ (x, y) = ∇Φ(x) − ∇Φ(y)

(∇Φ(π) − ∇Φ(y))⊤ (π − z) = DΦ (π, y) + DΦ (z, π) − DΦ (z, y) ≤ 0 ,

and since DΦ (π, y) ≥ 0, we would have DΦ (z, π) ≤ DΦ (z, y).

R2 = sup Φ(x) − min Φ(x)

We use the definition of Bregman divergence to get

DΦ (x♯ , x1 ) = Φ(x♯ ) − Φ(x1 ) − ∇Φ(x1 )(x♯ − x1 )

To conclude the proof, let x♯ → x∗ ∈ C.

xs+1 = argmin DΦ (x, ys+1 )

y (s) = x(s) exp(−ηg (s) ) .

DΦ (y, x) = Φ(y) − Φ(x) − ∇Φ⊤ (x)(y − x)

To find the Bregman projection, for all i = 1, · · · , d we write

R2 = max Φ(x) − min Φ(x) = log(d) .

Corollary: Let f be a convex function on ∆d such that

Boosting: For weak classifiers f1 (x), · · · , fN (x) and α ∈ ∆n , we define

Since |F |∞ ≤ 1 and |y|∞ ≤ 1, then |g|∞ ≤ L where L is the Lipschitz constant of φ

Other Potential Functions:

18.657 Mathematics of Machine Learning

You might also like