18.657: Mathematics of Machine Learning: S R LR LK K
18.657: Mathematics of Machine Learning: S R LR LK K
18.657: Mathematics of Machine Learning: S R LR LK K
• Consider the differentiable, convex function f on the Euclidean ball B2,n such that
√
k∇f (x)k∞ ≤ 1, ∀x ∈ B2,n . This implies that |∇f (x)|2 ≤ n and p nthe projected
gradient descent converges to the minimum of f in B2,n at rate k . Using the
q
method of mirror descent we can get convergence rate of log(n)
k
To get better rates of convergence in the optimization problem, we can use the Mirror
Descent algorithm. The idea is to change the Euclidean geometry to a more pertinent
geometry to a problem at hand. We will define a new geometry by using a function which
is sometimes called potential function Φ(x). We will use Bregman projection based on
Bregman divergence to define this geometry.
The geometric intuition behind the mirror Descent algorithm is the following: The
projected gradient described in previous lecture works in any arbitrary Hilbert space H so
that the norm of vectors is associated with an inner product. Now, suppose we are interested
in optimization in a Banach space D. In other words, the norm (or the measure of distance)
that we use does not derive from an inner product. In this case, the gradient descent does
not even make sense since the gradient ∇f (x) are elements of dual space. Thus, the term
x − η∇f (x) cannot be performed. (Note that in Hilbert space used in projected gradient
descent, the dual space of H is isometric to H. Thus, we didn’t have any such problems.)
The geometric insight of the Mirror Descent algorithm is that to perform the optimiza-
tion in the primal space D, one can first map the point x ∈ D in primal space to the dual
space D ∗ , then perform the gradient update in the dual space and finally map the optimal
point back to the primal space. Note that at each update step, the new point in the primal
space D might be outside of the constraint set C ⊂ D, in which case it should be projected
into the constraint set C. The projection associate with the Mirror Descent algorithm is
Bergman Projection defined based on the notion of Bergman divergence.
1
We will use the convex open set D ⊂ Rn whose closure contains the constraint set C ⊂ D.
Bregman divergence is the error term of the first order Taylor expansion of the function Φ
in D.
Also, note that the function Φ(x) is said to be α-strongly convex w.r.t. a norm k.k if
α
Φ(y) − Φ(x) − ∇Φ(x)T (y − x) ≥ ky − xk2 .
2
We used the following property of the Euclidean norm:
2a⊤ b = kak2 + kbk2 − ka − bk2
in the proof of convergence of projected gradient descent, where we chose a = xs − ys+1 and
b = xs − x∗ .
To prove the convergence of the Mirror descent algorithm, we use the following property
of the Bregman divergence in a similar fashion. This proposition shows that the Bregman di-
vergence essentially behaves as the Euclidean norm squared in terms of projections:
As described previously, the Bregman divergence is used in each step of the Mirror descent
algorithm to project the updated value into the constraint set.
ΠΦ
C (x) = argmin DΦ (z, x)
z∈C∩D
2
Moreover, DΦ (z, π(y)) ≤ DΦ (z, y).
Proof. Define π = ΠΦ C (y) and h(t) = DΦ (π + t(z − π), y) . Since h(t) is minimized at t = 0
(due to the definition of projection), we have
Thus,
(∇Φ(π) − ∇Φ(y))⊤ (π − z) ≤ 0 .
Using proposition 1, we know that
Theorem: Assume that f is convex and L-Lipschitz w.r.t. k.k. Assume that Φ is
α-strongly convex on C ∩ D w.r.t. k.k and
taq
ke x1 = argminx∈C∩D Φ(x) (assume that it exists). Then, Mirror Descent with η =
R 2α
L R gives,
r r
∗ 2 ◦ ∗ 2
f (x) − f (x ) ≤ RL and f (x ) − f (x ) ≤ RL ,
αk αk
Proof. Take x♯ ∈ C ∩ D. Similar to the proof of the projected gradient descent, we have:
(i)
f (xs ) − f (x♯ ) ≤ gs⊤ (xs − x♯ )
(ii) 1
= (ζ(xs ) − ζ(ys+1 ))⊤ (xs − x♯ )
η
(iii) 1
= (∇Φ(xs ) − ∇Φ(ys+1 ))⊤ (xs − x♯ )
η
(iv) 1
h i
= DΦ (xs , ys+1 ) + DΦ (x♯ , xs ) − DΦ (x♯ , ys+1 )
η
(v) 1 h i
≤ DΦ (xs , ys+1 ) + DΦ (x♯ , xs ) − DΦ (x♯ , xs+1 )
η
(vi) ηL2 1h ♯ ♯
i
≤ + D Φ (x , x s ) − DΦ (x , x s+1 )
2α2 η
Where (i) is due to convexity of the function f .
3
Equations (ii) and (iii) are direct results of Mirror descent algorithm.
Equation (iv) is the result of applying proposition 1.
Inequality (v) is a result of the fact that xs+1 = ΠΦ ♯
C (ys+1 ), thus for x ∈ C ∩ D, we have
♯ ♯
DΦ (x , ys+1 ) ≥ DΦ (x , xs+1 ).
We will justify the following derivations to prove inequality (vi):
(a)
DΦ (xs , ys+1 ) = Φ(xs ) − Φ(ys+1 ) − ∇Φ(ys+1 )⊤ (xs − ys+1 )
(b) α
≤ [∇Φ(xs ) − ∇Φ(ys+1 )]⊤ (xs − ys+1 ) − kys+1 − xs k2
2
(c) α
≤ ηkgs k∗ kxs − ys+1 k − kys+1 − xs k2
2
(d) η 2 L2
≤ .
2α
Equation (a) is the definition of Bregman divergence.
To show inequality (b), we used the fact that Φ is α-strongly convex which implies that
Φ(ys+1 ) − Φ(xs ) ≥ ∇Φ(xs )T (ys+1 − xs ) α2 kys+1 − xs k2 .
According to the Mirror descent algorithm, ∇Φ(xs ) − ∇Φ(ys+1 ) = ηgs . We use Hölder’s
inequality to show that gs⊤ (xs − ys+1 ) ≤ kgs k∗ kxs − ys+1 k and derive inequality (c).
Looking at the quadratic term ax−bx2 for a, b > 0 , it is not hard to show that max ax − bx2 =
a2 α
4b . We use this statement with x = kys+1 − xs k , a = η kgs k∗ ≤ L and b = 2 to derive
inequality (d).
Again, we use telescopic sum to get
k
1X ηL2 DΦ (x♯ , x1 )
[f (xs ) − f (x♯ )] ≤ + . (2.1)
k s=1 2α kη
Where we used the fact x1 ∈ argminC∩D Φ(x) in the description of the Mirror Descent
algorithm to prove ∇Φ(x1 )(x♯ − x1 ) ≥ 0. We optimize the right hand side of equation (2.1)
for η to get
k r
1X ♯ 2
[f (xs ) − f (x )] ≤ RL .
k αk
s=1
Note that with the right geometry, we can get projected gradient descent as an instance
the Mirror descent algorithm.
4
2.4.3 Remarks
The Mirror Descent is sometimes called Mirror Prox. We can write xs+1 as
Thus, we have
xs+1 = argmin η(gs⊤ x) + DΦ (x, xs ) .
x∈C∩D
To get xs+1 , in the first term on the right hand side we look at linear approximations
close to xs in the direction determined by the subgradient gs . If the function is linear, we
would just look at the linear approximation term. But if the function is not linear, the
linear approximation is only valid in a small neighborhood around xs . Thus, we penalized
by adding the term DΦ (x, xs ). We can penalized by the square norm when we choose
DΦ (x, xs ) = kx − xs k2 . In this case we get back the projected gradient descent algorithm
as an instance of Mirror descent algorithm.
But if we choose a different divergence DΦ (x, xs ), we are changing the geometry and we
can penalize differently in different directions depending on the geometry.
Thus, using the Mirror descent algorithm, we could replace the 2-norm in projected
gradient descent algorithm by another norm, hoping to get less constraining Lipschitz con-
stant. On the other hand, the norm is a lower bound on the strong convexity parameter.
Thus, there is trade off in improvement of rate of convergence.
2.4.4 Examples
Euclidean Setup:
Φ(x) = 12 kxk2 , D = Rd , ∇Φ(x) = ζ(x) = x. Thus, the updates will be similar to
the gradient descent.
1 1
DΦ (y, x) = kyk2 − kxk2 − x⊤ y + kxk2
2 2
1
= kx − yk2 .
2
Thus, Bregman projection with this potential function Φ(x) is the same as the usual Eu-
clidean projection and the Mirror descent algorithm is exactly the same as the projected
descent algorithm since it has the same update and same projection operator.
Note that α = 1 since DΦ (y, x) ≥ 12 kx − yk2 .
ℓ1 Setup:
We look at D = Rd+ \ {0}.
5
Define Φ(x) to be the negative entropy so that:
d
X
Φ(x) = xi log(xi ), ζ(x) = ∇Φ(x) = {1 + log(xi )}di=1
i=1
(s+1)
Thus, looking at the update function y (s+1) = ∇Φ(x(s) ) − ηgs , we get log(yi ) =
(s) (s) (s+1) (s) (s)
log(xi ) − ηgi and for all i = 1, · · · , d, we have yi = xi exp(−ηgi ). Thus,
We call this setup exponential Gradient Descent or Mirror Descent with multiplicative
weights.
The Bregman divergence of this mirror map is given by
ys+1 = xs exp(−ηgs )
y
xs+1 = .
|y|1
To analyze the rate of convergence, we want to study the ℓ1 norm on ∆d . Thus, we have
to show that for some α, Φ is α-strongly convex w.r.t | · |1 on ∆d .
6
X
DΦ (y, x) = KL(y, x) + (xi − yi )
i
= KL(y, x)
1
≥ |x − y|21
2
P
Where we used the fact that x, y ∈ ∆d to show i (xi − yi ) = 0 and used Pinsker
inequality show the result. ThuP
s, Φ is 1-strongly convex w.r.t. | · |1 on ∆d .
Remembering that Φ(x) = di=1 xi log(xi ) was defined to be negative entropy, we know
that − log(d) ≤ Φ(x) ≤ 0 for x ∈ ∆d . Thus,
kgk∞ ≤ L, ∀g ∈ ∂f (x), ∀x ∈ ∆d .
q
Then, Mirror descent with η = L1 2 log(d)
k gives
r r
∗ 2 log(d) ◦ ∗ 2 log(d)
f (xk ) − f (x ) ≤ L , f (xk ) − f (x ) ≤ L
k k
7
gradient descent. Thus, using Mirror descent would be most beneficial.
8
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.