Nothing Special   »   [go: up one dir, main page]

Homework 2: Mathematics For AI: AIT2005

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Homework 2

Mathematics for AI: AIT2005

Submit your work on https://courses.uet.vnu.edu.vn/.

Problem 1: Properties and examples of (sub)gradients


Prove the following properties and examples of (sub)gradients.

a) Show that ∂f (x) is a closed and convex set for any function f (not necessarily convex) and any
point x in its domain.

b) For a convex function f , show that if x ∈ U where U is a open neighborhood in its domain, then

f (y) ≥ f (x) + g T (y − x), for all y ∈ U ⇒ g ∈ ∂f (x).

In other words, if the tangent line inequality holds in a local open neighborhood of x, then it holds
globally.

c) For a convex function f and subgradients gx ∈ ∂f (x), gy ∈ ∂f (y), prove that

(gx − gy )T (x − y) ≥ 0.

This property is called monotonicity of the subdifferential ∂f .

d) For f (x) = ∥x∥2 , show that all subgradients g ∈ Rn at a point x ∈ Rn are of the form
(
{x/∥x∥2 } x ̸= 0
g∈
{v : ∥v∥2 ≤ 1} x = 0.

e) For f (x) = maxs∈S fs (x), where each fs is convex, show that


 [ 
∂f (x) ⊇ conv ∂fs (x) .
s:fs (x)=f (x)

Problem 2: Properties and examples of proximal operators


We will inspect various properties and examples of proximal operators. Unless otherwise specified,
take h to be a convex function with domain dom(h) = Rn , and t > 0 be arbitrary, and consider its
associated proximal operator
1
proxh,t (x) = argmin ∥x − z∥22 + h(z).
z 2t

1
a) Prove that proxh,t is a well-defined function on Rn , that is, each point x ∈ Rn gets mapped to a
unique value proxh,t (x).

b) Prove that proxh,t (x) = u if and only if


1
h(y) ≥ h(u) + (x − u)T (y − u), for all y.
t
Hint: use subgradient optimality.

c) The proximal minimization algorithm (a special case of proximal gradient descent) repeats the
updates:
x(k+1) = proxh,t (x(k) ), k = 1, 2, 3, . . . .
Write out these updates when applied to h(x) = 12 xT Ax − bT x, where A ∈ Sn . Show that this is
equivalent to the iterative refinement algorithm for solving the linear system Ax = b:

x(k+1) = x(k) + (A + ϵI)−1 (b − Ax(k) ), k = 1, 2, 3, . . . ,

d) For a matrix-variate function h, we define its proximal operator as


1
proxh,t (X) = argmin ∥X − Z∥2F + h(Z),
Z 2t

For h(X) = ∥X∥tr , show that the proximal operator evaluated at X = U ΣV T (this is an SVD of X)
is so-called matrix soft-thresholding,
 
proxh,t (X) = U Σt V T , where Σt = diag (Σ11 − t)+ , . . . , (Σnn − t)+ ,

and x+ = max{x, 0} denotes the positive part of x.

Problem 3: Group lasso logistic regression


Suppose we have features X ∈ Rn×(p+1) that we divide into J groups:
h i
X = 1 X(1) X(2) · · · X(J) ,

where 1 = (1, . . . , 1) ∈ Rn and each X(j) ∈ Rn×pj . To achieve sparsity over groups of features, rather
than individual features, we can use a group lasso penalty. Write β = (β0 , β(1) , . . . , β(J) ) ∈ Rp+1 ,
where β0 is an intercept term and each β(j) ∈ Rpj . Consider the problem
J
X
min g(β) + λ wj ∥β(j) ∥2 , (1)
β
j=1
PJ
where g is a loss function and λ ≥ 0 is a tuning parameter. The penalty h(β) = λ j=1 wj ∥β(j) ∥2 is

called the group lasso penalty. A common choice for wj is pj to adjust for the group size.

a) Derive the proximal operator proxh,t (β) for the group lasso penalty defined above.

b) Let y ∈ {0, 1}n be a binary label, and let g be the logistic loss
n
X n
X
g(β) = − yi (Xβ)i + log(1 + exp{(Xβ)i }),
i=1 i=1

2
Write out the steps for proximal gradient descent applied to the logistic group lasso problem (1) in
explicit detail.

c) Now we’ll use the logistic group lasso to classify a person’s age group from his movie ratings. The
movie ratings can be categorized into groups according to a movie’s genre (e.g., all ratings for action
movies can be grouped together). Load the training data in trainRatings.txt, trainLabels.txt;
the features have already been arranged into groups and you can find information about this in
groupTitles.txt, groupLabelsPerRating.txt. Solve the logistic group lasso problem (1) with
regularization parameter λ = 5 by running proximal gradient descent for 1000 iterations with fixed
step size t = 10−4 . Plot f (k) − f ⋆ versus k, where f (k) denotes the objective value at iteration k,
and use as an optimal objective value f ⋆ = 336.207. Make sure the plot is on a semi-log scale (where
the y-axis is in log scale).

(d, 5 pts) Now implement Nesterov acceleration for the same problem. You should again run
accelerated proximal gradient descent for 1000 iterations with fixed step size t = 10−4 . As before,
produce a plot f (k) − f ⋆ versus k. Describe any differences you see in the criterion convergence curve.

Note: since it makes for an easier comparison, you can draw the convergence curves from (c), (d), (e)
on the same plot.

f) Finally, use the solution from accelerated proximal gradient descent in part (d) to make predictions
on the test set, available in testRatings.txt, testLabels.txt. What is the classification error?
What movie genre are important for classifying whether a viewer is under 40 years old?

You might also like