Homework 2: Mathematics For AI: AIT2005
Homework 2: Mathematics For AI: AIT2005
Homework 2: Mathematics For AI: AIT2005
a) Show that ∂f (x) is a closed and convex set for any function f (not necessarily convex) and any
point x in its domain.
b) For a convex function f , show that if x ∈ U where U is a open neighborhood in its domain, then
In other words, if the tangent line inequality holds in a local open neighborhood of x, then it holds
globally.
(gx − gy )T (x − y) ≥ 0.
d) For f (x) = ∥x∥2 , show that all subgradients g ∈ Rn at a point x ∈ Rn are of the form
(
{x/∥x∥2 } x ̸= 0
g∈
{v : ∥v∥2 ≤ 1} x = 0.
1
a) Prove that proxh,t is a well-defined function on Rn , that is, each point x ∈ Rn gets mapped to a
unique value proxh,t (x).
c) The proximal minimization algorithm (a special case of proximal gradient descent) repeats the
updates:
x(k+1) = proxh,t (x(k) ), k = 1, 2, 3, . . . .
Write out these updates when applied to h(x) = 12 xT Ax − bT x, where A ∈ Sn . Show that this is
equivalent to the iterative refinement algorithm for solving the linear system Ax = b:
For h(X) = ∥X∥tr , show that the proximal operator evaluated at X = U ΣV T (this is an SVD of X)
is so-called matrix soft-thresholding,
proxh,t (X) = U Σt V T , where Σt = diag (Σ11 − t)+ , . . . , (Σnn − t)+ ,
where 1 = (1, . . . , 1) ∈ Rn and each X(j) ∈ Rn×pj . To achieve sparsity over groups of features, rather
than individual features, we can use a group lasso penalty. Write β = (β0 , β(1) , . . . , β(J) ) ∈ Rp+1 ,
where β0 is an intercept term and each β(j) ∈ Rpj . Consider the problem
J
X
min g(β) + λ wj ∥β(j) ∥2 , (1)
β
j=1
PJ
where g is a loss function and λ ≥ 0 is a tuning parameter. The penalty h(β) = λ j=1 wj ∥β(j) ∥2 is
√
called the group lasso penalty. A common choice for wj is pj to adjust for the group size.
a) Derive the proximal operator proxh,t (β) for the group lasso penalty defined above.
b) Let y ∈ {0, 1}n be a binary label, and let g be the logistic loss
n
X n
X
g(β) = − yi (Xβ)i + log(1 + exp{(Xβ)i }),
i=1 i=1
2
Write out the steps for proximal gradient descent applied to the logistic group lasso problem (1) in
explicit detail.
c) Now we’ll use the logistic group lasso to classify a person’s age group from his movie ratings. The
movie ratings can be categorized into groups according to a movie’s genre (e.g., all ratings for action
movies can be grouped together). Load the training data in trainRatings.txt, trainLabels.txt;
the features have already been arranged into groups and you can find information about this in
groupTitles.txt, groupLabelsPerRating.txt. Solve the logistic group lasso problem (1) with
regularization parameter λ = 5 by running proximal gradient descent for 1000 iterations with fixed
step size t = 10−4 . Plot f (k) − f ⋆ versus k, where f (k) denotes the objective value at iteration k,
and use as an optimal objective value f ⋆ = 336.207. Make sure the plot is on a semi-log scale (where
the y-axis is in log scale).
(d, 5 pts) Now implement Nesterov acceleration for the same problem. You should again run
accelerated proximal gradient descent for 1000 iterations with fixed step size t = 10−4 . As before,
produce a plot f (k) − f ⋆ versus k. Describe any differences you see in the criterion convergence curve.
Note: since it makes for an easier comparison, you can draw the convergence curves from (c), (d), (e)
on the same plot.
f) Finally, use the solution from accelerated proximal gradient descent in part (d) to make predictions
on the test set, available in testRatings.txt, testLabels.txt. What is the classification error?
What movie genre are important for classifying whether a viewer is under 40 years old?