hw7 Sol
hw7 Sol
hw7 Sol
Instructions. Please write up your responses to the following problems clearly and concisely. We encourage
you to write up your responses using LATEX; we have provided a LATEX template, available on Canvas, to
make this easier. Submit your answers in PDF form to Canvas. We will not accept paper copies
of the homework.
Collaboration. You are allowed and encouraged to work together. You may discuss the homework to
understand the problem and reach a solution in groups up to size two students. However, each student
must write down the solution independently, and without referring to written notes from the joint session.
In addition, each student must write on the problem set the names of the people with whom
you collaborated. You must understand the solution well enough in order to reconstruct it by yourself.
(This is for your own benefit: you have to take the exams alone.)
3
Range (m)
0
0 200 400 600 800 1000
Reading #
If the sensor fails when obtaining a reading, it returns a value with uniform probability in the range
[0, 5]. Otherwise, the sensor returns a value distributed according to N (µ, σ 2 ), where σ 2 is the variance of
the sensor readings and µ is the actual distance to the nearest obstacle. We will assume that σ 2 is specified
by the manufacturer of the sensor, and our goal is to determine µ given a set of readings x1 , . . . , xn .
The difficulty is that π0 , the failure rate of the sensor, is unknown ahead of time, and we want to avoid
biasing our estimate of µ with completely meaningless samples where the sensor failed. Therefore, we model
a sensor reading xi according to the outcome of a latent variable zi indicating
Student Version of MATLAB
whether or not the sensor
1
failed:
1 1 1
p(zi = 0) = π0 , p(xi | zi = 1) = √ exp − 2 (xi − µ)2 , p(xi | zi = 0) = 1(0 ≤ xi ≤ 5)
2πσ 2σ 5
This is a mixture model with two components: a Gaussian with mean µ and variance σ 2 , and a uniform over
[0, 5].
1. [4 points] Using the equations defined above, write out the expression for the marginal log likelihood
of a dataset x1 , . . . , xn given the parameters (this is what maximum likelihood is maximizing). Your
solution should be in terms of π0 , σ 2 , µ, and the xi . Hint: Remember to marginalize over the zi ’s.
Write your final answer as: log p(x1 , . . . , xn | µ, σ 2 , π0 ) = expression.
F SOLUTION:
log p(x1 , . . . , xn | µ, σ 2 , π0 ) =
n
X π0 1 − π0 1 2
log 1(0 ≤ xi ≤ 5) + √ exp − 2 (xi − µ)
i=1
5 2πσ 2σ
F SOLUTION:
p(xi | zi = 0)p(zi = 0)
q(zi = 0 | xi ) = P
z p(xi | zi )p(zi )
π0
5 1(0 ≤ xi ≤
5)
= π0 1−π 1 2
5 1(0 ≤ xi ≤ 5) + 2πσ exp − 2σ 2 (xi − µ)
√ 0
3. [4 points] (M-step for µ) Given the estimates of the posterior q(zi = 0 | xi ) and fixed π0 and σ 2 , now
write down the update of µ. Your solution can include any of the following terms: q(zi = 0 | xi ), π0 , σ 2
and the xi . Write your final answer as: µ = expression.
F SOLUTION: This is the same analysis as for Gaussian mixture models: we just update µ as the
weighted mean according to q:
Pn
(1 − q(zi = 0 | xi ))xi
µ = Pi=1
n
i=1 (1 − q(zi = 0 | xi ))
4. [4 points] (M-step for π0 ) Given the estimates of the posterior q(zi = 0 | xi ), updated µ, and
fixed σ 2 , now write down the update for π0 . Your solution can include any of the following terms:
q(zi = 0 | xi ), µ, σ 2 and the xi . Write your final answer as: π0 = expression.
2
1
Pn
F SOLUTION: This is again the same analysis as for Gaussian mixture models: π0 = n i=1 q(zi =
0 | xi ).
5. [4 points] Suppose we will now apply our EM procedure to the sample dataset below. If we initialize
µ = 0 and π0 = 1/100 (with parameter σ 2 = 1/2) we get the following distribution for q(zi = 0 | xi )
as a function of the observation xi :
Note that the plot of the data has been rotated to align with the plot of q(zi = 0 | xi ). In practice,
this initialization leads to µ incorrectly converging to a value close to zero on this dataset.
Briefly explain (2-3 sentences) why we might expect this failure to happen, given the above q(zi = 0 | xi )
distribution. Hint: Look at where the majority of points lie with respect to q(zi = 0 | xi ). How will this
affect the update for π0 and µ?
F SOLUTION: Most of the data points lie in the range [2, 4], which would be all considered “failures”
according to the above distribution. Therefore they will be ignored when updating µ, so µ will never deviate
from zero.
F SOLUTION: PCA finds (orthogonal) directions of maximum variance. These are plotted below.
2. [2 points] If we project the data onto a single principal component, which component will allow a
decision stump to perfectly separate the data: the first, the second, or both?
F SOLUTION: The second axis is the only one that separates the data.
3. [2 points] Suppose instead we transform the data using a non-linear function of only one of the
original axes: either φ(x) = x21 or φ(x) = x22 . Which will allow a decision stump to perfectly separate
the data?
3
Figure 2: Graph of Mystery.
1. [4 points] Suppose our data is generated according to a Gaussian mixture: first Y is sampled from
the set of classes {1, 2} with equal probability ( 21 ) of each class being selected, and then a scalar X is
4
sampled according to the Gaussian P (X | Y ). Let 1 be the mean of class 1 and −1 be the mean of
class 2. What is the variance of X? Your answer should be in terms of σ 2 .
F SOLUTION: The answer is 1 + σ 2 . Recall that V ar[X] = E[X 2 ] − E[X]2 . Then we have:
E[X 2 ] = EY [EX [X 2 | Y ]]
1 1
= (12 + σ 2 ) + ((−1)2 + σ 2 )
2 2
= 1 + σ2 .
By similar reasoning, E[X] = 12 (1) + 12 (−1) = 0, so E[X]2 = 0 and the overall variance is just 1 + σ 2 .
2. [3 points] Now suppose a data point X 0 is generated according to same process as above, but the
feature is uninformative, so the mean of class 1 and 2 are both zero. What is the variance of X 0 ? Your
answer should be in terms of σ 2 .
F SOLUTION: Because P (X 0 | Y = 1) and P (X 0 | Y = 2) are the same, we really just have that
P (X 0 ) = N (0, σ 2 ). Thus V ar[X 0 ] = σ 2 .
3. [3 points] Suppose now we generate X as in part 1 and X 0 as in part 2 from a single Y . What is the
covariance between X and X 0 ? Recall that covariance is defined as E[(X − EX [X])(X 0 − EX 0 [X 0 ])].
EX [X] = 0, EX 0 [X 0 ] = 0
E[(X − 0)(X 0 − 0)] = E[XX 0 ]
= EY [EX [X | Y ]EX 0 [X 0 | Y ]]
= EY [EX [X | Y ]EX 0 [X 0 ]]
= EY [EX [X | Y ] ∗ 0] = 0
4. [4 points] Suppose we generate a dataset (x1 , y1 ) . . . , (xn , yn ) by generating one informative feature
X1 as in part 1 and m non-informative features X2 , . . . , Xm+1 as in part 2 for each example. (An
example 2D dataset is shown in Figure 3.) What are the eigenvalues of the covariance matrix of the
data? Assume that n → ∞, so that the covariances are exactly the true covariances that you computed
in the previous questions. Hint: The eigenvalues of a diagonal matrix are the diagonal entries of the
matrix, and the eigenvectors are the standard bases e1 = [1 0 0 . . . 0]> , etc.
F SOLUTION: By parts 1-3, the covariance matrix is diag(1 + σ 2 , σ 2 , . . . , σ 2 ). Therefore the first
eigenvalue is 1 + σ 2 , and the rest are σ 2 .
5. [3 points] Suppose we use a nearest-neighbor classifier on our data using the first principal component
only. Briefly explain (2-3 sentences) why the ratio of intra-class to inter-class distances in the space
defined by the first principal component will increase, decrease, or not change as m increases (as more
uninformative features are added to the data).
5
Figure 3: Sample dataset with only one informative dimension: x1 .
F SOLUTION: By the previous sections, the first principal component will always be the first feature,
which is the only informative feature. Therefore distances will not change.
6. [2 points] Based on your previous answer, how will running PCA (and keeping only 1 component)
before using K-NN affect the performance of K-NN as m increases? That is, will it increase, decrease,
or have no effect on the accuracy of K-NN?
F SOLUTION: Improve: K-NN will begin to get confused by the irrelevant dimensions, but running
PCA first will keep it on track.
1. [5 points] Using the top 2 PCA dimensions, display all the test letters “x” and “y”, using crosses
and circles respectively. An example of a plot of “x” vs “w” is shown in Figure 4 below.
2. [5 points] Plot the average (in sample) reconstruction error ||X − X̂||F of the digits as a function of
the number of principle components that are included. How many principle components are required
to get 90% reconstruction accuracy? (I.e., to explain 90% of the variance: ||X − X̂||F /||X − X̄||F = 0.9
where X̄| is a matrix, every row of which is the average x – the average image. Reminder: be sure to
1 This
dataset is a modified version of the dataset at http://www.seas.upenn.edu/~taskar/ocr/, which has been subsampled
and slightly altered to simplify the processing.
6
Figure 4: Plot of ’x’-’w’ characters from top 2 PCA dimensions
subtract that mean from X before you find the principle components, and to add it back in for your
predictions.
3. [10 points] Learn a Gaussian Naive Bayes classifier using MLE on the PCA-ed data and report
accuracy on the test set using 5, 10 and 20 top dimensions. You should be learning a mean and
variance for each class and each dimension. So 26 parameters for class priors, and 26 times (5,10 or
20) means plus 26 times (5,10, or 20) variances. In addition to reporting accuracies, write a brief
description (3-4 sentences) of how these results compare to the performance of Naive Bayes on the
original data. Please include your source code in your write-up, but do not include built-in
7
matlab functions.
4. [10 points] Learn a k-means classifier on the samePCA-ed data and report accuracy on the test set
using 5, 10 and 20 top dimensions. In addition to reporting accuracies, write a brief description (3-4
sentences) of how these results compare to the performance of The Gaussian Naive Bayes classifier.
Please include your source code in your write-up, but do not include built-in matlab
functions.
where µ1 , . . . , µK are the centroids of the clusters and rik are binary variables indicating whether data point
xi belongs to cluster k.
The optimization is NP-hard. Instead, as described in class, a greedy iterative clustering algorithm is
used to alternatively update µk and rik to optimize J(µ, r).:
– Update step: Set the centroid to be the mean of the points assigned to it
P
rik xi
arg min J(µ, r) → µk = Pi
i rik
µ
1. [4 points] The greedy iterative clustering algorithm converges when the assignments no longer change.
Is the algorithm guaranteed to converge in a finite number of steps? Briefly explain why or why not.
Hint: Think about the total possible number of assignments and the trend of the objective function after
each iteration.
8
F SOLUTION: There are finite number K n of assignments. For each iteration, the objective function
monotonically decreases until it converges.
2. [7 points] Consider the toy 2D dataset in the following figure. We set K = 2 in this problem. The
* marker indicates the data point and the circle marker indicates the starting cluster centroids. Show
the update step and assignment step for each iteration until the algorithm converges. You can use as
many of the remaining figures provided as you need.
Please note that the first update step is given as the initialization. For each iteration, use one figure
to mark the updated centroids based on the assignment from last iteration, then circle the data point
assignment based on the updated centroid. We expect you to understand the detailed procedure of k-
means. You should be able to compute the centroids manually with a calculator and assign data points
with geometric intuition. Please be sure to show the coordinates of each centroid in every iteration.
What to hand in. You can
• scan: use your phone or a scanner to take the image with your circle and include it in the .pdf
you hand in or
• use a pdf tool like adobe acrobat to write directly on the pdf or
• run a matlab program, and just hand in the outputs at each iteration with the centroids and the
list of points in each cluster (but you should see graphically what is happening.)
1. Train PCA and an auto-encoder on the big data (12,000 samples) (see train.mat, which has X train
12,000*784 and Y train 12,000 *1).
Please use matlab function pcacov on the covariance matrix of X to train coefficients for each principle
component. (see pca getpc.m for details). For PCA, choose the number of principle components so
as to keep 90% reconstruction accuracy. How many principle components do we need in this case? Use
that many principle components.
For the auto-encoder, the setup to train a model is basically the same as what we did in hw5, except
that in this homework we change the hidden layer dimension from 100 to 200 (see rbm.m for details).
F SOLUTION:
9
2. Now we will do logistic regression on 3 different inputs:
F SOLUTION:
• Original data (12,000): 89.56%
• PCA-ed data: 90.34%
• The auto-encoder outputs: 94.64%
3. Do the same thing using K-means, with k = 10, 25 respectively, but run the code on PCA-ed data and
auto-encoder only. Test your model on test.mat, what’s the accuracy?
Also, compare the performance of K-means with that of logistic regression.
F SOLUTION: k = 10
• PCA-ed data: 59.91%
• The auto-encoder outputs: 63.93%
k = 25
• PCA-ed data: 73.90%
• The auto-encoder outputs: 75.63%
10
Figure 6: Solution of kmeans
11
Update
means
and
then
circle
corresponding
clusters
Update
means
and
then
circle
corresponding
clusters
Update
means
and
then
circle
corresponding
clusters
Update
means
and
then
circle
corresponding
clusters
Update
means
and
then
circle
corresponding
clusters
12