hw7 Sol

CIS 520, Machine Learning, Fall 2015: Assignment 7
Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas

[100 points]
Solutions
Instructions. Please write up your responses to the following problems clearly and concisely. We encourage
you to write up your responses using LATEX; we have provided a LATEX template, available on Canvas, to
make this easier. Submit your answers in PDF form to Canvas. We will not accept paper copies
of the homework.
Collaboration. You are allowed and encouraged to work together. You may discuss the homework to
understand the problem and reach a solution in groups up to size two students. However, each student
must write down the solution independently, and without referring to written notes from the joint session.
In addition, each student must write on the problem set the names of the people with whom
you collaborated. You must understand the solution well enough in order to reconstruct it by yourself.
(This is for your own benefit: you have to take the exams alone.)
1 Sensational EM [20 points]

Suppose we have a robot with a single unreliable range sensor. For example, if the robot is standing 3 meters
away from the nearest obstacle, we might get readings like this:
3
Range (m)
0
0 200 400 600 800 1000
Reading #
Figure 1: Range readings from sensor
If the sensor fails when obtaining a reading, it returns a value with uniform probability in the range
[0, 5]. Otherwise, the sensor returns a value distributed according to N (µ, σ 2 ), where σ 2 is the variance of
the sensor readings and µ is the actual distance to the nearest obstacle. We will assume that σ 2 is specified
by the manufacturer of the sensor, and our goal is to determine µ given a set of readings x1 , . . . , xn .
The difficulty is that π0 , the failure rate of the sensor, is unknown ahead of time, and we want to avoid
biasing our estimate of µ with completely meaningless samples where the sensor failed. Therefore, we model
a sensor reading xi according to the outcome of a latent variable zi indicating
Student Version of MATLAB
whether or not the sensor
1
failed:

1 1 1
p(zi = 0) = π0 , p(xi | zi = 1) = √ exp − 2 (xi − µ)2 , p(xi | zi = 0) = 1(0 ≤ xi ≤ 5)
2πσ 2σ 5
This is a mixture model with two components: a Gaussian with mean µ and variance σ 2 , and a uniform over
[0, 5].
1. [4 points] Using the equations defined above, write out the expression for the marginal log likelihood
of a dataset x1 , . . . , xn given the parameters (this is what maximum likelihood is maximizing). Your
solution should be in terms of π0 , σ 2 , µ, and the xi . Hint: Remember to marginalize over the zi ’s.
Write your final answer as: log p(x1 , . . . , xn | µ, σ 2 , π0 ) = expression.
F SOLUTION:
log p(x1 , . . . , xn | µ, σ 2 , π0 ) =
n
X π0 1 − π0 1 2
log 1(0 ≤ xi ≤ 5) + √ exp − 2 (xi − µ)
i=1
5 2πσ 2σ
2. [4 points] (E-Step) For fixed µ, π0 , and σ 2 , compute the posterior

P probabilities q(zi = 0 | xi ) = p(zi =
0 | x, µ, π0 , σ 2 ). Hint: Remember P (B | A) = P (A | B)P (B)/ B P (A | B)P (B). Your solution should
be in terms of π0 , σ 2 , µ, and the xi . Write your final answer as: q(zi = 0 | xi ) = expression.
F SOLUTION:
p(xi | zi = 0)p(zi = 0)
q(zi = 0 | xi ) = P
z p(xi | zi )p(zi )
π0
5 1(0 ≤ xi ≤
5)
= π0 1−π 1 2
5 1(0 ≤ xi ≤ 5) + 2πσ exp − 2σ 2 (xi − µ)
√ 0
3. [4 points] (M-step for µ) Given the estimates of the posterior q(zi = 0 | xi ) and fixed π0 and σ 2 , now
write down the update of µ. Your solution can include any of the following terms: q(zi = 0 | xi ), π0 , σ 2
and the xi . Write your final answer as: µ = expression.
F SOLUTION: This is the same analysis as for Gaussian mixture models: we just update µ as the
weighted mean according to q:
Pn
(1 − q(zi = 0 | xi ))xi
µ = Pi=1
n
i=1 (1 − q(zi = 0 | xi ))
4. [4 points] (M-step for π0 ) Given the estimates of the posterior q(zi = 0 | xi ), updated µ, and
fixed σ 2 , now write down the update for π0 . Your solution can include any of the following terms:
q(zi = 0 | xi ), µ, σ 2 and the xi . Write your final answer as: π0 = expression.
2
1
Pn
F SOLUTION: This is again the same analysis as for Gaussian mixture models: π0 = n i=1 q(zi =
0 | xi ).
5. [4 points] Suppose we will now apply our EM procedure to the sample dataset below. If we initialize
µ = 0 and π0 = 1/100 (with parameter σ 2 = 1/2) we get the following distribution for q(zi = 0 | xi )
as a function of the observation xi :
Note that the plot of the data has been rotated to align with the plot of q(zi = 0 | xi ). In practice,
this initialization leads to µ incorrectly converging to a value close to zero on this dataset.
Briefly explain (2-3 sentences) why we might expect this failure to happen, given the above q(zi = 0 | xi )
distribution. Hint: Look at where the majority of points lie with respect to q(zi = 0 | xi ). How will this
affect the update for π0 and µ?
F SOLUTION: Most of the data points lie in the range [2, 4], which would be all considered “failures”
according to the above distribution. Therefore they will be ignored when updating µ, so µ will never deviate
from zero.
2 Pretty Crazy Awesome PCA [28 points]

2.1 Classy Axes
1. [5 points] For the graph in Figure 2, sketch the 2 principal components that would be found by PCA.
Label them as “1” and “2” to indicate which is the first, and which is the second axis. You don’t have
to precisely calculate the equations for these axes; you should be able to figure out which directions
are the principal components without needing complicated formulas and draw lines in approximately
those directions. Remember that PCA ignores class labels.
F SOLUTION: PCA finds (orthogonal) directions of maximum variance. These are plotted below.
2. [2 points] If we project the data onto a single principal component, which component will allow a
decision stump to perfectly separate the data: the first, the second, or both?
F SOLUTION: The second axis is the only one that separates the data.
3. [2 points] Suppose instead we transform the data using a non-linear function of only one of the
original axes: either φ(x) = x21 or φ(x) = x22 . Which will allow a decision stump to perfectly separate
the data?
3
Figure 2: Graph of Mystery.
F SOLUTION: x21 will separate the data, but not x22 .
2.2 Like a good neighbor, PCA is there

You showed in your first homework that nearest-neighbor classification can suffer when many irrelevant
features are present. Here you will show that PCA can solve this problem by analyzing the covariance matrix
of the features. Suppose we have two classes, y = 1, 2, and P (x | y) is Gaussian. For simplicity assume
for all Gaussians below that the variance σ 2 is the same. Reminder: If X ∼ N (µ, σ 2 ), then E[X] = µ,
V ar[X] = σ 2 , and E[X 2 ] = µ2 + σ 2 . Also, recall that expectation is linear and it obeys the following four
properties:
E[X + c] = E[X] + c for any constant c

E[X + Y ] = E[X] + E[Y ]
E[aX] = aE[X] for any constant a
E[X] = EY [EX [X | Y ]] (iterated expectation)
1. [4 points] Suppose our data is generated according to a Gaussian mixture: first Y is sampled from
the set of classes {1, 2} with equal probability ( 21 ) of each class being selected, and then a scalar X is
4
sampled according to the Gaussian P (X | Y ). Let 1 be the mean of class 1 and −1 be the mean of
class 2. What is the variance of X? Your answer should be in terms of σ 2 .
F SOLUTION: The answer is 1 + σ 2 . Recall that V ar[X] = E[X 2 ] − E[X]2 . Then we have:
E[X 2 ] = EY [EX [X 2 | Y ]]
1 1
= (12 + σ 2 ) + ((−1)2 + σ 2 )
2 2
= 1 + σ2 .
By similar reasoning, E[X] = 12 (1) + 12 (−1) = 0, so E[X]2 = 0 and the overall variance is just 1 + σ 2 .
2. [3 points] Now suppose a data point X 0 is generated according to same process as above, but the
feature is uninformative, so the mean of class 1 and 2 are both zero. What is the variance of X 0 ? Your
answer should be in terms of σ 2 .
F SOLUTION: Because P (X 0 | Y = 1) and P (X 0 | Y = 2) are the same, we really just have that
P (X 0 ) = N (0, σ 2 ). Thus V ar[X 0 ] = σ 2 .
3. [3 points] Suppose now we generate X as in part 1 and X 0 as in part 2 from a single Y . What is the
covariance between X and X 0 ? Recall that covariance is defined as E[(X − EX [X])(X 0 − EX 0 [X 0 ])].
F SOLUTION: Zero. X and X 0 are conditionally independent given Y . Therefore, conditional on Y ,

there is zero covariance, and we use iterated expectation to get zero covariance overall.
EX [X] = 0, EX 0 [X 0 ] = 0
E[(X − 0)(X 0 − 0)] = E[XX 0 ]
= EY [EX [X | Y ]EX 0 [X 0 | Y ]]
= EY [EX [X | Y ]EX 0 [X 0 ]]
= EY [EX [X | Y ] ∗ 0] = 0
4. [4 points] Suppose we generate a dataset (x1 , y1 ) . . . , (xn , yn ) by generating one informative feature
X1 as in part 1 and m non-informative features X2 , . . . , Xm+1 as in part 2 for each example. (An
example 2D dataset is shown in Figure 3.) What are the eigenvalues of the covariance matrix of the
data? Assume that n → ∞, so that the covariances are exactly the true covariances that you computed
in the previous questions. Hint: The eigenvalues of a diagonal matrix are the diagonal entries of the
matrix, and the eigenvectors are the standard bases e1 = [1 0 0 . . . 0]> , etc.
F SOLUTION: By parts 1-3, the covariance matrix is diag(1 + σ 2 , σ 2 , . . . , σ 2 ). Therefore the first
eigenvalue is 1 + σ 2 , and the rest are σ 2 .
5. [3 points] Suppose we use a nearest-neighbor classifier on our data using the first principal component
only. Briefly explain (2-3 sentences) why the ratio of intra-class to inter-class distances in the space
defined by the first principal component will increase, decrease, or not change as m increases (as more
uninformative features are added to the data).
5
Figure 3: Sample dataset with only one informative dimension: x1 .
F SOLUTION: By the previous sections, the first principal component will always be the first feature,
which is the only informative feature. Therefore distances will not change.
6. [2 points] Based on your previous answer, how will running PCA (and keeping only 1 component)
before using K-NN affect the performance of K-NN as m increases? That is, will it increase, decrease,
or have no effect on the accuracy of K-NN?
F SOLUTION: Improve: K-NN will begin to get confused by the irrelevant dimensions, but running
PCA first will keep it on track.
3 Principal Component Analysis [30 points]

Task description. Here we will consider a dataset for optical character recognition (OCR). The dataset we
provide consists of a sequence of words, one character per row.1 The very first character of each word was capi-
talized in the original data and has been omitted for simplicity. The dataset is located in data/ocr data.mat.
The format of the data is described in data/data format.txt, so please read that file before continuing.
In this problem you will use PCA to reduce the training data (the 0/1 pixel data of the training set) from
64 dimensions to 20 dimensions. To answer the questions you need to write code in Matlab to process the
dataset. You are allowed to use built-in matlab function for PCA, K-means, etc. for this problem. Note,
there is no automatic grader for this problem and you do not need to submit your code for grading.
1. [5 points] Using the top 2 PCA dimensions, display all the test letters “x” and “y”, using crosses
and circles respectively. An example of a plot of “x” vs “w” is shown in Figure 4 below.
F SOLUTION: See Figure 5.
2. [5 points] Plot the average (in sample) reconstruction error ||X − X̂||F of the digits as a function of
the number of principle components that are included. How many principle components are required
to get 90% reconstruction accuracy? (I.e., to explain 90% of the variance: ||X − X̂||F /||X − X̄||F = 0.9
where X̄| is a matrix, every row of which is the average x – the average image. Reminder: be sure to
1 This
dataset is a modified version of the dataset at http://www.seas.upenn.edu/~taskar/ocr/, which has been subsampled
and slightly altered to simplify the processing.
6
Figure 4: Plot of ’x’-’w’ characters from top 2 PCA dimensions
Figure 5: Plot of ’x’-’y’ characters from top 2 PCA dimensions
subtract that mean from X before you find the principle components, and to add it back in for your
predictions.
3. [10 points] Learn a Gaussian Naive Bayes classifier using MLE on the PCA-ed data and report
accuracy on the test set using 5, 10 and 20 top dimensions. You should be learning a mean and
variance for each class and each dimension. So 26 parameters for class priors, and 26 times (5,10 or
20) means plus 26 times (5,10, or 20) variances. In addition to reporting accuracies, write a brief
description (3-4 sentences) of how these results compare to the performance of Naive Bayes on the
original data. Please include your source code in your write-up, but do not include built-in
7
matlab functions.
F SOLUTION: Character accuracies (percent): 43.66, 60.13, 66.3

Accuracy on original data: 60.75
Note that with only 10 dimensions, we compare favorably with Naive Bayes on the original data. Adding
10 more dimensions gains 6% improvement over the unprocessed data. By using PCA, irrelevant features
are discarded since the top principal components will focus on the features which explain the most variance
in the data, which leads to better classification accuracy as long as we use enough features so that we are
not discarding relevant information.
4. [10 points] Learn a k-means classifier on the samePCA-ed data and report accuracy on the test set
using 5, 10 and 20 top dimensions. In addition to reporting accuracies, write a brief description (3-4
sentences) of how these results compare to the performance of The Gaussian Naive Bayes classifier.
Please include your source code in your write-up, but do not include built-in matlab
functions.
F SOLUTION: Character accuracies (percent): 40.16 44.69 45.35
4 K-Means [11 points]

Given a set of points x1 , . . . , xn , recall that the objective of k-means clustering is to minimize within-class
sum of squares
Xn X K
J(µ, r) = rik ||µk − xi ||22
i=1 k=1
where µ1 , . . . , µK are the centroids of the clusters and rik are binary variables indicating whether data point
xi belongs to cluster k.
The optimization is NP-hard. Instead, as described in class, a greedy iterative clustering algorithm is
used to alternatively update µk and rik to optimize J(µ, r).:
• Initialize K cluster centroids at random

• Alternate until convergence:
– Assignment step: Assign points to the nearest centroid
arg min J(µ, r) → rik = 1(k = arg min

0
||µk0 − xi ||22 )
r k
– Update step: Set the centroid to be the mean of the points assigned to it
P
rik xi
arg min J(µ, r) → µk = Pi
i rik
µ
1. [4 points] The greedy iterative clustering algorithm converges when the assignments no longer change.
Is the algorithm guaranteed to converge in a finite number of steps? Briefly explain why or why not.
Hint: Think about the total possible number of assignments and the trend of the objective function after
each iteration.
8
F SOLUTION: There are finite number K n of assignments. For each iteration, the objective function
monotonically decreases until it converges.
2. [7 points] Consider the toy 2D dataset in the following figure. We set K = 2 in this problem. The
* marker indicates the data point and the circle marker indicates the starting cluster centroids. Show
the update step and assignment step for each iteration until the algorithm converges. You can use as
many of the remaining figures provided as you need.
Please note that the first update step is given as the initialization. For each iteration, use one figure
to mark the updated centroids based on the assignment from last iteration, then circle the data point
assignment based on the updated centroid. We expect you to understand the detailed procedure of k-
means. You should be able to compute the centroids manually with a calculator and assign data points
with geometric intuition. Please be sure to show the coordinates of each centroid in every iteration.
What to hand in. You can
• scan: use your phone or a scanner to take the image with your circle and include it in the .pdf
you hand in or
• use a pdf tool like adobe acrobat to write directly on the pdf or
• run a matlab program, and just hand in the outputs at each iteration with the centroids and the
list of points in each cluster (but you should see graphically what is happening.)
5 Semi-supervised learning [11 points]

In this exercise, we will use the same data set as in hw2 and hw5 for the task of handwritten digit recognition.
Specifically, in this assignment, we will examine how neural net and PCA affect the performance of a model.
We will also compare the performance of logistic regression and k-means on those data.
We will do ‘semi-supervised learning’ – use a large data set to learn dimensionality reduction, and then use
the reduced dimension projection (PCA) or nonlinear transformation (auto-encoder) as features in supervised
learning. Note that this will also be ‘transductive learning’ where the features vectors on the test set are
also known at training time.
1. Train PCA and an auto-encoder on the big data (12,000 samples) (see train.mat, which has X train
12,000*784 and Y train 12,000 *1).
Please use matlab function pcacov on the covariance matrix of X to train coefficients for each principle
component. (see pca getpc.m for details). For PCA, choose the number of principle components so
as to keep 90% reconstruction accuracy. How many principle components do we need in this case? Use
that many principle components.
For the auto-encoder, the setup to train a model is basically the same as what we did in hw5, except
that in this homework we change the hidden layer dimension from 100 to 200 (see rbm.m for details).
F SOLUTION:
• number of principle components: 53
9
2. Now we will do logistic regression on 3 different inputs:
• The original features

• The PCA-ed data
• The auto-encoder outputs. To get the new features from auto-encoder, use the trained model in
the question above, then plug in the original features into the auto-encoder.
To generate new features from auto-encoder, please use the newFeature rbm function in hw7 kit.
Also note that after loading X train and X test, divide them by 255 so that each element in the
matrices are less than 1. (See test.m for details)
You will be doing 10-way logistic regression (labels 1, ..., 9,10) and should use an L0 evaluation of the
error (e.g. you get it right or get it wrong). A useful matlab function for this is liblinear
Usage:
for training model = train(train y, sparse(train x),[’-s 0’,’col’])
for testing [predicted lable] = predict(test y, sparse(test x),[’-q’,’col’])
(see logistic.m for details).
Test your model on test data (load test.mat), what’s the accuracy?
Compare the performance of the accuracy on test set of the three inputs. What’s your observation?
F SOLUTION:
• Original data (12,000): 89.56%
• PCA-ed data: 90.34%
• The auto-encoder outputs: 94.64%
3. Do the same thing using K-means, with k = 10, 25 respectively, but run the code on PCA-ed data and
auto-encoder only. Test your model on test.mat, what’s the accuracy?
Also, compare the performance of K-means with that of logistic regression.
F SOLUTION: k = 10
k = 25
10
Figure 6: Solution of kmeans
11

Update means and then circle corresponding clusters





12
Figure 7: Figures for k-means

hw7 Sol

Uploaded by

Copyright:

Available Formats

hw7 Sol

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

hw7 Sol

Uploaded by

Copyright:

Available Formats

CIS 520, Machine Learning, Fall 2015: Assignment 7

Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas

1 Sensational EM [20 points]

Figure 1: Range readings from sensor

2. [4 points] (E-Step) For fixed µ, π0 , and σ 2 , compute the posterior

2 Pretty Crazy Awesome PCA [28 points]

F SOLUTION: x21 will separate the data, but not x22 .

2.2 Like a good neighbor, PCA is there

E[X + c] = E[X] + c for any constant c

F SOLUTION: Zero. X and X 0 are conditionally independent given Y . Therefore, conditional on Y ,

3 Principal Component Analysis [30 points]

F SOLUTION: See Figure 5.

Figure 5: Plot of ’x’-’y’ characters from top 2 PCA dimensions

F SOLUTION: Character accuracies (percent): 43.66, 60.13, 66.3

F SOLUTION: Character accuracies (percent): 40.16 44.69 45.35

4 K-Means [11 points]

• Initialize K cluster centroids at random

arg min J(µ, r) → rik = 1(k = arg min

5 Semi-supervised learning [11 points]

• number of principle components: 53

• The original features

Figure 7: Figures for k-means

You might also like