Nothing Special   »   [go: up one dir, main page]

Midterm Sp16 Solutions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Midterm Exam Solutions

CMU 10-601: Machine Learning (Spring 2016)

Feb. 29, 2016

Name:
Andrew ID:

START HERE: Instructions


• This exam has 17 pages and 5 Questions (page one is this cover page). Check to see if
any pages are missing. Enter your name and Andrew ID above.

• You are allowed to use one page of notes, front and back.

• Electronic devices are not acceptable.

• Note that the questions vary in difficulty. Make sure to look over the entire exam
before you start and answer the easier questions first.

Question Point Score

1 20

2 20

3 20

4 20

5 20

Extra Credit 14

Total 114
10-601: Machine Learning Page 2 of 17 2/29/2016

1 Naive Bayes, Probability, and MLE [20 pts. + 2 Extra Credit]

1.1 Naive Bayes


You are given a data set of 10,000 students with their sex, height, and hair color. You are
trying to build a classifier to predict the sex of a student, so you randomly split the data
into a training set and a testing set. Here are the specifications of the data set:
• sex ∈ {male,female}
• height ∈ [0,300] centimeters
• hair ∈ {brown, black, blond, red, green}
• 3240 men in the data set
• 6760 women in the data set
Under the assumptions necessary for Naive Bayes (not the distributional assumptions you
might naturally or intuitively make about the dataset) answer each question with T or F
and provide a one sentence explanation of your answer:
(a) [2 pts.] T or F: As height is a continuous valued variable, Naive Bayes is not appropriate
since it cannot handle continuous valued variables.
Solution: False. Naive Bayes can handle both continuous and discrete values as long
as the appropriate distributions are used for conditional probabilities. For example,
Gaussian for continuous and Bernoulli for discrete
(b) [2 pts.] T or F: Since there is not a similar number of men and women in the dataset,
Naive Bayes will have high test error.
Solution: False. Since the data was randomly split, the same proportion of male and
female will be in the training and testing sets. Thus this discrepancy will not affect
testing error.
(c) [2 pts.] T or F: P (height|sex, hair) = P (height|sex).
Solution: True. This results from the conditional independence assumption required for
Naive Bayes.
(d) [2 pts.] T or F: P (height, hair|sex) = P (height|sex)P (hair|sex).
Solution: True. This results from the conditional independence assumption required for
Naive Bayes.

1.2 Maximum Likelihood Estimation (MLE)


Assume we have a random sample that is Bernoulli distributed X1 , . . . , Xn ∼ Bernoulli(θ).
We are going to derive the MLE for θ. Recall that a Bernoulli random variable X takes
values in {0, 1} and has probability mass function given by
P (X; θ) = θX (1 − θ)1−X .
10-601: Machine Learning Page 3 of 17 2/29/2016

(a) [2 pts.] Derive the likelihood, L(θ; X1 , . . . , Xn ).


Solution:
n
Y
L(θ; X1 , . . . , Xn ) = p(Xi ; θ)
i=1
Yn
= θxi (1 − θ)1−xi
i=1
Pn Pn
xi
=θ i=1 (1 − θ)n− i=1 xi
.
Either of the final two steps are acceptable.
(b) [2 pts.] Derive the following formula for the log likelihood:
n
! n
!
X X
`(θ; X1 , . . . , Xn ) = Xi log(θ) + n − Xi log(1 − θ).
i=1 i=1

Solution:
l(θ; X1 , . . . , Xn ) = log L(θ; X1 , . . . , Xn )
h Pn Pn i
= log θ i=1 xi (1 − θ)n− i=1 xi
Xn n
X
=( xi ) log(θ) + (n − xi ) log(1 − θ)
i=1 i=1

1 Pn
(c) Extra Credit: [2 pts.] Derive the following formula for the MLE: θ̂ = ( i=1 Xi ).
n
d
Solution: To find the MLE we solve dθ `(θ; X1 , . . . , Xn ) = 0. The derivative is given by
n n
d dhX X i
`(θ; X1 , . . . , Xn ) = ( xi ) log(θ) + (n − xi ) log(1 − θ)
dθ dθ
Pn i=1 Pn i=1
x i n − x
i=1 i
= i=1 −
θ 1−θ
d
Next, we solve dθ
`(θ; X1 , . . . , Xn ) = 0:
Pn
n − ni=1 xi
P
i=1 xi
− =0
θ 1−θ
n
X   Xn 
⇐⇒ xi (1 − θ) − n − xi θ = 0
i=1 i=1
n
X
⇐⇒ xi − nθ = 0
i=1
n
1 X
⇐⇒ θ̂ = ( Xi ).
n i=1
10-601: Machine Learning Page 4 of 17 2/29/2016

1.3 MAP vs MLE


Answer each question with T or F and provide a one sentence explanation of your
answer:

(a) [2 pts.] T or F: In the limit, as n (the number of samples) increases, the MAP and
MLE estimates become the same.
Solution: True. As the number of examples increases, the data likelihood goes to zero
very quickly, while the magnitude of the prior stays the same. In the limit, the prior
plays an insignificant role in the MAP estimate and the two estimates will converge.

(b) [2 pts.] T or F: Naive Bayes can only be used with MAP estimates, and not MLE
estimates.
Solution: False. In Naive Bayes we need to estimate the distribution of each feature Xi
given the label Y . Any technique for estimating the distribution can be used, including
both the MLE and the MAP estimate.

1.4 Probability
Assume we have a sample space Ω. Answer each question with T or F. No justification
is required.

(a) [1 pts.] T or F: If events A, B, and C are disjoint then they are independent.
Solution: False. If they are disjoint, i.e. mutually exclusive, they are very dependent!
P (A)P (B|A)
(b) [1 pts.] T or F: P (A|B) ∝ . (The sign ‘∝’ means ‘is proportional to’)
P (A|B)
P (A)P (B|A)
Solution: False. P (A|B) =
P (B)
(c) [1 pts.] T or F: P (A ∪ B) ≤ P (A).
Solution: False. P (A ∪ B) ≥ P (A)

(d) [1 pts.] T or F: P (A ∩ B) ≥ P (A).


Solution: False. P (A ∩ B) ≤ P (A)
10-601: Machine Learning Page 5 of 17 2/29/2016

2 Errors, Errors Everywhere [20 pts.]

2.1 True Errors


Consider a classification problem on Rd with distribution D and target function c∗ : Rd →
{±1}. Let S be an iid sample drawn from the distribution D. Answer each question with T
or F and provide a one sentence explanation of your answer:

(a) [4 pts.] T or F: The true error of a hypothesis h can be lower than its training error on
the sample S.
Solution: True. If the sample S happens to favor part of the space where h makes
mistakes then the sample error will overestimate the true error. An extreme example is
when the hypothesis h has true error 0.5, but the training sample S contains a single
sample that h incorrectly classifies.

(b) [4 pts.] T or F: Learning theory allows us to determine with 100% certainty the true
error of a hypothesis to within any  > 0 error.
Solution: False. There is always a small chance that the sample S is not representative
of the underlying distribution D, in which case the sample S may have no relationship
to the true error. The sample complexity bounds discussed in class show that this is
rare, but not impossible.
10-601: Machine Learning Page 6 of 17 2/29/2016

2.2 Training Sample Size


In this problem, we will consider the effect of training sample size n on a logistic regression
classifier with d features. The classifier is trained by optimizing the conditional log-likelihood.
The optimization procedure stops if the estimated parameters perfectly classify the training
data or they converge. The following plot shows the general trend for how the training and
testing error change as we increase the sample size n = |S|. Your task in this question is
to analyze this plot and identify which curve corresponds to the training and test error.
Specifically:

curve (i)
Error

curve (ii)

Training set size

(a) [8 pts.] Which curve represents the training error? Please provide 1–2 sentences of
justification.
Solution: It is easier to correctly classify small training datasets. For example, if the
data contains just a single point, then logistic regression will always have zero training
error. On the other hand, we don’t expect a classifier learned from few examples to
generalize well, so for small training sets the true error is large. Therefore, curve (ii)
shows the general trend of the training error.

(b) [4 pt.] In one word, what does the gap between the two curves represent?
Solution: The gap between the two curves represents the amount of overfitting.
10-601: Machine Learning Page 7 of 17 2/29/2016

3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]

3.1 Linear regression


Given that we have an input x and we want to estimate an output y, in linear regression
we assume the relationship between them is of the form y = wx + b + , where w and b are
real-valued parameters we estimate and  represents the noise in the data. When the noise
is Gaussian, maximizing the likelihood of a dataset S = {(x1 , y1 ), . . . , (xn , yn )} to estimate
the parameters w and b is equivalent to minimizing the squared error:
n
X
arg min (yi − (wxi + b))2 .
w
i=1

Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative
to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write
your answers in the table below.

Dataset (a) (b) (c) (d) (e)


Regression line (b) (c) (b) (a) (a)

Figure 1: An observed data set and its associated regression line.

Figure 2: New regression lines for altered data sets S new .


10-601: Machine Learning Page 8 of 17 2/29/2016

(a) Adding one outlier to the (b) Adding two outliers to the original data
original data set. set.

(c) Adding three outliers to the original data (d) Duplicating the original data set.
set. Two on one side and one on the other
side.

(e) Duplicating the original data set and


adding four points that lie on the trajectory
of the original regression line.

Figure 3: New data set S new .


10-601: Machine Learning Page 9 of 17 2/29/2016

3.2 Logistic regression


Given a training set {(xi , yi ), i = 1, . . . , n} where xi ∈ Rd is a feature vector and yi ∈ {0, 1}
is a binary label, we want to find the parameters ŵ that maximize the likelihood for the
training set, assuming a parametric model of the form
1
p(y = 1|x; w) = .
1 + exp(−wT x)

The conditional log likelihood of the training set is


n
X
`(w) = yi log p(yi , |xi ; w) + (1 − yi ) log(1 − p(yi , |xi ; w)),
i=1

and the gradient is


n
X
∇`(w) = (yi − p(yi |xi ; w))xi .
i=1

(a) [5 pts.] Is it possible to get a closed form for the parameters ŵ that maximize the
conditional log likelihood? How would you compute ŵ in practice?
Solution: There is no closed form expression for maximizing the conditional log likeli-
hood. One has to consider iterative optimization methods, such as gradient descent, to
compute ŵ.

(b) [5 pts.] What is the form of the classifier output by logistic regression?
Solution: Given x, we predict ŷ = 1, if p(y = 1|x) ≥ p(y = 0|x). This is reduces to
ŷ = 1, if wT x ≥ 0, which is a linear classifier.

(c) [2 pts.] Extra Credit: Consider the case with binary features, i.e, x ∈ {0, 1}d ⊂ Rd ,
where feature x1 is rare and happens to appear in the training set with only label 1.
What is ŵ1 ? Is the gradient ever zero for any finite w? Why is it important to include
a regularization term to control the norm of ŵ?
Solution: If a binary feature was active for only label 1 in the training set then, by
maximizing the conditional log likelihood, we will make the weight associated to that
feature be infinite. This is because, when this feature is observed in the training set,
we will want to predict predict 1 irrespective of everything else. This is an undesired
behaviour from the point of view of generalization performance, as most likely we do
not believe this rare feature to have that much information about class 1. Most likely, it
is spurious co-occurrence. Controlling the norm of the weight vector will prevent these
pathological cases.
10-601: Machine Learning Page 10 of 17 2/29/2016

4 SVM, Perceptron and Kernels [20 pts. + 4 Extra Credit]

4.1 True or False


Answer each of the following questions with T or F and provide a one line justification.
(1) (1) (1) (1)
(a) [2 pts.] Consider two datasets D(1) and D(2) where D(1) = {(x1 , y1 ), ..., (xn , yn )}
(2) (2) (2) (2) (1) (2)
and D(2) = {(x1 , y1 ), ..., (xm , ym )} such that xi ∈ Rd1 , xi ∈ Rd2 . Suppose d1 > d2
and n > m. Then the maximum number of mistakes a perceptron algorithm will make
is higher on dataset D(1) than on dataset D(2) .

Solution: False. The maximum number of mistakes made by a perceptron is dependent


on the margin and radius of the training data, not its dimension or size. The maximum
mistake a perceptron will make is ( Rγ )2 .

(b) [2 pts.] Suppose φ(x) is an arbitrary feature mapping from input x ∈ X to φ(x) ∈ RN
and let K(x, z) = φ(x) · φ(z). Then K(x, z) will always be a valid kernel function.

Solution: True. K is a kernel if it is an inner product after applying some feature


transformation.

(c) [2 pts.] Given the same training data, in which the points are linearly separable, the
margin of the decision boundary produced by SVM will always be greater than or equal
to the margin of the decision boundary produced by Perceptron.

Solution: True. SVM explicitly maximizes margin; Perceptron does not differentiate
between decision boundaries as long as they lie within the margin of the training data.

4.2 Multiple Choice


(a) [3 pt.] If the data is linearly separable, SVM minimizes kwk2 subject to the constraints
∀i, yi w · xi ≥ 1. In the linearly separable case, which of the following may happen to the
decision boundary if one of the training samples is removed? Circle all that apply.

• Shifts toward the point removed Yes


• Shifts away from the point removed No
• Does not change Yes

pt.] Recall that when the data are not linearly separable, SVM minimizes kwk2 +
(b) [3 P
C i ξi subject to the constraint that ∀i, yi w · xi ≥ 1 − ξi and ξi ≥ 0. Which of the
following may happen to the size of the margin if the tradeoff parameter C is increased?
Circle all that apply.

• Increases No
• Decreases Yes
10-601: Machine Learning Page 11 of 17 2/29/2016

• Remains the same Yes

Proof of part (b):


Let w0∗ = argmin kwk2 + c0 i ξi and let w1∗ = argmin kwk2 + c1 i ξi . Let c1 > c0 . We
P P
need kw1∗ k2 ≥ kw0∗ k2 . Define ξi(0) to be the slack variables under w0∗ and ξi(1) to be the
slack variables under w1∗ .
We first show that for any kw0 k2 < kw0∗ k2 ,
P 0 P 0
i ξi > i ξi(0) where ξi are the slack
variables under w0 .
0 2 ∗ 2
P 0 P 0 2
P 0
By contradiction, assume kw k < kw 0 k and i ξ i ≤ i ξi(0) . Then, kw k + c0 i ξi <
kw0∗ k2 + c0 i ξi(0) and w0∗ 6= argmin kwk2 + c0 i ξi .
P P

Thus ∀kw0 k2 ≤ kw0∗ k2 , i ξi0 ≥ i ξi(0) .


P P

Next, we show that if w0∗ = argmin kwk2 + c0 i ξi and w1∗ = argmin kwk2 + c1 i ξi ,
P P
then kw1∗ k2 ≥ kw0∗ k2 .
By contradiction, assume kw1∗ k2 < kw0∗ k2 . Since w0∗ = argmin kwk2 + c0 i ξi :
P

X X
kw0∗ k2 + c0 ξi(0) ≤ kw1∗ k2 + c0 ξi(1)
i i
P P
Since c1 > c0 and i ξi(1) > i ξi(0) , then
X X
kw0∗ k2 + c1 ξi(0) < kw1∗ k2 + c1 ξi(1)
i i

But, X
w1∗ = argmin kwk2 + c1 ξi
i

This yields a contradiction. Thus kw1∗ k2 ≥ kw0∗ k2 .

4.3 Analysis
(a) [4 pts.] In one or two sentences, describe the benefit of using the Kernel trick.

Solution: Allows mapping features into higher dimensional space but avoids the extra
computational costs of mapping into higher dimensional feature space explicitly.

(b) [4 pt.] The concept of margin is essential in both SVM and Perceptron. Describe why a
large margin separator is desirable for classification.
Solution: Separators with large margin will have low generalization errors with high
probability.

(c) [4 pts.] Extra Credit: Consider the dataset in Fig. 4. Under the SVM formulation in
section 4.2(a),

(1) Draw the decision boundary on the graph.


10-601: Machine Learning Page 12 of 17 2/29/2016

(2) What is the size of the margin?


(3) Circle all the support vectors on the graph.

Solution: x2 − 2.5 = 0. The size of margin is 0.5. Support vectors are x2 , x3 , x6 , x7 .

Figure 4: SVM toy dataset


10-601: Machine Learning Page 13 of 17 2/29/2016

5 Learning Theory [20 pts.]

5.1 True or False


Answer each of the following questions with T or F and provide a one line justification.

(a) [3 pts.] T or F: It is possible to label 4 points in R2 in all possible 24 ways via linear
separators in R2 .
Solution: F. The VC dimension of linear separator in R2 is 3, hence it cannot shatter
a set of size 4 in all possible ways.

(b) [3 pts.] T or F: To show that the VC-dimension of a concept class H (containing


functions from X to {0, 1}) is d, it is sufficient to show that there exists a subset of X
with size d that can be labeled by H in all possible 2d ways.
Solution: F. This is only a necessary condition. We also need to show that no subset
of X with size d + 1 can be shattered by H.

(c) [3 pts.] T or F: The VC dimension of a finite concept class H is upper bounded by


dlog2 |H|e.
Solution: T. For any finite set S, if H shatters S, then H at least needs to have 2|S|
elements, which implies |S| ≤ dlog2 |H|e.

(d) [3 pts.] T or F: The VC dimension of a concept class with infinite size is also infinite.
Solution: F. Consider all the half-spaces in R2 , which has infinite cardinality but the
VC dimension is 3.

(e) [3 pts.] T or F: For every pair of classes, H1 , H2 , if H1 ⊆ H2 and H1 6= H2 , then


VCdim(H1 ) < VCdim(H2 ) (note that this is a strict inequality).
Solution: F. Let H1 be the collection of all the half-spaces in R2 with finite slopes and
let H2 be the collection of all the half-spaces in R2 . Clearly H1 ⊆ H2 and H1 6= H2 , but
V C(H1 ) = V C(H2 ) = 3.

(f) [3 pts.] T or F: Given a realizable concept class and a set of training instances, a
consistent learner will output a concept that achieves 0 error on the training instances.
Solution: T. Since the concept class is realizable, then the consistent learner can output
the oracle labeler by definition, which is guaranteed to achieve 0 error on the training
set.
10-601: Machine Learning Page 14 of 17 2/29/2016

5.2 VC dimension
Briefly explain in 2–3 sentences the importance of sample complexity and VC dimension
in learning with generalization guarantees.
Solution: Sample complexity guarantees quantify how many training samples we need
to see from the underlying data distribution D in order to guarantee that uniformly for
all hypotheses in the class of functions under consideration we have that their empirical
error rates are close to their true errors. This is important because we care abut finding
a hypothesis of small true error, but we can only optimize over a fixed training sample.
VC bounds are one kind of sample complexity guarantee, where the bound depends on the
VC-dimension of the hypothesis class, and they are particularly useful when the class of
functions is infinite.
10-601: Machine Learning Page 15 of 17 2/29/2016

6 Extra Credit: Neural Networks [6 pts.]


In this problem we will use a neural network to classify the crosses (×) from the circles (◦) in
the simple dataset shown in Figure 5a. Even though the crosses and circles are not linearly
separable, we can break the examples into three groups, S1 , S2 , and S3 (shown in Figure 5a)
so that S1 is linearly separable from S2 and S2 is linearly separable from S3 . We will exploit
this fact to design weights for the neural network shown in Figure 5b in order to correctly
classify this training set. For all nodes, we will use the threshold activation function

1 z>0
φ(z) =
0 z ≤ 0.

5
y
4 S2 w31
w32
3
S1 S3
x2 h1 h2
2

w11 w12 w21 w22


1

0 x1 x2
0 1 2 3 4 5
x1

(a) The dataset with groups S1 , S2 , and S3 . (b) The neural network architecture

Figure 5

5 5 5

4 4 4

3 3 3
x2 x2 x2
2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
x1 x1 x1

(a) Set S2 and S3 (b) Set S1 and S2 (c) Set S1, S2 and S3

Figure 6: NN classification.
10-601: Machine Learning Page 16 of 17 2/29/2016

(a) First we will set the parameters w11 , w12 and b1 of the neuron labeled h1 so that its
output h1 (x) = φ(w11 x1 + w12 x2 + b1 ) forms a linear separator between the sets S2 and
S3 .

(1) [1 pt.] On Fig. 6a, draw a linear decision boundary that separates S2 and S3 .
(2) [1 pt.] Write down the corresponding weights w11 , w12 , and b1 so that h1 (x) = 0 for
all points in S3 and h1 (x) = 1 for all points in S2 .
Solution: w11 = −1, w12 = 0, b1 = 3. With these parameters, we have w11 x1 +
w22 x2 + b1 > 0 if and only if −x1 > −3, which is equivalent to x1 < 3.

(b) Next we set the parameters w21 , w22 and b2 of the neuron labeled h2 so that its output
h2 (x) = φ(w21 x1 + w22 x2 + b2 ) forms a linear separator between the sets S1 and S2 .

(1) [1 pt.] On Fig. 6b, draw a linear decision boundary that separates S1 and S2 .
(2) [1 pt.] Write down the corresponding weights w21 , w22 , and b2 so that h2 (x) = 0 for
all points in S1 and h2 (x) = 1 for all points in S2 .
Solution: The provided line has a slope of −2 and crosses the x2 axis at the value
5. From this, the equation for the region above the line (those points for which
h2 (x) = 1)) is given by x2 ≥ −2x1 + 5 or, equivalently, x2 + 2x1 − 5 ≥ 0. Therefore,
w21 = 2, w22 = 1, b2 = −5.

(c) Now we have two classifiers h1 (to classify S2 from S3 ) and h2 (to classify S1 from S2 ). We
will set the weights of the final neuron of the neural network based on the results from h1
and h2 to classify the crosses from the circles. Let h3 (x) = φ w31 h1 (x) + w32 h2 (x) + b3 .

(1) [1 pt.] Compute w31 , w32 , b3 such that h3 (x) correctly classifies the entire dataset.
Solution: Consider the weights w31 = w32 = 1 and b3 = −1.5. With these weights,
h3 (x) = 1 if h1 (x) + h2 (x) ≥ 1.5. For points in S1 and S3 either h1 or h2 is zero, so
they will be classified as 0, as required. For points in S2 , both h1 and h2 output 1,
so the point is classified as 1. This rule has zero training error.
(2) [1 pt.] Draw your decision boundary in Fig. 6c.
10-601: Machine Learning Page 17 of 17 2/29/2016

Use this page for scratch work

You might also like