cs224n Practice Midterm 3 Sol
cs224n Practice Midterm 3 Sol
cs224n Practice Midterm 3 Sol
Name (printed):
Solution: C. The train loss always decreases whether the model is overfitting
or underfitting. If the step size is too small, the convergence is too slow, but
the training loss will still go down. If the step size is too large, it may cause a
bouncing effect because we skip and overshoot the optimal solution. This leads
to increase in training loss and decrease in training accuracy.
(b) (2 points) In which of the following cases will there be no update when doing SGD
with a max margin loss, J = max(0, 1 − s + sc ) where s is the true window score
and sc is the corrupt window score?
A. (s, sc ) = (1, 0.5)
B. (s, sc ) = (2, 0.5)
C. (s, sc ) = (1, 1.5)
D. (s, sc ) = (2, 1.5)
Solution: B. The max margin loss is 0 when s − sc > 1, which is only true for
option B.
(d) (2 points) Which one of the following is a valid projective dependency parse?
Page 2
Which of the following is a valid projective dependency tree
(A)
ROOT W X Y Z
(B)
ROOT W X Y Z
(C)
ROOT W X Y Z
(D)
ROOT W X Y Z
Solution: C. In (A) Z has no head, in (B) W has two heads, and (D) is not
projective because ROOT→X and Y→W cross.
(e) (2 points) We have learned that dense word vectors learned through word2vec or
GloVe have many advantages over using sparse one-hot word vectors. Which of the
following is NOT an advantage dense vectors have over sparse vectors?
A. Models using dense word vectors generalize better to unseen words than
those using sparse vectors.
B. Models using dense word vectors generalize better to rare words than
those using sparse vectors.
C. Dense word vectors encode similarity between words while sparse vectors
do not.
D. Dense word vectors are easier to include as features in machine learning
systems than sparse vectors.
(g) (2 points) In order for the following Tensorflow code to run, which of the following
MUST be contained as keys of feed dict?
Page 3
import tensorflow as tf
W = tf.Variable(tf.random_normal((200, 20)))
b = tf.Variable(tf.zeros((20,)))
x = tf.placeholder(tf.float32, (32, 200))
y = tf.matmul(x, W) + b
prediction = tf.nn.softmax(y)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
feed_dict = _________________________
sess.run([prediction], feed_dict=feed_dict)
A. x, W, b
B. only x
C. only prediction
D. x and label
(h) (2 points) A multi-layer neural network model trained using stochastic gradient
descent on the same dataset with different initializations for its parameters is guar-
anteed to learn the same parameters. A. True B. False
Solution: False. The loss function for a multi-layer neural network is non-
convex and hence SGD is only guaranteed to converge to a local optimum.
(j) (2 points) Stochastic gradient descent results in a smoother convergence plot (loss
vs epochs) as compared to batch gradient descent. A. True B. False
Page 4
Please write answers to the following questions in a sentence or two. Each part is worth
6 points.
(a) Dependency Parsing
i. (2 points) Can the neural dependency parser we learned in class correctly parse
the sentence “John saw a dog yesterday which was a Yorkshire terrier.”? Ex-
plain.
ii. (2 points) What is the ambiguity in parsing the sentence “There’s an awful
cost to getting a PhD that no one talks about”?
Solution: The phrase “that no one talks about” could either be referring
to the cost, “No one talks about the awful cost of getting a PhD” or to the
PhD, “Getting a PhD that no one talks about is awfully costly”.
iii. (2 points) Name at least two types of features that would be helpful to provide
as input to a neural dependency parser and explain why.
Solution: Here are some useful features that are indicative of transition
decisions:
• Taking word vectors for ns words from the top of the stack. Helps
identify
• Taking word vectors for nb words from the top of the buffer.
where θi is the parameter to be updated, ∇θi L is the gradient of the loss (cost)
with respect to θi , is a hyperparameter usually between 10−8 and 10−4 and η is a
hyperparameter similar to step size in SGD. cachei is initialized to zero at the start
of the algorithm.
(i) State two differences between AdaGrad and SGD in terms of step size and (ii)
describe what effect they would have.
Page 5
Solution:
i. (2 points) What would the gradient of the sigmoid be with respect to an input
that is very large?
Solution: Because the sigmoid is almost flat for very large values, its gra-
dient will be almost 0.
ii. (2 points) Why might this be a problem for training neural networks?
iii. (2 points) How does the ReLU activation ReLU(z) = max(0, z) solve this prob-
lem?
Solution: Because ReLU activations are linear, they do not saturate for
large (positive) values, and hence freely allow gradients to change weights
in the network. However, they do not solve the problem for large negative
weights and ReLU units can “die” if a gradient update moves parameters
into the flat region.
Page 6
LSTM LSTM LSTM LSTM Softmax y
This model inputs word vectors to the LSTM model at each time step and uses the
last hidden state vector to predict the sentiment label (y).
Recall that in assignment 1 we used a simple “bag-of-vectors” model for sentiment
classification: we used the average of all the word vectors in a sentence to predict
the sentiment label.
i. (3 points) Name at least one benefit of the LSTM model over the bag-of-vectors
model.
ii. (3 points) If we chose to update our word vectors when training the LSTM
model on sentiment classification data, how would these word vectors differ
from ones not updated during training? Explain with an example. Assume that
the word vectors of the LSTM model were initialized using GloVe or word2vec.
Solution: The word vectors learned using this method can capture more
sentiment information. In the word2vec models that depend on co-occurrence
counts, words like ‘good’ or ‘bad’ have very similar representations; repre-
sentations learned through a sentiment classifier would learn different em-
beddings.
Page 7
x1
ŷ x2
x = ..
.
h xN
h = tanh(W x + b)
x1 x2 x3 ŷ = softmax(U h + d)
J = CE(y, ŷ)
Green eggs and X
CE = − yi log(ŷi ).
i
(a) (4 points) State two important differences between NNLM the CBOW model we
learned in class. Explain how each might affect the word vectors learned.
Solution:
• The CBOW model simply uses the sum of context words, while the NNLM
model combines context words non-linearly. Thus the NNLM can learn to
treat “not good to” differently from “good to not”, etc.
ii. (3 points) Describe at least one way to change the model that would reduce
this complexity.
Page 8
Solution: Let z1 = xW + b and z2 = hU + d.
∂CE
δ1 = = (ŷ − y)
∂ ŷ
∂CE ∂z2
δ2 = = δ1 = U T δ1
∂h ∂h
∂CE ∂h
δ3 = = δ2 = δ2 ◦ (1 − tanh(z1 )2 )
∂z1 ∂z1
∂CE ∂z1
= δ3 = W T δ3
∂x ∂x
Dx ×1
In the autoencoder, input word vectors x1 , x2 ∈ R
two are first concatenated into a
x
single vector x = 1 ∈ R2Dx ×1 , and the parent vector p can be computed as:
x2
During training, we use the parent vector p to reconstruct the input vectors:
0
x
x = 10 = W2 p + b2 ∈ R2Dx ×1
0
x2
Page 9
where x01 , x02 ∈ RDx ×1 are the reconstructions. Correspondingly, a reconstruction loss
that computes the Euclidean distance between inputs and reconstructions is used during
training:
1
J1 = ||x0 − x||2 ∈ R.
2
For sentiment classification, the parent vector is used to predict a sentiment class ŷ for
the pair of the input words:
ŷ = W3 p + b3 ∈ RDc ×1 ,
J2 = CE(y, ŷ) ∈ R,
(b) (2 points) How do the reconstructed vectors and reconstruction loss help in learning
the parent representation?
Solution: The reconstruction loss prevents the network from ignoring the con-
tribution of one of x1 or x2 . In other words, it forces the network to capture the
joint ‘meaning’ of x1 and x2 .
(c) (2 points) i. What is the shape of each weight and bias term: W1 , W2 , W3 , b1 , b2 , b3 ?
ii. How many parameters does the model have in total?
Page 10
(d) Compute the following gradients for the reconstruction loss:
Hint: You can use the following notation
(
1, if x > 0
1{x > 0} =
0, otherwise
Using it on a matrix would perform an element-wise operation, e.g.
1 0 1 0
1 >0 =
−1 3 0 1
∂J1
i. (2 points) δ1 = ∂p
=
∂J1 def
ii. (2 points) δ2 = ∂h
= (where h = W1 x + b1 )
∂J1
iii. (2 points) ∂W1
=
∂J1
iv. (2 points) ∂b1
=
∂J1
v. (2 points) ∂x1
=
(e) (2 points) How would you construct a network using only copies of the autoencoder
above to predict a sentiment label for a whole sentence, not just a pair of words?
Remember that each sentence could be of arbitrary length.
Page 11
like averaging the predictions made on every pair of words to still be correct
(though not preferable) because such a network does not use any other element
than the autoencoders themselves.
ht = f (w1 xt + w2 ht−1 + b2 )
yt = g(w3 ht + b3 ),
Fill in weights (w1 , w2 , w3 ), biases (b1 , b2 ) so that the RNN initially outputs 0, but
as soon as it receives an input of 1, it switches to outputting 1 for all subsequent
time steps. For instance, the input 0001010 produces the output 0001111. The
hidden unit has an initial value of 0.
You don’t need to provide an explanation, but doing so may help you receive partial
credit.
Hint: In one possible solution, the hidden unit has an activation ht = 0 until there’s
an input xt = 1, at which point it switches to maintaining an activation of 1 forever.
The output unit always predicts the same as the hidden unit, i.e. yt = ht .
Page 12
Solution: There are many possible solutions, but only one main scenario: the
output unit must predict the same as the hidden state.
This requires that:
b3 =0
w 3 + b3 =1
b2 <0
w2 + b2 ≥0
w1 + b2 ≥ 0,
b3 =1
w 3 + b3 =0
b2 <0
w 2 + b2 ≥0
w 1 + w 2 + b2 < 0,
(b) Now suppose that f is a sigmoid unit and that the input is always 0. The following
diagram shows how the next state ht+1 changes with the current state ht when
w2 = 3 and b2 = −1, i.e. ht+1 = σ(3ht − 1). The diagram also shows that ht+1 = ht
when ht = 0.80.
i. (2 points) Would ht+1 be greater than or less than ht if ht = 0.5? greater than
ii. (2 points) Would ht+1 be greater than or less than ht if ht = 0.9? less than
iii. (2 points) What would ht+1 look like as we take increasingly longer sequences,
i.e. t → ∞?
iv. (1 point) For what range of initial values h0 , would this network experience
exploding gradients for long sequences? None
v. (1 point) For what range of initial values h0 , would this network experience
Page 13
vanishing gradients for long sequences? All
vi. (4 points) We have learned in class that ReLUs are good at preserving gradients
in deep neural networks. Would it be wise to replace the sigmoid activation
above with a ReLU activation unit? Explain.
Solution: No. When ht > 13 , then ReLU unit has a slope of 3 and would
experience exploding gradients. When ht < 13 , the network would experi-
ence vanishing gradients.
Page 14