Nothing Special   »   [go: up one dir, main page]

cs224n Practice Midterm 3 Sol

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

CS224N: Natural Language Processing with Deep Learning

Winter 2017 Midterm Exam


This examination consists of 14 printed sides, 5 questions, and 100 points. The exam accounts
for 17% of your total grade. Please write your answers on the exam paper in the spaces provided.
You may use the back of a page if necessary but please mark this. You have 80 minutes
to complete the exam. Exams turned in after the end of the examination period will be either
penalized or not graded at all. The exam is closed book and allows only a single page of notes.
You are not allowed to: use a phone, laptop/tablet, calculator or spreadsheet, access the internet,
communicate with others, or use other programming capabilities. You must disable all networking
and radios (“airplane mode”).
If you are taking the exam remotely, please send us the exam by Tuesday, February 14 at
5:50 pm PDT as a scanned PDF copy to scpd-distribution@lists.stanford.edu.
Stanford University Honor Code: I attest that I have not given or received aid in this
examination, and that I have done my share and taken an active part in seeing to it that others as
well as myself uphold the spirit and letter of the Honor Code.

SUNet ID: Signature:

Name (printed):

Question Points Score


Multiple choice 20
Short questions 24
Feedforward neural network language models 18
Autoencoders 18
Recurrent neural networks 20
Total: 100

The standard of academic conduct for Stanford students is as follows:


1. The Honor Code is an undertaking of the students, individually and collectively: a. that they will not
give or receive aid in examinations; that they will not give or receive unpermitted aid in class work,
in the preparation of reports, or in any other work that is to be used by the instructor as the basis of
grading; b. that they will do their share and take an active part in seeing to it that they as well as
others uphold the spirit and letter of the Honor Code.
2. The faculty on its part manifests its confidence in the honor of its students by refraining from proc-
toring examinations and from taking unusual and unreasonable precautions to prevent the forms of
dishonesty mentioned above. The faculty will also avoid, as far as practicable, academic procedures
that create temptations to violate the Honor Code.
3. While the faculty alone has the right and obligation to set academic requirements, the students and
faculty will work together to establish optimal conditions for honorable academic work.
1. Multiple choice (20 points)
For each of the following questions, circle the letter of your choice. There is only ONE
correct choice. No explanations are required.
(a) (2 points) If your training loss increases with number of epochs, which of the fol-
lowing could be a possible issue with the learning process?
A. Regularization is too low and model is overfitting
B. Regularization is too high and model is underfitting
C. Step size is too large
D. Step size is too small

Solution: C. The train loss always decreases whether the model is overfitting
or underfitting. If the step size is too small, the convergence is too slow, but
the training loss will still go down. If the step size is too large, it may cause a
bouncing effect because we skip and overshoot the optimal solution. This leads
to increase in training loss and decrease in training accuracy.

(b) (2 points) In which of the following cases will there be no update when doing SGD
with a max margin loss, J = max(0, 1 − s + sc ) where s is the true window score
and sc is the corrupt window score?
A. (s, sc ) = (1, 0.5)
B. (s, sc ) = (2, 0.5)
C. (s, sc ) = (1, 1.5)
D. (s, sc ) = (2, 1.5)

Solution: B. The max margin loss is 0 when s − sc > 1, which is only true for
option B.

(c) (2 points) The biggest advantage of neural transition-based dependency parsers


over non-neural transition-based dependency parsers is that neural parsers:
A. Choose transitions using more words in the stack and buffer
B. Generate a larger class of dependency parses
C. Model a grammar whereas traditional parsers do not
D. Rely on dense feature representations

Solution: D. The main advantage of neural dependency parsers is that they


offer a dense representation instead of a spare representation of the parser.
Neural and traditional parsers are not different in what input information they
can use, or what kinds of parses they can output (both can output any parse),
but they differ in their representation of the features they use.

(d) (2 points) Which one of the following is a valid projective dependency parse?

Page 2
Which of the following is a valid projective dependency tree

(a) C has no head


(b) A has multiple heads
(c) Correct
(d) Not projective (Root->B crosses C->A)

(A)
ROOT W X Y Z

(B)
ROOT W X Y Z

(C)
ROOT W X Y Z

(D)
ROOT W X Y Z

Solution: C. In (A) Z has no head, in (B) W has two heads, and (D) is not
projective because ROOT→X and Y→W cross.

(e) (2 points) We have learned that dense word vectors learned through word2vec or
GloVe have many advantages over using sparse one-hot word vectors. Which of the
following is NOT an advantage dense vectors have over sparse vectors?
A. Models using dense word vectors generalize better to unseen words than
those using sparse vectors.
B. Models using dense word vectors generalize better to rare words than
those using sparse vectors.
C. Dense word vectors encode similarity between words while sparse vectors
do not.
D. Dense word vectors are easier to include as features in machine learning
systems than sparse vectors.

Solution: A. Just like sparse representations, word2vec or GloVe do not have


representations for unseen words and hence do not help in generalization.

(f) (2 points) Which of the following statements is INCORRECT?


A. Recurrent neural networks can handle a sequence of arbitrary length,
while feedforward neural networks can not.
B. Training recurrent neural networks is hard because of vanishing and ex-
ploding gradient problems.
C. Gradient clipping is an effective way of solving vanishing gradient prob-
lem.
D. Gated recurrent units (GRUs) have fewer parameters than LSTMs.

Solution: C. Gradient clipping is only a solution for solving exploding gradient


problems, not vanishing gradient ones.

(g) (2 points) In order for the following Tensorflow code to run, which of the following
MUST be contained as keys of feed dict?

Page 3
import tensorflow as tf
W = tf.Variable(tf.random_normal((200, 20)))
b = tf.Variable(tf.zeros((20,)))
x = tf.placeholder(tf.float32, (32, 200))
y = tf.matmul(x, W) + b
prediction = tf.nn.softmax(y)

label = tf.placeholder(tf.float32, (32, 20))


cross_entropy = tf.reduce_mean(-tf.reduce_sum(label * tf.log(
prediction), reduction_indices=[1]))
train_op = tf.train.GradientDescentOptimizer(0.5).minimize(
cross_entropy)

sess = tf.Session()
sess.run(tf.initialize_all_variables())
feed_dict = _________________________
sess.run([prediction], feed_dict=feed_dict)

A. x, W, b
B. only x
C. only prediction
D. x and label

Solution: B. Only the placeholder x is required to compute prediction.


label is only required to compute cross entropy and train op. W, b and
prediction are not placeholders.

(h) (2 points) A multi-layer neural network model trained using stochastic gradient
descent on the same dataset with different initializations for its parameters is guar-
anteed to learn the same parameters. A. True B. False

Solution: False. The loss function for a multi-layer neural network is non-
convex and hence SGD is only guaranteed to converge to a local optimum.

(i) (2 points) If it parses a sentence correctly, a transition-based dependency parser


must use a RIGHT-ARC transition at least once. A. True B. False

Solution: True. A RIGHT-ARC must be used to link ROOT to a dependent.

(j) (2 points) Stochastic gradient descent results in a smoother convergence plot (loss
vs epochs) as compared to batch gradient descent. A. True B. False

Solution: False. SGD results in noisier convergence plots compared to batch


gradient descent.

2. Short questions (24 points)

Page 4
Please write answers to the following questions in a sentence or two. Each part is worth
6 points.
(a) Dependency Parsing
i. (2 points) Can the neural dependency parser we learned in class correctly parse
the sentence “John saw a dog yesterday which was a Yorkshire terrier.”? Ex-
plain.

Solution: No. The sentence has a non-projective parse: yesterday is a


dependent of saw, while the phrase “which was a Yorkshire terrier” is a
dependent of “dog”.

ii. (2 points) What is the ambiguity in parsing the sentence “There’s an awful
cost to getting a PhD that no one talks about”?

Solution: The phrase “that no one talks about” could either be referring
to the cost, “No one talks about the awful cost of getting a PhD” or to the
PhD, “Getting a PhD that no one talks about is awfully costly”.

iii. (2 points) Name at least two types of features that would be helpful to provide
as input to a neural dependency parser and explain why.

Solution: Here are some useful features that are indicative of transition
decisions:

• Taking word vectors for ns words from the top of the stack. Helps
identify

• Taking word vectors for nb words from the top of the buffer.

• Taking vector representations for words’ POS tags. Helps disam-


biguate the role of the word or word sense.

• Taking vector representations for words’ arc-labels.

(b) (6 points) Adagrad is a modification to the stochastic gradient descent algorithm


used to update parameters. Updates are performed using the following formula:

cachei = cachei + (∇θi L)2


p
θi = θi − η∇θi L/( cachei + )

where θi is the parameter to be updated, ∇θi L is the gradient of the loss (cost)
with respect to θi ,  is a hyperparameter usually between 10−8 and 10−4 and η is a
hyperparameter similar to step size in SGD. cachei is initialized to zero at the start
of the algorithm.
(i) State two differences between AdaGrad and SGD in terms of step size and (ii)
describe what effect they would have.

Page 5
Solution:

• Adagrad has a decreasing step-size. This allows parameters to change


quickly at first and then converge.

• Adagrad has different per-element stepsizes. This allows parameters with


small gradients to change more quickly than those with large step-sizes.

(c) Consider the sigmoid activation function below:

i. (2 points) What would the gradient of the sigmoid be with respect to an input
that is very large?

Solution: Because the sigmoid is almost flat for very large values, its gra-
dient will be almost 0.

ii. (2 points) Why might this be a problem for training neural networks?

Solution: If weights become saturated during a particular training epoch,


the small gradients make it very hard to change weights thereafter: in other
words, it makes training much much slower.

iii. (2 points) How does the ReLU activation ReLU(z) = max(0, z) solve this prob-
lem?

Solution: Because ReLU activations are linear, they do not saturate for
large (positive) values, and hence freely allow gradients to change weights
in the network. However, they do not solve the problem for large negative
weights and ReLU units can “die” if a gradient update moves parameters
into the flat region.

(d) A popular model used for sentiment classification is an LSTM model:

Page 6
LSTM LSTM LSTM LSTM Softmax y

The movie was okay

This model inputs word vectors to the LSTM model at each time step and uses the
last hidden state vector to predict the sentiment label (y).
Recall that in assignment 1 we used a simple “bag-of-vectors” model for sentiment
classification: we used the average of all the word vectors in a sentence to predict
the sentiment label.
i. (3 points) Name at least one benefit of the LSTM model over the bag-of-vectors
model.

Solution: The LSTM model is able to integrate information from word


ordering, e.g. “this was not an amazing fantastic movie” while the bag-of-
vectors model can not.

ii. (3 points) If we chose to update our word vectors when training the LSTM
model on sentiment classification data, how would these word vectors differ
from ones not updated during training? Explain with an example. Assume that
the word vectors of the LSTM model were initialized using GloVe or word2vec.

Solution: The word vectors learned using this method can capture more
sentiment information. In the word2vec models that depend on co-occurrence
counts, words like ‘good’ or ‘bad’ have very similar representations; repre-
sentations learned through a sentiment classifier would learn different em-
beddings.

3. Feedforward neural network language models (18 points)


A feedforward neural network language model (NNLM) can be used as another archi-
tecture for training word vectors. This model tries to predict a word given the N words
that precede it. To do so, we concatenate the word vectors of N previous words and
send them through a single hidden layer of size H with a tanh nonlinearity and use a
softmax layer to make a prediction of the current word. The size of the vocabulary is
V . The model is trained using a cross entropy loss for the current word.
Let the word vectors of the N previous words be x1 , x2 , . . . , xN , each a column vector
of dimension D, and let y be the one-hot vector for the current word. The network is
specified by the equations below:
The dimensions of our parameters and variables are x ∈ R(N ·D) , W ∈ RH×(N ·D) , b ∈
RH , h ∈ RH , U ∈ RV ×H , d ∈ RV , ŷ ∈ RV .

Page 7
 
x1
ŷ  x2 
x =  .. 
 
 . 
h xN
h = tanh(W x + b)
x1 x2 x3 ŷ = softmax(U h + d)
J = CE(y, ŷ)
Green eggs and X
CE = − yi log(ŷi ).
i

(a) (4 points) State two important differences between NNLM the CBOW model we
learned in class. Explain how each might affect the word vectors learned.

Solution:

• The CBOW is trained to predict a center word given a context window


that extends on both sides, while word vectors learned by NNLM do not
capture the context to the right of the word. Thus, the NNLM may not
differentiate “word (carefully)” from “word (puzzle)”.

• The CBOW model simply uses the sum of context words, while the NNLM
model combines context words non-linearly. Thus the NNLM can learn to
treat “not good to” differently from “good to not”, etc.

(b) i. (3 points) What is the complexity of forward propagation in an NNLM for a


single training example?

Solution: The forward propagation complexity for an NNLM is N × D for


concatenating the word vectors, N × D × H to compute h and H × V to
compute ŷ from h: in total, O(N DH + HV ). Typically, V  N D, so the
latter term dominates the forward propagation computation.

ii. (3 points) Describe at least one way to change the model that would reduce
this complexity.

Solution: The complexity can be reduced by using negative sampling to


compute the softmax or using the hierarchical softmax.

(c) (8 points) What is the gradient of J with respect to x, ∂J


∂x
? Hint: tanh(z)0 =
1 − tanh(z)2 .

Page 8
Solution: Let z1 = xW + b and z2 = hU + d.

∂CE
δ1 = = (ŷ − y)
∂ ŷ
∂CE ∂z2
δ2 = = δ1 = U T δ1
∂h ∂h
∂CE ∂h
δ3 = = δ2 = δ2 ◦ (1 − tanh(z1 )2 )
∂z1 ∂z1
∂CE ∂z1
= δ3 = W T δ3
∂x ∂x

4. Autoencoders (18 points)


In class, we’ve learned that one way to combine word vectors is to simply add them
together. In this question, we’ll explore another approach: using an autoencoder.

Figure 2: Illustration of an autoencoder unit to combine two vectors

Dx ×1
In the autoencoder,  input word vectors x1 , x2 ∈ R
 two are first concatenated into a
x
single vector x = 1 ∈ R2Dx ×1 , and the parent vector p can be computed as:
x2

p = ReLU(W1 x + b1 ) ∈ RDp ×1 ReLU(x) = max(0, x),

where W1 can be decomposed as:


 
W1 = W11 W12 W1 x = W11 x1 + W12 x2 .

During training, we use the parent vector p to reconstruct the input vectors:
 0
x
x = 10 = W2 p + b2 ∈ R2Dx ×1
0
x2

Page 9
where x01 , x02 ∈ RDx ×1 are the reconstructions. Correspondingly, a reconstruction loss
that computes the Euclidean distance between inputs and reconstructions is used during
training:
1
J1 = ||x0 − x||2 ∈ R.
2
For sentiment classification, the parent vector is used to predict a sentiment class ŷ for
the pair of the input words:

ŷ = W3 p + b3 ∈ RDc ×1 ,

where Dc = 3 is the number of sentiment classes (‘positive’, ‘neutral’ and ‘negative’).


Here, the network is also trained using a cross-entropy loss:

J2 = CE(y, ŷ) ∈ R,

where y ∈ RDc ×1 is the one-hot label vector with yk = 1.


In total, the network is trained using the loss J = J1 + J2 .
(a) (2 points) Give at least one example of how the sentiment predictions made by an
autoencoder model would differ from one made using a bag-of-vectors model like
the one in assignment 1.

Solution: The autoencoder model described above allows us to predict that


“not bad” is actually neutral or positive, while the bag-of-vectors model would
predict a more negative class.

(b) (2 points) How do the reconstructed vectors and reconstruction loss help in learning
the parent representation?

Solution: The reconstruction loss prevents the network from ignoring the con-
tribution of one of x1 or x2 . In other words, it forces the network to capture the
joint ‘meaning’ of x1 and x2 .

(c) (2 points) i. What is the shape of each weight and bias term: W1 , W2 , W3 , b1 , b2 , b3 ?
ii. How many parameters does the model have in total?

Solution: Shapes of all terms:

W1 ∈ RDp ×2Dx b1 ∈ RDp ×1


W2 ∈ R2Dx ×Dp b2 ∈ R2Dx ×1
W3 ∈ RDc ×Dp b3 ∈ RDc ×1

Number of parameters = 4Dx Dp + Dp + 2Dx + Dp Dc + Dc

Page 10
(d) Compute the following gradients for the reconstruction loss:
Hint: You can use the following notation
(
1, if x > 0
1{x > 0} =
0, otherwise
Using it on a matrix would perform an element-wise operation, e.g.
    
1 0 1 0
1 >0 =
−1 3 0 1
∂J1
i. (2 points) δ1 = ∂p
=

∂J1 def
ii. (2 points) δ2 = ∂h
= (where h = W1 x + b1 )

∂J1
iii. (2 points) ∂W1
=

∂J1
iv. (2 points) ∂b1
=

∂J1
v. (2 points) ∂x1
=

You can use the space below as scratch space.

Solution: Let h = xW1 + b1 . ◦ denotes element-wise multiplication.


1
J1 = ||x0 − x||2
2
∂J1
δ1 = = W2T (x0 − x)
∂p
∂J1
δ2 = = δ1 ◦ 1{h > 0}
∂h
∂J1 ∂J1 ∂J1
= δ2 xT , = δ2 , = (W1T δ2 )[1:Dx ] = W11
T
δ2
∂W1 ∂b1 ∂x1

(e) (2 points) How would you construct a network using only copies of the autoencoder
above to predict a sentiment label for a whole sentence, not just a pair of words?
Remember that each sentence could be of arbitrary length.

Solution: We can combine the vectors recursively by combining every subse-


quent pair of words, and then combining each pair of parent vectors, etc. We
can also combine the vectors linearly by combining the first two words to get
a parent vector, followed by combining every subsequent word with the parent
vector.
We considered answers like averaging each parent vector to be incorrect because
this adds mean-pooling layer to the network. However, we considered answers

Page 11
like averaging the predictions made on every pair of words to still be correct
(though not preferable) because such a network does not use any other element
than the autoencoders themselves.

5. Recurrent neural networks (20 points)


Recall that a recurrent neural network (RNN) takes in an input vector xt and a state
vector ht−1 and returns a new state vector ht and an output vector yt :

ht = f (w1 xt + w2 ht−1 + b2 )
yt = g(w3 ht + b3 ),

where f and g are activations functions.


(a) (8 points) The following diagram depicts a single RNN unit, where xt , ht−1 , ht and
yt are all scalars, as a state machine:

Suppose that f is a binary threshold unit and g is a linear unit:


(
0 if x < 0
f (x) =
1 if x ≥ 0
g(x) = x.

Fill in weights (w1 , w2 , w3 ), biases (b1 , b2 ) so that the RNN initially outputs 0, but
as soon as it receives an input of 1, it switches to outputting 1 for all subsequent
time steps. For instance, the input 0001010 produces the output 0001111. The
hidden unit has an initial value of 0.
You don’t need to provide an explanation, but doing so may help you receive partial
credit.
Hint: In one possible solution, the hidden unit has an activation ht = 0 until there’s
an input xt = 1, at which point it switches to maintaining an activation of 1 forever.
The output unit always predicts the same as the hidden unit, i.e. yt = ht .

Page 12
Solution: There are many possible solutions, but only one main scenario: the
output unit must predict the same as the hidden state.
This requires that:

b3 =0
w 3 + b3 =1
b2 <0
w2 + b2 ≥0
w1 + b2 ≥ 0,

so values like w1 = 1, w2 = 1, b2 = −1, w3 = 1, b3 = 0 would be considered


correct.
Unfortunately, the original hint we provided was wrong given that the initial
value of the hidden state is 0: there is no value of ht for which the unit can flip
from 0 to 1 at the start, when the input is 0 AND flip from 1 to 0 when input
is 1 and stay there.
When grading this question, we have considered parameters that agree with the
hint, i.e. w3 = −1, b3 = 1, to also be correct. The remaining parameters were
partially marked if (a) you stated that h0 = 1 and had consistent parameters:

b3 =1
w 3 + b3 =0
b2 <0
w 2 + b2 ≥0
w 1 + w 2 + b2 < 0,

OR (b) reasoned that no values of w1 , w2 , b2 are possible.

(b) Now suppose that f is a sigmoid unit and that the input is always 0. The following
diagram shows how the next state ht+1 changes with the current state ht when
w2 = 3 and b2 = −1, i.e. ht+1 = σ(3ht − 1). The diagram also shows that ht+1 = ht
when ht = 0.80.
i. (2 points) Would ht+1 be greater than or less than ht if ht = 0.5? greater than
ii. (2 points) Would ht+1 be greater than or less than ht if ht = 0.9? less than
iii. (2 points) What would ht+1 look like as we take increasingly longer sequences,
i.e. t → ∞?

Solution: ht+1 would converge to 0.80 as t → ∞.

iv. (1 point) For what range of initial values h0 , would this network experience
exploding gradients for long sequences? None
v. (1 point) For what range of initial values h0 , would this network experience

Page 13
vanishing gradients for long sequences? All

Solution: The plot above shows how ht is mapped to ht+1 . Whenever


ht < 0.8, the graph shows that ht+1 is greater than ht . Similarly, whenever
ht > 0.8, the graph shows that ht+1 is less than ht . This shows that as the
t → ∞, ht will converge to ht+1 .
In the long term, the gradients experienced by the network, will converge
to the gradients at ht = 0.8, which is clearly less than 1 (because it cuts
the line ht = ht+1 with a slope less than 1). At each time step, the net-
works’ gradients will be multiplied by this value and hence the network will
experience vanishing gradients irrespective of where it started from.

vi. (4 points) We have learned in class that ReLUs are good at preserving gradients
in deep neural networks. Would it be wise to replace the sigmoid activation
above with a ReLU activation unit? Explain.

Solution: No. When ht > 13 , then ReLU unit has a slope of 3 and would
experience exploding gradients. When ht < 13 , the network would experi-
ence vanishing gradients.

Page 14

You might also like