Nothing Special   »   [go: up one dir, main page]

Neural Networks (2010/11) Example Exam, December 2010

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Neural Networks (2010/11)

Example exam, December 2010


In the final exam, four problems are to be solved within 3 hours. The use of
supporting material (books, notes, calculators) is not allowed. In each
of the four problems you can achieve up to 2.25 points, with a total maximum of
9 points. The grade will be 1+number of points.

1) Model neurons and networks

a) Consider a single neuron of the McCulloch Pitts type. Define precisely


how its state of activity is determined from the neurons it is connected
to (the neighbors). Explain why a positive weight can be interpreted as
representing an excitatory synapse.

b) Consider a Hopfield model consisting of N McCulloch Pitts type of neurons


with activities Sj (t) {1, +1} (j = 1, 2, . . . , N ). Write down an update
equation that specifies Si (t + 1) as a function of the neural activities in the
previous time step.

c) How is a set of patterns { IRN } ( = 1, 2, . . . , P ) stored in the Hopfield


model? Explain in words, how it can work as an associative memory.

2) Perceptron storage problem

Consider a set of data ID = { , S }P=1 where IRN and S {+1, 1}.


You can assume that the data is homogeneously linearly separable.

a) In this problem, you can assume that ID is homogeneously linearly sepa-


rable. Define the stability (w) of a perceptron solution w with respect
to the given set of data ID. Give a geometric interpretation and provide a
sketch of an illustration. Explain in words why (w) quantifies the stability
of the perceptron output with respect to noise.

b) Assume you have found two different solutions w(1) and w(2) of the percep-
tron storage problem for data set ID. Assume furthermore that w(1) can
be written as a linear combination
X
P
w(1) = x S with x IR,
=1

whereas the difference vector w(2) w(1) is orthogonal to all the vectors
ID.
Consider the stabilities of the competing solutions and prove (give precise
mathematical arguments) that (w(1) ) (w(2) ) holds true. What does
this result imply for the perceptron of optimal stability and potential train-
ing algorithms?

1
3) Learning a linearly separable rule
Here we consider linearly separable data ID = { , SR }P=1 where noise free labels
SR = sign[w ] are provided by a teacher vector w IRN with |w | = 1.
a) Define and explain the term version space precisely in this context, provide
a mathematical definition as a set of vectors and also a simplifying graphical
illustration. Give a brief argument why one can expect the perceptron of
maximum stability to display good generalization behavior.
b) Define and explain the (Rosenblatt) Perceptron algorithm for a given set
of examples ID. Be precise, for instance by writing it in a few lines of
pseudocode. Also include a stopping criterion.
c) While experimenting with the Rosenblatt perceptron in the practicals, your
partner has a brilliant idea: the use of a larger learning rate. His/her
argument: updating w by Hebbian terms of the form S with a large
> 1 should give (I) faster convergence and (II) a better perceptron vector.
Are you convinced? Give precise arguments for yor answer!
Note: The following will be treated in January, but some of you might be able
to do (a) and (b) already, using the information on gradients etc. 3c) should get
clearer in January, but I think you might solve it already, with a little thinking :-)

4) Learning by gradient descent


Consider a feed-forward continuous neural network with an N -dim. input layer
and one very simple, linear unit with continuous output
() = w IR
Here, denotes an N -dim. input vector and w is adaptive weight vector.
a) Given a set of training examples, i.e. inputs with continuous labels
IR, consider the quadratic error measure
1X P
E(w) = (( ) )2 .
2 =1
as a cost function for training Derive a gradient descent learning step for
the adaptive weights with respect to the cost function E.
b) What are the necessary conditions for a weight vector w to be a local min-
imum of E? You dont have to discuss sufficient conditions here. Assume
some w does indeed satisfy the necessary conditions, but it is not a local
minimum. What else could w correspond to?
c) Discuss qualitatively (in words) the role of the step size or learning rate
in the gradient descent algorithm. What can happen if is (too) small
or (too) large, respectively? Do you have suggestions for how to deal with
these problems in practice?

You might also like