Nothing Special   »   [go: up one dir, main page]

Probability and Information Theory: Lecture Slides For Chapter 3 of Deep Learning Ian Goodfellow 2016-09-26

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Probability and

Information Theory
Lecture slides for Chapter 3 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-26
has probability 1, and no state can have a grea
a probability distribution over many variables is known as a joint probability
distribution. P (x = x, y = y) denotes the probability that x = x and y = y
P simultaneously. We may also write P (x, y) for brevity.
• Tox2x be a P (x) =mass We onrefer
1.function tovariable
thisx, property as b
Probability
satisfy
probability
the following properties: Mass
a random

this property, we could obtain probabilities gre


Function a function P must

• The domain of P must be the set of all possible states of x.


the probability of one of many events occurring
• 8x 2 x, 0  P (x)  1. An impossible event has probability 0 and no state can
be less probable than that. Likewise, an event that is guaranteed to happen
For example, consider a single discrete random
has probability 1, and no state can have a greater chance of occurring.
P
x2x P (x) = 1. We refer to this property as being normalized. Without
es. We can place a uniform distribution on x

this property, we could obtain probabilities greater than one by computing
es equally likely—by setting its probability mass
the probability of one of many events occurring.

For example, consider a single discrete random variable x with k different


states. We can place a uniform distribution on x—that is, make each of its
1
statesExample:
equally likely—by settingdistribution:
uniform its probability mass function to
P (x = xi ) =
P (x = xi ) =
1 k(3.1)
k
allfori.all We can
i. We can seethisthat
see that fits thethis fits forthe
requirements requirements
a probability mass function. for
The value1k1 is positive because k is a positive integer. We also see that (Goodfellow 2016)
set of points. Specifica
integral of Density
3.3.2 Continuous Variables and Probability p(x) over th
Func
Probability Density Functionlies in the interval [a, b
When working with continuous random variables, we describe probabil
butions using a probability density functionFor (PDF)an rather
examplethan aofp
mass function. To be a probability density probability
function, a function p must
density overs
following properties:
tion on an interval of t
• The domain of p must be the set of all possible
where statesb of
a and x. the e
are
• 8x 2 x, p(x) 0. Note that we do not“parametrized
require p(x)  1. by”; we c
R
• p(x)dx = 1. b are parameters that
mass outside the
A probability density function p(x) does not give the probability of
inter
b a . We ca
1
Example:
state directly, uniform
instead distribution:
the probability u(x;inside
of landing a, b) an
= infinitesimal re
volume x is given by p(x) x. integrates to 1. We oft
We can integrate the density function to byfind
writing x⇠
the actual probability
U (a, b).
set of points. Specifically, the probability that x lies in some set S is giv
integral of p(x) over that set. In the R univariate example, the probabili
lies in the interval [a, b] is given by p(x)dx. (Goodfellow 2016)
inal Probability
Computing Marginal
TY AND INFORMATION THEORY
now the probability distribution over a set of variables and we want
Probability
obability with
distribution over just the
a subset Sum
of them. Rule
The probability
probability”
r the subset iscomes
knownfrom themarginal
as the process of computingdistribution.
probability marginal
,When
supposethe values
we have of Prandom
discrete (x, y) are written
variables iny,aand
x and grid
we with
know
nows
findand
P (x)different values
with the sum of y in columns, it is natural to
rule:
X
grid,8xthen write P (x)
2 x, P (x = x) = in the margin of
P (x = x, y = y). the paper just to
(3.3)
y

bles, we need to use integration


58 instead of summation:
Z
p(x) = p(x, y)dy. (3.4)

Probability (Goodfellow 2016)


tional Probability
Conditional Probability
e are interested in the probability of some event, given that some
happened. This is called a conditional probability. We denote
probability that y = y given x = x as P (y = y | x = x). This
ability can be computed with the formula

P (y = y, x = x)
P (y = y | x = x) = . (3.5)
P (x = x)

probability is only defined when P (x = x) > 0. We cannot compute


robability conditioned on an event that never happens.
nt not to confuse conditional probability with computing what
some action were undertaken. The conditional probability that
Germany given that they speak German is quite high, but if
cted person is taught to speak German, their country of origin
Computing the consequences of an action is called making an (Goodfellow 2016)
hange. Computing the consequences of an action is called making an
on query. Intervention queries are the domain of causal modeling,
o not explore in this book.
Chain Rule of Probability
he Chain Rule of Conditional Probabilities
robability distribution over many random variables may be decomposed
ional distributions over only one variable:

P (x(1) , . . . , x(n) ) = P (x(1) )⇧ni=2 P (x(i) | x(1) , . . . , x(i 1)


). (3.6)

servation is known as the chain rule or product rule of probability.


mmediately from the definition of conditional probability in equation 3.5.
59

(Goodfellow 2016)
P (a, b, c) = P (a | b, c)P (b | c)P (c).

Independence
ependence and Conditional Independence
variables x and y are independent if their probability distribution
ssed as a product of two factors, one involving only x and one involving

8x 2 x, y 2 y, p(x = x, y = y) = p(x = x)p(y = y). (3.7)

om variables x and y are conditionally independent given a random


the conditional probability distribution over x and y factorizes in this
y value of z:

y, z 2 z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z).


(3.8)
(Goodfellow 2016)
can be expressed as a product of two factors, one involving only x and one involving
only y:

Conditional Independence
8x 2 x, y 2 y, p(x = x, y = y) = p(x = x)p(y = y). (3.7)

Two random variables x and y are conditionally independent given a random


variable z if the conditional probability distribution over x and y factorizes in this
way for every value of z:

8x 2 x, y 2 y, z 2 z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z).


(3.8)
We can denote independence and conditional independence with compact
notation: x?y means that x and y are independent, while x?y | z means that x
and y are conditionally independent given z.

3.8 Expectation, Variance and Covariance


The expectation or expected value of some function f (x) with respect to a
probability distribution P (x) is the average or mean value that f takes on when x
is drawn from P . For discrete variables this can be computed with a summation:
(Goodfellow 2016)
ion, Variance and Covariance
r expected value of some function f (x) with respect to a

PROBABILITY
Expectation
on P (x) is the average or mean value that f takes on when x
expected value of some function f (x) with respect to a
r discrete variables this can be computed
AND INFORMATION THEORY with a summation:
on P (x) is the average or mean value that f takes on when x
discrete variables X
this can be computed with a summation:
Ex⇠P [f (x)] = P (x)f (x), (3.9)
ntity of the distributionX x is clear from the context, we may simply
e of theEx⇠P [f (x)]
random =
variable P (x)f
that expectation is over, as in Ex(3.9)
the(x), [f (x)].
variables,
which randomit isvariable
computed
xthe with an integral:
expectation is over, we may omit the
ely, as in E[f (x)]. ByZdefault, we can assume that E[·] averages over
variables, it is computed with an integral:
all the E
random variables
x⇠p [f (x)] inside (x)dx.
= p(x)f the brackets. Likewise, when(3.10)
there is
we may omit the square Z brackets.
ons areElinear,
x⇠p [f (x)] = linearity
p(x)f (x)dx.
for example, of expectations: (3.10)
60
Ex [↵f (x) + g(x)] = ↵Ex [f (x)] + Ex [g(x)], (3.11)
60
are not dependent on x.
nce gives a measure of how much the values of a function of a random (Goodfellow 2016)
and are not dependent on x.
E x [↵f (x)gives
variance + ag(x)] = ↵E
measure much+theEvalues
x [f (x)]
of how x [g(x)], (3.11)
of a function of a random
e x vary as we sample different values of x from its probability distribution:
Variance
not dependent on x.
and Covariance
E
h
E[f 2
gives a measure of how much the values of a function
Var(f (x))
TY AND INFORMATION THEORY
= (f (x) (x)])
i
. (3.12)
of a random
we sample different values of x from its probability distribution:
he variance is low, the values of f (x) cluster near their expected value. The
h
root of the variance is known as the standard i deviation.
2
Var(f (x)) = E (f (x) E[f (x)]) . (3.12)
wecovariance
choose the gives value of sof to
some sense howbemuch Otherwise,
1. two we choose
values are linearly related
other, as well as the scale of these variables:
eWe canthe
is low, then generate
values a random
of f (x) cluster near variable y byvalue.
their expected assigning
The
are Cov(f
not is(x),
variance independent,
known
g(y)) =asE the because
standard
[(f (x) completely
xdeviation.
E [f (x)]) (g(y) E [g(y)])]determines
. (3.13)
ever,
ce
bsolute Cov(x,
gives
values y)
the=
someofsense 0.how much
of
covariance mean two
that values arechange
the values linearly related
very much
ewell
bothasfar
trix ofthe scale
afrom theirofrespective
random these
vector
Covariance matrix:
variables:
means atnthe same time. If the sign of the
x 2 R is an n ⇥ n matrix, such
nce is positive, then both variables tend to take on relatively high values
fneously. If the
(x), g(y)) E [(fof(x)
= sign E [f (x)])is (g(y)
the covariance negative,Ethen one variable
[g(y)])] . (3.13)
tends to
n a relatively
Cov(x)high i,j =value
Cov(xat the xj ). that the other takes on a relatively
i , times (3.14)
ues of the
ue and vice covariance
versa. Othermean measures thatsuch
the asvalues changenormalize
correlation very much the
the their
ution
rom covariance
of variablegive
eachrespective theto at
in means
order variance:
measure only how
the same time.much the sign
If the variables are
of the
,tive,
rather thanboth
then also being affected
variables tend by the scale on
to take of the separate high
relatively variables.
values(Goodfellow 2016)
Cov(x
e notions i , xi ) =and
of covariance Var(x i ).
dependence (3.15)
are related, but are in fact distinct
Distribution

Bernoulli Distribution
ution is a distribution over a single binary random variable.
ngle parameter 2 [0, 1], which gives the probability of the
equal to 1. It has the following properties:
P (x = 1) = (3.16)
P (x = 0) = 1 (3.17)
x 1 x
P (x = x) = (1 ) (3.18)
Ex [x] = (3.19)
Varx (x) = (1 ) (3.20)

Distribution

tegorical distribution is a distribution over a single discrete


1
nt states, where k is finite. The multinoulli distribution is (Goodfellow 2016)
N (x; µ,
ust impose ) =limits on2 the
strict exp distribution.
2
(x µ) . (3.21)
2⇡ 2
e 3.1 for a plot of the density function.
Gaussian Distribution
ssian Distribution
parameters µ 2 R and 2 (0, 1) control the normal distribution.
er µ gives thedistribution
coordinate of thereal
central peak. isThis
mmonly used over numbers the isnormal
also thedistribu-
mean of
ion: E[x]
own as the = µ. The
Gaussian standard deviation
distribution: of the distribution is given by
ariance byParametrized
2. by variance:
r ✓ ◆ . When we need to
evaluate the PDF, we need 1 to square 1and invert
2 2
valuate N (x;
theµ,PDF) with
= different 2
expparameter 2
µ) a .more efficient(3.21)
(xvalues, way
2⇡ 2
zing the distribution is to use a parameter 2 (0, 1) to control the
r3.1 for aParametrized
inverse plot of the
variance by
density
of the precision:
function.
distribution:
parameters µ 2 R andr 2 (0, ✓ 1) control the◆ normal distribution.
r µ gives the coordinate of the 1 peak. This
central
N (x; µ, 1
)= exp (x µ)2 . is also the mean
(3.22)of
on: E[x] = µ. The standard 2⇡ deviation2 of the distribution is given by
riance by 2.
istributions are a sensible choice for many applications. In the absence
evaluateabout
wledge the PDF,
what we need
form to square and
a distribution invert
over . When
the real we should
numbers need to
luate the PDF with
mal distribution is a different parameter
good default values,
choice for two amajor
morereasons.
efficient way
ing the distribution is to use a parameter 2 (0, 1) to control the (Goodfellow 2016)
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Gaussian Distribution
0.40
0.35
0.30 Maximum at x = µ

0.25 Inflection points at


p(x)

0.20 x=µ±

0.15
0.10
0.05
0.00
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Figure 3.1x
Figure 3.1: The normal distribution: The normal distribution N (x; µ, 2 ) exhibits
a classic “bell curve” shape, with the x coordinate of its central peak given by µ, and
the width of its peak controlled by . In this example, we depict the standard normal
distribution, with µ = 0 and = 1.
(Goodfellow 2016)
e normal distribution encodes the maximum amount of uncertainty over the
al numbers. We can thus think of the normal distribution as being the one
at inserts the least amount of prior knowledge into a model. Fully developing
Multivariate Gaussian
d justifying this idea requires more mathematical tools, and is postponed to
tion 19.4.2.
3. PROBABILITY AND INFORMATION THEORY
The normal distribution generalizes to Rn , in which case it is known as the
ultivariate normal distribution. It may be parametrized with a positive
finite symmetric Parametrized
matrix ⌃: by covariance matrix:
arameter µ still gives the mean of the distribution, though now it is
ued. The parameters ⌃ gives the ✓covariance matrix of the◆ distribution.
1 1
univariate case,
N (x; µ, ⌃) = whenn we wish expto evaluate
(x µ)the
>
⌃ PDF
1
(x several
µ) . times for
(3.23)
(2⇡) det(⌃) 2
erent values of the parameters, the covariance is not a computationally
way to parametrize the distribution, since we need to invert ⌃ to evaluate
We can instead use a precision 64matrix :
Parametrized by precision matrix:
s ✓ ◆
det( ) 1
N (x; µ, 1
)= n
exp (x µ)> (x µ) . (3.24)
(2⇡) 2

ten fix the covariance matrix to be a diagonal matrix. An even simpler


the isotropic Gaussian distribution, whose covariance matrix is a scalar
(Goodfellow 2016)
f x.
deep learning, we often want to have a probability distribution
tility
and distribution
at xLaplace that allows
Distributions
= 0. To accomplish uscan
this, we to place
use thea exponential
sharp peak

arning, p(x;
More Distributions
bitrary point µ is the Laplace distribution
we often
) = want
1x ✓ to have
0 exp ( x)a.◆probability distribution
(3.25)
= 0. To 1
accomplish |x weµ|can use the exponential
this,
stribution
e(x; µ, )uses theexp (3.26)
indicator function 1. x 0 to assign probability
= Exponential:
e values of x. 2
p(x; ) = 1 x 0 exp ( x) . (3.25)
d probability distribution that allows us to place a sharp peak
ribution
sion
at uses theand
an arbitrary Empirical
point
indicator
Laplace: µ is the
function Distribution
Laplace
1 x 0
distribution
to assign probability
s of x. ✓ ◆
1 |x µ|
pecify that all
Laplace(x; µ, of
) =the mass
exp in a probability
. distribution
(3.26)
bability distribution2that allows us to place a sharp peak
nt. This can be accomplished by defining a PDF using
arbitrary point µ is the Laplace distribution
x): Dirac:and✓Empirical◆Distribution
ac Distribution
1 |x µ|
ace(x; µ, =
p(x) ) =(x exp µ). . (3.26)
(3.27)
wish to specify that
2 all of the mass in a probability distribution
ingle point. This can be accomplished by defining a PDF using
defined(x):
nction, such that it is zero-valued everywhere except (Goodfellow 2016)
unctions that put less and less mass on all points other

Empirical Distribution
be shifted by µ we obtain an infinitely narrow and
obability mass where x = µ.
Dirac delta distribution is as a component of an empirical
m
X
1 (i)
p̂(x) = (x x ) (3.28)
m
i=1

ass 1
m on each of the m points (1)
x ,...,x(m) forming a
of samples. The Dirac delta distribution is only necessary
tribution over continuous variables. For discrete variables,
an empirical distribution can be conceptualized as a
with a probability associated to each possible input value
e empirical frequency of that value in the training set.
(Goodfellow 2016)
distribution. A mixture distribution is made up of several
ns. On each trial, the choice of which component distribution
is determined by sampling a component identity from a
n: Mixture Distributions
X
P (x) = P (c = i)P (x | c = i) (3.29)
i
Gaussian mixture
tinoulli distribution over AND
CHAPTER 3. PROBABILITY component
with threeidentities.
INFORMATION THEORY

seen one example of a mixture distribution: the empirical


components
-valued variables is a mixture distribution with one Dirac
aining example.
x2

66

x1

Figure 3.2 (Goodfellow 2016)


Logistic Sigmoid
1.0

0.8

0.6
(x)

0.4

0.2

0.0

10 5 0 5 10
x
Figure 3.3: The logistic sigmoid function.

Commonly used to parametrize Bernoulli distributions


(Goodfellow 2016)
Softplus Function
10

6
(x)

0
10 5 0 5 10
x
Figure 3.4: The softplus function.

(Goodfellow 2016)
le Bayes’ Rule
n a situation where we know P (y | x) and need to know
we also know P (x), we can compute the desired quantity

P (x)P (y | x)
P (x | y) = . (3.42)
P (y)
ppears in the formula, it is usually feasible to compute
, so we do not need to begin with knowledge of P (y).

70

(Goodfellow 2016)
@y

Change of Variables
px (x) = py (g(x))
@g(x)
@x
. (3.46)

the derivative generalizes to the determinant of the Jacobian


@xi
with Ji,j = @yj . Thus, for real-valued vectors x and y,
✓ ◆
@g(x)
px (x) = py (g(x)) det . (3.47)
@x

72

(Goodfellow 2016)
daranteed to occur. an event of probability 1 . Other texts use base-2
by observing
has come up
Self-information
as
deals
heads twice should
e convey twice as
nits called bits or only with a single
shannons; outcome. We measured
information can quantify the amount
in bits is
nding
uncertaintyout that
in an entireaprobability
tossed
information measured in nats. coin has
distribution come
using up as
the Shannon heads
entropy:

Information Theory
H(x) = Ex⇠P [I(x)] = Ex⇠P [log P (x)].
tinuous, we use the same definition of information by analogy, (3.49)
roperties from
o denoted H(P theother
). In discrete
words, case are lost.
the Shannon For of
entropy example, an event
a distribution is the
epected
of these
still has
amountproperties,
zero
of information inwe
information, define
an despite
event the
not
drawn self-information
being
from that an event that
distribution. is
It gives
ur. bound on the number of bitsInformation:
ower (if the logarithm is base 2, otherwise the units
different) needed on average to encode symbols drawn from a distribution P .
ntributions
deals
I(x)onlythatwith
= log
are aPsingle
nearly(x). outcome.
deterministic We the
(where canoutcome
quantify the (3.48)
amount
is nearly certain)
an entire
ve low probability
entropy; distribution
distributions using
that are closer the Shannon
to uniform entropy:
have high entropy. See
og to for
ure 3.5 mean the natural
a demonstration. When logarithm,
x is continuous,with
Entropy: base e.
the Shannon Ouris
entropy
ownH(x)
as the =
differential
E entropy.
= E (3.49)
written inx⇠P
units of nats.
x⇠POne nat is the amount of
[I(x)] [log P (x)].
If we have two separate probability distributions P (x) and Q(x) over the same
dom
). Invariable
other x, we canthe
words, measure how different
Shannon entropytheseoftwo distributions areisusing
a distribution the
73
Kullback-Leibler (KL) divergence:
of information in an eventKL drawn from that distribution. It gives
divergence:

he number of bits (if
DKL (P kQ) = Ex⇠P log
the logarithm
P (x) is base 2, otherwise
= Ex⇠P [log P (x) log Q(x)] .
the units
(3.50)
ed on average to encode symbols drawn from a distribution P .
Q(x)
are nearly deterministic (where the outcome is nearly certain) (Goodfellow 2016)
TER 3. PROBABILITY AND INFORMATION THEORY

Entropy of a Bernoulli Variable


0.7

0.6
Shannon entropy in nats

0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
p Bernoulli parameter
e 3.5: This plot shows how distributions that are closer to deterministic hav
Figure
on entropy while distributions that 3.5 to uniform have high Shannon en
are close
(Goodfellow 2016)
The KL Divergence is
Asymmetric
q ⇤ = argminq DKL (p q) q ⇤ = argminq DKL (q p)

p(x) p(x)
Probability Density

Probability Density
q ⇤ (x) q ⇤ (x)

x x
Figure 3.6
.6: The KL divergence is asymmetric. Suppose we have a distribution p(x
(Goodfellow 2016)
Directed Model
APTER 3. PROBABILITY AND INFORMATION THEORY
a b

a b
Figure 3.7
c d

c d

rected graphical model over randome variables a, b, c, d and e. This graph


probability distributions that can be factored as
ure 3.7: A directed graphical model over random variables a, b, c, d and e. This
p(a, b, c, d, e) = p(a)p(b | a)p(c | a, b)p(d | b)p(e | c). (3.54)
esponds to probability distributions that can be factored as
ws us to quickly see some properties of the distribution. For example, a
p(a, b, c, d, e) = p(a)p(b | a)p(c | a, b)p(d | b)p(e | c).
irectly, but a and e interact only indirectly via c. (Goodfellow 2016)
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Undirected Model
a b

a b
c d

c d
e

Figure 3.8
directed graphical model over random variables
e
a, b, c, d and e. This
to probability distributions that can be factored as
1 (1)
Figure 3.8: An
p(a, b, c, d, e) =undirected graphical
(a, b, c) (2) model
(b, d) (3) over random variables a, b
(c, e). (3.56)
Z
graph corresponds to probability distributions that can be factored as
us to quickly see some properties of the distribution. For example, a(Goodfellow 2016)

You might also like