Chapter 02
Chapter 02
Chapter 02
Lecturer: McGreevy
Contents
How we count in 8.044: first one random variable, then two random variables, then 1024
random variables.
A random variable (RV) is a quantity whose value can’t be predicted (with certainty)
given what we know. We have limited knowledge, in the form of a probability distribution.
2-1
Two basic sources of uncertainty (and hence RVs) in physics:
2. ignorance (8.044)
This happens when we have insufficient information to fully specify the state of the
system, e.g. because there are just too many bits for us to keep track of. Suppose we
know P, V, T, E, S of a cylinder of gas – this determines the macrostate. ~x, p~ of the
1024 molecules are not specified by this. So we will use RVs to describe the microstate.
Types of RVs:
• mixed: both continuous and discrete components (e.g. energy spectrum of H atom has
discrete boundstate energies below a continuum)
2-2
Probability theory can be based on an ensemble of similarly prepared systems.
(e.g. suppose we are interested in the air in the room. Imagine M copies of the room, fill
each with gas of N2 , O2 , ... and one Xe atom. x = position of the Xe atom.)
1 2 3 n M-1 M
x1 x2 x3 xn xM-1 xM
Make a histogram of how many of the rooms have each value of x, or rather have their
value of x in some interval of possible values. We define the probability density for x to be
2-3
A simple example of assembling a probability distrbution: shoot arrows at a target.
Make a histogram of values of x: divide up the possible locations into bins of size dζ, and
count how many times the arrow has an x-position in each bin. Shoot M arrows.
In these plots, I am just showing raw numbers of hits in each bin on the vertical axis, and
I am fixing the bin size. In the three plots I’ve taken M = 500, 5000, 50000.
M →∞,dζ→0
Claim: [histogram] −→ [smooth distribution]
2-4
Properties of px :
• Prob(x between ζ and ζ + dζ) = px (ζ)dζ. (So the probability of hitting any particular
point exactly is zero if p is smooth. But see below about delta functions.)
• p is real
• px (ζ) ≥ 0.
Rb
• Prob(x ∈ (a, b)) = a
px (ζ)dζ
• The fact that the arrow always ends up somewhere means that the distribution is
normalized:
R∞
p (ζ)dζ = 1.
−∞ x
1
• the units of px are units of x
. (e.g. 1 is dimensionless.)
2-5
The probability density is a way of packaging what we do know about a RV. Another
convenient package for the same info is:
Cumulative probability
Z ζ
P (ζ) ≡ P rob(x < ζ) = dζ 0 px (ζ 0 ).
−∞
(Note the serif on the capital P .) To get back the probability density:
d
px (ζ) = P (ζ).
dζ
The question we have to answer in this chapter of the course is: “Given p or P , what can
we learn?” How to determine p is the subject of 8.04 and the whole rest of 8.044. So in the
next few lectures, we’ll have to cope with Probability Densities from Outer Space, i.e. I’m
not going to explain where the come from.
2-6
Example of a discrete density:
Repackage:
∞
X ∞
X
pn (ζ) = p(n)δ(ζ − n) = (1 − a)n aδ(ζ − n).
n=0 n=0
(δ(x) here is the Dirac delta function, which has the property
that it vanishes for any x 6= 0 but integrates to 1 around any
neighborhood of x = 0.) The point of this interlude is that all
manipulations can be done in the same formalism for discrete and continuum cases, we don’t
need to write separate equations for the discrete case.
2-7
What to do with a probability density? p(x) is a lot of information. Here’s how to extract
the bits which are usually most useful.
Averages
Z ∞
Mean: hxi ≡ xp(x)dx
−∞
Z ∞
2
Mean square: hx i ≡ x2 p(x)dx
−∞
Z ∞
nth Moment: hxn i ≡ xn p(x)dx
−∞
Z ∞ Z ∞
hf (x)i ≡ f (ζ)px (ζ)dζ = f (x)p(x)dx
−∞ −∞
Meaning: make a histogram of f (x) instead of a histogram of x. What is the mean of this
histogram?
√
Standard deviation ≡ stdev ≡ Variance
Z ∞
2
Note: Var(x) ≡ h(x − hxi) i = dxp(x)(x − hxi)2
Z−∞
∞ Z ∞ Z ∞
2 2
= dxp(x)x − 2hxi dxp(x)x + hxi dxp(x)
−∞ −∞ −∞
2-8
Example: the Gaussian density (ubiquitous, and provably so)
(x − a)2
1
p(x) = √ exp −
2πσ 2 2σ 2
The plots shows the gaussian distribution for a = 4, σ = 1 and a = 4, σ = 1/2. The lines
appear at x = a − σ, a, a + σ. Visibly, a determines the center of the peak, and σ determines
its width.
= a.
Z ∞
2
hx i = dx x2 p(x)
−∞
∞
η2
Z
η≡x−a 1 2 2
= √ dη η + 2ηa + |{z}
a exp − 2
2πσ 2 −∞
|{z} |{z} 2σ
see 2.1.2 below odd integral h1i
= σ 2 + a2 .
2-9
Gaussian density, cont’d
Please see Prof. Greytak’s probability notes (page 13), and the subsections below, and your
recitation instructors, for more on how to do the integrals. The conclusion here is that for
the gaussian distribution,
Var(x) = σ 2 , stdev(x) = σ.
Cumulative probability: Z x
1 (ζ−a)2
P (x) = √ dζ e− 2σ 2 .
2πσ 2 −∞
This is an example to get us comfortable with the notation, and it’s an example which
arises over and over. Soon: why it arises over and over (Central Limit Theorem).
2-10
2.1.1 A trick for doing the gaussian integral by squaring it
Z x2 x2
Z Z ∞
1 2 r2 r2
− − −
1=C 2
dx1 dx2 e 2s2 e 2s2 =C 2
rdrdθe 2s2 = 2πC 2
rdre− 2s2
0
r2
This is the square of a single gaussian integral. Let u = 2s2 , rdr = s2 du:
Z ∞
2 2
due−u = 2πC 2 s2 −e−u |∞ 2 2
1 = 2πC s 0 = 2πC s
0
2-11
Another example: Poisson density
Imagine a situation where events occur over and over at random points on the real line.
This real line could describe e.g. the random times at which a geiger counter clicks from
a radioactive source, or random positions of galaxies. More precisely, suppose the events
satisfy the conditions:
1. In the limit dx → 0, the probability that one and only one event occurs between x and
x + dx is rdx with r independent of x. (r is a rate.)
2. The probability of an event occurring in some interval dx is independent of what
happens in other intervals. (Roughly: the clicks of the counter don’t care about each
other. Not: each radioactive decay triggers an avalanche...)
Under these circumstances, in a finite interval of length L (i.e. not an infinitesimal interval
like dx),
1
Prob(n events occur) ≡ p(n) = (rL)n e−rL (Poisson density)
n!
2-12
Poisson density, cont’d
Z ∞ Z ∞ ∞
!
X
Normalize: 1 = dyp(y) = dy p(n)δ(y − n)
−∞ −∞ n=0
∞
X Z ∞
= p(n) dyδ(y − n)
n=0 −∞
| {z }
=1
∞
−rL
X 1
= e (rL)n = e−rL erL = 1.
n=0
n!
Z ∞ Z ∞ ∞
X ∞
X
hni = dyp(y)y = dy p(n)δ(y − n)y = np(n)
−∞ −∞ n=0 n=0
∞ ∞
X n 1 X
= e−rL (rL)n = e−rL rL (rL)n−1 = rL.
n! (n − 1)!
n = 0}
| {z |n=1 {z }
n = 0 doesn’t contribute erL
" ∞
#
∂ X 1
or: erL = (rL)n
∂r n=0
n!
∞
rL
X n 1
=⇒ Le = (rL)n .
n! r
|n=0 {z }
what we want
2-13
Poisson density, cont’d
∞
X
2
hn i = ... = n2 p(n) = ... = (rL)(rL + 1).
n=0
Note that only the (dimensionless) combination rL appears in the distribution itself.
Rewrite:
1
p(n) = hnin e−hni .
n!
Getting used to this abuse of notation is useful: n is the random variable.
hni is a number.
2-14
rL = hni = 1/2
rL = hni = 3
rL = hni = 25
2-15
2.2 Two random variables
The extension of the previous discussion to encode partial information about two degrees of
freedom rather than one is pretty straightforward.
dζdη px,y (ζ, η) ≡ Prob ζ ≤ x < ζ + dζ and η ≤ y < η + dη
|{z}
↓
As usual, the fact that some outcome must occur requires us to normalize the distribution
as Z Z ∞ Z ∞
1= p = h1i = dζ dη px,y (ζ, η).
all possibilities −∞ −∞
A new twist with two variables is the following new thing to do: There are several different
ways to get a distribution for a single random varaible from a joint distribution.
If we have no knowledge of y, we can get a probability for x by integrating over all y against
the joint distribution: Z ∞
px (ζ) = dη px,y (ζ, η)
−∞
2-16
Example: “hockey puck” distribution
1 2 2
π , for x + y ≤ 1
p(x, y) = .
0, for x2 + y 2 > 1
top view:
√
R 2
1−x √
dy π1 = 2
1 − x2 , for x ≤ 1
∞ π
√
Z
− 1−x2
p(x) = dy p(x, y) =
−∞
0, for x > 1
2-17
Here is a second way to get a probability distribution for one variable from a joint distri-
bution for two. It is a little more delicate.
This means that p(ζ|y) must be proportional to px,y (ζ, y). There is no reason for the latter
quantity to be normalized as a distribution for ζ, however:
p (ζ, y) = c px (ζ|y)
| x,y{z } |{z} | {z }
a slice: not normalized to be determined a normalized prob density for ζ
Z ∞ Z ∞
dζpx,y (ζ, y) = c dζpx (ζ|y)
−∞ −∞
| {z } | {z }
py (η=y) =1
=⇒ c = py (η = y).
2-18
Translation into human language:
p(x, y)
p(x|y) = “Bayes’ Theorem”
p(y)
The only content of this Theorem is: Slice the mountain, then normalize.
Slicing the other way allows Bayes to put his name on another formula:
Looked at this way, this relation is a prescription for building the joint probability moun-
tain from conditional probabilities p(x|y) and the single-variable probability p(y) for the
‘conditioning variable’ y.
2-19
Returning to the hockey puck example:
Notice that as the conditioning variable y approaches 1, we become more and more certain
about the value of x.
2-20
With these defs, we can make a useful characterization of a joint distribution:
Statistical Independence
Two random variables are statistically independent (SI) if the joint distribution factorizes:
or, equivalently, if
p(x, y)
p(x|y) = = p(x), independent of y
p(y)
(in words: the conditional probability for x given y is independent of the choice of y. Telling
me y leaves me as ignorant of x as before.) and
p(x, y)
p(y|x) = = p(y), independent of x
p(x)
So: for SI RVs, knowing something about one variable gives no additional information about
the other. You are still just as ignorant about the other as you were before you knew anything
about the first one.
Now you have all the ingredients necessary to follow the discussion of the derivation of
the Poisson density in Prof. Greytak’s notes. You have seen a different derivation of this
distribution in recitation, which is reviewed in the final sub section of these notes (2.5).
2-21
Example: Jointly gaussian random variables
1 v 2 − 2ρv1 v2 + v22
p(v1 , v2 ) = p exp[− 1 2 ]
2πσ 2 1 − ρ2 2σ (1 − ρ2 )
v1 , v2 : random variables. v is for ‘voltage’ as we’ll see later.
ρ, σ: parameters specifying the density. They satisfy σ > 0; −1 ≤ ρ ≤ 1.
1 v2 1 v2
− 12 − 22
p(v1 , v2 ) = √ e 2σ √ e 2σ = p(v1 ) · p(v2 ),
2πσ 2 2πσ 2
a circularly symmetric gaussian mountain. In this case, v1 and
v2 are SI, as demonstrated by the last equality above. Slicing a
gaussian mountain gives gaussian slices.
Now consider ρ 6= 0.
v2
v2= −v1
v2= v1
v1
ρ = −1 + ρ=0 ρ=1−
2-22
Reduction to a single variable i.e. “squashing the mountain”
Z ∞
p(v1 ) = dv2 p(v1 , v2 )
−∞
∞
v12 (v2 − ρv1 )2
Z
1
= p exp[− 2 ] dv2 exp[− 2 ] (1)
2πσ 2 (1 − ρ2 ) 2σ −∞ 2σ (1 − ρ2 )
Here we completed the square in the exponent of p(v1 , v2 ):
v12 − 2ρv1 v2 + v22 = (v2 − ρv1 )2 − ρ2 v12 − v12 = (v2 − ρv1 )2 − (1 − ρ2 )v12
But then this is just a gaussian integral for v2 :
∞
v2 (v2 − ρv1 )2
Z
1
p(v1 ) = p exp[− 12 ] dv2 exp[− 2 ]
2πσ 2 (1 − ρ2 ) 2σ −∞ 2σ (1 − ρ2 )
| √ {z }
2πσ 2 (1−ρ2 )
2-23
1 v12
= √ exp[− ] (2)
2πσ 2 2σ 2
Information about the correlations between v1 and v2 (i.e. all data about the effect of ρ on
the joint distribution) is lost in the squashing process. This information is hiding in the ...
But now the same manipulation of completing the square as above shows that this is
(v2 − ρv1 )2
1 1
p(v2 |v1 ) = √ p exp − 2 (4)
2πσ 2 1 − ρ2 2σ (1 − ρ2 )
This is a probability density for v2 which is gaussian, with mean ρv1 (remember p v1 is a
parameter labelling this distribution for the random variable v2 ), with stdev = σ 1 − ρ2 .
2-24
Plots of p(v2 |v1 ) for various v1 :
2-25
Correlation function:
Z ∞ Z ∞
≡ hv1 v2 i = dv1 dv2 v1 v2 p(v1 , v2 )
−∞ −∞
We could just calculate, but we’ve already done the hard part. Use Bayes here:
Z ∞ Z ∞
hv1 v2 i = dv1 dv2 v1 v2 p(v2 |v1 )p(v1 )
−∞ −∞
Z ∞ Z ∞
hv1 v2 i = dv1 v1 p(v1 ) dv2 v2 p(v2 |v1 )
−∞ −∞
| {z }
conditional mean=ρv1
ρ > 0 : v1 , v2 correlated
ρ < 0 : v1 , v2 anticorrelated
2-26
Whence this probability distribution?
Johnson noise: thermal noise in the voltage across an electric circuit with L, R, C (but no
batteries) at room temperature.
e.g.:
Why ρ < 0? Consider a circuit with a sharp resonance. The fluctuations can get a large
contribution from a particular resonant frequency ω. Then at τ = period2
= 2π
ω
· 12 , v1 > 0
means v2 < 0. Hence a negative correlation function: hv1 v2 i ∝ ρ < 0.
2-27
2.3 Functions of a random variable
Consider a gas in thermal equilibrium. Suppose you know p(v), where v is the speed of one
atom.
1
Kinetic energy of an atom ≡ KE = mv 2
2
So: what is p(KE)?
Various methods, some fancier than others. Here is a very robust 3-step pictorial method:
B Integrate px (ζ) over regions found in A . This gives the cumulative probability for f :
Z
Pf (η) = dζpx (ζ).
Rη
d
pf (η) = Pf (η) .
dη
2-28
Example 1: Kinetic energy of an ideal gas.
We’ll see much later in the course that a molecule or atom in an ideal gas has a velocity
distribution of the form:
1 vx2 p
p( vx )= √ exp[− ] with σ = kT /m
|{z} 2πσ 2 2σ 2
x−component of velocity
p p
A Rη = [− 2η/m, 2η/m].
B
√
Z2η/m
PKEx (η) = pvx (ζ)dζ
√
− 2η/m
C
d 1 p 1 p
pKEx (η) = PKEx = √ pvx ( 2η/m) − (− √ )pvx ( 2η/m)
dη 2mη 2mη
2-29
1 KEx
√πmσ2 KEx exp[− mσ2 ],
for KEx > 0
p(KEx ) =
0 for KEx < 0
dv
Whence the divergence? As v → 0, dKEx
→ ∞. =⇒
Pileup at v = 0.
ΚΕ
dKE
v
dv
Note that pf (η) is in general a completely different function from px (ζ). This is the purpose
of the pedantic notation with the subscripts and the dummy variables. A more concise
statement of their relationship is
Z ∞
pf (η) = dζpx (ζ)δ(η − f (ζ)).
−∞
Unpacking this neat formula is what’s accomplished by the 3-step graphical method. Let’s
work another example with that method.
2-30
Example 2: Hockey puck distribution again.
1 2 2
π , for x + y ≤ 1
p(x, y) = .
0, for x2 + y 2 > 1
A : pick some r, θ. Sketch the region Rr,θ ≡ {(r0 , θ0 )|r0 < r, θ0 < θ}.
B: R r
0
Rθ 01
Z 0
r dr 0
dθ π = π1 πr2 2π
θ
for r < 1
P (r, θ) = dxdyp(x, y) =
Rr,θ θ
for r > 1.
2π
It’s just the fraction of the area of the disc taken up by Rr,θ .
C : (
r
π
r<1
p(r, θ) = ∂r ∂θ P (r, θ) =
0 r>1
Note: p(r, θ) = p(r)p(θ). With this joint distribution r, θ are SI, although x, y are not.
2-31
2.4 Sums of statistically-independent random variables
Variables describing the macrostate of a gas in the thermodynamic limit = sums of 1023
variables describing the individual particles.
P 23
Energy ideal gas: E = 10 i=1 KEi
Claim: “all of thermodynamics is about sums of SI RVs”. (a more accurate claim: we can
get all of the effects of thermodynamics from sums of SI RVs.)
PN
Consider N SI RVs xi , i = 1..N . Let SN ≡ i=1 xi . (Note: pxi (ζ) need not be independent
of i.)
Z
hSN i = dx1 ...dxN SN p(x1 , x2 ...xN )
|{z}
=x1 +x2 +...+xN
N Z
X XN
= dxi xi p(xi ) = hxi i (5)
i=1 i=1
X N
Var(SN ) = Var(xi ) (6)
i=1
2-32
Z Z
= ( dxi p(xi )xi )( dxj p(xj )xj )
SO:
Var(sum) = sum(Variances) if RVs are SI
Crucial note: where correlations, this statement is not true. Suppose given distributions
for two variables (results of squashing): p(x1 ), p(x2 ).
“iid” RVs
Consider n SI RVs, x1 ..xn . If all p(xi ) are the same, and the xi are SI, these are (naturally)
called “independent and identically distributed” (“iid”). Let hxi i ≡ hxi, Var(xi ) ≡ σ 2 . Then
As n → ∞, √
hSn i ∝ n, Var(Sn ) ∝ n, stdev(Sn ) ∝ n.
stdev(Sn ) 1
=⇒ ∝√ . The distribution gets narrower as n → ∞.
hSn i n
Notes:
• This statement is still true if the p(xi ) are different, as long as no subset dominates
the mean. (i.e. it’s not like: 1000 vars, 999 with hxi i = 1 and one with x? = 1052 . )
• This is the basis of statements like: “The statistical error of the opinion poll was x”.
That means that x2 people were polled.
2-33
Towards the central limit theorem.
So far we know hSn i, VarSn . What can we say about the shape of p(Sn )?
Answer first for two vars x, y. s ≡ x + y. Given px,y (ζ, η), what’s p(s)? One way to do it
is: Z
p(s) = dζdη px,y (ζ, η)δ(s − (ζ + η)).
C : The reason we don’t do the integral is because we can use the Fundamental Theorem of
d FTC R ∞
Calculus here: ps (α) = dα Ps (α) = −∞ dζpx,y (ζ, α − ζ). Note that this result is completely
general – we haven’t assumed anything about the joint distribution.
In words: The probability distribution for the sum of SI RVs is the convolution of the two
original distributions.
2-34
[Mathy aside on convolutions: given two functions f (x), g(x), their convolution is defined
to be Z ∞
(f ⊗ g)(x) ≡ dzf (z)g(x − z).
−∞
A few useful properties of this definition that you can show for your own entertainment:
f ⊗g =g⊗f
f ⊗ (g + h) = f ⊗ g + f ⊗ h
f ⊗ (g ⊗ h) = (f ⊗ g) ⊗ h
Fourier transform of convolution is multiplication. ]
2-35
Graphical example of convolution, with steps.
2-36
Lessons
Claim [algebra, Greytak]: p(s = x + y) = (px ⊗ py )(s) is gaussian. By our previous analysis
of SI RVs, we know its hsi, Var(s). So if it’s gaussian:
1 (s − (a + b))2
p(s) = s exp[− ]
2π (σx2 + σy2 ) 2(σx2 + σy2 )
| {z }
Vars add
2-37
2) Poisson: Claim [algebra, Greytak]: Sum of two SI Poisson RVs is also Poisson distributed.
with mean = sum of means = variance. (Recall that a Poisson distribution is fully specified
by the mean.)
3) Lorentzian:
1 1
p(x) = , Γ, m parameters
π (x − m)2 + Γ2
R∞
Check normalization −∞ dx p(x) = 1.
hxi = m.
p(m ± Γ) = p(m)/2. So: Γ = “half-width at half-height” (or
just “width”).
=⇒ hsi = m1 + m2 , Γsum = Γ1 + Γ2
This is a useful exception to remember when listening to people who work in finance blather
on about how your money is safe because fluctuations will average out (for example when
they bundle together lots of mortgages each with a fat-tailed distribution of outcomes...)
and in the following.
2-38
Central Limit Theorem (CLT)
Let sn = sum of n SI RVs, which are identically distributed (so: iid) (?)
with mean hxi and variance σx2 (which must be finite (??)).
For large n (? ? ?),
p(sn ) can often be well-represented by (? ? ??)
a gaussian, with mean nhxi, and variance nσx2 .
1 (sn − nhxi)2
i.e. p(sn ) ≈ p exp[− ]
|{z} 2πnσx2 2nσx2
can be well-rep. by
Fine print
(?): The theorem also works if not identical, as long as no subset dominates s.
(??): Think of Lorentzians. Convolving many distributions with fat tails gives another
distribution with fat tails, i.e. the fluctuations about the mean are still large.
(? ? ?): The mathematicians among us would like us to say how large n has to be in order
to achieve a given quality of approximation by a guassian. This is not our problem,
because n ∼ 1023 gives an approximation which is good enough for anyone.
(????): The reason for vagueness here is an innocent one; it’s so that we can incorporate the
possibility of discrete variables. For example, flip a coin 1000 times. The probability
distribution for the total number of heads looks like a gaussian if you squint just a
little bit: the envelope of the histogram is gaussian. This is what we mean by “can be
well-represented by”. (More on this next.)
We will illustrate the origin of the CLT with an example. We’ll prove (next) that it works
for many binomially-distributed SIRVs.
2-39
Flipping a coin (pset 1) A discrete example.
1 N!
pset 1: pN (n) = ×
2N
|{z} n!(N − n)!
| {z }
total # of possible outcomes # of ways to get n heads
Let n = N2 + . ≡ deviation from mean. Since the distribution only has support when
N/2, we will make this approximation everywhere below.
N 1 N!
p + = N N N
2 2 2
+ ! 2 − !
The log of a function is always more slowly-varying than the function itself; this means that
it is better-approximated by its Taylor series; for this reason let’s take the log of both sides.
N 1
Use Stirling’s approximation (ln N ! ≈ N ln N − N , valid for N 1):
N N 1
ln p + ≈ N ln 1/2 + N ln N −N
2
− N2 + ln N2 + +N/2 +
N N
− 2
− ln 2
− +N/2 −
2-40
Now expand about the mean (which is also the maximum) of the distribution = 0, using
3
ln(N/2 ± ) ≈ ln N/2 ± 2/N − (2/N )2 /2 + O N 1
N N/2,N 1
ln p + ≈ N ln N/2
2
1 2
−(N/2 + ) ln N/2 + 2/N − (2/N )
2
1 2
−(N/2 − ) ln N/2 − 2/N − (2/N )
2
(8)
Therefore:
22
N
ln p + ≈− 1+O .
2 N N
Comments: First, the reason that the order- terms cancel is that we are expanding around
the maximum of the distribution. The statement that it is the maximum means that the
derivative vanishes there – that derivative is exactly the linear term in the Taylor expansion
– and that the second derivative is negative there; this is an important sign. Second, the
nontrivial statement of the Central Limit Theorem here is not just that we can Taylor
expand the log of the distribution about the maximum. The nontrivial statement is that
the coefficients of terms of higher order than 2 in that Taylor expansion become small as
3
N → ∞. It is crucial here that the terms we are neglecting go like N 2 .
2
N
p + ≈ exp − .
2 N/2
These expressions are valid for small compared to the mean, but indeed the mean is
1 3
(by O (x) I mean “terms of order x3 which we are ignoring”)
2-41
N/2 1.
(n − N/2)2 (n − hni)2
p (n) ≈ exp − = N exp −
2
N/2
|{z} 2 σN
fixed by normalization |{z}
VarN (n)
2
It’s a gaussian with mean N/2 and variance VarN (n) = σN = N/4.
This is consistent with the CLT quoted above. For one coin flip, hni = 1/2, σ12 = Var(n) =
1/4.
The variance of the distribution of n after N flips is σn2 = N σ12 = N/4 (Variance adds for
IID RVs).
2-42
How quickly does the CLT become true as we increase N ?
N = 2 : S = x1 + x2
N = 4 : S = x1 + x2 + x3 + x4
2-43
Some more convolutions, in Tom Greytak’s notes. Consider the one-variable distribution:
1 −x/a
ae
, for x ≥ 0
p(x) =
0, for x < 0
Let S = ni=1 xi where each xi is governed by the density p(x). For simplicity, set a = 1.
P
2-44
An example from physics: energy of molecules in ideal gas
Then the CLT tells us what the whole distribution for E is:
(E − 23 N kT )2
1
p(E) = q exp −
3
2π 2 N (kT ) 2 2 · 32 N (kT )2
2-45
Very small fluctuations about the mean!
√
q
3
width Var kT 2
N 1
∼ = 3 ∼ √ ∼ 10−25/2 .
mean mean 2
N N
Our discussion of the CLT here assumed that the vars we were summing were SI. The CLT
actually still applies, as long as correlations are small enough.
Atoms in a real fluid are not SI. An occasion where correlations lead to important fluctu-
ations is at a critical point. See 8.08 or 8.333.
2-46
2.5 Epilog: an alternate derivation of the Poisson distribution
A drunk person is trying to get home from a bar at x = 0, and makes a series of steps
of length L down the (one-dimensional) street. Unfortunately, the direction of each step is
random, and uncorrelated with the previous steps: with probability p he goes to the right
and with probability q = 1 − p he goes to the left. Let’s ask: after N steps, what’s his
probability P(m) of being at x = mL?
Note that we’ve assumed all his steps are the same size, which has the effect of making
space discrete. Let’s restrict ourselves to the case where he moves in one dimension. This
already has many physical applications, some of which we’ll mention later.
What’s the probability that he gets |m| > N steps away? With N steps, the farthest away
he can get is |m| = N , so for |m| > N , P (m) = 0.
In the second step here we used the fact that the steps are statistically independent, so the
joint probability factorizes. nR is the number of steps to the right, i.e. the number of the xi
which equal R. Since the total number of steps is N , nL + nR = N , the net displacement
(in units of the step length L) is
m = nR − nL = 2NR − N.
In asking about the drunk’s probability for reaching some location, we don’t care about
the order of the steps. There are many more ways to end up near the starting point than
close by. For example, with N = 3, the possibilities are
LLL m = −3
2-47
What’s the number of sequences for a given nL , nR ? The sequence is determined if we say
which of the steps is a R, so we have to choose nR identical objects out of N . The number
of ways to do this is
N N! N
= = .
nR nR !nL ! nL
A way to think about this formula for the number of ways to arrange N = nR + nL of
which nR are indistinguishably one type and nL are indistinguishably another type, is: N !
is the total number of orderings if all the objects can be distinguished. Redistributing the
nR R-steps amongst themselves doesn’t change the pattern (there are nR ! such orderings),
so we must divide by this overcounting. Similarly redistributing the nL L-steps amongst
themselves doesn’t change the pattern (there are nL ! such orderings).
So
N ! nR nL
P (nL , nR ) = p q .
nR !nL !
Note that the binomial formula is
N
X N!
(p + q)N = pn q N −n .
n=0
n!(N − n)!
We have shown that the probability that an event with probability p occurs n times in N
(independent) trials is
N!
WN (n) = pn (1 − p)N −n ;
n!(N − n)!
this is called the binomial distribution. 1 − p here is the probability that anything else
happens. So the analog of ”step to the right” could be ”a particular song is played on your
ipod in shuffle mode” and the analog of ”step to the left” is ”any other song comes on”.
2-48
For example, suppose you have 2000 songs on your ipod and you listen on shuffle by song;
1
then the probability of hearing any one song is p = 2000 . Q: If you listen to N = 1000 songs
on shuffle, what’s the probability that you hear a particular song n times?
The binomial distribution applies. But there are some simplifications we can make. First,
p itself is a small number, and N is large. Second, the probability will obviously be very
small for n ∼ N , so let’s consider the limit n N . In this case, we can apply Sterling’s
formula to the factorials:
1 NN
WN (n) ≈ N −n
pn (1 − p)N −n
n! (N − n)
We can use N − n ∼ N except when there is a cancellation of order-N terms:
1 NN N −n 1 n n
WN (n) ≈ p n
(1 − p) = N p (1 − p)N −n
n! (N )N −n n!
Now we can taylor expand in small p, using ln(1 − x) ≈ −x + x2 /2 − x3 /3 + ...
1 1
WN (n) ≈ (N p)n e(N −n) ln(1−p) ≈ (N p)n e−N p .
n! n!
This is called the Poisson distribution,
1 n −µ
Poissonµ (n) = µ e .
n!
1
Note that it only depends on the product µ = pN , which for our example is pN = 2000
1000 =
1/2. In this case, it looks like in the figure 2.5.2.
Poisson12 HnL
600
500
400
300
200
100
n
1 2 3 4 5
It may seem like your ipod is conspiring to play some songs multiple times and not play
others at all (I had this impression too until I thought about it), but it’s just because we
don’t have much intuition yet about the Poisson distribution. In fact, if we vary µ = N p, we
can make the probability that a given song is never heard much larger than the probability
that it is heard once; see figure 2.5.2.
2-49
Poisson H0L
Poisson H1L
1.5
1.0
0.5
Μ
1 2 3 4
2-50