02 ProbIntro 2020 Annotated
02 ProbIntro 2020 Annotated
02 ProbIntro 2020 Annotated
Introduction to Probability
Ioannis Patras
with thanks to Mark Plumbley, Mike Davies, Kevin Murphy and Sam Roweis
School of EECS
Introduction
Probability Theory deals with phenomena that have some degree of
randomness/uncertainty/chance.
Example
The tossing of a coin.
Loose concept of probability:
No. of Heads
Prob(Heads) = = 140/250 = 0.56
No. of Trials
Clearly Coin tossing is not predictable, however, we can expect that given another 100
spins we will see approximately
100 × Prob(Heads) = 56
heads
Axioms of probability
These axioms are fairly intuitive, and are sufficient to define probability theory.
Mathematical Framework for Probability
Sample Space
We define the sample space to be, Ω, the space of all possible outcomes from an
experiment.
Die roll
Ω = {1, 2, . . . , 6}
Spinning pointer
Ω = {θ : 0 ≤ θ < 2π}
note this last space is not discrete.
Probability measure
We can now define a measure of probability for an element of A (i.e. a subset of Ω)
1. P(Ω) = 1;
2. if E1 ∩ E2 = ∅ then P(E1 ∪ E2 ) = P(E1 ) + P(E2 )
1. P(∅) = 0
proof:
P(A ∩ B)
P(A|B) ,
P(B)
Example:
Corollaries
• B becomes the new certain event, P(B|B) = 1.
• If A ∩ B = ∅ then P(A|B) = 0.
i.e. if A and B are mutually exclusive events,
and B has occurred, then A cannot have occurred.
• If B ⊂ A then P(A|B) = 1
i.e. if B is a subset of A,
and B occurred, then A must have occurred.
Important fact
Proof
The quantity P(A|B) satisfies all three probability axioms:
1. P(A|B) ≥ 0
2. P(Ω|B) = 1
3. if A ∩ C = ∅, P(A ∪ C |B) = P(A|B) + P(C |B)
Total Probability
1. Ai ∩ Aj = ∅, i 6= j (Mutually exclusive)
S
2. i Ai = Ω (Cover)
P(B|A)P(A)
P(A|B) =
P(B)
‘Likelihood’ × ‘Prior Probability for A’
‘Posterior Probability’ =
‘Prior Probability for B’
What is the probability that a loan will not be repaid given that the system has identified it
as such?
Independence
A key concept in statistics is Independence. That is when one event has no influence on
the probability of another.
P(A|B) = P(A)
etc.
Example
Tossing two fair coins (elements of sample space are ordered pairs):
In general relative frequency view of repeated experiments only holds if the experiments are
independent.
Random Variables
For a discrete Random Variable X , P(X = x) is the probability that the RV X takes the
value x.
For a real Random Variable X , P(X ≤ x) is the probability that the RV X takes a value
smaller than x. This is the Cummulative Distribution function (cdf)
1.5
0.5
−0.5
P(X ≤ x) (cdf) −1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
1.5
0.5
−0.5
p(x) (pdf) −1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Interpretation of p (x )
Properties of pdf
1. f (x) ≥ 0 (from definition of F (x))
R∞
2. −∞ f (x)dx = 1 (probability of the whole space)
Example
When gambling the expected rate of return is our main statistical interest.
(Here we ignore important practical considerations, like available cash reserve.)
We sometimes write mX = E {X } or µX = E {X }.
Example
Roll many (large N) independent fair dice,
each die roll giving a number between 1 and 6.
(Remember: ‘fair’ means ‘equal probability of each outcome’.)
Assuming equal probabilities, we would expect:
6
1 X X k
m̂N = xi ≈ E {x} = = 3.5
N 6
i k=1
Functions of an RV - Means and expectations
Suppose that g (x) is a function,
mapping real values x ∈ R to real values y = g (x) ∈ R.
We can define means and expectations of functions of a RV X :
Z ∞
E {Y } = E {g (X )} = g (x)p(x)dx
−∞
(note Y is a random variable itself and hence we are again considering the mean of an RV)
Example: p(x) the pdf of the age of the population, g (x) the medical costs associated
with age x
= E {X 2 } − E {X }2
Higher Order Statistics
Experiments often only need 2nd order statistics, i.e. only
mean mX = E {X } and variance σX2 = E {X 2 } − E {X }2 .
There are also Higher Order Statistics (HOS), often based upon polynomial expectations,
(E {X 3 }, E {X 4 }, . . .), e.g.:
( )
X − mX 3
Skewness = E
σX
(this measures how much of the probability mass lies within the tails).
Positive kurtosis is also termed ‘heavy-tailed’.
(‘Heavy’ with respect to a Gaussian: hence the −3).
The Gaussian Distribution
aka the ‘normal distribution’, with probability density function (pdf):
1 2 2
p(x) = √ × e −(x−µ) /2σ ,
σ 2π
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
−4 −3 −2 −1 0 1 2 3 4
Multiple Random Variables
Extend the idea of RVs to two or more dimensions,
e.g. RVs X (age) and Y (wealth)
RVs X and Y have respective distributions pX (x) and pY (y ) but this does not tell us
about the relation between X and Y .
Joint Density
P({x < X ≤ x + ∆x}, {y < Y ≤ y + ∆y })
pXY (x, y ) ≈
∆x∆y
The probability of any joint area D ⊂ R2 (e.g. P({x1 < X ≤ x2 }, {y1 < Y ≤ y2 }) can be
evaluated from the 2-dimensional integral:
Z
P((X , Y ) ∈ D) = pXY (x, y )dx dy
D
Joint Density
Conditional probability of x and y , given z are (normalised) slices in the p(x, y , z) cube.
RR
(NB. Remember that p(x, y |z)dxdy = 1)
Conditional densities
Conditional probability of x and y , given z are (normalised) slices in the p(x, y , z) cube.
RR
(NB. Remember that p(x, y |z)dxdy = 1)
Total Probability and Bayes Rule
As before we now have Total Probability Rule:
Z ∞
pY (y ) = p(x, y )dx
−∞
Z ∞
= p(y |x)p(x)dx
−∞
or
p(x|y )p(y )
p(y |x) =
p(x)
Independent RVs
We note: Z Z
pZ (z) = pXZ (x, z) dx = pZ |X (z|x)pX (x) dx
X X
pZ |X (z|x) = pY (z − x)
which gives: Z
pZ (z) = pY (z − x) pX (x) dx
X
Example: Expectation of sum of two independent RVs
To prove (which we used in the weak law of large numbers):
E {X + Y } = E {X } + E {Y }
Proof: Z
E {X + Y } = (x + y )pXY (x, y )dx dy
{X ,Y }
Given independence:
Z Z
E {X + Y } = (x + y )pX (x) pY (y )dx dy
ZX ZY Z Z
= xpX (x)pY (y )dx dy + ypX (x) pY (y )dx dy
ZX Y Z X Y
etc.
Multi-Dimensional Expectations
Just as we defined E {x} for scalar RVs, we have the same for vector RVs:
Z Z Z
m ~} =
~ x = E {X ··· ~x pX~ (~x )dx1 . . . dxN
x1 x2 xN
~ ) is an N × N matrix.
Note that Cov(X
~ and Y
For two N-d RVs X ~ we can define the cross-covariance:
~,Y
Cov(X ~ ) = E {(X
~ −m ~ −m
~ X~ )(Y ~ Y~ )T }
Exercise
det(Cov(X ~ ))−1/2
1
T ~ −1
pX (x) = exp − (x − mx ) Cov(X ) (x − mx )
~ ~
(2π)N/2 2
1. Let Z = aX + bY .
If X and Y are jointly Gaussian,
~ z = am
then Z is also Gaussian with m ~ x + bm
~y.
2. If X and Y are jointly Gaussian and uncorrelated,
then they are also Independent.
So correlation measures dependence for Gaussian RVs.