02 ProbIntro 2020 Annotated

ECS708: Machine Learning
Introduction to Probability
Ioannis Patras
with thanks to Mark Plumbley, Mike Davies, Kevin Murphy and Sam Roweis
School of EECS
Introduction
Probability Theory deals with phenomena that have some degree of
randomness/uncertainty/chance.
Example
The tossing of a coin.
Loose concept of probability:
No. of Heads
Prob(Heads) = = 140/250 = 0.56
No. of Trials
Clearly Coin tossing is not predictable, however, we can expect that given another 100
spins we will see approximately
100 × Prob(Heads) = 56
heads
Axioms of probability
An event is represented by a set A of possible outcomes.
Example: Die Roll

Some possible events:
{1}, {2}, {3}, {4}, {5}, {6}, (shorthand: ‘1’, ‘2’, ‘3’, etc.)
‘odd’ = {1, 3, 5},
‘even’ = {2, 4, 6},
‘less than 3’ = {1, 2}
Axioms of probability
Let A and B be possible events

Let Ω be the sample space (the set of all possible outcomes).
The axioms of probability can then be given as:
1. P(A) ≥ 0 (Events have nonnegative probability)

2. P(Ω) = 1 (The certain event)
3. P(A ∪ B) = P(A) + P(B) if A and B are mutually exclusive
For ‘P(A ∪ B)’ read ‘the probability that A occurs or B occurs’.

Mutually exclusive means that events A and B cannot both occur at once, i.e. A ∩ B = ∅.
These axioms are fairly intuitive, and are sufficient to define probability theory.
Mathematical Framework for Probability
Sample Space
We define the sample space to be, Ω, the space of all possible outcomes from an
experiment.
Examples of Sample Spaces

Coin toss
Ω = {H, T }
Die roll
Ω = {1, 2, . . . , 6}
Spinning pointer
Ω = {θ : 0 ≤ θ < 2π}
note this last space is not discrete.
Probability measure
We can now define a measure of probability for an element of A (i.e. a subset of Ω)
P is just a function from A to the interval [0, 1] that satisfies:
1. P(Ω) = 1;
2. if E1 ∩ E2 = ∅ then P(E1 ∪ E2 ) = P(E1 ) + P(E2 )
That is we have just reformulated the axioms of probability.
Construction of Probability Models

A complete probability model is defined by {Ω, A, P}.
Getting a probability model from a real system is more difficult. We will look at this later
in the course.
Exercise
Having defined probability, we can draw some corollaries:
1. P(∅) = 0
proof:
2. P(E1 ∪ E2 ) = P(E1 ) + P(E2 ) − P(E1 ∩ E2 )

proof:
Conditional probability
The conditional probability of A given B is defined to be:
P(A ∩ B)
P(A|B) ,
P(B)
Example:
• Probability of being Greek (A) given being blond (B)
Caution: There is no time order between the events.

Conditional probability
Corollaries
• B becomes the new certain event, P(B|B) = 1.
• If A ∩ B = ∅ then P(A|B) = 0.
i.e. if A and B are mutually exclusive events,
and B has occurred, then A cannot have occurred.
• If B ⊂ A then P(A|B) = 1
i.e. if B is a subset of A,
and B occurred, then A must have occurred.
Important fact
P(· |B) defines a new probability model on (Ω, A).
Proof
The quantity P(A|B) satisfies all three probability axioms:
1. P(A|B) ≥ 0
2. P(Ω|B) = 1
3. if A ∩ C = ∅, P(A ∪ C |B) = P(A|B) + P(C |B)
Total Probability
Now let B be some other event. Then
P(B) = Pnk=1 P(B ∩ Ak )

P
= nk=1 P(B|Ak )P(Ak )
where A1 , . . . , An be mutually exclusive events that cover Ω, that is
1. Ai ∩ Aj = ∅, i 6= j (Mutually exclusive)
S
2. i Ai = Ω (Cover)
This is called the Total Probability Rule.
Useful when we know the conditionals but not the marginal.

Bayes Rule
From the definition of Conditional Probability we have
P(A|B)P(B) = P(B ∩ A) = P(B|A)P(A)
Eliminating P(B ∩ A) = P(A ∩ B) we get

P(A|B)P(B) P(B∩A)
P(B|A) = P(A) and P(A|B) = P(B)
This is Bayes Rule.

Bayesian Inference
Bayes Rule is used a lot for inference.
The various elements are given special terms:
P(B|A)P(A)
P(A|B) =
P(B)
‘Likelihood’ × ‘Prior Probability for A’
‘Posterior Probability’ =
‘Prior Probability for B’
• P(A) is prior probability of A before observation of B

• P(A|B) is posterior probability of A after B observed
Often B will constitute some observed data,

and A is a class to assign, or decision to be made based on B.
Difference between P(A|B) and P(B|A) can be very dramatic.

Exercise
A bank asks a machine learning engineer to build a system to decide whether to approve or
not, loan applications
(0.1% of all loan applications will not be repaid).
The bank installs a system that recommends whether the loan should be approved or not.
The system, is such that it will correctly identify

loans that will not be repaid, 99% of the time.
Similarly, the system will correctly identify

loans that will be repaid 99% of the time.
What is the probability that a loan will not be repaid given that the system has identified it
as such?
Independence
A key concept in statistics is Independence. That is when one event has no influence on
the probability of another.
Definition: Events A and B are independent if:
P(A|B) = P(A)
or P(A ∩ B) = P(A)P(B) (alternative definition)
If A1 , A2 , A3 are all mutually independent then we have:
P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 )P(A3 )
Pair-wise independence is necessary but not sufficient for mutual independence.

Properties of independent events
If A and B are independent we have:
• Ac and B c are mutually independent

• Ac and B or A and B c are independent
If A, B and C are mutually independent then:
• A and B ∩ C are independent

• A and B ∪ C are independent
etc.
Example
Tossing two fair coins (elements of sample space are ordered pairs):
Ω = {(H, H), (H, T ), (T , H), (T , T )}
assigning ‘classical’ probabilities:

1 1
P({(H, H)}) = , P({(H, T )}) = , etc.
4 4
With these probabilities, coin 1 / coin 2 events are independent:
P({(H, H)}) 1
P(coin 1 = H|coin 2 = H) = = = P(coin 1 = H)
P(coin 2 = H) 2
(The same holds for all other combinations of H and T .)
Hence the experiments are independent.
In general relative frequency view of repeated experiments only holds if the experiments are
independent.
Random Variables
A random Variable (RV) X represent outcomes or states of the world. Instantiations

usually in lower case: x.
For a discrete Random Variable X , P(X = x) is the probability that the RV X takes the
value x.
For a real Random Variable X , P(X ≤ x) is the probability that the RV X takes a value
smaller than x. This is the Cummulative Distribution function (cdf)
Probability mass (density) function (pdf) p(x) ≥ 0

Discrete case: p(x) = p(X = x)
Continuous case: p(x) = dxd
P(X ≤ x) ≈ P(x≤X∆x <x+∆x)
Example pdf: The Uniform Distribution
2
1.5
0.5
−0.5
P(X ≤ x) (cdf) −1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
1.5
0.5
−0.5
p(x) (pdf) −1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Interpretation of p (x )
Probability of ‘small interval’ event {x ≤ X < x + ∆x}:
P(x ≤ X < x + ∆x)

p(x) ≈
∆x
This is exact in the limit ∆x → 0.
Properties of pdf
1. f (x) ≥ 0 (from definition of F (x))
R∞
2. −∞ f (x)dx = 1 (probability of the whole space)
Remember: “Densities are for integrating.”

Ensemble Averages and Expectations
Often we are only interested in a partial characterization of the RV.
Example
When gambling the expected rate of return is our main statistical interest.
(Here we ignore important practical considerations, like available cash reserve.)
Hence we define ‘average’ properties.
There are two complementary definitions:
• in terms of ‘data’ (sample properties)

• in terms of the model (expectations)
Sample Mean and Expected Value
Sample Mean: Given N samples xi drawn (independently) from an experiment, we define
the mean to be:
x1 + x2 + . . . + xN
m̂N =
N
This measures the ‘average’ of the samples.
Expected Value: Given a RV X with pdf p(x) we define:

Z ∞
E {X } = x · p(x)dx if X is continuous
−∞
X
E {X } = x · P(X = x) if X is discrete (no pdf)
x
We sometimes write mX = E {X } or µX = E {X }.
Intuitively they are related (e.g. relative frequency) but how?

Law of Large Numbers
Roughly (and in many different ways!) as we have a very large number of data points
(N → ∞), we find that
m̂N → E {x}
This general idea is known as the ‘Law of Large Numbers’.
(More on this later. . . )
Example
Roll many (large N) independent fair dice,
each die roll giving a number between 1 and 6.
(Remember: ‘fair’ means ‘equal probability of each outcome’.)
Assuming equal probabilities, we would expect:
6
1 X X k
m̂N = xi ≈ E {x} = = 3.5
N 6
i k=1
Functions of an RV - Means and expectations
Suppose that g (x) is a function,
mapping real values x ∈ R to real values y = g (x) ∈ R.
We can define means and expectations of functions of a RV X :
Z ∞
E {Y } = E {g (X )} = g (x)p(x)dx
−∞
(note Y is a random variable itself and hence we are again considering the mean of an RV)
Example: p(x) the pdf of the age of the population, g (x) the medical costs associated
with age x
Again there is an analogous definition for sample mean of Y :

y1 + y2 + . . . + yN
m̂N =
N
Variance
Another important statistic is one that measures the ‘spread’ of the outcomes. One such
statistic is the variance:
Var{X } = σX2 = E {(X − E {X })2 }

p
The quantity σX = Var{X } is called the standard deviation.
Z ∞
Var{X } = (x − E {X })2 pX (x)dx
Z−∞
∞
= (x 2 − 2xE {X } + E {X }2 )pX (x)dx
Z−∞
∞ Z ∞ Z ∞
2 2
= x pX (x)dx − 2E {X } xpX (x)dx + E {X } pX (x)dx
−∞ −∞ −∞
= E {X 2 } − 2E {X }2 + E {X } 2
= E {X 2 } − E {X }2
Higher Order Statistics
Experiments often only need 2nd order statistics, i.e. only
mean mX = E {X } and variance σX2 = E {X 2 } − E {X }2 .
There are also Higher Order Statistics (HOS), often based upon polynomial expectations,
(E {X 3 }, E {X 4 }, . . .), e.g.:
( )
X − mX 3
Skewness = E
σX
(this measures the lopsidedness of a distribution) and

( )
X − mX 4
Kurtosis = E −3
σX
(this measures how much of the probability mass lies within the tails).
Positive kurtosis is also termed ‘heavy-tailed’.
(‘Heavy’ with respect to a Gaussian: hence the −3).
The Gaussian Distribution
aka the ‘normal distribution’, with probability density function (pdf):
1 2 2
p(x) = √ × e −(x−µ) /2σ ,
σ 2π
written in terms of the mean, µ, and variance, σ 2 .
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
−4 −3 −2 −1 0 1 2 3 4
Multiple Random Variables
Extend the idea of RVs to two or more dimensions,
e.g. RVs X (age) and Y (wealth)
RVs X and Y have respective distributions pX (x) and pY (y ) but this does not tell us
about the relation between X and Y .
Joint Density
P({x < X ≤ x + ∆x}, {y < Y ≤ y + ∆y })
pXY (x, y ) ≈
∆x∆y
The probability of any joint area D ⊂ R2 (e.g. P({x1 < X ≤ x2 }, {y1 < Y ≤ y2 }) can be
evaluated from the 2-dimensional integral:
Z
P((X , Y ) ∈ D) = pXY (x, y )dx dy
D
Joint Density
Joint probability of three random variables x, y , and z .

Joint and Marginal densities
Given a joint distribution pXY (x, y ) we define the marginal density:

Z ∞
pX (x) = pXY (x, y )dy (1)
−∞
So: Area of slice through pXY (x, y ) at X = x is equal to pX (x).

Relationship between Joint and Marginal densities
Marginal probability of x and y , when z is marginalised.

Conditional Densities
pXY (x, y )
pY |X (y |x) =
fX (x)
Conditional probability of x and y , given z are (normalised) slices in the p(x, y , z) cube.
RR
(NB. Remember that p(x, y |z)dxdy = 1)
Conditional densities
Conditional probability of x and y , given z are (normalised) slices in the p(x, y , z) cube.
RR
(NB. Remember that p(x, y |z)dxdy = 1)
Total Probability and Bayes Rule
As before we now have Total Probability Rule:
Z ∞
pY (y ) = p(x, y )dx
−∞
Z ∞
= p(y |x)p(x)dx
−∞
We also have a density version of Bayes rule:
p(y |x)p(x) = p(x|y )p(y ) = p(x, y )
or
p(x|y )p(y )
p(y |x) =
p(x)
Independent RVs
Recall: Events A and B are independent if P(A, B) = P(A)P(B).
We say the RVs X and Y are independent RVs if:
fXY (x, y ) = fX (x) fY (y ) (2)

So joint density has a factorial (or product) form.
Independent RVs
So joint distribution & density has a factorial (or product) form.

Example: Sum of two independent RVs
Let Z = X + Y and X and Y be independent.
Given pX (x) and pY (y ) how do we calculate pZ (z)?
We note: Z Z
pZ (z) = pXZ (x, z) dx = pZ |X (z|x)pX (x) dx
X X
For fixed X = x, we get Z = x + Y or Y = Z − x.

So Z is simply a translation of Y :
pZ |X (z|x) = pY (z − x)
which gives: Z
pZ (z) = pY (z − x) pX (x) dx
X
Example: Expectation of sum of two independent RVs
To prove (which we used in the weak law of large numbers):
E {X + Y } = E {X } + E {Y }
Proof: Z
E {X + Y } = (x + y )pXY (x, y )dx dy
{X ,Y }
Given independence:
Z Z
E {X + Y } = (x + y )pX (x) pY (y )dx dy
ZX ZY Z Z
= xpX (x)pY (y )dx dy + ypX (x) pY (y )dx dy
ZX Y Z X Y
= xpX (x)dx + ypY (y )dy

X Y
= E {X } + E {Y } QED
Exercise
Show that the same holds for the variance of the sum of independent RVs, i.e.
Var{X + Y } = Var{X } + Var{Y }

Covariance and correlations
The covariance of two RVs X and Y is:
Cov(X , Y ) ≡ σXY = E {(X − mX )(Y − mY )} = E {XY } − mX mY
(Pearson) correlation coefficient (sometimes just correlation) is:
Cov(X , Y )
ρXY =
σX σY
which measures how linearly related the two variables are.
Fact: −1 ≤ ρXY ≤ 1. (Can you show this?)

If ρXY = 0, the RVs are said to be uncorrelated.
If ρXY = ±1 then Y is simply a linear rescaling of X .
NB: In e.g. signal processing, can also talk about correlation,

Cor(X , Y ) = E {XY }, so that Cov(X , Y ) = Cor(X , Y ) − mX mY .
Beware possible confusion in use of word “correlation”!

Independent vs Uncorrelated RVs
If X and Y are independent scalar RVs, we have:
pXY (x, y ) = pX (x) pY (y )
Hence E {XY } = E {X }E {Y } = mX mY . (Exercise: Show this).

Hence Cov(X , Y ) = E {XY } − mX mY = 0 and so also ρXY = 0
Therefore: Independent ⇒ Uncorrelated.
However: Uncorrelated 6⇒ Independent

Covariance Cov(X , Y ) only measures 2nd order statistics. Independence is much stronger:
all nonlinear statistics must be unrelated.
Nevertheless, “uncorrelated” is much easier to calculate (and enforce!) than

“independent”, so is of great practical interest.
N-dimensional Random Variables
Distributions, densities, etc. generalize to N dimensions.
~ = [X1 , . . . , XN ] then:
Let X
pX~ (~x ) = p([x1 , . . . , xk ]|[xk+1 , . . . , xN ])f ([xk+1 , . . . , xN ])
etc.
Multi-Dimensional Expectations
Just as we defined E {x} for scalar RVs, we have the same for vector RVs:
Z Z Z
m ~} =
~ x = E {X ··· ~x pX~ (~x )dx1 . . . dxN
x1 x2 xN
where ~· represents a vector value.

Multi-Dimensional Variance: Covariance Matrix
The equivalent of multi-dimensional variance is more complex. There are several ways of
generalising E (X 2 ) to N-dimensions.
To capture the full generality we define the covariance matrix:

~ ) = E {(X
Cov(X ~ −m ~ −m
~ X~ )(X ~ X~ )T }
~ ) is an N × N matrix.
Note that Cov(X
~ and Y
For two N-d RVs X ~ we can define the cross-covariance:
~,Y
Cov(X ~ ) = E {(X
~ −m ~ −m
~ X~ )(Y ~ Y~ )T }
Exercise
Let X and Y be uncorrelated scalar RVs, and let Z~ = [X , Y ].
Calculate the form of the covariance matrix Cov(Z~ ).

N-dimensional Gaussian RVs
The Gaussian distribution has a natural N-dimensional form:
det(Cov(X ~ ))−1/2
1

T ~ −1
pX (x) = exp − (x − mx ) Cov(X ) (x − mx )
~ ~
(2π)N/2 2
As for the scalar case, it is completely specified by

(1) the mean m ~ x , and (2) the covariance Cov(X ~ ).
The N-d Gaussian also has the following interesting properties:
1. Let Z = aX + bY .
If X and Y are jointly Gaussian,
~ z = am
then Z is also Gaussian with m ~ x + bm
~y.
2. If X and Y are jointly Gaussian and uncorrelated,
then they are also Independent.
So correlation measures dependence for Gaussian RVs.

02 ProbIntro 2020 Annotated

Uploaded by

Copyright:

Available Formats

02 ProbIntro 2020 Annotated

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 ProbIntro 2020 Annotated

Uploaded by

Copyright:

Available Formats

ECS708: Machine Learning

An event is represented by a set A of possible outcomes.

Example: Die Roll

Let A and B be possible events

The axioms of probability can then be given as:

1. P(A) ≥ 0 (Events have nonnegative probability)

For ‘P(A ∪ B)’ read ‘the probability that A occurs or B occurs’.

Examples of Sample Spaces

P is just a function from A to the interval [0, 1] that satisfies:

That is we have just reformulated the axioms of probability.

Construction of Probability Models

2. P(E1 ∪ E2 ) = P(E1 ) + P(E2 ) − P(E1 ∩ E2 )

The conditional probability of A given B is defined to be:

• Probability of being Greek (A) given being blond (B)

Caution: There is no time order between the events.

P(· |B) defines a new probability model on (Ω, A).

Now let B be some other event. Then

P(B) = Pnk=1 P(B ∩ Ak )

where A1 , . . . , An be mutually exclusive events that cover Ω, that is

This is called the Total Probability Rule.

Useful when we know the conditionals but not the marginal.

From the definition of Conditional Probability we have

P(A|B)P(B) = P(B ∩ A) = P(B|A)P(A)

Eliminating P(B ∩ A) = P(A ∩ B) we get

This is Bayes Rule.

• P(A) is prior probability of A before observation of B

Often B will constitute some observed data,

Difference between P(A|B) and P(B|A) can be very dramatic.

The system, is such that it will correctly identify

Similarly, the system will correctly identify

Definition: Events A and B are independent if:

or P(A ∩ B) = P(A)P(B) (alternative definition)

If A1 , A2 , A3 are all mutually independent then we have:

P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 )P(A3 )

Pair-wise independence is necessary but not sufficient for mutual independence.

If A and B are independent we have:

• Ac and B c are mutually independent

If A, B and C are mutually independent then:

• A and B ∩ C are independent

Ω = {(H, H), (H, T ), (T , H), (T , T )}

assigning ‘classical’ probabilities:

A random Variable (RV) X represent outcomes or states of the world. Instantiations

Probability mass (density) function (pdf) p(x) ≥ 0

Probability of ‘small interval’ event {x ≤ X < x + ∆x}:

P(x ≤ X < x + ∆x)

This is exact in the limit ∆x → 0.

Remember: “Densities are for integrating.”

Hence we define ‘average’ properties.

There are two complementary definitions:

• in terms of ‘data’ (sample properties)

Expected Value: Given a RV X with pdf p(x) we define:

Intuitively they are related (e.g. relative frequency) but how?

Again there is an analogous definition for sample mean of Y :

Var{X } = σX2 = E {(X − E {X })2 }

(this measures the lopsidedness of a distribution) and

written in terms of the mean, µ, and variance, σ 2 .

Joint probability of three random variables x, y , and z .

Given a joint distribution pXY (x, y ) we define the marginal density: