Nothing Special   »   [go: up one dir, main page]

02 ProbIntro 2020 Annotated

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

ECS708: Machine Learning

Introduction to Probability

Ioannis Patras
with thanks to Mark Plumbley, Mike Davies, Kevin Murphy and Sam Roweis
School of EECS
Introduction
Probability Theory deals with phenomena that have some degree of
randomness/uncertainty/chance.

Example
The tossing of a coin.
Loose concept of probability:
No. of Heads
Prob(Heads) = = 140/250 = 0.56
No. of Trials

Clearly Coin tossing is not predictable, however, we can expect that given another 100
spins we will see approximately

100 × Prob(Heads) = 56

heads
Axioms of probability

An event is represented by a set A of possible outcomes.

Example: Die Roll


Some possible events:
{1}, {2}, {3}, {4}, {5}, {6}, (shorthand: ‘1’, ‘2’, ‘3’, etc.)
‘odd’ = {1, 3, 5},
‘even’ = {2, 4, 6},
‘less than 3’ = {1, 2}
Axioms of probability

Let A and B be possible events


Let Ω be the sample space (the set of all possible outcomes).

The axioms of probability can then be given as:

1. P(A) ≥ 0 (Events have nonnegative probability)


2. P(Ω) = 1 (The certain event)
3. P(A ∪ B) = P(A) + P(B) if A and B are mutually exclusive

For ‘P(A ∪ B)’ read ‘the probability that A occurs or B occurs’.


Mutually exclusive means that events A and B cannot both occur at once, i.e. A ∩ B = ∅.

These axioms are fairly intuitive, and are sufficient to define probability theory.
Mathematical Framework for Probability
Sample Space
We define the sample space to be, Ω, the space of all possible outcomes from an
experiment.

Examples of Sample Spaces


Coin toss
Ω = {H, T }

Die roll
Ω = {1, 2, . . . , 6}

Spinning pointer
Ω = {θ : 0 ≤ θ < 2π}
note this last space is not discrete.
Probability measure
We can now define a measure of probability for an element of A (i.e. a subset of Ω)

P is just a function from A to the interval [0, 1] that satisfies:

1. P(Ω) = 1;
2. if E1 ∩ E2 = ∅ then P(E1 ∪ E2 ) = P(E1 ) + P(E2 )

That is we have just reformulated the axioms of probability.

Construction of Probability Models


A complete probability model is defined by {Ω, A, P}.
Getting a probability model from a real system is more difficult. We will look at this later
in the course.
Exercise
Having defined probability, we can draw some corollaries:

1. P(∅) = 0
proof:

2. P(E1 ∪ E2 ) = P(E1 ) + P(E2 ) − P(E1 ∩ E2 )


proof:
Conditional probability

The conditional probability of A given B is defined to be:

P(A ∩ B)
P(A|B) ,
P(B)

Example:

• Probability of being Greek (A) given being blond (B)

Caution: There is no time order between the events.


Conditional probability

Corollaries
• B becomes the new certain event, P(B|B) = 1.
• If A ∩ B = ∅ then P(A|B) = 0.
i.e. if A and B are mutually exclusive events,
and B has occurred, then A cannot have occurred.
• If B ⊂ A then P(A|B) = 1
i.e. if B is a subset of A,
and B occurred, then A must have occurred.
Important fact

P(· |B) defines a new probability model on (Ω, A).

Proof
The quantity P(A|B) satisfies all three probability axioms:

1. P(A|B) ≥ 0
2. P(Ω|B) = 1
3. if A ∩ C = ∅, P(A ∪ C |B) = P(A|B) + P(C |B)
Total Probability

Now let B be some other event. Then

P(B) = Pnk=1 P(B ∩ Ak )


P
= nk=1 P(B|Ak )P(Ak )

where A1 , . . . , An be mutually exclusive events that cover Ω, that is

1. Ai ∩ Aj = ∅, i 6= j (Mutually exclusive)
S
2. i Ai = Ω (Cover)

This is called the Total Probability Rule.

Useful when we know the conditionals but not the marginal.


Bayes Rule

From the definition of Conditional Probability we have

P(A|B)P(B) = P(B ∩ A) = P(B|A)P(A)

Eliminating P(B ∩ A) = P(A ∩ B) we get


P(A|B)P(B) P(B∩A)
P(B|A) = P(A) and P(A|B) = P(B)

This is Bayes Rule.


Bayesian Inference
Bayes Rule is used a lot for inference.
The various elements are given special terms:

P(B|A)P(A)
P(A|B) =
P(B)
‘Likelihood’ × ‘Prior Probability for A’
‘Posterior Probability’ =
‘Prior Probability for B’

• P(A) is prior probability of A before observation of B


• P(A|B) is posterior probability of A after B observed

Often B will constitute some observed data,


and A is a class to assign, or decision to be made based on B.

Difference between P(A|B) and P(B|A) can be very dramatic.


Exercise
A bank asks a machine learning engineer to build a system to decide whether to approve or
not, loan applications
(0.1% of all loan applications will not be repaid).
The bank installs a system that recommends whether the loan should be approved or not.

The system, is such that it will correctly identify


loans that will not be repaid, 99% of the time.

Similarly, the system will correctly identify


loans that will be repaid 99% of the time.

What is the probability that a loan will not be repaid given that the system has identified it
as such?
Independence
A key concept in statistics is Independence. That is when one event has no influence on
the probability of another.

Definition: Events A and B are independent if:

P(A|B) = P(A)

or P(A ∩ B) = P(A)P(B) (alternative definition)

If A1 , A2 , A3 are all mutually independent then we have:

P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 )P(A3 )

Pair-wise independence is necessary but not sufficient for mutual independence.


Properties of independent events

If A and B are independent we have:

• Ac and B c are mutually independent


• Ac and B or A and B c are independent

If A, B and C are mutually independent then:

• A and B ∩ C are independent


• A and B ∪ C are independent

etc.
Example
Tossing two fair coins (elements of sample space are ordered pairs):

Ω = {(H, H), (H, T ), (T , H), (T , T )}

assigning ‘classical’ probabilities:


1 1
P({(H, H)}) = , P({(H, T )}) = , etc.
4 4
With these probabilities, coin 1 / coin 2 events are independent:
P({(H, H)}) 1
P(coin 1 = H|coin 2 = H) = = = P(coin 1 = H)
P(coin 2 = H) 2
(The same holds for all other combinations of H and T .)
Hence the experiments are independent.

In general relative frequency view of repeated experiments only holds if the experiments are
independent.
Random Variables

A random Variable (RV) X represent outcomes or states of the world. Instantiations


usually in lower case: x.

For a discrete Random Variable X , P(X = x) is the probability that the RV X takes the
value x.

For a real Random Variable X , P(X ≤ x) is the probability that the RV X takes a value
smaller than x. This is the Cummulative Distribution function (cdf)

Probability mass (density) function (pdf) p(x) ≥ 0


Discrete case: p(x) = p(X = x)
Continuous case: p(x) = dxd
P(X ≤ x) ≈ P(x≤X∆x <x+∆x)
Example pdf: The Uniform Distribution
2

1.5

0.5

−0.5

P(X ≤ x) (cdf) −1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

1.5

0.5

−0.5

p(x) (pdf) −1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Interpretation of p (x )

Probability of ‘small interval’ event {x ≤ X < x + ∆x}:

P(x ≤ X < x + ∆x)


p(x) ≈
∆x

This is exact in the limit ∆x → 0.

Properties of pdf
1. f (x) ≥ 0 (from definition of F (x))
R∞
2. −∞ f (x)dx = 1 (probability of the whole space)

Remember: “Densities are for integrating.”


Ensemble Averages and Expectations
Often we are only interested in a partial characterization of the RV.

Example
When gambling the expected rate of return is our main statistical interest.
(Here we ignore important practical considerations, like available cash reserve.)

Hence we define ‘average’ properties.

There are two complementary definitions:

• in terms of ‘data’ (sample properties)


• in terms of the model (expectations)
Sample Mean and Expected Value
Sample Mean: Given N samples xi drawn (independently) from an experiment, we define
the mean to be:
x1 + x2 + . . . + xN
m̂N =
N
This measures the ‘average’ of the samples.

Expected Value: Given a RV X with pdf p(x) we define:


Z ∞
E {X } = x · p(x)dx if X is continuous
−∞
X
E {X } = x · P(X = x) if X is discrete (no pdf)
x

We sometimes write mX = E {X } or µX = E {X }.

Intuitively they are related (e.g. relative frequency) but how?


Law of Large Numbers
Roughly (and in many different ways!) as we have a very large number of data points
(N → ∞), we find that
m̂N → E {x}
This general idea is known as the ‘Law of Large Numbers’.
(More on this later. . . )

Example
Roll many (large N) independent fair dice,
each die roll giving a number between 1 and 6.
(Remember: ‘fair’ means ‘equal probability of each outcome’.)
Assuming equal probabilities, we would expect:
6
1 X X k
m̂N = xi ≈ E {x} = = 3.5
N 6
i k=1
Functions of an RV - Means and expectations
Suppose that g (x) is a function,
mapping real values x ∈ R to real values y = g (x) ∈ R.
We can define means and expectations of functions of a RV X :

Z ∞
E {Y } = E {g (X )} = g (x)p(x)dx
−∞

(note Y is a random variable itself and hence we are again considering the mean of an RV)

Example: p(x) the pdf of the age of the population, g (x) the medical costs associated
with age x

Again there is an analogous definition for sample mean of Y :


y1 + y2 + . . . + yN
m̂N =
N
Variance
Another important statistic is one that measures the ‘spread’ of the outcomes. One such
statistic is the variance:

Var{X } = σX2 = E {(X − E {X })2 }


p
The quantity σX = Var{X } is called the standard deviation.
Z ∞
Var{X } = (x − E {X })2 pX (x)dx
Z−∞

= (x 2 − 2xE {X } + E {X }2 )pX (x)dx
Z−∞
∞ Z ∞ Z ∞
2 2
= x pX (x)dx − 2E {X } xpX (x)dx + E {X } pX (x)dx
−∞ −∞ −∞
= E {X 2 } − 2E {X }2 + E {X } 2

= E {X 2 } − E {X }2
Higher Order Statistics
Experiments often only need 2nd order statistics, i.e. only
mean mX = E {X } and variance σX2 = E {X 2 } − E {X }2 .

There are also Higher Order Statistics (HOS), often based upon polynomial expectations,
(E {X 3 }, E {X 4 }, . . .), e.g.:
(  )
X − mX 3
Skewness = E
σX

(this measures the lopsidedness of a distribution) and


(  )
X − mX 4
Kurtosis = E −3
σX

(this measures how much of the probability mass lies within the tails).
Positive kurtosis is also termed ‘heavy-tailed’.
(‘Heavy’ with respect to a Gaussian: hence the −3).
The Gaussian Distribution
aka the ‘normal distribution’, with probability density function (pdf):

1 2 2
p(x) = √ × e −(x−µ) /2σ ,
σ 2π

written in terms of the mean, µ, and variance, σ 2 .

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
−4 −3 −2 −1 0 1 2 3 4
Multiple Random Variables
Extend the idea of RVs to two or more dimensions,
e.g. RVs X (age) and Y (wealth)

RVs X and Y have respective distributions pX (x) and pY (y ) but this does not tell us
about the relation between X and Y .

Joint Density
P({x < X ≤ x + ∆x}, {y < Y ≤ y + ∆y })
pXY (x, y ) ≈
∆x∆y

The probability of any joint area D ⊂ R2 (e.g. P({x1 < X ≤ x2 }, {y1 < Y ≤ y2 }) can be
evaluated from the 2-dimensional integral:
Z
P((X , Y ) ∈ D) = pXY (x, y )dx dy
D
Joint Density

Joint probability of three random variables x, y , and z .


Joint and Marginal densities

Given a joint distribution pXY (x, y ) we define the marginal density:


Z ∞
pX (x) = pXY (x, y )dy (1)
−∞

So: Area of slice through pXY (x, y ) at X = x is equal to pX (x).


Relationship between Joint and Marginal densities

Marginal probability of x and y , when z is marginalised.


Conditional Densities
pXY (x, y )
pY |X (y |x) =
fX (x)

Conditional probability of x and y , given z are (normalised) slices in the p(x, y , z) cube.
RR
(NB. Remember that p(x, y |z)dxdy = 1)
Conditional densities

Conditional probability of x and y , given z are (normalised) slices in the p(x, y , z) cube.
RR
(NB. Remember that p(x, y |z)dxdy = 1)
Total Probability and Bayes Rule
As before we now have Total Probability Rule:
Z ∞
pY (y ) = p(x, y )dx
−∞
Z ∞
= p(y |x)p(x)dx
−∞

We also have a density version of Bayes rule:

p(y |x)p(x) = p(x|y )p(y ) = p(x, y )

or
p(x|y )p(y )
p(y |x) =
p(x)
Independent RVs

Recall: Events A and B are independent if P(A, B) = P(A)P(B).

We say the RVs X and Y are independent RVs if:

fXY (x, y ) = fX (x) fY (y ) (2)


So joint density has a factorial (or product) form.
Independent RVs

So joint distribution & density has a factorial (or product) form.


Example: Sum of two independent RVs
Let Z = X + Y and X and Y be independent.

Given pX (x) and pY (y ) how do we calculate pZ (z)?

We note: Z Z
pZ (z) = pXZ (x, z) dx = pZ |X (z|x)pX (x) dx
X X

For fixed X = x, we get Z = x + Y or Y = Z − x.


So Z is simply a translation of Y :

pZ |X (z|x) = pY (z − x)

which gives: Z
pZ (z) = pY (z − x) pX (x) dx
X
Example: Expectation of sum of two independent RVs
To prove (which we used in the weak law of large numbers):

E {X + Y } = E {X } + E {Y }

Proof: Z
E {X + Y } = (x + y )pXY (x, y )dx dy
{X ,Y }

Given independence:
Z Z
E {X + Y } = (x + y )pX (x) pY (y )dx dy
ZX ZY Z Z
= xpX (x)pY (y )dx dy + ypX (x) pY (y )dx dy
ZX Y Z X Y

= xpX (x)dx + ypY (y )dy


X Y
= E {X } + E {Y } QED
Exercise
Show that the same holds for the variance of the sum of independent RVs, i.e.

Var{X + Y } = Var{X } + Var{Y }


Covariance and correlations
The covariance of two RVs X and Y is:
Cov(X , Y ) ≡ σXY = E {(X − mX )(Y − mY )} = E {XY } − mX mY
(Pearson) correlation coefficient (sometimes just correlation) is:
Cov(X , Y )
ρXY =
σX σY
which measures how linearly related the two variables are.

Fact: −1 ≤ ρXY ≤ 1. (Can you show this?)


If ρXY = 0, the RVs are said to be uncorrelated.
If ρXY = ±1 then Y is simply a linear rescaling of X .

NB: In e.g. signal processing, can also talk about correlation,


Cor(X , Y ) = E {XY }, so that Cov(X , Y ) = Cor(X , Y ) − mX mY .

Beware possible confusion in use of word “correlation”!


Independent vs Uncorrelated RVs
If X and Y are independent scalar RVs, we have:

pXY (x, y ) = pX (x) pY (y )

Hence E {XY } = E {X }E {Y } = mX mY . (Exercise: Show this).


Hence Cov(X , Y ) = E {XY } − mX mY = 0 and so also ρXY = 0

Therefore: Independent ⇒ Uncorrelated.

However: Uncorrelated 6⇒ Independent


Covariance Cov(X , Y ) only measures 2nd order statistics. Independence is much stronger:
all nonlinear statistics must be unrelated.

Nevertheless, “uncorrelated” is much easier to calculate (and enforce!) than


“independent”, so is of great practical interest.
N-dimensional Random Variables
Distributions, densities, etc. generalize to N dimensions.
~ = [X1 , . . . , XN ] then:
Let X

pX~ (~x ) = p([x1 , . . . , xk ]|[xk+1 , . . . , xN ])f ([xk+1 , . . . , xN ])

etc.

Multi-Dimensional Expectations
Just as we defined E {x} for scalar RVs, we have the same for vector RVs:
Z Z Z
m ~} =
~ x = E {X ··· ~x pX~ (~x )dx1 . . . dxN
x1 x2 xN

where ~· represents a vector value.


Multi-Dimensional Variance: Covariance Matrix
The equivalent of multi-dimensional variance is more complex. There are several ways of
generalising E (X 2 ) to N-dimensions.

To capture the full generality we define the covariance matrix:


~ ) = E {(X
Cov(X ~ −m ~ −m
~ X~ )(X ~ X~ )T }

~ ) is an N × N matrix.
Note that Cov(X
~ and Y
For two N-d RVs X ~ we can define the cross-covariance:

~,Y
Cov(X ~ ) = E {(X
~ −m ~ −m
~ X~ )(Y ~ Y~ )T }

Exercise

Let X and Y be uncorrelated scalar RVs, and let Z~ = [X , Y ].

Calculate the form of the covariance matrix Cov(Z~ ).


N-dimensional Gaussian RVs
The Gaussian distribution has a natural N-dimensional form:

det(Cov(X ~ ))−1/2 
1

T ~ −1
pX (x) = exp − (x − mx ) Cov(X ) (x − mx )
~ ~
(2π)N/2 2

As for the scalar case, it is completely specified by


(1) the mean m ~ x , and (2) the covariance Cov(X ~ ).

The N-d Gaussian also has the following interesting properties:

1. Let Z = aX + bY .
If X and Y are jointly Gaussian,
~ z = am
then Z is also Gaussian with m ~ x + bm
~y.
2. If X and Y are jointly Gaussian and uncorrelated,
then they are also Independent.
So correlation measures dependence for Gaussian RVs.

You might also like