Definition 8.1.1.: - April 25, 2016
Definition 8.1.1.: - April 25, 2016
Definition 8.1.1.: - April 25, 2016
Let n ≥ 1, X1 , X2 , . . . , Xn be an i.i.d. random sample from a population. Recall the sample mean
8
n
1X
X= Xi
n
i=1
In Chapter 3, while discussing discrete random variables we had considered a finite collection of random
variables (X1 , X2 , . . . , Xn ). In Definition 3.2.7, we had described how to define their joint distribution and
we used this to understand the multinomial distribution in Example 3.2.12.
In the continuous setting as well there are many instances where it is relevant to study the joint
distribution of a finite collection of random variables. Suppose X is a point chosen randomly in the unit
sphere in the 3 dimensions. Then X has three coordinates and say X = (X1p , X2 , X3 ) where each Xi is
a random variable in (0, 1). Also they are dependent because we know that, X12 + X22 + X32 ≤ 1. It is
useful and needed to understand their “joint distribution”. We have already seen the usefulness of sample
mean and sample variance which are a function of X1 , X2 , . . . , Xn . To understand the distribution of sample
mean and sample variance the joint distribution of X1 , X2 , . . . , Xn will be needed to be understood first.
We define the joint distribution function first.
Definition 8.1.1. Let n ≥ 1 and X1 , X2 , . . . , Xn be random variables defined on the same probability space.
The joint distribution function F : Rn → [0, 1] is given by
F (x1 , x2 , . . . , xn ) = P (X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ),
for x1 , x2 , . . . , xn ∈ R.
As in single variable and two variable situations, the joint distribution function determines the entire joint
distribution of X1 , X2 , . . . , Xn ,. More precisely, if all the random variables were discrete with Xi : S → Ti
with Ti being countable subsets of ⊂ R for 1 ≤ i ≤ n from the joint distribution function one can determine
P (X1 = t1 , X2 = t2 , . . . , Xn = tn ),
203
Version: – April 25, 2016
204 sampling distributions and limit theorems
for all ti ∈ Ti , 1 ≤ i ≤ n. To understand the random variables in the continuous setting we need to set up
some notation.
Let n ≥ 1 and f : Rn → R be a non-negative function, piecewise-continuous in each variable for which
Z
f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn = 1.
Rn
In this setting, the joint distribution of (X1 , X2 , . . . , Xn ) is determined by joint density f . . Using
multivariable calculus we can can state and prove a similar type of result as Theorem 5.2.5 for random
variables (X1 , X2 , . . . , Xn ) that have a joint density. In particular, we can conclude that since the joint
densities are assumed to be piecewise continuous, the corresponding distribution functions are piecewise
differentiable. Further, the joint distribution of the continuous random variables (X1 , X2 , . . . , Xn ) are
completely determined by their joint distribution function F . That is, if we know F (x1 , x2 , . . . , xn ) for all
x1 , x2 , . . . , xn ∈ R we could use multivariable calculus to differentiate F to find f . Then integrating this
joint density over the event A we can calculate P ((X1 , X2 , . . . , Xn ) ∈ A).
As in the n = 2 case one can recover the marginal density of each Xi for i between 1 and n by integrating
over the other indices. So, the marginal density of Xi at a is given by
Z
fXi (a) = f (x1 , . . . , xi−1 , a, xi+1 , . . . , xn )dx1 . . . dxi−1 dxi+1 . . . dxn .
Rn−1
Further for n ≥ 3, we can deduce the joint density for any sub-collection m ≤ n random variables by
integrating over the other variables. For instance, if we were interested in the joint density of (X1 , X3 , X7 )
we would obtain
Z
fX1 ,X3 ,X7 (a1 , a3 , a7 ) = f (a1 , x2 , a3 , x4 , x5 , x6 , a7 , x8 . . . , xn )dx2 dx4 dx5 dx6 . . . dxn .
Rn−3
Suppose X1 , X2 , . . . , Xn are random variables defined on a single sample space S with joint density
f : Rn → R. Let g : Rn → R be a function of n variables for which g (X1 , X2 , . . . , Xn ) is defined in the
range of the Xj variables. Let B be an event in the range of g. Then, following the proof of Theorem 3.3.5,
we can show that
The above provides an abstract method of finding the distribution of the random variable Y = g (X1 , X2 , . . . , Xn )
but it might be difficult to calculate it explicitly. For n = 1, we discussed this question in detail in Section
5.3, for n = 2 we did explore how to find the distributions of sums and ratios of independent random
variables (see Section 5.5). In a few cases by induction n, this method could be extended but in general it
is not possible. In Appendix B, Section B.2 we discuss the Jacobian method of finding the joint density of
the transformed random variable.
The notion of independence also extends to the multi-dimensional continuous random variable as
in discrete setting. As discussed in Definition 3.2.3, a finite collection of continuous random variables
X1 , X2 , . . . , Xn is mutually independent if the sets (Xj ∈ Aj ) are mutually independent for all events Aj
in the ranges of the corresponding Xj . As proved for the n = 2 case in Theorem 5.4.7, we can similarly
deduce that if (X1 , X2 , X3 , . . . , Xn ) are mutually independent continuous random variables with marginal
densities fXi then their joint density is given by
n
Y
f (x1 , x2 , . . . , xn ) = fXi (xi ), (8.1.1)
i=1
for xi ∈ R and 1 ≤ i ≤ n. Further for any finite sub-collection {Xi1 , Xi2 , . . . , Xim } of the above independent
random variables, their joint density is given by
m
Y
f (a1 , a2 , . . . , am ) = fXij (aj ). (8.1.2)
i=j
Let n ≥ 1 and let X1 , X2 , . . . , Xn be a i.i.d random sample from a population. Let F be the common
distribution function. Let the X 0 s be arranged in increasing order of magnitude denoted by
X(1) ≤ X(2) ≤ · · · ≤ X(n) .
These ordered values are called the order statistics of the sample X1 , X2 , . . . , Xn . For, 1 ≤ r ≤ n, X(r ) is
called the r-th order statistic. One can computer F(r ) , the distribution function of X(r ) , for 1 ≤ r ≤ n in
terms of n and F . We have,
F(1) (x ) = P ( X ( 1 ) ≤ x ) = 1 − P ( X ( 1 ) > x ) = 1 − P ( ∩n
1=1 (Xi > x))
n
Y n
Y
= 1− P (Xi > x) = 1 − (1 − P (Xi ≤ x))
i=1 i=1
= 1 − (1 − F (x))n ,
n
Y
F(n) (x ) = P ( X ( n ) ≤ x ) = P ( ∩n
i=1 (Xi ≤ x)) = P (Xi ≤ x) = (F (x))n ,
i=1
If the distribution function F had a probability density function f then each X(r ) has a probability density
function f(r ) . This can be obtained by differentiating F(r ) and is given by the below expression.
Example 8.1.3. Let n ≥ 1 and let X1 , X2 , . . . , Xn be a i.i.d random sample from a population whose
common distribution F is an Exponential (λ) random variable. Then we know that
0 x<0
F (x) =
1 − e−λx x≥0
Therefore using (8.1.3) and substituting for F as above we have that the densities of the order statistics are
given by
−λx n−1 −λx
n(e
)) λe r=1
f(r ) (x ) = nλe−λx (1 − e−λx )n−1 r=n
λe−λx
n!
(1 − e−λx )r−1 (e−λx )n−r 1 < r < n,
(r−1)!(n−r )!
nλe−nλx
r=1
f(r ) (x ) = nλe−λx (1 − e−λx )n−1 r=n
λn!
(1 − e−λx )r−1 (e−λx )n−r +1
(r−1)!(n−r )!
1 < r < n,
for x > 0. We note from the above that X(1) , i.e minimum of exponentials, is Exponential (nλ) random
variable. However the other order statistics are not exponentially distributed.
In many applications one is interested in the range of values a random variable X assumes. A method
to understand this to sample X1 , X2 , . . . , Xn i.i.d. X and examine R = X(n) − X(1) . Suppose X has a
probability density function f : R → R and distribution function F : R → [0, 1]. As before we can can
calculate the joint density of X(1) , X(n) by first computing the joint distribution function. This is done by
using the i.i.d. nature of the sample and the definition of the order statistics.
From the above, differentiating partially in x and y we see that the joint density of (X(1) , X(n) ) is given by
(
n(f (x) − f (y ))[F (y ) − F (x)]n−1 x<y
fX(1) ,X(n) (x, y ) (8.1.4)
0 otherwise.
To calculate the distribution of R, we compute its distribution function. For r ≤ 0, P (R ≤ r ) = 0 and for
r > 0, using the above joint density of (X(1) , X(n) ) we have
P (R ≤ r ) = P (X(n) ≤ X (1 ) + r )
Z ∞ Z r
= fX(1) ,X(n) (x, z + x)dz dx
−∞ 0
Z r Z ∞
= fX(1) ,X(n) (x, z + x)dx dz,
0 −∞
where we have done a change of variable y = z + x in the second last line and a change in the order of
integration in the last line. Differentiating the above we conclude that R has a joint density given by
R ∞
−∞ fX(1) ,X(n) (x, r + x)dx
if r > 0
fR ( r ) = (8.1.5)
0 otherwise.
Example 8.1.4. Let X1 , X2 , . . . , Xn be i.i.d Uniform (0, 1). The probability density function and distribution
function of a uniform (0, 1) random variable are given by
(
0
if x ≤ 0
1 if x ∈ (0, 1)
f (x) = and F (x) = x if 0 < x < 1
0 otherwise.
if x > 1.
1
We see that X(r ) ∼ Beta (r, n − r + 1) for 1 ≤ r ≤ n and the range R ∼ Beta(n, 2)
In general we can also understand the joint-distribution of the order statistics. Suppose we have an i.i.d
sample X1 , X2 , . . . , Xn having distribution X. If X has a probability density function f : R → R then one
can show that the order statistic (X(1) , X(2) , . . . , X(n) ) has a joint density h : Rn → R by
(
n!f (u1 )f (u2 ) . . . f (un ) u1 < u 2 < . . . < u n ,
h ( u1 , u2 , . . . , un )
0 otherwise.
The above fact intuitively is clear. Any ordering u1 < u2 < . . . < un “has a probability” f (u1 )f (u2 ) . . . f (un ).
Each of the Xi can assume any of the uk ’s. The total number of possible orderings is n!. A formal proof
involves using the Jacobian method and will be discussed in Appendix B.
8.1.2 χ2 , F and t
χ2 , F and t distributions arise naturally when considering functions of i.i.d. normal random variables
(X1 , X2 , X3 , . . . , Xn ) for n ≥ 1. They also are useful in Hypothesis testing as well. We discuss these via
three examples.
Example 8.1.5. (Chi-Square) Let n ≥ 1 and (X1 , X2 , X3 , . . . , Xn ) be a collection of independent Normal
random variables with mean 0 and variance 1. Then the joint density is given by
n Pn x2
Y 1 − i
f (x1 , x2 , . . . xn ) = f X i ( xi ) = √
n
e i=1 2 ,
2π
i=1
for xi ∈ R and
Pn1 ≤ i ≤ n.
Let Z = i=1 Xi2 . We shall find the distribution of Z in two steps. First, clearly the range of X12 is
non-negative. The distribution function for X12 at z ≥ 0, is given by
F1 ( z ) = P (X12 ≤ z )
√
= P (X1 ≤ z )
√
Z z
1 x2
= √ e− 2 dx
2π
Z0 z
1 u 1
= √ e− 2 u− 2 du
0 2 2π
Comparing it with the Gamma (α, λ) random variable defined in Definition 5.5.5 and using Exercise 5.5.10,
we see that X12 is distributed as a Gamma ( 12 , 12 ) random
Pn variable. Using the calculation done in Example
5.5.6 for n = 2 and by induction we have that Z = i=1 Xi2 will be Gamma ( n2 , 12 ). This distribution is
referred to as Chi-Square with n- degrees of freedom. We define it precisely next.
Definition 8.1.6. (Chi-Square with n degrees of freedom) A random variable X whose distribution is
Gamma ( n2 , 12 ) is said to have Chi-square distribution with n-degrees of freedom (i.e number of parameters).
Gamma ( n2 , 12 ) is denoted by χ2n and as discussed earlier it has density given by
n
2− 2 n2 −1 − x2
f (x) = x e
Γ( n2 )
n
2− 2 n x
(n
x 2 −1 e− 2 when n is even.
2 −1) !
=
n
2n− 2 −1 ( n−1 )! n −1 − x
√2 x2 e 2 when n is odd.
(n−1)! π
when x > 0.
We shall show in the next subsection that sample variance from a Normal population is a Chi-square
random variable. In the next chapter we shall construct a test to make inferences about the variances of
the two population. In that context we shall compare sample variances and this is where the F distribution
arises naturally.
Example 8.1.7. (F-distribution) Suppose X1 , X2 , . . . , Xn1 be an i.i.d. random sample from a Normal
mean 0 and variance σ12 population and Y1 , Y2 , . . . , Yn2 be an i.i.d. random sample from a Normal mean
Pn1 Xi 2
0 and variance σ22 population. We have already seen in Example 8.1.5 that U = is a χ2n1
i=1 σ1
Pn2 Yi 2
is a χ2n2 random variable. Further U and V are independent. Let
random variable and V = i = 1 σ2
U V n1
Z= n1 / n2 . Let Y = n2 Z = VU . As done in Example 5.5.10 the density of Y for y > 0 is given by
n1
y 2 −1 Γ ( n1 +
2
n2
)
fY ( y ) =
(1 + y )
n1 + n2
2 Γ( 2 )Γ( n22 )
n1
FZ ( z ) = P (Z ≤ z )
n
= P (Y ≤ 2 z )
n1
n2 n1
z −1 Γ( n1 + n2
Z n1
y 2
2 )
= dy
−∞ (1 + y )
n1 +n2
2 Γ( 2 )Γ( n22 )
n1
n1
making a u-substitution with n2 y =u
n1
Z z n21 −1 Γ ( n1 + n2
)
n2 u
2
2
= du
n1 (1 + n1 n1 + n2
Γ( 2 )Γ( n22 )
n1
−∞ n2 u )
2
Z is said to have F(n1 , n2 ) distribution. Z is close to a widely used distribution in statistics called F -
distribution.
The distribution of the ratio of sample mean and sample variance plays an important role in Hypothesis
testing. This forms the motivation for the next example where the t distribution arises naturally.
Example 8.1.8. (t-distribution) Let X1 be a Normal random variable with mean 0 and variance 1. Let X2
be an independent χ2n random variable. Let
X
Z= q1 .
X2
n
X12
We wish to find the density of Z. Observe that U = Z 2 is given by X2 . Now, X12 has χ21 distribution
n
(See Example 8.1.5). So applying Example 8.1.7 with n1 = 1 and n2 = n, we find that U has F (1, n)
distribution. The density of U is given by
1
21
1 u 2 −1 Γ ( n+ 1
2 )
fU ( u ) = n+1
n (1 + n1 u) 2 Γ( 12 )Γ( n2 )
1
Γ ( n+1 ) u− 2
= √ 2 n .
nπΓ( 2 ) (1 + u ) n2+1
n
P (U ≤ u) = P (Z 2 ≤ u)
√ √
= P (− u ≤ Z ≤ u)
√ √
= P (Z ≤ u) − P (Z ≤ − u)
√ √
= P (Z ≤ u) − P (Z ≥ u)
√
= 2P (Z ≤ u) − 1
1 √
fU (u) = √ (fZ ( u)).
u
fZ ( z ) = | z | fU ( z 2 )
−1
Γ ( n+1 ) z2 2
= |z| √ 2 n
nπΓ( 2 ) 1 + u n2+1
n
− n2+1
Γ ( n+1 ) z2
= √ 2 n 1+
nπΓ( 2 ) n
Z is said to have t-distribution with n-degrees of freedom. We will denote this by the notation Z ∼ tn .
Let n ≥ 1, X1 , X2 , . . . , Xn be an i.i.d. random sample from a population having mean µ and variance σ 2 .
Consider the sample mean
n
1X
X= Xi
n
i=1
and sample variance
n
1 X
S2 = ( Xi − X ) 2 .
n−1
i=1
We have already seen in Theorem 7.2.2 that E [X ] = µ and in Theorem 7.2.4 that E [S 2 ] = σ 2 . It turns out
that it is not easy to understand the precise distribution of X or S 2 in general. However, this can be done
when the population is normally distributed. The main result of this section is the following.
Proof - (a) follows from Theorem 6.3.13. The typical proof for (b) and (c) is via Helmert’s transformation
(see [Rao73]) and requires some knowledge of Linear Algebra. Here we will follow Kruskal’s proof as
illustrated in [Stig84]. The proof is by the method of induction. For implementing the inductive step on
the sample size n, we shall replace X and S 2 with X n and Sn2 for the rest of the proof.
(X1 − X2 )2
2 2
X1 + X2 X1 + X2 X1 + X2
X2 = and S22 = X1 − + X2 − = . (8.1.6)
2 2 2 2
(a) Follows from Theorem 6.3.13.
(b) As X1 and X2 are independent Normal random variables with mean µ and variance σ 2 , by Theorem
(X1 −X2 )
6.3.13, √ is a Normal random variable with mean 0 and variance 1. Using Example 8.1.5, we know
σ 2
S22
that σ2 has χ21 distribution and this proves (b).
(c) From (8.1.6), X 2 is a function of X1 + X2 and S22 is a function of X1 − X2 . Theorem 8.1.2 will imply
that X 2 and S22 are independent if we show X1 + X2 and X1 − X2 are independent. Let α, β ∈ R. Then
using Theorem 6.3.13 again we have that α(X1 + X2 ) + β (X1 − X2 ) = (α + β )X1 + (α − β )X2 is normally
distributed. As this is true for any α, β ∈ R, by Definition 6.4.1 (X1 + X2 , X1 − X2 ) is a bivariate normal
random variable. Using Theorem 6.2.2 (f) and (g), along with the fact that X1 and X2 are independent
Normal random variables with mean µ and variance σ 2 ,we have
As
r r ! k r
k (k + 1)σ 2 X 1 k
(Xk+1 − X k ) = Xk+1 − Xi
(k + 1)σ 2 k k (k + 1)σ 2
i=1
It is routine calculation using Theorem 6.3.13 to see that is a standard normal random variable.
2
By induction hypothesis X k and k−1 σ 2 Sk are independent. Since X1 , . . . , Xk , Xk +1 are mutually
2
independent, Theorem 8.1.2 implies that Xk+1 is independent of X k and k−1 σ 2 Sk . Therefore,
k−1 2
Xk, S , Xk+1 are mutually independent random variables. (8.1.9)
σ2 k
(k−1)
Consequently, another application of Theorem 8.1.2 will then imply that σ2 (kk+1) (Xk+1 − X k )2 and σ2 Sk2
are independent random variables.
(c) To prove (c), it is enough to show that X k+1 and Xk+1 − X k are independent. The reason is the
following:
(i) Theorem 8.1.2 then implies X k+1 is independent of k
σ 2 (k +1)
(Xk+1 − X k )2 ;
(ii) X k+1 is a function of Xk+1 and X k . So (8.1.9) and Theorem 8.1.2 will then imply X k+1 is
(k−1) (k−1)
independent of σ2 Sk2 and also σ2 (kk+1) (Xk+1 − X k )2 is independent of σ2 Sk2 ;
(k−1) 2
(iii) Using (i) and (ii) we can conclude that X k+1 , σ 2 Sk , and k
σ 2 (k +1)
(Xk+1 − X k )2 are mutually
independent; and
(k−1)
(iv) finally Sk2+1 is a function σ2 Sk2 , and σ2 (kk+1) (Xk+1 − X k )2 by (8.1.8). Then (iii) and Theorem
8.1.2 will imply that Sk2+1 and X k+1 are independent.
Let α, β ∈ R. We have
k
α β α
X
α(X k+1 ) + β (Xk+1 − X k ) = − Xi + − β Xk+1 .
k+1 k k+1
i=1
Theorem 6.3.13 will imply that α(X k+1 ) + β (Xk+1 − X k ) is is normally distributed random variable for
any α, β ∈ R. So by Definition 6.4.1 (X k+1 , Xk+1 − X k ) is a bivariate normal random variable. Further,
from Theorem 6.2.2 (f) and (g), we have
kX k + Xk+1
Cov [X k+1 , Xk+1 − X k ] = Cov [ , Xk+1 − X k ]
k+1
1 k
= V ar [Xk+1 ] − Cov [X k , Xk+1 ] − V ar [X k ]
k+1 k+1
1 k σ2
= σ2 + 0 + − = 0,
k+1 k+1 k
where we have used (8.1.9) in the last line. From Theorem 6.4.3 we conclude that X k+1 , Xk+1 − X k are
independent.
The following Corollary will be used in Chapter 9
X −µ
√
σ/ n
(n − 1) 2
S
σ2
exercises
Ex. 8.1.4. Let X1 , X2 , . . . , Xn be i.i.d. random variables having a common distribution function F : R →
[0, 1] and probability density function f : R → R. Let X(1) < X(2) < . . . < X(n) be the corresponding
order statistic. Show that for 1 ≤ i < j ≤ n, (X(i) , X(j ) ) has a joint density function given by
n!
fX(i) ,X(j ) (x, y ) = f (x)f (y )[F (x)]i−1 [F (y ) − F (x)]j−1−i [1 − F (y )]n−j ,
(i − 1) ! (j − 1 − i) ! (n − j ) !
for −∞ < x < y < ∞.
Ex. 8.1.5. Let X1 , X2 , . . . , Xn be i.i.d. random variables having a common distribution X ∼ Uniform
X
(0, 1). Let X(1) < X(2) < . . . < X(n) be the corresponding order statistic. Show that X((n1)) and X(n) are
independent random variables.
Ex. 8.1.6. Let {Ui : i ≥ 1} be a sequence of i.i.d. uniform (0, 1) random variables and Let N ∼ Poisson (λ).
Find the distribution of V = min{U1 , U2 , . . . , UN +1 }.
Ex. 8.1.7. Let −∞ < a < b < ∞. Let X1 , X2 , . . . , Xn i.i.d X ∼ Uniform (a, b). Find the probability
X(1) + X(n)
density function of M = 2 .
Ex. 8.1.8. Let X1 , X2 be two independent standard normal random variables. Find the distribution of
Z = X(21) .
Ex. 8.1.9. Let X1 , X2 , . . . , Xn be i.i.d. Uniform (0, 1) random variables.
(a) Find the conditional distribution of X(n) | X(1) = x for some 0 < x < 1.
(b) Find E [X(n) | X(1) = x] and V ar [X(n) | X(1) = x].
Ex. 8.1.10. Suppose X is a symmetric continuous random variable. Let Y be a continuous random variable
such that P (Y > 0) = 1. Show that X
Y is symmetric.
Ex. 8.1.11. Verify (8.1.3).
Ex. 8.1.12. Suppose X1 , X2 , . . . are i.i.d. Cauchy (0, 1) random variables.
(a) Fix z ∈ R. Find a, b, c, d such that
1 1 ax + b cx + d
= + ,
1 + x2 1 + (z − x)2 1 + x2 1 + (z − x)2
for all x ∈ R.
(b) Show that X1 + X2 ∼ Cauchy (0, 2).
(c) Use induction to show that X1 + X2 + . . . + Xn ∼ Cauchy (0, n).
(d) Use Lemma 5.3.2 to show that X n ∼ Cauchy (0, 1).
and showed in Theorem 7.2.2 that E [X ] = µ. We also discussed that X could be considered as an estimate
for µ. The below result makes this precise and is referred to as the weak law of large numbers.
In the statement and proof of the below Theorem we shall denote X by X n to emphasise the dependence
on n.
Theorem 8.2.1. (Weak Law of Large Numbers) Let X1 , X2 , . . . be a sequence of i.i.d. random variables.
Assume that X1 has finite mean µ and finite variance σ 2 . Then for any > 0
Example 8.2.4. Let X1 , X2 , · · · , Xn be i.i.d random variables that are uniformly distributed over the
interval (0, 1). We already know by the law of large numbers that X converges to E (X1 ) = 12 in probability.
Often we are interested in other functionals of the sample and their convergence properties. We illustrate
one such example below.
Consider the n-th order statistic X(n) = max{X1 , · · · , Xn }. For any 0 < < 1,
P | X(n) − 1 |≥ = P X(n) ≤ 1 − + P X(n) ≥ 1 +
= P X(n) ≤ 1 − + 0
P ∩n
= i=1 (Xi ≤ 1 − )
n
= 1− .
Clearly Yn are independent (as the Xn are) and further they are identically distributed as P (Yn = 1) =
P (Xn ∈ A) = p. In particular {Yn } are an i.i.d. Bernoulli (p) sequence of random variables. We readily
observe (as done in Chapter 7) that
Pn
i = 1 Yi #{Xi ∈ A}
Yn = = = p̂.
n n
Hence the Weak law of large numbers (applied to the sequence Yn ) will imply that sample proportion
converges to the true proportion p in probability. Consequently, as discussed earlier, this provides legitimacy
to the relationship between Probability and relative frequency.
exercises
Ex. 8.2.1. Let X, X1 , X2 , · · · , Xn be i.i.d random variables that are uniformly distributed over the interval
(0, 1). Consider the first order statistic X(1) = max{X1 , · · · , Xn }. Show that X(1) converges to 0 in
probability.
Ex. 8.2.2. Let X1 , X2 , . . . , Xn , . . . be i.i.d. random variables with finite mean and variance. Define
n
2 X
Yn = iXi .
n(n + 1)
i=1
p
Show that Yn → E (X1 ) as n → ∞.
When discussing a collection of random variables it makes sense to think of them as a sequence of objects,
and as with any sequence in calculus we may ask whether the sequence converges in any way. We have
already seen “convergence in probability” in the previous section. Here we be interested in what is known as
“convergence in distribution”. . This type of convergence plays a major role in the understand the limiting
distribution of the sample mean (See Central Limit Theorem, Theorem 8.4.1).
d
Xn −→ X
If X is the constant random variable for which P (X = 0) = 1, then X has distribution function
0 if 0 < x
FX ( x ) =
1 if x ≥ 0
It is not true that FX (x) = F (x), but the two are equal are points where they are continuous. Therefore
the sequence X1 , X2 , . . . converges in distribution to the constant random variable 0.
Note that this form of convergence does not generally guarantee that probabilities associated with X can
be derived as limits of probabilities associated with Xn . For instance, in the example above P (Xn = 0) = 0
for all n while P (X = 0) = 1. However, with a few additional assumptions a stronger claim may be made.
Theorem 8.3.3. Let fX1 , fX2 , . . . be the respective densities of continuous random variables X1 , X2 , . . . .
Suppose they converge in distribution to a continuous random variable X with density fX . Then for every
interval A we have P (Xn ∈ A) → P (X ∈ A).
Proof - Since X is a continuous random variable FX (x) is the integral of a density, and thus a continuous
function. Therefore convergence in distribution guarantees that FXn (x) converges to FX (x) everywhere.
Let A = (a, b) (and note that whether or not endpoints are included does not matter since all random
variables are taken to be continuous). Then
Z b
P (Xn ∈ A) = fXn (x) dx
a
= FXn (b) − FXn (a)
→ FX ( b ) − FX ( a )
Z b
= fX (x) dx = P (X ∈ A).
a
The second theorem about moment generating functions that we will state, but leave unproven, is the
following:
Theorem 8.3.4. (M.G.F. Convergence Theorem) If X1 , X2 , . . . are a sequence of random variables whose
moment generating functions Mn (t) exist in an interval containing zero, and if Mn (t) → M (t) on that
interval where M (t) is the moment generating function of a random variable X, then Xn converges to X in
distribution.
To illustrate the use of this fact, consider an alternate proof of the limiting relationship between binomial
and Poisson random variables (See Theorem 2.2.2).
λ
Example 8.3.5. Let X ∼ Poisson(λ) and let Xn ∼ Binomial(n, n ). Then Xn converges in distribution to
X.
The moment generating function of a binomial variable was already computed in Example 6.3.7.
Therefore,
λ λ
MXn (t) = ( et + (1 − ))n
n n
λ ( et − 1 ) n
= (1 + )
n
MX ( t ) = E [etX ]
∞
X
= etj P (X = j )
j =0
∞
X λj e−λ
= etj
j!
j =0
∞ t
t X (λet )j e−λe
= eλe · e−λ ·
j!
j =0
t
−1)
= eλ(e
where the series equals 1 since it is simply the sum of the probabilities of a Poisson(λet ) random variable.
Since MXn (t) → MX (t), by the m.g.f. convergence theorem (Theorem 8.3.4), Xn converges in
distribution to X. That is, Binomial(n, p) random variables converge in distribution to a Poisson(λ)
λ
distribution when p = n and n → ∞.
exercises
for all continuity points of FX : R → [0, 1] with FXn , FX being the distribution functions of Xn and X
respectively.
d
Ex. 8.3.2. Let Xn have the t-distribution with n degrees of freedom. Show that Xn −→ X where X is
standard Normal distribution.
d d
→ X. Show that Xn2 −
Ex. 8.3.3. Let Xn − → X 2.
Let n ≥ 1, X1 , X2 , . . . , Xn be an i.i.d. random sample from a population with mean µ and variance σ 2 .
Consider the sample mean
n
1X
X= Xi .
n
i=1
As observed in Theorem 7.2.2, E (X ) = µ and SD(X ) = √σ . As discussed before, we might view this
n
information as X being typically close to µ up to an error of √σ with high probability. As n → ∞, √σ →0
n n
and this indicates that X approaches µ. We have already verified that X converges in probability to µ
courtesy of the weak law of large numbers (in fact it converges with probability 1 by the strong law of large
numbers).
To get a better understanding of the limiting distribution of X we standardise it and consider,
√
n(X − µ)
Yn = .
σ
Finding the probabilities of events connected with Yn for each n exactly may not be possible in all cases
but one can find good approximate values. It turns out that for a large class of random variables the
distribution of Yn is close to that of the standard Normal random variable particularly for large n. This
remarkable fact is referred to as the Central Limit Theorem and we prove it next.
As done earlier, in the statement and proof of the below Theorem we shall denote X by X n to emphasise
the dependence on n.
Theorem 8.4.1. (Central Limit Theorem) Let X1 , X2 , . . . be i.i.d. random variables with finite mean
µ, finite variance σ 2 , and possessing common moment generating function MX (). Then
√
n(X n − µ) d
−→ Z, (8.4.1)
σ
where Z ∼ Normal (0, 1).
√
n(X−µ)
Proof- Let Yn = σ . We will verify that
t2
lim MYn (t) = e 2 .
n→∞
Now, using the definition of the moment generating function and some elementary algebra we have
√
n(X − µ)
MYn (t) = E [exp(tYn ))] = E [exp(t )]
σ
Pn n
t√ Xi X t
= E [exp( n ( i=1 − µ))] = E [exp( √ (Xi − µ))]
σ n σ n
i=1
n
Y t
= E[ exp( √ (Xi − µ))]. (8.4.2)
σ n
i=1
are also independent. From Exercise 7.2.2 and 7.2.3, they also have the same distribution. So from the
calculation in (8.4.2) and using Exercise 6.3.4 inductively we have
n n
Y t Y t
M Yn ( t ) = E [ exp( √ (Xi − µ))] = E [exp( √ (Xi − µ))]
σ n σ n
i=1 i=1
(Using Theorem 6.3.9(a))
n
t
= E [exp( √ (X1 − µ))] . (8.4.3)
σ n
Let U = X1σ−µ . As E (U ) = 0, E (U 2 ) = 1 we have that MU0 (0) = 0 and MU00 (0) = 1. From Exercise 8.4.5,
we have that for t ∈ R
t2
MU (t) = 1 + + g (t) (8.4.4)
2
g (s)
where lims→0 s2 = 0. Therefore from (8.4.3) and (8.4.4) we have
n n
t2 1 t2
t t
MYn (t)) = [MU (t)]n = 1 + + g( √ ) = 1+ ( + ng ( √ ))
2n n n 2 n
t2 t2
Using the fact 2 + ng ( √tn ) → 2 and Exercise 8.4.4 it follows that ,
t2
lim MYn (t) = e 2 .
n→∞
Sn − nµ d
√ −→ Z, (8.4.5)
nσ
Pn
where Sn = i=1 Xi .
A typical application of the central limit theorem is to find approximate value of the probability of events
related to Sn or X. For instance, suppose we were interested in calculating for any a, b ∈ R, P (a < Sn ≤ b)
for large n. We would proceed in the following way. We know from (8.4.5) that
Sn − nµ
P( √ ≤ x) → P (Z ≤ x) (8.4.6)
nσ
as n → ∞ for all x ∈ R.
a − nµ Sn − nµ b − nµ
P ( a < Sn ≤ b ) = P( √ < √ ≤ √ )
nσ nσ nσ
Sn − nµ b − nµ Sn − nµ a − nµ
= P( √ ≤ √ ) −P( √ ≤ √ )
nσ nσ nσ nσ
from (8.4.6)for large enough n
b − nµ a − nµ
≈ P (Z ≤ √ ) − P (Z ≤ √ )
nσ nσ
a − nµ b − nµ
= P( √ <Z≤ √ ),
nσ nσ
where in the second last line we have used the notation ≈ to indicate that the right hand side is an
approximation. Therefore we would conclude that for large n,
a − nµ b − nµ
P ( a < Sn ≤ b ) ≈ P ( √ <Z≤ √ ). (8.4.7)
nσ nσ
We would then use the R function pnorm() or Normal Tables (See Table D.2) to compute the right hand
side.
A similar computation would also yield
√ √
n(a − µ) n(b − µ)
P (a < X ≤ b) ≈ P ( <Z≤ ). (8.4.8)
σ σ
Example 8.4.3. Let Y be a random variable distributed as Gamma(100, 4). Suppose we were interested in
finding
P (20 < Y ≤ 30).
P100
Suppose X1 , X2 , . . . , X100 are independent Exponential (4) random variables then Y and S100 = i=1 Xi
have the same distribution. Therefore, applying the Central Limit Theorem with µ = E (X1 ) = 14 , σ =
SD(X1 ) = 41 , we have
Looking up Table D.2, we see that this value comes out to be approximately 2 × 0.9772 − 1 = 0.9544. A
more precise answer is given by R as
> 2 * pnorm(2) - 1
[1] 0.9544997
Using R, we can also compare this with the exact probability that we are approximating.
> pgamma(30, 100, 4) - pgamma(20, 100, 4)
[1] 0.9550279
Continuity
Pn Correction: Suppose X1 , X2 , X3 , . . . are all integer valued random variables. Then
Sn = i=1 Xi is also a integer random variable. Now,
P (Sn = k ) = P (k − h < Sn ≤ k + h)
for all natural numbers k and 0 < h < 1. However it is easy to see that two distinct values of h will lead
to two different answers if we use the Normal approximation provided by the Central Limit Theorem.
One can also observe that this will increase with h. So it is customary to use h = 12 while computing
such probabilities using the Normal approximation. So when X1 , X2 , X3 , . . . are all integer valued random
variables we use,
If the coin is √fair, then p = 0.5 and E [X1 ] = 0.5, V ar [X1 ] = 0.25, and therefore E [S1000 ] = 500 and
SD [S1000 ] = 250 = 15.8114. We want to approximate
24
1−P Z ≤ = 1 − P (Z ≤ 1.52)
15.8114
which can be computed using Table D.2 as 1 − 0.9357 = 0.0643, or using R as
> 1 - pnorm(24 / sqrt(250))
[1] 0.06452065
With the continuity correction, the approximate value would instead use z = 24.5/15.8114 = 1.55 , giving
1 − 0.9394 = 0.0606 using Table D.2 or
> 1 - pnorm(24.5 / sqrt(250))
[1] 0.06062886
in R. We can also compute the exact probability that we are trying to approximate, namely P (S1000 ≥ 525),
in R as
> 1 - pbinom(524, 1000, 0.5)
[1] 0.06060713
As we can see, the continuity correction gives us a slightly better approximation. These calculations tell us
that the probability of Priya’s test reaching a false conclusion if the coin is fair is approximately 0.061. We
shall examine the topic of Hypothesis testing, that Priya was trying to do, more in Chapter 9.
Example 8.4.5. We return to the Birthday problem. Suppose a small town has 1095 students. What is
the probability that five or more students were born on independence day ? Assume that birthrates are
constant throughout the year and that each year has 365 days.
1
The probability that any given student was born on independence day is 365 . So the exact probability
that five or more students were born on independence day is
4
X 1095 1 k 364 1095−k
1− .
k 365 365
k =0
In Example 2.2.1 we have used the Poisson approximation with λ = 4 to estimate the above as
4
X 1460 1 k 364 1460−k
1−
k 365 365
k =0
42
−4 −4 1 1
≈ 1− e + 4e + e−4 + 43 e−4 + 44 e−4
2 6 24
= 0.3711631
We can do another approximation using central limit theorem, which is typically called the normal
approximation. For 1 ≤ i ≤ 1460, define
1 if ith person’s birthday is on independence day
Xi =
0 otherwise
1
Given the assumptions above on birthrates we know Xi are i.i.d random variables distributed as Bernoulli( 365 ).
P1460
Note that S1460 = i=1 Xi is the number of people born on independence day and we are interested in
calculating
P (S1460 ≥ 5).
1 1 1 364
Observe that E (X1 ) = 365 , Var(X1 ) = 365 (1 − 365 ) = 3652 . By the central limit theorem, we know that
P (S1460 ≥ 5) = 1 − P (S1460 ≤ 4) = 1 − P (S1460 ≤ 4.5)
1
4.5 − (1460)( 365 )
≈ 1 − P (Z ≤ q )
364
(1460)( 3652 )
0.5
= 1 − P (Z ≤ )
1.9973
= 0.401
Recall from the calculations done in Example 2.2.1 that the exact answer for this problem is 0.3711629.
So in this example, the Poisson approximation seems to work better then the Normal approximation.
This is due to the fact that more asymmetry in the underlying Bernoulli distribution worsens the normal
approximation, just as it improves the Poisson approximation as we saw in Figure 2.2.
exercises
Ex. 8.4.1. Suppose Sn is binomially distributed with parameters n = 200 and p = 0.3 Use the central limit
theorem to find an approximation for P (99 ≤ Sn ≤ 101).
Ex. 8.4.2. Toss a fair coin 400 times. Use the central limit theorem to
(a) find an approximation for the probability of at most 190 heads.
(b) find an approximation for the probability of at least 70 heads.
(c) find an approximation for the probability of at least 120 heads.
(d) find an approximation for the probability that the number of heads is between 140 and least 160.
Ex. 8.4.3. Suppose that the weight of open packets of daal in a home is uniformly distributed from 200 to
600 gms. In random survey of 64 homes, find the (approximate) probability that the total weight of open
boxes is less than 25 kgs.
Ex. 8.4.4. Let {an }n≥1 be a sequence of real numbers such that an → a as n → ∞. Then
n
an
lim 1+ = ea .
n→∞ n
Ex. 8.4.5. Suppose U is a random variable (discrete or continuous) and MU (t) = E (etU ) exists for all t.
Then show that
t2
MU (t) = 1 + tMU0 (0) + MU00 (0) + g (t)
2
g (t)
where limt→0 t2 = 0.
Ex. 8.4.6. Let {Xn }n≥1 be a sequence of i.i.d. random variables with X1 ∼, Exponential (1). Find
√ √
n
!
n n X n n
lim P − √ ≤ [1 − exp(−Xi )] ≤ + √ .
n→∞ 2 2 3 2 2 3
i=1
nk −n
Pn
Ex. 8.4.7. Let an = k =0 k! e , n ≥ 1. Using the Central Limit Theorem evaluate limn→∞ an .
Ex. 8.4.8. How often should you toss a coin:
(a) to be at least 90 % sure that your estimate of the P( head) is within 0.1 of its true value ?
(b) to be at least 90 % sure that your estimate of the P( head) is within 0.01 of its true value ?
Ex. 8.4.9. To forecast the outcome of the election in which two parties are contesting, an internet poll via
Facebook is conducted. How many people should be surveyed to be at least 95% sure that the estimated
proportion is within 0.05 of the true value ?
Ex. 8.4.10. A medical study is conducted to estimate the proportion of people suffering from April allergies
in Bangalore. How many people should be surveyed to be at least 99% sure that the estimate is within 0.02
of the true value ?