Definition 8.1.1.: - April 25, 2016

SAMPLING DISTRIBUTIONS AND LIMIT THEOREMS
Let n ≥ 1, X1 , X2 , . . . , Xn be an i.i.d. random sample from a population. Recall the sample mean
8
n
1X
X= Xi
n
i=1
and sample variance

n
1 X
S2 = (Xi − X )2 .
n−1
i=1
We have seen in the previous chapter the significance of the above two sample statistics. In this chapter we
shall discuss their distributional properties and limiting behaviour. In the next chapter we shall discuss how
these results can be effectively used to verify specific hypotheses about the population. The corresponding
field of study is called Hypothesis Testing or Test of Significance.
We will find the distribution of the sample mean and sample variance given the distribution of X1 . One
immediately observes that these are somewhat complicated functions of independent random variables.
However in Section 3.3 and Section 5.5 we have seen examples of functions for which we were able to
explicitly compute their distribution. To understand sampling statistics we must also understand the notion
of joint distribution of more than two continuous random variables (See Section 3.3 for discrete random
variables).
8.1 multi-dimensional continuous random variables
In Chapter 3, while discussing discrete random variables we had considered a finite collection of random
variables (X1 , X2 , . . . , Xn ). In Definition 3.2.7, we had described how to define their joint distribution and
we used this to understand the multinomial distribution in Example 3.2.12.
In the continuous setting as well there are many instances where it is relevant to study the joint
distribution of a finite collection of random variables. Suppose X is a point chosen randomly in the unit
sphere in the 3 dimensions. Then X has three coordinates and say X = (X1p , X2 , X3 ) where each Xi is
a random variable in (0, 1). Also they are dependent because we know that, X12 + X22 + X32 ≤ 1. It is
useful and needed to understand their “joint distribution”. We have already seen the usefulness of sample
mean and sample variance which are a function of X1 , X2 , . . . , Xn . To understand the distribution of sample
mean and sample variance the joint distribution of X1 , X2 , . . . , Xn will be needed to be understood first.
We define the joint distribution function first.
Definition 8.1.1. Let n ≥ 1 and X1 , X2 , . . . , Xn be random variables defined on the same probability space.
The joint distribution function F : Rn → [0, 1] is given by
F (x1 , x2 , . . . , xn ) = P (X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ),
for x1 , x2 , . . . , xn ∈ R.
As in single variable and two variable situations, the joint distribution function determines the entire joint
distribution of X1 , X2 , . . . , Xn ,. More precisely, if all the random variables were discrete with Xi : S → Ti
with Ti being countable subsets of ⊂ R for 1 ≤ i ≤ n from the joint distribution function one can determine
P (X1 = t1 , X2 = t2 , . . . , Xn = tn ),
203
Version: – April 25, 2016
204 sampling distributions and limit theorems
for all ti ∈ Ti , 1 ≤ i ≤ n. To understand the random variables in the continuous setting we need to set up
some notation.
Let n ≥ 1 and f : Rn → R be a non-negative function, piecewise-continuous in each variable for which
Z
f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn = 1.
Rn
For a Borel set A ⊂ Rn if Z

P (A) = f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn .
A
Then one can show as in Theorem 5.1.5 that P is a probability on Rn . f is called the density function for
P . A sequence of random variables (X1 , X2 , X3 , . . . , Xn ) is said to have a joint density f : Rn → R if for
every event A ⊂ Rn
Z
P ((X1 , X2 , X3 , . . . , Xn ) ∈ A) = f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn .
A
In this setting, the joint distribution of (X1 , X2 , . . . , Xn ) is determined by joint density f . . Using
multivariable calculus we can can state and prove a similar type of result as Theorem 5.2.5 for random
variables (X1 , X2 , . . . , Xn ) that have a joint density. In particular, we can conclude that since the joint
densities are assumed to be piecewise continuous, the corresponding distribution functions are piecewise
differentiable. Further, the joint distribution of the continuous random variables (X1 , X2 , . . . , Xn ) are
completely determined by their joint distribution function F . That is, if we know F (x1 , x2 , . . . , xn ) for all
x1 , x2 , . . . , xn ∈ R we could use multivariable calculus to differentiate F to find f . Then integrating this
joint density over the event A we can calculate P ((X1 , X2 , . . . , Xn ) ∈ A).
As in the n = 2 case one can recover the marginal density of each Xi for i between 1 and n by integrating
over the other indices. So, the marginal density of Xi at a is given by
Z
fXi (a) = f (x1 , . . . , xi−1 , a, xi+1 , . . . , xn )dx1 . . . dxi−1 dxi+1 . . . dxn .
Rn−1
Further for n ≥ 3, we can deduce the joint density for any sub-collection m ≤ n random variables by
integrating over the other variables. For instance, if we were interested in the joint density of (X1 , X3 , X7 )
we would obtain
Z
fX1 ,X3 ,X7 (a1 , a3 , a7 ) = f (a1 , x2 , a3 , x4 , x5 , x6 , a7 , x8 . . . , xn )dx2 dx4 dx5 dx6 . . . dxn .
Rn−3
Suppose X1 , X2 , . . . , Xn are random variables defined on a single sample space S with joint density
f : Rn → R. Let g : Rn → R be a function of n variables for which g (X1 , X2 , . . . , Xn ) is defined in the
range of the Xj variables. Let B be an event in the range of g. Then, following the proof of Theorem 3.3.5,
we can show that
P (g (X1 , X2 , . . . , Xn ) ∈ B ) = P ((X1 , X2 , . . . , Xn ) ∈ g −1 (B )).
The above provides an abstract method of finding the distribution of the random variable Y = g (X1 , X2 , . . . , Xn )
but it might be difficult to calculate it explicitly. For n = 1, we discussed this question in detail in Section
5.3, for n = 2 we did explore how to find the distributions of sums and ratios of independent random
variables (see Section 5.5). In a few cases by induction n, this method could be extended but in general it
is not possible. In Appendix B, Section B.2 we discuss the Jacobian method of finding the joint density of
the transformed random variable.
The notion of independence also extends to the multi-dimensional continuous random variable as
in discrete setting. As discussed in Definition 3.2.3, a finite collection of continuous random variables
X1 , X2 , . . . , Xn is mutually independent if the sets (Xj ∈ Aj ) are mutually independent for all events Aj
in the ranges of the corresponding Xj . As proved for the n = 2 case in Theorem 5.4.7, we can similarly

8.1 multi-dimensional continuous random variables 205
deduce that if (X1 , X2 , X3 , . . . , Xn ) are mutually independent continuous random variables with marginal
densities fXi then their joint density is given by
n
Y
f (x1 , x2 , . . . , xn ) = fXi (xi ), (8.1.1)
i=1
for xi ∈ R and 1 ≤ i ≤ n. Further for any finite sub-collection {Xi1 , Xi2 , . . . , Xim } of the above independent
random variables, their joint density is given by
m
Y
f (a1 , a2 , . . . , am ) = fXij (aj ). (8.1.2)
i=j
We conclude this section with a result that we will repeatedly use.

Theorem 8.1.2. For each j ∈ {1, 2, . . . , n} define a positive integer mj and suppose Xi,j is an array of
mutually independent continuous random variables for j ∈ {1, 2, . . . , n} and i ∈ {1, 2, . . . , mj }. Let gj (·) be
functions such that the quantity
Yj = gj (X1,j , X2,j , . . . Xmj ,j )
is defined for the outputs of the Xi,j variables. Then the resulting variables Y1 , Y2 , . . . , Yn are mutually
independent.
Proof- Follows by the same proof presented in Theorem 3.3.6.
8.1.1 Order Statistics and their Distributions
Let n ≥ 1 and let X1 , X2 , . . . , Xn be a i.i.d random sample from a population. Let F be the common
distribution function. Let the X 0 s be arranged in increasing order of magnitude denoted by
X(1) ≤ X(2) ≤ · · · ≤ X(n) .
These ordered values are called the order statistics of the sample X1 , X2 , . . . , Xn . For, 1 ≤ r ≤ n, X(r ) is
called the r-th order statistic. One can computer F(r ) , the distribution function of X(r ) , for 1 ≤ r ≤ n in
terms of n and F . We have,
F(1) (x ) = P ( X ( 1 ) ≤ x ) = 1 − P ( X ( 1 ) > x ) = 1 − P ( ∩n
1=1 (Xi > x))
n
Y n
Y
= 1− P (Xi > x) = 1 − (1 − P (Xi ≤ x))
i=1 i=1
= 1 − (1 − F (x))n ,
n
Y
F(n) (x ) = P ( X ( n ) ≤ x ) = P ( ∩n
i=1 (Xi ≤ x)) = P (Xi ≤ x) = (F (x))n ,
i=1
and for 1 < r < n,

F(r ) (x ) = P (X(r ) ≤ x) = P (at least r elements from the sample are ≤ x)
n
X
= P (exactly j elements from the sample are ≤ x)
j =r
n
X n
= P (chosen j elements from the sample are ≤ x) ×
j
j =r
×P (( n − j elements not chosen from the sample are > x)
n
X n
= F (x)j (1 − F (x))n−j
j
j =r

If the distribution function F had a probability density function f then each X(r ) has a probability density
function f(r ) . This can be obtained by differentiating F(r ) and is given by the below expression.
n(1 − F (x))n−1 f (x)



 r=1



f(r ) (x ) = nf (x)(F (x))n−1 r=n (8.1.3)



n!
f (x)(F (x))r−1 (1 − F (x))n−r


(r−1)!(n−r )!
1<r<n
Example 8.1.3. Let n ≥ 1 and let X1 , X2 , . . . , Xn be a i.i.d random sample from a population whose
common distribution F is an Exponential (λ) random variable. Then we know that

0 x<0
F (x) =
1 − e−λx x≥0
Therefore using (8.1.3) and substituting for F as above we have that the densities of the order statistics are
given by
−λx n−1 −λx

 n(e


)) λe r=1


f(r ) (x ) = nλe−λx (1 − e−λx )n−1 r=n



 λe−λx
 n!
(1 − e−λx )r−1 (e−λx )n−r 1 < r < n,
(r−1)!(n−r )!
for x > 0. Simplifying the algebra we obtain,
nλe−nλx


 r=1



f(r ) (x ) = nλe−λx (1 − e−λx )n−1 r=n



λn!
(1 − e−λx )r−1 (e−λx )n−r +1


(r−1)!(n−r )!
1 < r < n,
for x > 0. We note from the above that X(1) , i.e minimum of exponentials, is Exponential (nλ) random
variable. However the other order statistics are not exponentially distributed.

In many applications one is interested in the range of values a random variable X assumes. A method
to understand this to sample X1 , X2 , . . . , Xn i.i.d. X and examine R = X(n) − X(1) . Suppose X has a
probability density function f : R → R and distribution function F : R → [0, 1]. As before we can can
calculate the joint density of X(1) , X(n) by first computing the joint distribution function. This is done by
using the i.i.d. nature of the sample and the definition of the order statistics.
P (X(1) ≤ x, X(n) ≤ y ) = P (X(n) ≤ y ) − P (x < X(1) , X(n) ≤ y )

= P ( ∩n n
i=1 {Xi ≤ y}) − P (∩i=1 {x < Xi ≤ y})
= [P (X ≤ y )]n − [P (x < X ≤ y )]n
(
[F (x)]n − [F (y ) − F (x)]n x<y
=
0 otherwise.
From the above, differentiating partially in x and y we see that the joint density of (X(1) , X(n) ) is given by
(
n(f (x) − f (y ))[F (y ) − F (x)]n−1 x<y
fX(1) ,X(n) (x, y ) (8.1.4)
0 otherwise.

To calculate the distribution of R, we compute its distribution function. For r ≤ 0, P (R ≤ r ) = 0 and for
r > 0, using the above joint density of (X(1) , X(n) ) we have
P (R ≤ r ) = P (X(n) ≤ X (1 ) + r )
Z ∞ Z r
= fX(1) ,X(n) (x, z + x)dz dx
−∞ 0
Z r Z ∞
= fX(1) ,X(n) (x, z + x)dx dz,
0 −∞
where we have done a change of variable y = z + x in the second last line and a change in the order of
integration in the last line. Differentiating the above we conclude that R has a joint density given by
R ∞
 −∞ fX(1) ,X(n) (x, r + x)dx
 if r > 0
fR ( r ) = (8.1.5)

0 otherwise.
Example 8.1.4. Let X1 , X2 , . . . , Xn be i.i.d Uniform (0, 1). The probability density function and distribution
function of a uniform (0, 1) random variable are given by

(
0
 if x ≤ 0
1 if x ∈ (0, 1)
f (x) = and F (x) = x if 0 < x < 1
0 otherwise. 
if x > 1.
1
Using (8.1.3), we have the probability density function of

(
n(1 − x)n−1 if x ∈ (0, 1)
X(1) is given by fX ( 1 ) ( x ) =
0 otherwise,
(
nxn−1 if x ∈ (0, 1)
X(n) is given by fX ( n ) ( x ) =
0 otherwise.
(
n!
(r−1)!(n−r )!
xr−1 (1 − x)n−r if x ∈ (0, 1)
X(r ) is given by fX ( r ) ( x ) =
0 otherwise.
for 1 < r < n.
Using (8.1.4), we have the joint density of

(
n(n − 1)(y − x)n−1 if 0 ≤ x ≤ y ≤ 1)
(X(1) , X(n) ) is given by fX(1) ,X(n) (x, y ) =
0 otherwise,
Using (8.1.5), we have the probability density function of the range

(R
1−r
0
n(n − 1)(x + r − x)n−1 dx if 0 < r < 1
R = X(n) − X(1) is given by fR (r ) =
0 otherwise,
(
n(n − 1)rn−1 (1 − r ) if 0 < r < 1
=
0 otherwise.
We see that X(r ) ∼ Beta (r, n − r + 1) for 1 ≤ r ≤ n and the range R ∼ Beta(n, 2)

In general we can also understand the joint-distribution of the order statistics. Suppose we have an i.i.d
sample X1 , X2 , . . . , Xn having distribution X. If X has a probability density function f : R → R then one
can show that the order statistic (X(1) , X(2) , . . . , X(n) ) has a joint density h : Rn → R by
(
n!f (u1 )f (u2 ) . . . f (un ) u1 < u 2 < . . . < u n ,
h ( u1 , u2 , . . . , un )
0 otherwise.
The above fact intuitively is clear. Any ordering u1 < u2 < . . . < un “has a probability” f (u1 )f (u2 ) . . . f (un ).
Each of the Xi can assume any of the uk ’s. The total number of possible orderings is n!. A formal proof
involves using the Jacobian method and will be discussed in Appendix B.
8.1.2 χ2 , F and t
χ2 , F and t distributions arise naturally when considering functions of i.i.d. normal random variables
(X1 , X2 , X3 , . . . , Xn ) for n ≥ 1. They also are useful in Hypothesis testing as well. We discuss these via
three examples.
Example 8.1.5. (Chi-Square) Let n ≥ 1 and (X1 , X2 , X3 , . . . , Xn ) be a collection of independent Normal
random variables with mean 0 and variance 1. Then the joint density is given by
n Pn x2
Y 1 − i
f (x1 , x2 , . . . xn ) = f X i ( xi ) = √
n
e i=1 2 ,
2π
i=1
for xi ∈ R and
Pn1 ≤ i ≤ n.
Let Z = i=1 Xi2 . We shall find the distribution of Z in two steps. First, clearly the range of X12 is
non-negative. The distribution function for X12 at z ≥ 0, is given by
F1 ( z ) = P (X12 ≤ z )
√
= P (X1 ≤ z )
√
Z z
1 x2
= √ e− 2 dx
2π
Z0 z
1 u 1
= √ e− 2 u− 2 du
0 2 2π
Comparing it with the Gamma (α, λ) random variable defined in Definition 5.5.5 and using Exercise 5.5.10,
we see that X12 is distributed as a Gamma ( 12 , 12 ) random
Pn variable. Using the calculation done in Example
5.5.6 for n = 2 and by induction we have that Z = i=1 Xi2 will be Gamma ( n2 , 12 ). This distribution is
referred to as Chi-Square with n- degrees of freedom. We define it precisely next.
Definition 8.1.6. (Chi-Square with n degrees of freedom) A random variable X whose distribution is
Gamma ( n2 , 12 ) is said to have Chi-square distribution with n-degrees of freedom (i.e number of parameters).
Gamma ( n2 , 12 ) is denoted by χ2n and as discussed earlier it has density given by
n
2− 2 n2 −1 − x2
f (x) = x e
Γ( n2 )
 n
2− 2 n x

(n
x 2 −1 e− 2 when n is even.
2 −1) !


=
n
2n− 2 −1 ( n−1 )! n −1 − x


 √2 x2 e 2 when n is odd.
(n−1)! π
when x > 0.

We shall show in the next subsection that sample variance from a Normal population is a Chi-square
random variable. In the next chapter we shall construct a test to make inferences about the variances of
the two population. In that context we shall compare sample variances and this is where the F distribution
arises naturally.
Example 8.1.7. (F-distribution) Suppose X1 , X2 , . . . , Xn1 be an i.i.d. random sample from a Normal
mean 0 and variance σ12 population and Y1 , Y2 , . . . , Yn2 be an i.i.d. random sample from a Normal mean
Pn1 Xi 2
0 and variance σ22 population. We have already seen in Example 8.1.5 that U = is a χ2n1

i=1 σ1
Pn2 Yi 2
is a χ2n2 random variable. Further U and V are independent. Let

random variable and V = i = 1 σ2
U V n1
Z= n1 / n2 . Let Y = n2 Z = VU . As done in Example 5.5.10 the density of Y for y > 0 is given by
n1
y 2 −1 Γ ( n1 +
2
n2
)
fY ( y ) =
(1 + y )
n1 + n2
2 Γ( 2 )Γ( n22 )
n1
Therefore, for z > 0
FZ ( z ) = P (Z ≤ z )
n
= P (Y ≤ 2 z )
n1
n2 n1
z −1 Γ( n1 + n2
Z n1
y 2
2 )
= dy
−∞ (1 + y )
n1 +n2
2 Γ( 2 )Γ( n22 )
n1
n1
making a u-substitution with n2 y =u
n1
Z z n21 −1 Γ ( n1 + n2
)
n2 u
2
2
= du
n1 (1 + n1 n1 + n2
Γ( 2 )Γ( n22 )
n1
−∞ n2 u )
2
Therefore the density of Z, for z > 0 is given by

n1
n21 −1 Γ ( n1 + n2
)
n2 z
2
2
f (z ) = .
n1 (1 + n1 n1 + n2
Γ( 2 )Γ( n22 )
n1
n2 z )
2
Z is said to have F(n1 , n2 ) distribution. Z is close to a widely used distribution in statistics called F -
distribution.
The distribution of the ratio of sample mean and sample variance plays an important role in Hypothesis
testing. This forms the motivation for the next example where the t distribution arises naturally.
Example 8.1.8. (t-distribution) Let X1 be a Normal random variable with mean 0 and variance 1. Let X2
be an independent χ2n random variable. Let
X
Z= q1 .
X2
n
X12
We wish to find the density of Z. Observe that U = Z 2 is given by X2 . Now, X12 has χ21 distribution
n
(See Example 8.1.5). So applying Example 8.1.7 with n1 = 1 and n2 = n, we find that U has F (1, n)
distribution. The density of U is given by
1
21
1 u 2 −1 Γ ( n+ 1
2 )
fU ( u ) = n+1
n (1 + n1 u) 2 Γ( 12 )Γ( n2 )
1
Γ ( n+1 ) u− 2
= √ 2 n .
nπΓ( 2 ) (1 + u ) n2+1
n

q
X2
Since X1 is a symmetric random variable and n is positive valued we conclude that Z is a symmetric
random variable (Exercise 8.1.10). So, for u > 0
P (U ≤ u) = P (Z 2 ≤ u)
√ √
= P (− u ≤ Z ≤ u)
√ √
= P (Z ≤ u) − P (Z ≤ − u)
√ √
= P (Z ≤ u) − P (Z ≥ u)
√
= 2P (Z ≤ u) − 1
Therefore if fZ (·) is the density of Z then
1 √
fU (u) = √ (fZ ( u)).
u
Hence for any z ∈ R the density of Z is given by
fZ ( z ) = | z | fU ( z 2 )
−1
Γ ( n+1 ) z2 2
= |z| √ 2 n
nπΓ( 2 ) 1 + u n2+1
n
− n2+1
Γ ( n+1 ) z2

= √ 2 n 1+
nπΓ( 2 ) n
Z is said to have t-distribution with n-degrees of freedom. We will denote this by the notation Z ∼ tn .
8.1.3 Distribution of Sampling Statistics from a Normal population
Let n ≥ 1, X1 , X2 , . . . , Xn be an i.i.d. random sample from a population having mean µ and variance σ 2 .
Consider the sample mean
n
1X
X= Xi
n
i=1
and sample variance
n
1 X
S2 = ( Xi − X ) 2 .
n−1
i=1
We have already seen in Theorem 7.2.2 that E [X ] = µ and in Theorem 7.2.4 that E [S 2 ] = σ 2 . It turns out
that it is not easy to understand the precise distribution of X or S 2 in general. However, this can be done
when the population is normally distributed. The main result of this section is the following.
Theorem 8.1.9. Let n ≥ 1, X1 , X2 , . . . Xn , be an i.i.d random sample with distribution X ∼ Normal(µ, σ 2 ).

Let X and S 2 be as above. Then,
σ2
(a) X is a Normal random variable with mean µ and variance n .
(n−1) 2
(b) σ2 S has the χ2n−1 distribution.
(c) X and S 2 are independent.
Proof - (a) follows from Theorem 6.3.13. The typical proof for (b) and (c) is via Helmert’s transformation
(see [Rao73]) and requires some knowledge of Linear Algebra. Here we will follow Kruskal’s proof as
illustrated in [Stig84]. The proof is by the method of induction. For implementing the inductive step on
the sample size n, we shall replace X and S 2 with X n and Sn2 for the rest of the proof.

Step 1: (Proof for n = 2) Here
(X1 − X2 )2
2 2
X1 + X2 X1 + X2 X1 + X2

X2 = and S22 = X1 − + X2 − = . (8.1.6)
2 2 2 2
(a) Follows from Theorem 6.3.13.
(b) As X1 and X2 are independent Normal random variables with mean µ and variance σ 2 , by Theorem
(X1 −X2 )
6.3.13, √ is a Normal random variable with mean 0 and variance 1. Using Example 8.1.5, we know
σ 2
S22
that σ2 has χ21 distribution and this proves (b).
(c) From (8.1.6), X 2 is a function of X1 + X2 and S22 is a function of X1 − X2 . Theorem 8.1.2 will imply
that X 2 and S22 are independent if we show X1 + X2 and X1 − X2 are independent. Let α, β ∈ R. Then
using Theorem 6.3.13 again we have that α(X1 + X2 ) + β (X1 − X2 ) = (α + β )X1 + (α − β )X2 is normally
distributed. As this is true for any α, β ∈ R, by Definition 6.4.1 (X1 + X2 , X1 − X2 ) is a bivariate normal
random variable. Using Theorem 6.2.2 (f) and (g), along with the fact that X1 and X2 are independent
Normal random variables with mean µ and variance σ 2 ,we have
Cov [X1 + X2 , X1 − X2 ] = V ar [X1 ] + Cov [X2 , X1 ] − Cov [X1 , X2 ] − V ar [X2 ] = 0.
Theorem 6.4.3 then implies that X1 + X2 and X1 − X2 are independent.

Step 2: (inductive hypothesis) Let us inductively assume that (a),(b), and (c) are true when n = k
for some k ∈ N.
Step 3: (Proof for n = k + 1) We shall rewrite X k+1 and Sk2+1 using some elementary algebra.
k +1
1 X k 1 1

X k − X k +1 = X k − Xi = 1 − Xk − X = (X − Xk+1 ). (8.1.7)
k+1 k+1 k + 1 k +1 k+1 k
i=1
Adding and subtracting X k inside the summand of Sk2+1 , we have

k +1 k +1
1X 1X
Sk2+1 = (Xi − X k+1 )2 = (Xi − X k + X k − X k+1 )2
k k
i=1 i=1
k +1
1X
= (Xi − X k )2 + 2(Xi − X k )(X k − X k+1 ) + (X k − X k+1 )2
k
i=1
k−1 2 1 1
− X k )2 + 2(Xk+1 − X k )(X k − X k+1 ) + (k + 1)(X k − X k+1 )2

= Sk + (X
k k k +1 k
(Xk+1 − X k ) (Xk+1 − X k )2

k−1 2 1 1
= Sk + (X − X k )2 − 2(Xk+1 − X k ) +
k k k +1 k k+1 k+1
k−1 2 1
= Sk + (X − X k )2 ,
k k + 1 k +1
where we have used (8.1.7) in the second last inequality. Dividing across by σ 2 and multiplying by k we
have
k 2 k−1 2 k
S = S + (X − X k )2 . (8.1.8)
σ 2 k +1 σ 2 k σ 2 (k + 1 ) k +1
(a) Follows from Theorem 6.3.13.
(b) To prove (b), it is enough to show that:
r
k (k − 1) 2
(Xk+1 − X k ) is a standard normal random variable and is independent of Sk .
(k + 1)σ 2 σ2
The reason being: k

σ 2 (k +1)
(Xk+1 − X k )2 then has χ21 distribution by Example 8.1.5 and is independent of
(k−1) 2 (k−1)
σ 2 Sk by Theorem 8.1.2; by the induction hypothesis σ2 Sk2 has the χ2k−1 distribution; and finally
using (8.1.8) along with Example 5.5.6, will imply that σk2 Sk2+1 has χ2k distribution.

As
r r ! k r
k (k + 1)σ 2 X 1 k
(Xk+1 − X k ) = Xk+1 − Xi
(k + 1)σ 2 k k (k + 1)σ 2
i=1
It is routine calculation using Theorem 6.3.13 to see that is a standard normal random variable.
2
By induction hypothesis X k and k−1 σ 2 Sk are independent. Since X1 , . . . , Xk , Xk +1 are mutually
2
independent, Theorem 8.1.2 implies that Xk+1 is independent of X k and k−1 σ 2 Sk . Therefore,
k−1 2
Xk, S , Xk+1 are mutually independent random variables. (8.1.9)
σ2 k
(k−1)
Consequently, another application of Theorem 8.1.2 will then imply that σ2 (kk+1) (Xk+1 − X k )2 and σ2 Sk2
are independent random variables.
(c) To prove (c), it is enough to show that X k+1 and Xk+1 − X k are independent. The reason is the
following:
(i) Theorem 8.1.2 then implies X k+1 is independent of k
σ 2 (k +1)
(Xk+1 − X k )2 ;
(ii) X k+1 is a function of Xk+1 and X k . So (8.1.9) and Theorem 8.1.2 will then imply X k+1 is
(k−1) (k−1)
independent of σ2 Sk2 and also σ2 (kk+1) (Xk+1 − X k )2 is independent of σ2 Sk2 ;
(k−1) 2
(iii) Using (i) and (ii) we can conclude that X k+1 , σ 2 Sk , and k
σ 2 (k +1)
(Xk+1 − X k )2 are mutually
independent; and
(k−1)
(iv) finally Sk2+1 is a function σ2 Sk2 , and σ2 (kk+1) (Xk+1 − X k )2 by (8.1.8). Then (iii) and Theorem
8.1.2 will imply that Sk2+1 and X k+1 are independent.
Let α, β ∈ R. We have
k
α β α
X
α(X k+1 ) + β (Xk+1 − X k ) = − Xi + − β Xk+1 .
k+1 k k+1
i=1
Theorem 6.3.13 will imply that α(X k+1 ) + β (Xk+1 − X k ) is is normally distributed random variable for
any α, β ∈ R. So by Definition 6.4.1 (X k+1 , Xk+1 − X k ) is a bivariate normal random variable. Further,
from Theorem 6.2.2 (f) and (g), we have
kX k + Xk+1
Cov [X k+1 , Xk+1 − X k ] = Cov [ , Xk+1 − X k ]
k+1
1 k
= V ar [Xk+1 ] − Cov [X k , Xk+1 ] − V ar [X k ]
k+1 k+1
1 k σ2
= σ2 + 0 + − = 0,
k+1 k+1 k
where we have used (8.1.9) in the last line. From Theorem 6.4.3 we conclude that X k+1 , Xk+1 − X k are
independent.
The following Corollary will be used in Chapter 9
Corollary 8.1.10. Let n ≥ 1, X1 , X2 , . . . Xn , be an i.i.d random sample with distribution X ∼ Normal(µ, σ 2 ).

Let X and S 2 be as above. Then √
n(X − µ)
S
has the tn−1 distribution.

Proof - From Theorem 8.1.9 it is clear that
X −µ
√
σ/ n
is a Normal random variable with mean 0 and variance 1, and
(n − 1) 2
S
σ2
is a χ2n−1 random variable. Note

√ X−µ
√
n(X − µ) σ/ n
= q
S 1 (n−1)S 2
n−1 σ2
So by Example 8.1.8 we have the result.
exercises
Ex. 8.1.1. Verify that each of f : R3 → R are density functions on R3 .

(
2
(a) f (x1 , x2 , x3 ) = 3 (x1 + x2 + x3 ) if 0 < xi < 1, i = 1, 2, 3.
0 otherwise
(
1 2
8 (x1 + x22 + x23 ) if 0 < xi < 2, i = 1, 2, 3.
(b) f (x1 , x2 , x3 ) =
0 otherwise
(
2
81 x1 x2 x3 if 0 < xi < 3, i = 1, 2, 3.
(c) f (x1 , x2 , x3 ) =
0 otherwise
(
3
(d) f (x1 , x2 , x3 ) = 4 (x1 x2 + x1 x3 + x2 x3 ) if 0 < xi < 1, i = 1, 2, 3.
0 otherwise
Ex. 8.1.2. Suppose (X1 , X2 , X3 ) have a joint density f : R3 → R given by

(
4 3
3 (x1 + x32 + x33 ) if 0 < xi < 1, i = 1, 2, 3.
f (x1 , x2 , x3 ) =
0 otherwise
(a) Find P (X1 < 12 , X3 > 1

2 ).
(b) Find the joint density of (X1 , X2 ),(X1 , X3 ), (X2 , X3 ).
(c) Find the marginal densities of X1 , X2 , and X3 .
Ex. 8.1.3. Let D be a set in R3 with a well defined volume. (X1 , X2 , X3 ) are said be uniform on a set D if
they have a joint density given by
(
1
if x ∈ D
f (x1 , x2 , x3 ) = Volume(D )
0 otherwise.
Suppose D is a cube of dimension R.

(a) Find the joint density (X1 , X2 , X3 ) which is uniform on D.
(b) Find the marginal density of X1 , X2 , X3 .
(c) Find the joint density of (X1 , X2 ),(X1 , X3 ),(X3 , X2 ).

Ex. 8.1.4. Let X1 , X2 , . . . , Xn be i.i.d. random variables having a common distribution function F : R →
[0, 1] and probability density function f : R → R. Let X(1) < X(2) < . . . < X(n) be the corresponding
order statistic. Show that for 1 ≤ i < j ≤ n, (X(i) , X(j ) ) has a joint density function given by
n!
fX(i) ,X(j ) (x, y ) = f (x)f (y )[F (x)]i−1 [F (y ) − F (x)]j−1−i [1 − F (y )]n−j ,
(i − 1) ! (j − 1 − i) ! (n − j ) !
for −∞ < x < y < ∞.
Ex. 8.1.5. Let X1 , X2 , . . . , Xn be i.i.d. random variables having a common distribution X ∼ Uniform
X
(0, 1). Let X(1) < X(2) < . . . < X(n) be the corresponding order statistic. Show that X((n1)) and X(n) are
independent random variables.
Ex. 8.1.6. Let {Ui : i ≥ 1} be a sequence of i.i.d. uniform (0, 1) random variables and Let N ∼ Poisson (λ).
Find the distribution of V = min{U1 , U2 , . . . , UN +1 }.
Ex. 8.1.7. Let −∞ < a < b < ∞. Let X1 , X2 , . . . , Xn i.i.d X ∼ Uniform (a, b). Find the probability
X(1) + X(n)
density function of M = 2 .
Ex. 8.1.8. Let X1 , X2 be two independent standard normal random variables. Find the distribution of
Z = X(21) .
Ex. 8.1.9. Let X1 , X2 , . . . , Xn be i.i.d. Uniform (0, 1) random variables.
(a) Find the conditional distribution of X(n) | X(1) = x for some 0 < x < 1.
(b) Find E [X(n) | X(1) = x] and V ar [X(n) | X(1) = x].
Ex. 8.1.10. Suppose X is a symmetric continuous random variable. Let Y be a continuous random variable
such that P (Y > 0) = 1. Show that X
Y is symmetric.
Ex. 8.1.11. Verify (8.1.3).
Ex. 8.1.12. Suppose X1 , X2 , . . . are i.i.d. Cauchy (0, 1) random variables.
(a) Fix z ∈ R. Find a, b, c, d such that
1 1 ax + b cx + d
= + ,
1 + x2 1 + (z − x)2 1 + x2 1 + (z − x)2
for all x ∈ R.
(b) Show that X1 + X2 ∼ Cauchy (0, 2).
(c) Use induction to show that X1 + X2 + . . . + Xn ∼ Cauchy (0, n).
(d) Use Lemma 5.3.2 to show that X n ∼ Cauchy (0, 1).
8.2 weak law of large numbers
Let n ≥ 1, X1 , X2 , . . . , Xn be an i.i.d. random sample from a population whose distribution is given by a

random variable X which has mean µ. In Chapter 7 we considered the sample mean
n
1X
X= Xi
n
i=1
and showed in Theorem 7.2.2 that E [X ] = µ. We also discussed that X could be considered as an estimate
for µ. The below result makes this precise and is referred to as the weak law of large numbers.
In the statement and proof of the below Theorem we shall denote X by X n to emphasise the dependence
on n.
Theorem 8.2.1. (Weak Law of Large Numbers) Let X1 , X2 , . . . be a sequence of i.i.d. random variables.
Assume that X1 has finite mean µ and finite variance σ 2 . Then for any > 0
lim P (| X n − µ |> ) = 0, (8.2.1)

n→∞

8.2 weak law of large numbers 215
Proof- Let > 0 be given. We note that

Pn n
i=1 Xi 1 nµ
X
E (X n ) = E ( )= E (Xi ) = = µ.
n n n
i=1
Using Theorem 4.2.4, Theorem 4.2.6 and Exercise 6.2.17 we have
Pn
i=1 Xi
Var[X n ] = Var[ ]
n
n
1 X
= Var[ Xi ]
n2
i=1
n
1 X
= Var[Xi ]
n2
i=1
σ2
=
n
So we have shown that the random variable X n has finite expectation variance. By Chebychev’s inequal-
ity,(apply Theorem 6.1.13 (a) with k = σ ), we have
σ2
P (|X n − µ| > ) ≤ .
n2
2
σ
Therefore as 0 ≤ P (|X n − µ| > ) for all n ≥ 1 and n 2 → 0 as n → ∞, by standard results in Real
Analysis we conclude that

lim P (| X n − µ |> ) = 0.
n→∞

Remark 8.2.2. The convergence of sample mean to µ actually happens with Probability one. That is,
suppose we denote the event A = {limn→∞ X n = µ}, then P (A) = 1. The result is referred to as the
Strong Law of large numbers. We prove it in Appendix C (see Theorem C.0.1).
Theorem 8.2.1 states that, for any > 0, the P (| X n − µ |> ), goes to zero as n → ∞. This mode of
convergence of the sample to the true mean is called “convergence in probability” . We define it precisely
below.
Definition 8.2.3. A sequence X1 , X2 , . . . is said to converge in probability to a random variable X if for

any > 0
lim P (| Xn − X |> ) = 0, (8.2.2)
n→∞
The following notation
p
Xn → X
is typically used to convey that the sequence X1 , X2 , . . . converges in probability to X.
Example 8.2.4. Let X1 , X2 , · · · , Xn be i.i.d random variables that are uniformly distributed over the
interval (0, 1). We already know by the law of large numbers that X converges to E (X1 ) = 12 in probability.
Often we are interested in other functionals of the sample and their convergence properties. We illustrate
one such example below.
Consider the n-th order statistic X(n) = max{X1 , · · · , Xn }. For any 0 < < 1,

P | X(n) − 1 |≥ = P X(n) ≤ 1 − + P X(n) ≥ 1 +

= P X(n) ≤ 1 − + 0
P ∩n

= i=1 (Xi ≤ 1 − )
n
= 1− .

and for > 1,

P | X(n) − 1 |≥ = P X(n) ≤ 1 − + P X(n) ≥ 1 + = 0.
For 0 < < 1, n
lim 1− = 0.
n→∞
So we have shown that X(n) converges in probability to 1 as n → ∞.
Another application of the weak law of large numbers is to sample proportion discussed in Section 7.2.3.
Example 8.2.5. Suppose we are interested in an event A and want to estimate p = P (X ∈ A). We consider
a sample X1 , X2 , . . . , Xn which is i.i.d. X. We define a sequence of random variables {Yn }n≥1 by

1 if Xn ∈ A
Yn =
0 if Xn ∈
6 A
Clearly Yn are independent (as the Xn are) and further they are identically distributed as P (Yn = 1) =
P (Xn ∈ A) = p. In particular {Yn } are an i.i.d. Bernoulli (p) sequence of random variables. We readily
observe (as done in Chapter 7) that
Pn
i = 1 Yi #{Xi ∈ A}
Yn = = = p̂.
n n
Hence the Weak law of large numbers (applied to the sequence Yn ) will imply that sample proportion
converges to the true proportion p in probability. Consequently, as discussed earlier, this provides legitimacy
to the relationship between Probability and relative frequency.
exercises
Ex. 8.2.1. Let X, X1 , X2 , · · · , Xn be i.i.d random variables that are uniformly distributed over the interval
(0, 1). Consider the first order statistic X(1) = max{X1 , · · · , Xn }. Show that X(1) converges to 0 in
probability.
Ex. 8.2.2. Let X1 , X2 , . . . , Xn , . . . be i.i.d. random variables with finite mean and variance. Define
n
2 X
Yn = iXi .
n(n + 1)
i=1
p
Show that Yn → E (X1 ) as n → ∞.
8.3 convergence in distribution
When discussing a collection of random variables it makes sense to think of them as a sequence of objects,
and as with any sequence in calculus we may ask whether the sequence converges in any way. We have
already seen “convergence in probability” in the previous section. Here we be interested in what is known as
“convergence in distribution”. . This type of convergence plays a major role in the understand the limiting
distribution of the sample mean (See Central Limit Theorem, Theorem 8.4.1).
Definition 8.3.1. A sequence X1 , X2 , . . . is said to converge in distribution to a random variable X if

FXn (x) converges to FX (x) at every point x for which FX is continuous. The following notation
d
Xn −→ X
is typically used to convey that the sequence X1 , X2 , . . . converges in distribution to X.

8.3 convergence in distribution 217
Example 8.3.2. Let Xn ∼ Uniform(0, n1 ) so that the distribution function is


 0 if 0 ≤ x
1
FXn (x) = nx if 0 < x < n
if x ≥ n1
 1
and it is then easy to see that FXn (x) converges to

0 if 0 ≤ x
F (x) =
1 if x > 0
If X is the constant random variable for which P (X = 0) = 1, then X has distribution function

0 if 0 < x
FX ( x ) =
1 if x ≥ 0
It is not true that FX (x) = F (x), but the two are equal are points where they are continuous. Therefore
the sequence X1 , X2 , . . . converges in distribution to the constant random variable 0.
Note that this form of convergence does not generally guarantee that probabilities associated with X can
be derived as limits of probabilities associated with Xn . For instance, in the example above P (Xn = 0) = 0
for all n while P (X = 0) = 1. However, with a few additional assumptions a stronger claim may be made.
Theorem 8.3.3. Let fX1 , fX2 , . . . be the respective densities of continuous random variables X1 , X2 , . . . .
Suppose they converge in distribution to a continuous random variable X with density fX . Then for every
interval A we have P (Xn ∈ A) → P (X ∈ A).
Proof - Since X is a continuous random variable FX (x) is the integral of a density, and thus a continuous
function. Therefore convergence in distribution guarantees that FXn (x) converges to FX (x) everywhere.
Let A = (a, b) (and note that whether or not endpoints are included does not matter since all random
variables are taken to be continuous). Then
Z b
P (Xn ∈ A) = fXn (x) dx
a
= FXn (b) − FXn (a)
→ FX ( b ) − FX ( a )
Z b
= fX (x) dx = P (X ∈ A).
a

The second theorem about moment generating functions that we will state, but leave unproven, is the
following:
Theorem 8.3.4. (M.G.F. Convergence Theorem) If X1 , X2 , . . . are a sequence of random variables whose
moment generating functions Mn (t) exist in an interval containing zero, and if Mn (t) → M (t) on that
interval where M (t) is the moment generating function of a random variable X, then Xn converges to X in
distribution.
To illustrate the use of this fact, consider an alternate proof of the limiting relationship between binomial
and Poisson random variables (See Theorem 2.2.2).
λ
Example 8.3.5. Let X ∼ Poisson(λ) and let Xn ∼ Binomial(n, n ). Then Xn converges in distribution to
X.
The moment generating function of a binomial variable was already computed in Example 6.3.7.
Therefore,

λ λ
MXn (t) = ( et + (1 − ))n
n n
λ ( et − 1 ) n
= (1 + )
n
Using Exercise 8.4.4, we see that

t
−1)
MXn ( t ) → e λ ( e .
On the other hand, the moment generating function of X is
MX ( t ) = E [etX ]
∞
X
= etj P (X = j )
j =0
∞
X λj e−λ
= etj
j!
j =0
∞ t
t X (λet )j e−λe
= eλe · e−λ ·
j!
j =0
t
−1)
= eλ(e
where the series equals 1 since it is simply the sum of the probabilities of a Poisson(λet ) random variable.
Since MXn (t) → MX (t), by the m.g.f. convergence theorem (Theorem 8.3.4), Xn converges in
distribution to X. That is, Binomial(n, p) random variables converge in distribution to a Poisson(λ)
λ
distribution when p = n and n → ∞.
exercises
Ex. 8.3.1. Suppose a sequence Xn , n ≥ 1 of random variables converges to a random variable X in

probability then show that Xn converges in distribution to X. That is show that
FXn (x) → FX (x)as n → ∞,
for all continuity points of FX : R → [0, 1] with FXn , FX being the distribution functions of Xn and X
respectively.
d
Ex. 8.3.2. Let Xn have the t-distribution with n degrees of freedom. Show that Xn −→ X where X is
standard Normal distribution.
d d
→ X. Show that Xn2 −
Ex. 8.3.3. Let Xn − → X 2.
8.4 central limit theorem
Let n ≥ 1, X1 , X2 , . . . , Xn be an i.i.d. random sample from a population with mean µ and variance σ 2 .
Consider the sample mean
n
1X
X= Xi .
n
i=1
As observed in Theorem 7.2.2, E (X ) = µ and SD(X ) = √σ . As discussed before, we might view this
n
information as X being typically close to µ up to an error of √σ with high probability. As n → ∞, √σ →0
n n
and this indicates that X approaches µ. We have already verified that X converges in probability to µ

8.4 central limit theorem 219
courtesy of the weak law of large numbers (in fact it converges with probability 1 by the strong law of large
numbers).
To get a better understanding of the limiting distribution of X we standardise it and consider,
√
n(X − µ)
Yn = .
σ
Finding the probabilities of events connected with Yn for each n exactly may not be possible in all cases
but one can find good approximate values. It turns out that for a large class of random variables the
distribution of Yn is close to that of the standard Normal random variable particularly for large n. This
remarkable fact is referred to as the Central Limit Theorem and we prove it next.
As done earlier, in the statement and proof of the below Theorem we shall denote X by X n to emphasise
the dependence on n.
Theorem 8.4.1. (Central Limit Theorem) Let X1 , X2 , . . . be i.i.d. random variables with finite mean
µ, finite variance σ 2 , and possessing common moment generating function MX (). Then
√
n(X n − µ) d
−→ Z, (8.4.1)
σ
where Z ∼ Normal (0, 1).
√
n(X−µ)
Proof- Let Yn = σ . We will verify that
t2
lim MYn (t) = e 2 .
n→∞
Now, using the definition of the moment generating function and some elementary algebra we have
√
n(X − µ)
MYn (t) = E [exp(tYn ))] = E [exp(t )]
σ
Pn n
t√ Xi X t
= E [exp( n ( i=1 − µ))] = E [exp( √ (Xi − µ))]
σ n σ n
i=1
n
Y t
= E[ exp( √ (Xi − µ))]. (8.4.2)
σ n
i=1
As X1 , X2 , . . . , Xn are independent, from Theorem 8.1.2 we can conclude that

t t t
exp( √ (X1 − µ)), exp( √ (X2 − µ)), . . . , exp( √ (Xn − µ))
σ n σ n σ n
are also independent. From Exercise 7.2.2 and 7.2.3, they also have the same distribution. So from the
calculation in (8.4.2) and using Exercise 6.3.4 inductively we have
n n
Y t Y t
M Yn ( t ) = E [ exp( √ (Xi − µ))] = E [exp( √ (Xi − µ))]
σ n σ n
i=1 i=1
(Using Theorem 6.3.9(a))
n
t
= E [exp( √ (X1 − µ))] . (8.4.3)
σ n
Let U = X1σ−µ . As E (U ) = 0, E (U 2 ) = 1 we have that MU0 (0) = 0 and MU00 (0) = 1. From Exercise 8.4.5,
we have that for t ∈ R
t2
MU (t) = 1 + + g (t) (8.4.4)
2

g (s)
where lims→0 s2 = 0. Therefore from (8.4.3) and (8.4.4) we have
n n
t2 1 t2

t t
MYn (t)) = [MU (t)]n = 1 + + g( √ ) = 1+ ( + ng ( √ ))
2n n n 2 n
t2 t2
Using the fact 2 + ng ( √tn ) → 2 and Exercise 8.4.4 it follows that ,
t2
lim MYn (t) = e 2 .
n→∞
Theorem 8.3.4 will then imply the result.

Remark 8.4.2. The existence of moment generating function is not essential for the Central Limit Theorem.
(8.4.1) holds when X, X1 , X2 , . . . are i.i.d. random variables with finite mean µ and finite variance σ 2 . The
proof is more complicated in this case.
Further we shall often use an equivalent formulation
Pn of (8.4.1). By definition of X and elementary
n −nµ
algebra we see that Yn = S√ nσ
, where S n = X
i=1 i .
Sn − nµ d
√ −→ Z, (8.4.5)
nσ
Pn
where Sn = i=1 Xi .
8.4.1 Normal Approximation and Continuity Correction
A typical application of the central limit theorem is to find approximate value of the probability of events
related to Sn or X. For instance, suppose we were interested in calculating for any a, b ∈ R, P (a < Sn ≤ b)
for large n. We would proceed in the following way. We know from (8.4.5) that
Sn − nµ
P( √ ≤ x) → P (Z ≤ x) (8.4.6)
nσ
as n → ∞ for all x ∈ R.
a − nµ Sn − nµ b − nµ
P ( a < Sn ≤ b ) = P( √ < √ ≤ √ )
nσ nσ nσ
Sn − nµ b − nµ Sn − nµ a − nµ
= P( √ ≤ √ ) −P( √ ≤ √ )
nσ nσ nσ nσ
from (8.4.6)for large enough n
b − nµ a − nµ
≈ P (Z ≤ √ ) − P (Z ≤ √ )
nσ nσ
a − nµ b − nµ
= P( √ <Z≤ √ ),
nσ nσ
where in the second last line we have used the notation ≈ to indicate that the right hand side is an
approximation. Therefore we would conclude that for large n,
a − nµ b − nµ
P ( a < Sn ≤ b ) ≈ P ( √ <Z≤ √ ). (8.4.7)
nσ nσ
We would then use the R function pnorm() or Normal Tables (See Table D.2) to compute the right hand
side.
A similar computation would also yield
√ √
n(a − µ) n(b − µ)
P (a < X ≤ b) ≈ P ( <Z≤ ). (8.4.8)
σ σ

Example 8.4.3. Let Y be a random variable distributed as Gamma(100, 4). Suppose we were interested in
finding
P (20 < Y ≤ 30).
P100
Suppose X1 , X2 , . . . , X100 are independent Exponential (4) random variables then Y and S100 = i=1 Xi
have the same distribution. Therefore, applying the Central Limit Theorem with µ = E (X1 ) = 14 , σ =
SD(X1 ) = 41 , we have
P (20 < Y ≤ 30) = P (20 < S100 ≤ 30)

by (8.4.7)
20 − 100(0.25) 30 − 100(0.25)
≈ P( √ <Z≤ √ )
100(0.25) 100(0.25)
−5 5
= P( <Z≤ )
2.5 2.5
= P (−2 < Z ≤ 2)
= P (Z ≤ 2) − P (Z ≤ −2)
using symmetry of Normal distribution
= P (Z ≤ 2) − (1 − P (Z ≤ 2))
= 2P (Z ≤ 2) − 1
Looking up Table D.2, we see that this value comes out to be approximately 2 × 0.9772 − 1 = 0.9544. A
more precise answer is given by R as
> 2 * pnorm(2) - 1
[1] 0.9544997
Using R, we can also compare this with the exact probability that we are approximating.
> pgamma(30, 100, 4) - pgamma(20, 100, 4)
[1] 0.9550279

Continuity
Pn Correction: Suppose X1 , X2 , X3 , . . . are all integer valued random variables. Then
Sn = i=1 Xi is also a integer random variable. Now,
P (Sn = k ) = P (k − h < Sn ≤ k + h)
for all natural numbers k and 0 < h < 1. However it is easy to see that two distinct values of h will lead
to two different answers if we use the Normal approximation provided by the Central Limit Theorem.
One can also observe that this will increase with h. So it is customary to use h = 12 while computing
such probabilities using the Normal approximation. So when X1 , X2 , X3 , . . . are all integer valued random
variables we use,
P (a < Sn ≤ b) = P (a − 0.5 < Sn ≤ b + 0.5)

a + 0.5 − nµ b + 0.5 − nµ
≈ P( √ <Z≤ √ ) (8.4.9)
nσ nσ
whenever a, b are in the range of Sn . This convention is referred to as the “continuity correction”.
Example 8.4.4. Two types of coin are produced at a factory: a fair coin and a biased one that comes
up heads 55% of the time. Priya is the quality control scientist at the factory. She wants to design an
experiment that will test whether a coin is fair or biased. In order to ascertain which type of coin she
has, she prescribes the following experiment as a test:- Toss the given coin 1000 times, if the coin comes
up heads 525 or more times conclude that it is a biased coin. Otherwise conclude that it is fair. Factory
manager Ayesha is interested in the following question: What is the probability that Priya’s test shall reach
a false conclusion for a fair coin ?
Let S1000 be theP number of heads in 1000 tosses of a coin. As discussed in earlier chapters, we
1000
know that S1000 = i=1 Xi where each Xi are i.i.d. Bernoulli random variables with parameter p.

If the coin is √fair, then p = 0.5 and E [X1 ] = 0.5, V ar [X1 ] = 0.25, and therefore E [S1000 ] = 500 and
SD [S1000 ] = 250 = 15.8114. We want to approximate
P (S1000 ≥ 525) = 1 − P (S1000 ≤ 524) = 1 − P (S1000 ≤ 524.5)
Without the continuity correction, we would approximate this probability as
24

1−P Z ≤ = 1 − P (Z ≤ 1.52)
15.8114
which can be computed using Table D.2 as 1 − 0.9357 = 0.0643, or using R as
> 1 - pnorm(24 / sqrt(250))
[1] 0.06452065
With the continuity correction, the approximate value would instead use z = 24.5/15.8114 = 1.55 , giving
1 − 0.9394 = 0.0606 using Table D.2 or
> 1 - pnorm(24.5 / sqrt(250))
[1] 0.06062886
in R. We can also compute the exact probability that we are trying to approximate, namely P (S1000 ≥ 525),
in R as
> 1 - pbinom(524, 1000, 0.5)
[1] 0.06060713
As we can see, the continuity correction gives us a slightly better approximation. These calculations tell us
that the probability of Priya’s test reaching a false conclusion if the coin is fair is approximately 0.061. We
shall examine the topic of Hypothesis testing, that Priya was trying to do, more in Chapter 9.
Example 8.4.5. We return to the Birthday problem. Suppose a small town has 1095 students. What is
the probability that five or more students were born on independence day ? Assume that birthrates are
constant throughout the year and that each year has 365 days.
1
The probability that any given student was born on independence day is 365 . So the exact probability
that five or more students were born on independence day is
4
X 1095 1 k 364 1095−k
1− .
k 365 365
k =0
In Example 2.2.1 we have used the Poisson approximation with λ = 4 to estimate the above as
4
X 1460 1 k 364 1460−k
1−
k 365 365
k =0
42

−4 −4 1 1
≈ 1− e + 4e + e−4 + 43 e−4 + 44 e−4
2 6 24
= 0.3711631
We can do another approximation using central limit theorem, which is typically called the normal
approximation. For 1 ≤ i ≤ 1460, define

1 if ith person’s birthday is on independence day
Xi =
0 otherwise
1
Given the assumptions above on birthrates we know Xi are i.i.d random variables distributed as Bernoulli( 365 ).
P1460
Note that S1460 = i=1 Xi is the number of people born on independence day and we are interested in
calculating
P (S1460 ≥ 5).

1 1 1 364
Observe that E (X1 ) = 365 , Var(X1 ) = 365 (1 − 365 ) = 3652 . By the central limit theorem, we know that
P (S1460 ≥ 5) = 1 − P (S1460 ≤ 4) = 1 − P (S1460 ≤ 4.5)
1
4.5 − (1460)( 365 )
≈ 1 − P (Z ≤ q )
364
(1460)( 3652 )
0.5
= 1 − P (Z ≤ )
1.9973
= 0.401
Recall from the calculations done in Example 2.2.1 that the exact answer for this problem is 0.3711629.
So in this example, the Poisson approximation seems to work better then the Normal approximation.
This is due to the fact that more asymmetry in the underlying Bernoulli distribution worsens the normal
approximation, just as it improves the Poisson approximation as we saw in Figure 2.2.
exercises
Ex. 8.4.1. Suppose Sn is binomially distributed with parameters n = 200 and p = 0.3 Use the central limit
theorem to find an approximation for P (99 ≤ Sn ≤ 101).
Ex. 8.4.2. Toss a fair coin 400 times. Use the central limit theorem to
(a) find an approximation for the probability of at most 190 heads.
(b) find an approximation for the probability of at least 70 heads.
(c) find an approximation for the probability of at least 120 heads.
(d) find an approximation for the probability that the number of heads is between 140 and least 160.
Ex. 8.4.3. Suppose that the weight of open packets of daal in a home is uniformly distributed from 200 to
600 gms. In random survey of 64 homes, find the (approximate) probability that the total weight of open
boxes is less than 25 kgs.
Ex. 8.4.4. Let {an }n≥1 be a sequence of real numbers such that an → a as n → ∞. Then
n
an

lim 1+ = ea .
n→∞ n
Ex. 8.4.5. Suppose U is a random variable (discrete or continuous) and MU (t) = E (etU ) exists for all t.
Then show that
t2
MU (t) = 1 + tMU0 (0) + MU00 (0) + g (t)
2
g (t)
where limt→0 t2 = 0.
Ex. 8.4.6. Let {Xn }n≥1 be a sequence of i.i.d. random variables with X1 ∼, Exponential (1). Find
√ √
n
!
n n X n n
lim P − √ ≤ [1 − exp(−Xi )] ≤ + √ .
n→∞ 2 2 3 2 2 3
i=1
nk −n
Pn
Ex. 8.4.7. Let an = k =0 k! e , n ≥ 1. Using the Central Limit Theorem evaluate limn→∞ an .
Ex. 8.4.8. How often should you toss a coin:
(a) to be at least 90 % sure that your estimate of the P( head) is within 0.1 of its true value ?
(b) to be at least 90 % sure that your estimate of the P( head) is within 0.01 of its true value ?
Ex. 8.4.9. To forecast the outcome of the election in which two parties are contesting, an internet poll via
Facebook is conducted. How many people should be surveyed to be at least 95% sure that the estimated
proportion is within 0.05 of the true value ?
Ex. 8.4.10. A medical study is conducted to estimate the proportion of people suffering from April allergies
in Bangalore. How many people should be surveyed to be at least 99% sure that the estimate is within 0.02
of the true value ?


Definition 8.1.1.: - April 25, 2016

Uploaded by

Copyright:

Available Formats

Definition 8.1.1.: - April 25, 2016

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Definition 8.1.1.: - April 25, 2016

Uploaded by

Copyright:

Available Formats

SAMPLING DISTRIBUTIONS AND LIMIT THEOREMS

and sample variance

8.1 multi-dimensional continuous random variables

For a Borel set A ⊂ Rn if Z

P (g (X1 , X2 , . . . , Xn ) ∈ B ) = P ((X1 , X2 , . . . , Xn ) ∈ g −1 (B )).

Version: – April 25, 2016

We conclude this section with a result that we will repeatedly use.

8.1.1 Order Statistics and their Distributions

and for 1 < r < n,

Version: – April 25, 2016

n(1 − F (x))n−1 f (x)

for x > 0. Simplifying the algebra we obtain,

P (X(1) ≤ x, X(n) ≤ y ) = P (X(n) ≤ y ) − P (x < X(1) , X(n) ≤ y )

Version: – April 25, 2016

Using (8.1.3), we have the probability density function of

Using (8.1.4), we have the joint density of

Using (8.1.5), we have the probability density function of the range

Version: – April 25, 2016

Version: – April 25, 2016

Therefore, for z > 0

Therefore the density of Z, for z > 0 is given by

Version: – April 25, 2016

Therefore if fZ (·) is the density of Z then

Hence for any z ∈ R the density of Z is given by

8.1.3 Distribution of Sampling Statistics from a Normal population

Theorem 8.1.9. Let n ≥ 1, X1 , X2 , . . . Xn , be an i.i.d random sample with distribution X ∼ Normal(µ, σ 2 ).

Version: – April 25, 2016

Step 1: (Proof for n = 2) Here

Cov [X1 + X2 , X1 − X2 ] = V ar [X1 ] + Cov [X2 , X1 ] − Cov [X1 , X2 ] − V ar [X2 ] = 0.

Theorem 6.4.3 then implies that X1 + X2 and X1 − X2 are independent.

Adding and subtracting X k inside the summand of Sk2+1 , we have

The reason being: k

Version: – April 25, 2016

Corollary 8.1.10. Let n ≥ 1, X1 , X2 , . . . Xn , be an i.i.d random sample with distribution X ∼ Normal(µ, σ 2 ).

Version: – April 25, 2016

Proof - From Theorem 8.1.9 it is clear that

is a Normal random variable with mean 0 and variance 1, and

is a χ2n−1 random variable. Note

So by Example 8.1.8 we have the result.

Ex. 8.1.1. Verify that each of f : R3 → R are density functions on R3 .

Ex. 8.1.2. Suppose (X1 , X2 , X3 ) have a joint density f : R3 → R given by

(a) Find P (X1 < 12 , X3 > 1

Suppose D is a cube of dimension R.

Version: – April 25, 2016

8.2 weak law of large numbers

Let n ≥ 1, X1 , X2 , . . . , Xn be an i.i.d. random sample from a population whose distribution is given by a

lim P (| X n − µ |> ) = 0, (8.2.1)

Version: – April 25, 2016

Proof- Let  > 0 be given. We note that

Analysis we conclude that

Definition 8.2.3. A sequence X1 , X2 , . . . is said to converge in probability to a random variable X if for

Version: – April 25, 2016

and for  > 1,

8.3 convergence in distribution

Definition 8.3.1. A sequence X1 , X2 , . . . is said to converge in distribution to a random variable X if

is typically used to convey that the sequence X1 , X2 , . . . converges in distribution to X.

Version: – April 25, 2016

Example 8.3.2. Let Xn ∼ Uniform(0, n1 ) so that the distribution function is

and it is then easy to see that FXn (x) converges to

lim P (| X n − µ |> ) = 0, (8.2.1)

Proof- Let > 0 be given. We note that

and for > 1,