Strong Law
Strong Law
Strong Law
1 Introduction
Probability Theory includes various theorems known as Laws of Large Numbers; for instance, see [Fel68, Hea71, Ros89]. Usually two major categories are distinguished: Weak Laws versus Strong Laws. Within these categories there are numerous subtle variants of differing generality. Also the Central Limit Theorems are often brought up in this context. Many introductory probability texts treat this topic supercially, and more than once their vague formulations are misleading or plainly wrong. In this note, we consider a special case to clarify the relationship between the Weak and Strong Laws. The reason for doing so is that I have not been able to nd a concise formal exposition all in one place. The material presented here is certainly not new and was gleaned from many sources. In the following sections, X 1 , X 2 , . . . is a sequence of independent and identically distributed random variables with nite expectation . We dene the associ ated sequence X i of partial sample means by Xn = 1 n
n
Xi .
i=1
The Laws of Large Numbers make statements about the convergence of X n to . Both laws relate bounds on sample size, accuracy of approximation, and degree of condence. The Weak Laws deal with limits of probabilities involving X n . The Strong Laws deal with probabilities involving limits of X n . Especially the mathematical underpinning of the Strong Laws requires a careful approach ([Hea71, Ch. 5] is an accessible presentation).
We have (1)
as n
or in words: X n converges in probability to as n . On account of the denition of limit and the fact that probabilities are at most 1, Equation (1) can be rewritten as >0 >0 N>0 nN Pr | X n | 1 . (2)
The proof of the Weak Law is easy when the X i s have a nite variance. It is most often based on Chebyshevs Inequality. 2.2 Theorem (Chebyshevs Inequality) Let X be a random variable with nite mean and nite variance 2 . Then we have Pr (|X | a) for all a > 0. A slightly different way of putting it is this: For all a > 0, we have Pr (|X | a ) 1 . a2 2 a2
Thus, the probability that X deviates from its expected value by at least k standard deviations is at most 1/k 2 . Chebyshevs Inequality is sharp when no further assumptions are made about X s distribution, but for practical applications it is often too sloppy. For example, the probability that X remains within 3 of is at least 8 , 9 no matter what distribution X has. However, when X is known to have a normal distribution, this probability in fact exceeds 0.9986. We now prove the Weak Law when the variance is nite. Let 2 be the variance of each X i . In that case, we have E X n = and Var X n = 2 /n. Let > 0. Substituting X, , , a := X n , , / n, in Chebyshevs Inequality then yields Pr | X n | 2 . n 2 (3)
Hence, for > 0 and for all n max{1, 2 / 2 } we have Pr | X n | < > 1
lim Pr a
Xn b / n
z
(b)
(a)
(4)
where
Convergence in (4) is uniform in a and b. The Central Limit Theorem can be interpreted as stating that for large n, the ran dom variable X n approximately has a normal distribution with mean and stan dard deviation / n. We now prove that the Central Limit Theorem implies the Weak Law of Large Numbers when 0 < < . First observe that substituting a, b := c/, c/ in the Central Limit Theorem yields c lim Pr | X n | n n = c c . (5)
Let > 0 and > 0. Take c > 0 such that (c/ ) /3 (this is possible since (z) 0 as z ) and take N such that c/ N and the limit in (5) is approached closer than /3 for all n N. We derive for n N (with hints placed between braces): { monotonicity of Pr, using c/ n c/ N , on account of denition of N } Pr | X n | c/ n { denition of N } (c/ ) (c/ ) /3 { (z) + (z) = 1 } 1 2 (c/ ) /3 { denition of c } 1 3 Pr | X n |
This concludes the proof. If convergence to the standard normal distribution is assumed to be good (much better than ), then we can take bound N such that N 1 . (6) 2 Compare this to the bound N 2 / 2 on account of Chebyshevs Inequality. As an example, consider the case where we want to be 95% certain that the sample mean falls within 1 of ; that is, = 0.05 and = /4. Chebyshevs 4 Inequality yields N 16/0.05 = 320 and the standard normal approximation yields N /4 1.96 or N 61.47. Thus, if the standard normal approximation is good then our need is already fullled by the mean of 62 samples, instead of the 320 required by Chebyshevs Inequality. I would like to emphasize the following points concerning the Central Limit Theorem. There exist estimates of how closely the standard normal distribution approximates the distribution of the sample mean. Consult [Fel71, Hea71] for the BerryEss en bound. e If the X i s themselves have a normal distribution, then so does the sample mean and the approximation in the Central Limit Theorem is in fact exact. The more general versions of the Weak Law are not derivable from (more general versions of) the Central Limit Theorem.
We have (7)
= 1.
This is often abbreviated to as n or in words: X n converges almost surely to as n . One of the problems with such a law is the assignment of probabilities to statements involving innitely many random variables. For that purpose, one needs a careful introduction of notions like sample space, probability measure, and random variable. See for instance [Tuc67, Hea71, Chu74a, LR79]. Using some Probability Theory, the Strong Law can be rewritten into a form with probabilities involving nitely many random variables only. We rewrite Equation (7) in a chain of equivalences: 4
Pr lim X n = = 1
n
Comparing Equations (2) and (10) we immediately infer the Weak Law from the Strong Law, which explains their names. In order to supply the notes to above derivation, let ( , F , P) be an appropriate probability space for the random variables X i , and dene events A , B N , and Cr for > 0, N > 0, and r 0 by A = { BN Cr = { = { | N>0 nN | X n () | } | nN | X n () | } | NnN+r | X n () | } .
These events satisfy the following monotonicity properties: A BN Cr A for B N+1 Cr+1 .
Therefore, on account of the continuity of probability measure P for monotonic chains of events, we have P( P( P(
m=1
A1/m ) = BN ) = Cr ) =
N=1 r=0
N r
lim P(Cr ) .
A ) = 1 A1/m ) = 1
lim P(A1/m ) = 1
{ property of limits, using that P(A1/m ) is descending and at most 1 } m>0 P(A1/m ) = 1 { see rst two steps, also using monotonicity of P } >0 Pr N>0 nN | X n | = 1 Note 2. We derive for > 0 Pr N>0 nN | X n | = 1 { denitions of Pr and B N , and set theory } P( B N ) = 1 N=1 { (13) } lim P(B N ) = 1
N
{ denition of limit, using P(Bk ) 1 } >0 N>0 kN P(Bk ) 1 { monotonicity of P, using Bk B N for k N } >0 N>0 P(B N ) 1 { denitions of Pr and B N } >0 N>0 Pr nN | X n | 1 Note 3. We derive for > 0, > 0, and N > 0 Pr nN | X n | 1 { denitions of Pr and Cr , and set theory } P( r0 Cr ) 1 { (14) } lim P(Cr ) 1
r
{ property of limits, using that P(Cr ) is descending } r0 P(Cr ) 1 { denitions of Pr and Cr } r0 Pr NnN+r | X n | 1
In particular, [the Strong Law] shows that, with probability 1, for any positive value ,
n
i=1
Xi n
will be greater than only a nite number of times. Equation (10) can be recognized in [Hea71, p. 226]: Indeed for arbitrarily small > 0, > 0, and large N = N (, ), . . . a.s. the [denition] of X n X . . . can be restated . . . as
P
n=N
... Equation (11) resembles the denition in [Fel68, p. 259]: We say that the sequence X k obeys the strong law of large numbers if to every pair > 0, > 0, there corresponds an N such that there is probability 1 or better that for every r > 0 all r + 1 inequalities |Sn m n | < , n will be satised. n = N, N + 1, . . . , N + r
Compare this result to (3): it is the same upper bound but for a much larger event. This creates the impression that the Weak Law is not that much weaker than the Strong Law. 7
4 Concluding Remarks
We have looked at one special case to clarify the relationship between the Weak and the Strong Law of Large Numbers. The case was special in that we have assumed X i to be a sequence of independent and identically distributed random variables with nite expectation and that we have considered the convergence of partial sample means to the common expectation. Historically, it was preceded by more special cases, for instance, with the X i restricted to Bernoulli variables. Nowadays these laws are treated in much more generality. My main interest was focused on Equations (8) through (11), which are equivalent but subtly differentformulations of the Strong Law of Large Numbers. Furthermore, I have looked at constructive bounds related to the rate of convergence. Let me conclude with some sobering quotes (also to be found in [Chu74b, p. 233]). Feller writes in [Fel68, p. 152]: [The weak law of large numbers] is of very limited interest and should be replaced by the more precise and more useful strong law of large numbers. In [Wae71, p. 98], van der Waerden writes: [The strong law of large numbers] scarcely plays a role in mathematical statistics. Acknowledgment I would like to thank Fred Steutel for discussing the issues in this paper and for pointing out reference [HR80].
References
[Chu74a] Kai Lai Chung. A Course in Probability Theory. Academic Press, second edition, 1974. [Chu74b] Kai Lai Chung. Elementary Probability Theory and Stochastic Processes. Springer, 1974. [Eis69] [Fel68] [Fel71] [Hea71] Marin Eisen. Introduction to Mathematical Probability Theory. Prentice-Hall, 1969. William Feller. An Introduction to Probability Theory and Its Applications, volume I. Wiley, third edition, 1968. William Feller. An Introduction to Probability Theory and Its Applications, volume II. Wiley, second edition, 1971. C. R. Heathcote. Probability: Elements of the Mathematical Theory. Unwin, 1971.
A. H. Hoekstra and J. T. Runnenburg. Inequalities compared. Statistica Neerlandica, 34(2):6782, 1980. R. G. Laha and V. K. Rohatgi. Probability Theory. Wiley, 1979. Sheldon Ross. A First Course in Probability. Macmillan, third edition, 1989. Howard G. Tucker. A Graduate Course in Probability. Academic Press, 1967.
[Wae71] B. L. Waerden, van der. Mathematische Statistik. Springer, third edition, 1971.