Nothing Special   »   [go: up one dir, main page]

James-Stein Estimator - Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

James–Stein Estimator: Introduction

James–Stein Estimator: Introduction


By Benjamin R. Baer, Martin T. Wells, and George Casella†
Keywords: admissibility, Baranchik estimator, Bayes estimation, decision theory, empirical Bayes,
James–Stein estimator, multivariate normal, shrinkage estimation, squared error loss, unbiased risk
estimate

Abstract: For a multivariate normal distribution with independent components, the


observed vector is an inadmissible estimator of the mean vector. The James–Stein
estimator defines a statistic with lower squared error loss than the observed vector. It has
led to research into families of minimax estimators, some of which are summarized.

Estimating the mean of a multivariate normal distribution is a central problem in statistics. The usual
estimator is the sample mean, which is the maximum-likelihood estimator, the uniformly minimum
variance unbiased estimator, the best invariant estimator, and minimax for nearly arbitrary symmetric
loss[1] . Initially, it was believed that the sample mean was also admissible, but Stein[2] argued that the
sample mean is inadmissible under quadratic loss for any dimension bigger than two. This led to a
surprising discrepancy, at first considered a paradox, between estimating the mean vector one component
at a time or all at once as it undermined work on classical estimation theory, going back to Gauss and
Legendre. Now the sense of a paradox has diminished and the idea of Stein’s phenomenon is quite
mainstream[3] .
Let X|𝜃 ∼ p (𝜃, 𝜎 2 Ip ) be a draw from a p-dimensional normal distribution with mean 𝜃, where each
component is independent with a common and known variance 𝜎 2 . The preference of the estimator is eval-
∑p
uated using the quadratic loss L(𝜃, 𝛿) = ||𝛿 − 𝜃||2 = i=1 (𝛿i − 𝜃i )2 and the risk (𝜃, 𝛿) = E𝜃 ||𝛿(X) − 𝜃||2 .
The usual estimator for the mean is denoted 𝛿0 (X) = X. James and Stein[4] introduced their celebrated
shrinkage estimators by showing that a James–Stein estimator
𝛿aJS (X) = (1 − a𝜎 2 ∕||X||2 )X (1)
has uniformly smaller risk (or dominates) 𝛿0 (X) for p ≥ 3 whenever 0 < a < 2(p − 2). These estimators and
their progeny will be the focus of this article.

Cornell University, Ithaca, NY, USA


† Deceased

Update based on original article by G Casella, Wiley StatsRef: Statistics Reference Online © 2014 John Wiley & Sons, Ltd.

Wiley StatsRef: Statistics Reference Online, © 2014–2019 John Wiley & Sons, Ltd. 1
This article is © 2019 John Wiley & Sons, Ltd.
DOI: 10.1002/9781118445112.stat05872.pub2
James–Stein Estimator: Introduction

1 Some Intuition on the Possibility of Improvement

To develop some intuition for why estimators can improve over 𝛿0 (X) = X, some modern formulations of
the estimation problem are helpful. These approaches all yield James–Stein estimators, and such broad
support suggests promise.
Suppose that a prior distribution 𝜋 of 𝜃 satisfies that the marginal density m(x) of X is finite. Brown[5]
showed that the posterior mean is
∇m(X)
E[𝜃|X] = X + 𝜎 2 (2)
m(X)

which is known as the Brown–Stein formula. The second term on the right-hand side is known as the
shrinkage function. For this section, we will consider linear shrinkage functions and the estimators 𝛿a (X) =
(1 − a)X.
From a geometric point of view, the estimator 𝛿0 (X) = X is “too long” to be an estimator of 𝜃 and needs
to be shrunk. One way in which this can be seen is by recalling that E𝜃 [||X||2 ] = ||𝜃||2 + p𝜎 2 . To shorten the
length of the estimator, we will consider projections of the mean 𝜃 onto X. The resulting oracle estimator
depends on the unknown target of inference 𝜃, so an estimator will be employed as an adaptive surrogate.
First, minimizing the risk

(𝜃, 𝛿a ) = E𝜃 [||(1 − a)X − 𝜃||2 ] = (1 − a)2 p𝜎 2 + a2 ||𝜃||2

with respect to the real number a has the minimizer aopt,1 = p𝜎 2 ∕(p𝜎 2 + ||𝜃||2 ). This suggests an ora-
cle version of (1), 𝛿aopt,1 (X) = (1 − p𝜎 2 ∕(p𝜎 2 + ||𝜃||2 ))X. We may directly estimate the denominator of the
shrinkage function and consequently estimate aopt,1 with â opt,1 = p𝜎 2 ∕||X||2 . The resulting estimator is a
James–Stein estimator 𝛿â opt,1 (X) = (1 − p𝜎 2 ∕||X||2 )X = 𝛿pJS (X).
This derivation suggests some interesting properties of James–Stein estimators. The quality of the esti-
mator of aopt,1 improves as p increases, which suggests that James–Stein estimators may have better risk
properties for larger p. In addition, the identity (𝜃, 𝛿aopt,1 ) = (𝜃, 𝛿0 )||𝜃||2 ∕(p𝜎 2 + ||𝜃||2 ) shows that the
largest improvement over 𝛿0 occurs when ||𝜃||2 is close to zero.
Second, minimizing the loss L(𝜃, 𝛿) = ||(1 − a)X − 𝜃||2 with respect to a has the minimizer aopt,2 =
(||X||2 − X T 𝜃)∕||X||2 . Using the same estimator for the numerator now produces an estimator
â opt,2 = p𝜎 2 ∕||X||2 , and the resulting estimator is the same James–Stein estimator. Notice that the
calculations in the two abovementioned approaches did not use normality.
Third, we will consider a Bayesian model. Assume that 𝜃 ∼ p (0, 𝜏 2 Ip ) for unknown 𝜏 2 and still that
X|𝜃 ∼ p (𝜃, 𝜎 2 Ip ) for known 𝜎 2 . Minimizing the Bayes risk E[L(𝛿a (X), 𝜃)|X] produces the Bayes estimator,
which by the Brown–Stein formula is 𝛿aopt,3 (X) = (1 − 𝜎 2 ∕(𝜎 2 + 𝜏 2 ))X, for aopt,3 = 𝜎 2 ∕(𝜎 2 + 𝜏 2 ). To find
an estimator for the shrinkage function, we use the uniformly minimum variance unbiased estimator for
JS
1∕(𝜎 2 + 𝜏 2 ), which is (p − 2)∕||X||2 . This suggests using the estimator 𝛿p−2 (X) = (1 − (p − 2)𝜎 2 ∕||X||2 )X.
Notice that the scale parameter 𝜏 2 of the prior was empirically estimated, making this an empirical Bayes
estimator.

2 Stein’s Unbiased Risk Estimator

In this section, we will consider estimators of the form 𝛿(X) = X + 𝜎 2 g(X), where the shrinkage function
g ∶ ℝp → ℝp satisfies E𝜃 [||g(X)||2 ] < ∞. This finiteness constraint is necessary and sufficient for the risk
2 Wiley StatsRef: Statistics Reference Online, © 2014–2019 John Wiley & Sons, Ltd.
This article is © 2019 John Wiley & Sons, Ltd.
DOI: 10.1002/9781118445112.stat05872.pub2
James–Stein Estimator: Introduction

of 𝛿(X) to be finite. The derivation of an unbiased estimator of the risk of 𝛿(X) is the main focus of this
section, following the development due to Stein[6, 7] . The risk (𝜃, 𝛿) equals
E𝜃 ||X + 𝜎 2 g(X) − 𝜃||2 = p𝜎 2 + 𝜎 4 E𝜃 [||g(X)||2 ] + 2𝜎 2 E𝜃 [(X − 𝜃)T g(X)] (3)
and only involves the mean 𝜃 through the last summand, E𝜃 [(X − 𝜃)T g(X)].
Amazingly, there exists a simple unbiased estimator of the last summand of (3) known as Stein’s integra-
tion by parts formula. Let X ∼ p (𝜃, 𝜎 2 I), and let g ∶ ℝp → ℝp be weakly differentiable (see Appendix A.1
∑p
of Fourdrinier et al.[8] ). Define the divergence div g(x) = i=1 𝜕i g(x). Then, Stein’s lemma states that
E𝜃 [(X − 𝜃)T g(X)] = 𝜎 2 E𝜃 [div g(X)]
if, for any i = 1, … , p, either E𝜃 [|(Xi − 𝜃i )gi (X)|] < ∞ or E𝜃 [|𝜕i gi (X)|] < ∞.
From Stein’s lemma, it follows that p𝜎 2 + 𝜎 4 (||g(X)||2 + 2div g(X)) is an unbiased estimator for the risk
(𝜃, 𝛿(X)). This estimator is known as Stein’s unbiased risk estimator, or SURE. Indeed, since X is complete
sufficient, it follows that SURE is also minimum variance unbiased. Since the risk of 𝛿0 (X) = X equals p𝜎 2 ,
this immediately yields that the estimator 𝛿(X) = X + 𝜎 2 g(X) is minimax if ||g(X)||2 + 2div g(X) ≤ 0 and
dominates 𝛿0 (X) if there is a strict inequality on a set of positive measure.

3 Risk Properties of Estimators

We are now in a position to study the risk properties of James–Stein estimators. After doing so, we review
a more general class of estimators, some of which dominate 𝛿0 or other James–Stein estimators.
A James–Stein estimator 𝛿aJS (X) dominates 𝛿0 (X) = X when 0 < a < 2(p − 2) and is minimax when
0 ≤ a ≤ 2(p − 2) for all p ≥ 3; further, the uniformly optimal choice of a is a = p − 2 for p ≥ 3. The risk
JS
(0, 𝛿p−2 ) for the optimal James–Stein estimator is 2𝜎 2 for all p ≥ 3.
An undesirable property of James–Stein estimators is that they become negative and arbitrarily large
when ||X||2 approaches zero. A possible remedy is to instead consider the positive part of the estimators,
that is, 𝛿aJS+ (X) = (1 − a𝜎 2 ∕||X||2 )+ X. This positive-part James–Stein estimator 𝛿aJS+ dominates its corre-
sponding James–Stein estimator 𝛿aJS for any a > 0.
A more general modification of a James–Stein estimator is a Baranchik-type estimator[9] , defined by
𝛿a,r (X) = (1 − a𝜎 2 r(||X||2 )∕||X||2 )X, where r ∶ ℝ≥0 → ℝ. For example, choosing r+ (s) = min{s2 ∕(a𝜎 2 ), 1}
B
JS+
produces 𝛿a,r B
+ (X) = 𝛿a (X), a positive-part James–Stein estimator. It can be shown that a Baranchik-type
estimator is minimax for p ≥ 3 when 0 ≤ a ≤ 2(p − 2), 0 ≤ r(⋅) ≤ 1, and r(⋅) is absolutely continuous and
nondecreasing.
Stein’s lemma can also be used √ to show minimaxity of some Bayes estimators E[X|𝜃]. The esti-
mator in (2) is minimax if Δ m(x) ≤ 0 for all x ∈ ℝp , where Δ denotes the weak Laplacian, and
E𝜃 [||∇m(X)∕m(X)||2 ] < ∞.
There are proofs and much further detail in all of the topics discussed here available in Fourdrinier
et al.[8] . Also discussed are the estimators that shrink toward subspaces, estimators that adaptively shrink
toward some of the several targets, estimators that do not rely on a known error variance, and risk prop-
erties with respect to more general loss functions and error distributions.

Related Articles

Admissibility; Decision Theory: Biostatistical Aspects; Shrinkage Estimation; James-Stein Estima-


tors: Theory.
Wiley StatsRef: Statistics Reference Online, © 2014–2019 John Wiley & Sons, Ltd. 3
This article is © 2019 John Wiley & Sons, Ltd.
DOI: 10.1002/9781118445112.stat05872.pub2
James–Stein Estimator: Introduction

References

[1] Lehmann, E.L. and Casella, G. (2006) Theory of Point Estimation, Springer, New York.
[2] Stein, C.M. (1956) Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution. Proc. 3rd
Berkeley Symposium, vol. 1, pp. 197–206.
[3] Efron, B. and Morris, C. (1977) Stein’s paradox in statistics. Sci. Am., 236 (5), 119–127.
[4] James, W. and Stein, C.M. (1961) Estimation with Quadratic Loss. Proc. 4th Berkeley Symposium, vol. 1, pp. 361–379.
[5] Brown, L.D. (1971) Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math.
Statist., 42 (3), 855–903.
[6] Stein, C.M. (1973) Estimation of the Mean of a Multivariate Normal Distribution. Proc. Prague Symp. on Asymptotic
Statistics, pp. 345–381.
[7] Stein, C.M. (1981) Estimation of the mean of a multivariate normal distribution. Ann. Statist., 9 (6), 1135–1151.
[8] Fourdrinier, D., Strawderman, W.E., and Wells, M.T. (2018) Shrinkage Estimation, Springer, New York.
[9] Baranchik, A.J. (1970) A family of minimax estimators of the mean of a multivariate normal distribution. Ann. Math. Statist.,
41 (2), 642–645.

4 Wiley StatsRef: Statistics Reference Online, © 2014–2019 John Wiley & Sons, Ltd.
This article is © 2019 John Wiley & Sons, Ltd.
DOI: 10.1002/9781118445112.stat05872.pub2

You might also like