James-Stein Estimator - Introduction
James-Stein Estimator - Introduction
James-Stein Estimator - Introduction
Estimating the mean of a multivariate normal distribution is a central problem in statistics. The usual
estimator is the sample mean, which is the maximum-likelihood estimator, the uniformly minimum
variance unbiased estimator, the best invariant estimator, and minimax for nearly arbitrary symmetric
loss[1] . Initially, it was believed that the sample mean was also admissible, but Stein[2] argued that the
sample mean is inadmissible under quadratic loss for any dimension bigger than two. This led to a
surprising discrepancy, at first considered a paradox, between estimating the mean vector one component
at a time or all at once as it undermined work on classical estimation theory, going back to Gauss and
Legendre. Now the sense of a paradox has diminished and the idea of Stein’s phenomenon is quite
mainstream[3] .
Let X|𝜃 ∼ p (𝜃, 𝜎 2 Ip ) be a draw from a p-dimensional normal distribution with mean 𝜃, where each
component is independent with a common and known variance 𝜎 2 . The preference of the estimator is eval-
∑p
uated using the quadratic loss L(𝜃, 𝛿) = ||𝛿 − 𝜃||2 = i=1 (𝛿i − 𝜃i )2 and the risk (𝜃, 𝛿) = E𝜃 ||𝛿(X) − 𝜃||2 .
The usual estimator for the mean is denoted 𝛿0 (X) = X. James and Stein[4] introduced their celebrated
shrinkage estimators by showing that a James–Stein estimator
𝛿aJS (X) = (1 − a𝜎 2 ∕||X||2 )X (1)
has uniformly smaller risk (or dominates) 𝛿0 (X) for p ≥ 3 whenever 0 < a < 2(p − 2). These estimators and
their progeny will be the focus of this article.
Update based on original article by G Casella, Wiley StatsRef: Statistics Reference Online © 2014 John Wiley & Sons, Ltd.
Wiley StatsRef: Statistics Reference Online, © 2014–2019 John Wiley & Sons, Ltd. 1
This article is © 2019 John Wiley & Sons, Ltd.
DOI: 10.1002/9781118445112.stat05872.pub2
James–Stein Estimator: Introduction
To develop some intuition for why estimators can improve over 𝛿0 (X) = X, some modern formulations of
the estimation problem are helpful. These approaches all yield James–Stein estimators, and such broad
support suggests promise.
Suppose that a prior distribution 𝜋 of 𝜃 satisfies that the marginal density m(x) of X is finite. Brown[5]
showed that the posterior mean is
∇m(X)
E[𝜃|X] = X + 𝜎 2 (2)
m(X)
which is known as the Brown–Stein formula. The second term on the right-hand side is known as the
shrinkage function. For this section, we will consider linear shrinkage functions and the estimators 𝛿a (X) =
(1 − a)X.
From a geometric point of view, the estimator 𝛿0 (X) = X is “too long” to be an estimator of 𝜃 and needs
to be shrunk. One way in which this can be seen is by recalling that E𝜃 [||X||2 ] = ||𝜃||2 + p𝜎 2 . To shorten the
length of the estimator, we will consider projections of the mean 𝜃 onto X. The resulting oracle estimator
depends on the unknown target of inference 𝜃, so an estimator will be employed as an adaptive surrogate.
First, minimizing the risk
with respect to the real number a has the minimizer aopt,1 = p𝜎 2 ∕(p𝜎 2 + ||𝜃||2 ). This suggests an ora-
cle version of (1), 𝛿aopt,1 (X) = (1 − p𝜎 2 ∕(p𝜎 2 + ||𝜃||2 ))X. We may directly estimate the denominator of the
shrinkage function and consequently estimate aopt,1 with â opt,1 = p𝜎 2 ∕||X||2 . The resulting estimator is a
James–Stein estimator 𝛿â opt,1 (X) = (1 − p𝜎 2 ∕||X||2 )X = 𝛿pJS (X).
This derivation suggests some interesting properties of James–Stein estimators. The quality of the esti-
mator of aopt,1 improves as p increases, which suggests that James–Stein estimators may have better risk
properties for larger p. In addition, the identity (𝜃, 𝛿aopt,1 ) = (𝜃, 𝛿0 )||𝜃||2 ∕(p𝜎 2 + ||𝜃||2 ) shows that the
largest improvement over 𝛿0 occurs when ||𝜃||2 is close to zero.
Second, minimizing the loss L(𝜃, 𝛿) = ||(1 − a)X − 𝜃||2 with respect to a has the minimizer aopt,2 =
(||X||2 − X T 𝜃)∕||X||2 . Using the same estimator for the numerator now produces an estimator
â opt,2 = p𝜎 2 ∕||X||2 , and the resulting estimator is the same James–Stein estimator. Notice that the
calculations in the two abovementioned approaches did not use normality.
Third, we will consider a Bayesian model. Assume that 𝜃 ∼ p (0, 𝜏 2 Ip ) for unknown 𝜏 2 and still that
X|𝜃 ∼ p (𝜃, 𝜎 2 Ip ) for known 𝜎 2 . Minimizing the Bayes risk E[L(𝛿a (X), 𝜃)|X] produces the Bayes estimator,
which by the Brown–Stein formula is 𝛿aopt,3 (X) = (1 − 𝜎 2 ∕(𝜎 2 + 𝜏 2 ))X, for aopt,3 = 𝜎 2 ∕(𝜎 2 + 𝜏 2 ). To find
an estimator for the shrinkage function, we use the uniformly minimum variance unbiased estimator for
JS
1∕(𝜎 2 + 𝜏 2 ), which is (p − 2)∕||X||2 . This suggests using the estimator 𝛿p−2 (X) = (1 − (p − 2)𝜎 2 ∕||X||2 )X.
Notice that the scale parameter 𝜏 2 of the prior was empirically estimated, making this an empirical Bayes
estimator.
In this section, we will consider estimators of the form 𝛿(X) = X + 𝜎 2 g(X), where the shrinkage function
g ∶ ℝp → ℝp satisfies E𝜃 [||g(X)||2 ] < ∞. This finiteness constraint is necessary and sufficient for the risk
2 Wiley StatsRef: Statistics Reference Online, © 2014–2019 John Wiley & Sons, Ltd.
This article is © 2019 John Wiley & Sons, Ltd.
DOI: 10.1002/9781118445112.stat05872.pub2
James–Stein Estimator: Introduction
of 𝛿(X) to be finite. The derivation of an unbiased estimator of the risk of 𝛿(X) is the main focus of this
section, following the development due to Stein[6, 7] . The risk (𝜃, 𝛿) equals
E𝜃 ||X + 𝜎 2 g(X) − 𝜃||2 = p𝜎 2 + 𝜎 4 E𝜃 [||g(X)||2 ] + 2𝜎 2 E𝜃 [(X − 𝜃)T g(X)] (3)
and only involves the mean 𝜃 through the last summand, E𝜃 [(X − 𝜃)T g(X)].
Amazingly, there exists a simple unbiased estimator of the last summand of (3) known as Stein’s integra-
tion by parts formula. Let X ∼ p (𝜃, 𝜎 2 I), and let g ∶ ℝp → ℝp be weakly differentiable (see Appendix A.1
∑p
of Fourdrinier et al.[8] ). Define the divergence div g(x) = i=1 𝜕i g(x). Then, Stein’s lemma states that
E𝜃 [(X − 𝜃)T g(X)] = 𝜎 2 E𝜃 [div g(X)]
if, for any i = 1, … , p, either E𝜃 [|(Xi − 𝜃i )gi (X)|] < ∞ or E𝜃 [|𝜕i gi (X)|] < ∞.
From Stein’s lemma, it follows that p𝜎 2 + 𝜎 4 (||g(X)||2 + 2div g(X)) is an unbiased estimator for the risk
(𝜃, 𝛿(X)). This estimator is known as Stein’s unbiased risk estimator, or SURE. Indeed, since X is complete
sufficient, it follows that SURE is also minimum variance unbiased. Since the risk of 𝛿0 (X) = X equals p𝜎 2 ,
this immediately yields that the estimator 𝛿(X) = X + 𝜎 2 g(X) is minimax if ||g(X)||2 + 2div g(X) ≤ 0 and
dominates 𝛿0 (X) if there is a strict inequality on a set of positive measure.
We are now in a position to study the risk properties of James–Stein estimators. After doing so, we review
a more general class of estimators, some of which dominate 𝛿0 or other James–Stein estimators.
A James–Stein estimator 𝛿aJS (X) dominates 𝛿0 (X) = X when 0 < a < 2(p − 2) and is minimax when
0 ≤ a ≤ 2(p − 2) for all p ≥ 3; further, the uniformly optimal choice of a is a = p − 2 for p ≥ 3. The risk
JS
(0, 𝛿p−2 ) for the optimal James–Stein estimator is 2𝜎 2 for all p ≥ 3.
An undesirable property of James–Stein estimators is that they become negative and arbitrarily large
when ||X||2 approaches zero. A possible remedy is to instead consider the positive part of the estimators,
that is, 𝛿aJS+ (X) = (1 − a𝜎 2 ∕||X||2 )+ X. This positive-part James–Stein estimator 𝛿aJS+ dominates its corre-
sponding James–Stein estimator 𝛿aJS for any a > 0.
A more general modification of a James–Stein estimator is a Baranchik-type estimator[9] , defined by
𝛿a,r (X) = (1 − a𝜎 2 r(||X||2 )∕||X||2 )X, where r ∶ ℝ≥0 → ℝ. For example, choosing r+ (s) = min{s2 ∕(a𝜎 2 ), 1}
B
JS+
produces 𝛿a,r B
+ (X) = 𝛿a (X), a positive-part James–Stein estimator. It can be shown that a Baranchik-type
estimator is minimax for p ≥ 3 when 0 ≤ a ≤ 2(p − 2), 0 ≤ r(⋅) ≤ 1, and r(⋅) is absolutely continuous and
nondecreasing.
Stein’s lemma can also be used √ to show minimaxity of some Bayes estimators E[X|𝜃]. The esti-
mator in (2) is minimax if Δ m(x) ≤ 0 for all x ∈ ℝp , where Δ denotes the weak Laplacian, and
E𝜃 [||∇m(X)∕m(X)||2 ] < ∞.
There are proofs and much further detail in all of the topics discussed here available in Fourdrinier
et al.[8] . Also discussed are the estimators that shrink toward subspaces, estimators that adaptively shrink
toward some of the several targets, estimators that do not rely on a known error variance, and risk prop-
erties with respect to more general loss functions and error distributions.
Related Articles
References
[1] Lehmann, E.L. and Casella, G. (2006) Theory of Point Estimation, Springer, New York.
[2] Stein, C.M. (1956) Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution. Proc. 3rd
Berkeley Symposium, vol. 1, pp. 197–206.
[3] Efron, B. and Morris, C. (1977) Stein’s paradox in statistics. Sci. Am., 236 (5), 119–127.
[4] James, W. and Stein, C.M. (1961) Estimation with Quadratic Loss. Proc. 4th Berkeley Symposium, vol. 1, pp. 361–379.
[5] Brown, L.D. (1971) Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math.
Statist., 42 (3), 855–903.
[6] Stein, C.M. (1973) Estimation of the Mean of a Multivariate Normal Distribution. Proc. Prague Symp. on Asymptotic
Statistics, pp. 345–381.
[7] Stein, C.M. (1981) Estimation of the mean of a multivariate normal distribution. Ann. Statist., 9 (6), 1135–1151.
[8] Fourdrinier, D., Strawderman, W.E., and Wells, M.T. (2018) Shrinkage Estimation, Springer, New York.
[9] Baranchik, A.J. (1970) A family of minimax estimators of the mean of a multivariate normal distribution. Ann. Math. Statist.,
41 (2), 642–645.
4 Wiley StatsRef: Statistics Reference Online, © 2014–2019 John Wiley & Sons, Ltd.
This article is © 2019 John Wiley & Sons, Ltd.
DOI: 10.1002/9781118445112.stat05872.pub2