(Springer Series in Statistics) Michael L. Stein (Auth.) - Interpolation of Spatial Data - Some Theory For Kriging-Springer-Verlag New York (1999)
(Springer Series in Statistics) Michael L. Stein (Auth.) - Interpolation of Spatial Data - Some Theory For Kriging-Springer-Verlag New York (1999)
(Springer Series in Statistics) Michael L. Stein (Auth.) - Interpolation of Spatial Data - Some Theory For Kriging-Springer-Verlag New York (1999)
Advisors:
P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg,
1. Olkin, N. Wermuth, S. Zeger
With 27 Illustrations
, Springer
Michael L. Stein
Department of Statistics
University of Chicago
Chicago, IL 60637
USA
987 6 5 4 3 2 1
the mid 1980s I was seeking a way to obtain an asymptotic theoryto sup-
port this general understanding. The asymptotic framework I had in mind
was to take more and more observations in a fixed and bounded region of
space, which I call fixed-domain asymptotics. Using this approach, I sus-
pected that it should generally be the case that only the behavior of the
semivariogram near the origin matters asymptotically for determining the
properties of kriging predictors. Unfortunately, I had no idea how to prove
such a result except in a few very special cases. However, I did know of
an example in which behavior away from the origin of the semivariogram
could have an asymptotically nonnegligible impact on the properties of
kriging predictors. Specifically, as described in 3.5, the semivariograms cor-
responding to exponential and triangular auto covariance functions have the
same behavior near the origin, but optimal linear interpolants under the
two models do not necessarily have similar asymptotic behavior. I believed
that there should be some mathematical formulation of the problem that
would exclude the "pathological" triangular auto covariance function and
would allow me to obtain a general theorem on asymptotic properties of
kriging predictors. Soon after arriving at the University of Chicago in the
fall of 1985, I was browsing through the library and happened upon Gaus-
sian Random Processes by Ibragimov and Rozanov (1978). I leafedthrough
the book and my initial reaction was to dismiss it as being too difficult for
me to read and in any case irrelevant to my research interests. Fortunately,
sitting among all the lemmas and theorems and corollaries in this book was
a single figure on page 100 showing plots of an exponential and triangu-
lar autocovariance function. The surrounding text explained how Gaussian
processes corresponding to these two auto covariance functions could have
orthogonal measures, which did not make an immediate impression on me.
However, the figure showing the two autocovariance functions stuck in my
mind and the next day I went back to the library and checked out the book.
I soon recognized that equivalence and orthogonality of Gaussian measures
was the key mathematical concept I needed to prove results connecting the
behavior of the semivariogram at the origin to the properties of kriging
predictors. Having devoted a great amount of effort to this topic in subse-
quent years, I am now more firmly convinced than ever that the confluence
of fixed-domain asymptotics and equivalence and orthogonality of Gaussian
measures provides the best mathematical approach for the study of kriging
based on estimated covariance structures. I would like to thank Ibragimov
and Rozanov for including that single figure in their work.
This monograph represents a synthesis of my present understanding of
the connections between the behavior of semivariograms at the origin, the
properties of kriging predictors and the equivalence and orthogonality of
Gaussian measures. Without an understanding of these connections, I be-
lieve it is not possible to develop a full appreciation of kriging. Although
there is a lot of mathematics here, I frequently discuss the repercussions of
the mathematical results on the practice of kriging. Readers whose main
Preface ix
interests are in the practice of kriging should consider skipping most of the
proofs on a first reading and focus on the statements of results and the
related discussions. Readers who find even the statements of the theorems
difficult to digest should carefully study the numerical results in Chap-
ters 3 and 6 before concluding that they can ignore the implications of this
work. For those readers who do plan to study at least some of the proofs, a
background in probability theory at the level of, say, Billingsley (1995) and
some familiarity with Fourier analysis and Hilbert spaces should be suffi-
cient. The necessary second-order theory of random fields is developed in
Chapter 2 and results on equivalence and orthogonality of Gaussian mea-
sures in Chapter 4. Section 1.3 provides a brief summary of the essential
results on Hilbert spaces needed here.
In selecting topics for inclusion, I have tried to stick to topics pertinent
to kriging about which I felt I had something worthwhile to say. As a con-
sequence, for example, there is little here about nonlinear prediction and
nothing about estimation for non-Gaussian processes, despite the impor-
tance of these problems. In addition, no mention is made of splines as a way
of interpolating spatial data, even though splines and kriging are closely
related and an extensive literature exists on the use of splines in statistics
(Wahba 1990). Thus, this monograph is not a comprehensive guide to sta-
tistical approaches to spatial interpolation. Part I of Cressie (1993) comes
much closer to providing a broad overview of kriging.
This work is quite critical of some aspects of how kriging is com-
monly practiced at present. In particular, I criticize some frequently used
classes of models for semivariograms and describe ways in which empirical
semivariograms can be a misleading tool for making inferences about semi-
variograms. Some of this criticism is based on considering what happens
when the underlying random field is differentiable and measurement errors
are negligible. In some areas of application, nondifferentiable random fields
and substantial measurement errors may be common, in which case, one
could argue that my criticisms are not so relevant to those areas. However,
what I am seeking to accomplish here is not to put forward a set of method-
ologies that will be sufficient in some circumscribed set of applications, but
to suggest a general framework for thinking about kriging that makes sense
no matter how smooth or rough is the underlying random field and whether
there is nonnegligible measurement error. Furthermore, I contend that the
common assumption that the semivariogram of the underlying random field
behaves linearly in a neighborhood of the origin (which implies the random
field is not differentiable), is often made out of habit or ignorance and not
because it is justified.
For those who want to know what is new in this monograph, I provide a
summary here. All of 3.6 and 3.7, which study the behavior of predictions
with evenly spaced observations in one dimension as the spacing between
neighboring observations tends to 0, are new. Section 4.3 mixes old and
new results on the asymptotic optimality of best linear predictors under
x Preface
Preface vii
1 Linear Prediction 1
1.1 Introduction ..... 1
1.2 Best linear prediction 2
Exercises · ...... 3
1.3 Hilbert spaces and prediction . 4
Exercises · ......... 5
1.4 An example of a poor BLP . . 6
Exercises · ........... 6
1.5 Best linear unbiased prediction . 7
Exercises · ....... 9
1.6 Some recurring themes 10
The Matern model . . . 12
BLPs and BLUPs ... 12
Inference for differentiable random fields 13
Nested models are not tenable .. 13
1.7 Summary of practical suggestions 14
Exercise . . . . . . . . . . . . . . . .19 . . . .
2.3 Elementary properties of auto covariance functions 19
Exercise . . . . . . . . . . . . . . . .20 . . . .
2.4 Mean square continuityand differentiability 20
Exercises . . . . . . . . . . . . . . . . . . 22
2.5 Spectral methods. . . . . . . . . . . . . .22
Spectral representation of a random field 23
Bochner's Theorem . . . . . . . . 24
Exercises . . . . . . . . . . . . . . . . . . 25
2.6 Two corresponding Hilbert spaces. . . . 26
An application to mean square differentiability 26
Exercises . . . . . . . . . . . . . . . 27
2.7 Examples of spectral densities on lR 27
Rational spectral densities 28
Principal irregular term . . . . . . . 28
Gaussian model. . . . . . . . . . . . 29
Triangular autocovariance functions 30
Matern class . . . . . . . . . . . . 31
Exercises . . . . . . . . . . . . . . 33
2.8 Abelian and Tauberian theorems. 33
Exercises . . . . . . . . . . . . . . 35
2.9 Random fields with nonintegrable spectral densities. 36
Intrinsic random functions 36
Semivariograms . . . . . . . 39
Generalized random fields . 40
Exercises . . . . . . . . . . 41
2.10 Isotropic autocovariance functions 42
Characterization . . . . . . . . . . 42
Lower bound on isotropic autocorrelation functions 45
Inversion formula. . . . 46
Smoothness properties . 46
Matern class .. 48
Spherical model . . . . 52
Exercises . . . . . . . . 53
2.11 Tensor product autocovariances 54
Exercises . . . . . . . . . . . . . 55
Exercises . . . . . . . . . . . . . . . . . . . . 65
3.5 Prediction with the wrong spectral density . 66
Examples of interpolation . . . . . . . . . . .66
An example with a triangular auto covariance function 67
More criticism of Gaussian auto covariance functions . 69
Examples of extrapolation . . . . . . . . . . . . . . . . 70
Pseudo-BLPs with spectral densities misspecified at high
frequencies . . . . . . . . . . . . . . 71. . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . 74
3.6 Theoretical comparison of extrapolation and
interpolation . . . . . . . . 76
An interpolation problem. 77
An extrapolation problem. 78
Asymptotics for BLPs . . . 79
Inefficiency of pseudo-BLPs with misspecified high
frequency behavior . . . . . . . . . . . . . . . . . 81
Presumed mses for pseudo-BLPs with misspecified high
frequency behavior . . . . . . . . . . . . . . . . . . . 85
Pseudo-BLPs with correctly specified high frequency
behavior . . . . . 86
Exercises . . . . . . . . . 92
3.7 Measurement errors . . . 94
Some asymptotic theory . 95
Exercises . . . . . . . . . 97
3.8 Observations on an infinite lattice 97
Characterizing the BLP . . . . . . 98
Bound on fraction of mse of BLP attributable to a set of
frequencies . . . . . . . . . . . . . . 99.
Asymptotic optimality of pseudo-BLPs . . . . . . 101
Rates of convergence to optimality. . . . . . . . 104
.
Pseudo-BLPs with a misspecified mean function. 105
Exercises . . . . . . . . . . . . . . . . . . . . 108
B Symbols 231
References 235
Index 243
1
Linear Prediction
1.1 Introduction
This book investigates prediction of a spatially varying quantity based on
observations of that quantity at some set of locations. Although the notion
of prediction sometimes suggests the assessment of something that has not
yet happened, here I take it to mean the assessment of any random quantity
that is presently not known exactly. This work focuses on quantities that
vary continuously in space and for which observations are made without
error, although Sections 3.7,4.2,4.3,6.6 and 6.8 do address some issues re-
garding measurement errors. Our goals are to obtain accurate predictions
and to obtain reasonable assessments of the uncertainty in these predic-
tions. The approach to prediction I take is to consider the spatially varying
quantity to be a realization of a real-valued random field, that is, a family
of random variables whose index set is IRd.
Much of this work focuses on the properties of predictors that are linear
functions of the observations, although 1.4 describes a cautionary example
on the potential inefficiencies of "optimal" linear predictors. Section 1.2
defines and derives best linear prediction of random fields based on a fi-
nite number of observations. Section 1.3 briefly reviews some properties of
Hilbert spaces, which are a powerful tool for studying general linear predic-
tion problems. Section 1.5 considers best linear unbiased prediction, which
applies when the mean function of the random field is known up to a vec-
tor of linear parameters. Best linear unbiased prediction is frequently used
in spatial statistics where it is commonly called universal kriging. Section
2 1. Linear Prediction
1.6 summarizes some basic themes of this work and briefly considers how
these themes relate to practical issues in the prediction of random fields.
Section 1.7 succinctly states my main recommendations for the practice of
predicting random fields. Readers who can only spare 30 seconds on this
book might want to skip directly to 1. 7.
Chapter 2 provides a detailed discussion of properties of random fields
relevant to this work. For now, let us introduce some essential definitions
and notation. For a random variable X, I use E(X) to indicate its ex-
pected value and var(X) for its variance. For random variables X and Y,
cov(X, Y) = E(XY) - E(X)E(Y) is the covariance of X and Y. Suppose
{Z(x) : x E ]Rd} is a real-valued random field on ]Rd and x, y E ]Rd. The
mean function of Z is EZ(x), which I often denote by m(x). The covari-
ance function is cov{ Z(x), Z(y)}, which I often denote by K(x, y). Finally,
a random field is Gaussian if all of its finite-dimensional distributions are
Gaussian (multivariate normal). See Appendix A for a brief summary of
results on multivariate normal distributions.
Exercises
1 Show that the BLP is unique in the sense that if >'O+~TZ and J.Lo+p.TZ
are both BLPs for Z(xo), then E(>.o + ~TZ - J.Lo - p.TZ)2 = O.
2 Suppose X o, Xl and X 2 are random variables with mean 0 and variance
1, cov(Xo, Xl) = cov(Xl, X 2 ) = p with Ipl ~ 2- 1/ 2 and cov(Xo, X 2 ) =
O. Find the BLP of Xo based on Xl and X 2 • Find the mse of the BLP.
Note that unless p = 0, X 2 plays a role in the BLP despite the fact that
it is uncorrelated with Xo. Why is there the restriction Ipi ~ 2- 1/ 2 ?
4 1. Linear Prediction
We are mostly concerned with Hilbert spaces for which scalar multiplication
is restricted to reals and the inner product is real.
For any subset X of a Hilbert space 1t, the linear manifold spanned by
X, denoted by Mo, is the set of all linear combinations alXl + ... anXn
with n finite and Xl, ..• ,Xn EX. The closed linear manifold spanned by X,
denoted by M, is just Mo together with its limit points under the metric
defined by the inner product. Note that M is itself necessarily a Hilbert
space. Any set whose closed linear manifold is M is called a basis for
M, so that X is automatically one basis for M. In this work, we generally
only consider Hilbert spaces with finite or countable bases. Every separable
Hilbert space possesses a finite or countable basis (Exercise 4).
For studying prediction, the crucial concept is that of projection of an
element of a Hilbert space onto a subspace. Suppose 1t is a Hilbert space
and 9 a subspace. Given hE 1t, there exists a unique element 9 E 9 such
that
Exercises
4 Show that every separable Hilbert space has a finite or countable basis.
5 For a Hilbert space 1t, a subspace 9 and h E 1t, show that there is a
unique element 9 E 9 such that (3) holds.
6 1. Linear Prediction
Exercises
8 Show that for Z as defined in this section, K(s, t) = (2 -Is - tl)+'x.
1.5 Best linear unbiased prediction 7
3 -< ---{
-3 -2 -1 o 1 2 3
t
FIGURE 1. A partial realization of the process Z described in 1.4. The xs on
the horizontal axis indicate events of the Poisson process N. Values for Z(t) are
plotted for t E R = [-2, -1] U [1,2].
for some p" where 0 is a matrix of zeroes. This set of linear equations has
a solution if and only if m(xo) E C(MT) (Exercise 12). If K and M are of
full rank, then
For making predictions about Z(xo), the natural Bayesian solution is to use
the conditional distribution of Z(xo) given Z but averaging over the poste-
rior of {3 given Z. This distribution is known as the predictive distribution
of Z(xo) (Zellner 1971, Section 2.8) and is given by (Exercise 17)
Exercises
11 Show that if j3 is the generalized least squares estimator for {3, then for
any fixed vector q E RP, qTj3 is the BLUP for qT{3. Since the quantity
being predicted here is not random, q T j3 is more commonly called the
best linear unbiased estimator. Thus, we have that best linear unbiased
estimation is just a special case of best linear unbiased prediction.
12 Show that if a LUP exists, then the BLUP exists and is unique in the
sense that the BLP was shown to be unique in Exercise 1.
relevant. Sections 3.4-3.6, 6.8 and 6.9 provide some comparisons between
interpolation and extrapolation.
This focus on interpolation leads to the second theme, which is that the
properties of interpolation schemes depend mainly on the local behavior of
the random field under study. In particular, 3.4-3.6 provide theoretical and
numerical evidence that the behavior of a random field over longer distances
is much less relevant when interpolating than when extrapolating. Ac-
cordingly, Chapter 2, which provides background material on second-order
properties of random fields, emphasizes their local behavior. Chapter 2 fo-
cuses on random fields on lRd with covariance functions K(x,y) depending
only on x -y, in which case, I call the function K(x - y) = K(x, y) the
autocovariance function of the random field. If the auto covariance function
is continuous, then it can be written as the Fourier transform of a positive
finite measure. In most cases of practical interest, this measure has a den-
sity with respect to Lebesgue measure known as the spectral density. More
specifically, the spectral density / satisfies
for all x E lRd . It turns out that the local behavior of a random field is
intimately related to how the spectral density / behaves for large values of
Iwl. Generally speaking, the more quickly the spectral density decreases as
Iwl increases, the smoother the random field.
As in many areas of statistics, it is not possible to make much progress on
the theory of spatial interpolation from finite sample results. Thus, much of
the theory in the following chapters is asymptotic. The third theme of this
work is that the most appropriate asymptotic framework for problems of
spatial interpolation is to consider taking more and more observations in a
fixed region, which I call fixed-domain asymptotics. Most existing asymp-
totic theory concerning inference for stochastic processes and random fields
based on discrete observations allows the observation region to grow with
the number of observations, which I call increasing-domain asymptotics.
Chapter 3, and 3.3 in particular, detail my arguments for preferring fixed-
domain asymptotics for studying spatial interpolation. For now, I would
point out that if the goal is to develop a theory that shows the relation-
ship between the local behavior of a random field and the properties of
interpolation methods, then the fixed-domain approach is quite natural in
that the degree of differentiability of the random field, which is a funda-
mental aspect of its local behavior, plays a central role in any fixed-domain
asymptotic results.
The final theme is the connection between what aspects of a random
field model are important for purposes of spatial interpolation and what
aspects of the model can be well estimated from available data. This issue
is particularly crucial when using fixed-domain asymptotics because there
will commonly be parameters of models that cannot be consistently esti-
12 1. Linear Prediction
(1978, p. 167), Wackernagel (1995, p. 95) and Goovaerts (1997, p. 159) for
examples where such models are advocated or employed. However, because
all spherical semivariograms correspond to random fields with the same lo-
cal behavior, there is little to be gained for purposes of spatial interpolation
in employing such models. Furthermore, there is little hope of estimating
the parameters of such models with any degree of accuracy for datasets
of the size that generally occur in geological applications. I believe such
models would not be employed if users had a proper appreciation of the
inherent uncertainties in empirical semivariograms.
2.1 Preliminaries
This chapter provides the necessary background on random fields for under-
standing the subsequent chapters on prediction and inference for random
fields. The focus here is on weakly stationary random fields (defined later
in this section) and the associated spectral theory. Some previous exposure
to Fourier methods is assumed. A knowledge of the theory of characteristic
functions at the level of a graduate course in probability (see, for example,
Billingsley (1995), Chung (1974), or Feller (1971)) should, for the most
part, suffice. When interpolating a random field, the local behavior of the
random field turn out to be critical (see Chapter 3). Accordingly, this chap-
ter goes into considerable detail about the local behavior of random fields
and its relationship to spectral theory.
For a real random field Z on R with E{Z(x)2} < 00 for all x E R, the
covariance function K(x, y) = cov{Z(x), Z(y)} must satisfy
n
E CjckK(Xj,Xk) 20 (1)
j,k=l
for all finite n, all Xl, ... , Xn E R and all real CI, ... , en, which follows by
noting
16 2. Properties of Random Fields
Stationarity
If we do not make any assumptions restricting the class of random fields
we wish to consider, making inferences about its probability law from ob-
serving a single realization of the field is hopeless. A common simplifying
assumption is that the probabilistic structure in some sense looks similar
in different parts of R. Supposing R = ]Rd for instance, one way to define
this concept is through strict stationarity: for all finite n, Xl, ... ,X n E ]Rd,
h, ... , tn E]R and X E ]Rd,
Pr{Z(xI + x) :::; iI, ... , Z(xn + x) :::; tn}
= Pr {Z(XI) :::; iI, ... , Z(x n ) :::; t n }.
A different type of stationarity is defined in terms of the first two moments
of Z. Suppose the covariance function of Z depends on x and y only through
x -y. Then there is a function K on ]Rd, which I call the auto covariance
function for Z, such that cov{Z(x), Z(y)} = K(x - y) for all x and y
in ]Rd. A random field is called weakly stationary if it has finite second
moments, its mean function is constant and it possesses an auto covariance
function. Note that a strictly stationary random field with finite second
moment is also weakly stationary. For describing strength of associations
between random variables it is more natural to consider correlations than
covariances, so we sometimes make use of the autocorrelation function of
a weakly stationary random field, defined as C(x) = K(x)/ K(O) assuming
K(O) > O.
Since weak stationarity is a less restrictive assumption than strict sta-
tionarity whenever the second moments are finite, it is tempting to claim in
practice that one is only assuming weak stationarity and then make infer-
ences that only depend on specifying the first two moments of the random
field. This temptation is perhaps encouraged by results in discrete time
series showing that certain asymptotic properties of the periodogram (the
squared modulus of the discrete Fourier transform of the observations) do
not depend on the time series being Gaussian (Priestley 1981, Section 6.2).
2.2 The turning bands method 17
Isotropy
Stationarity canbe thought of as an invariance property under the trans-
lation group of transformations of the coordinates. For a random field on
lid, we can also consider invariance under rotations and reflections. I call a
random field Z on lid strictly isotropic if its finite-dimensional joint distri-
butions are invariant under all rigid motions. That is, for any orthogonal
d x d matrix H and any x E lid,
Pr {Z(HXl + x) h, ... , Z(HXn + x) ~ t n}
~
= Pr {Z(xt} ~ tl, ... , Z(Xn) ~ t n }
for all finite n, Xl, ... ,Xn E lid and tl, ... ,tn E JR. A random field on JRd
is weakly isotropic if there exists a constant m and a function K on [0,00)
such that m(x) = m and cov {Z(x), Z(y)} = K(lx - yl) for all x,y E lid,
where I . indicates
I Euclidean distance. I call the function K on [0,00) the
isotropic auto covariance function for Z. Note that I am implicitly assuming
a (strictly/weakly) isotropic random field is always (strictly/weakly) sta-
tionary. The isotropy condition amounts to assuming there is no reason to
distinguish one direction from another for the random field under consider-
ation. A simple but useful extension of isotropic random fields is to random
fields that become isotropic after a linear transformation of coordinates. We
say Z exhibits a geometric anisotropy if there exists an invertible matrix
V such that Z(Vx) is isotropic (Journel and Huijbregts 1978, p. 177).
Exercise
1 Show that a Gaussian random field on JRd is strictly stationary if and
only if it is weakly stationary. Show that a Gaussian random field on
JRd is strictly isotropic if and only if it is weakly isotropic.
K(r) = 4~ 10 10
27r 7r
B(r cos </» sin</> d</> d(J
= 10 1 B(rt) dt.
where the VjS are random unit vectors independent of Y b ... , Yn . If, in ad-
dition, V 1, ... , V n are independent and uniformly distributed on the unit
sphere, then Z has covariance function given by (2). For n large, a central
limit effect should make at least the low-order finite-dimensional distribu-
tions approximately normal. There may be some advantages in choosing
the VjS more systematically to obtain a more evenly spaced distribution
on 8bd • For example, for d = 3, Journel and Huijbregts (1978, p. 503) sug-
gest taking n = 15 and the VjS to be along the lines joining the midpoints
of opposite edges on a regular icosahedron centered at the origin.
Note that the approximate Gaussianity of Z should hold even if the Yis
are not Gaussian due to the central limit theorem effect. Thus, the turning
bands method cannot be used directly to simulate a non-Gaussian random
field. For a random field Z such that, for example, log Z is Gaussian, we can
of course use turning bands to simulate log Z and then transform pointwise
to obtain Z. See Cressie (1993) for further discussion on simulating random
fields.
Exercise
2 In using the turning bands method to simulate an isotropic random
field on IR3 with K as its isotropic auto covariance function, show that
B in step (i) of the algorithm is given by (d/drHrK(r)}.
for all finite n, all real CI, ... ,C n and all Xl, ... ,X n E ]Rd. This condition is
necessary and sufficient for there to exist a weakly stationary random field
with auto covariance function K by the same argument as given in 2.1 for
positive definite functions on ]Rd x ]Rd.
Some other properties of positive definite (p.d.) functions include the
following.
If KI and K2 are p.d., then alKI +a2K2 is p.d. for all nonnegative al
~~. ~
If Kb K 2, ... are p.d. and lim Kn{x) = K{x) for all X E ]Rd, then K
n-+oo
is p.d. (4)
If KI and K2 are p.d., then K{x) = KI{X)K2{X) is p.d. (5)
The proofs of (3) and (4) are straightforward. The easiest way to prove
(5) is to consider independent mean 0 Gaussian random fields Zl and Z2
with autocovariance functions KI and K 2, respectively, and to show that
K is the auto covariance function of the random field Z defined by Z{x) =
Zl (X)Z2{X).
Exercise
3 If Ko is a p.d. autocovariance function on ]Rd for all () E ]R and is
continuous in () for all x, show that fIR Ko{x)Jl{d(}) is p.d. if Jl is a
positive finite measure on ]R and fIR Ko{O)Jl{d(}) < 00.
The mean square continuity of Z does not imply that its realizations are
continuous. The process Z in Section 1.4 is mean square continuous but
Pr(Z is continuous on JR) = o.
If K is continuous at 0, then it is continuous everywhere, which follows
by noting
IK(x) - K(y)1 = Icov {Z(x) - Z(y), Z(O)}I
~ [var{Z(x) - Z(y)}var{Z(0)}]1/2
= [2 {K(O) - K(x - y)} K(0)]1/2
-to
so that -K" is positive definite by (4). In Section 2.6 I prove that Z is mean
square differentiable if and only if K"(O) exists and is finite, and that if Z
is mean square differentiable, then Z' has auto covariance function - K".
To define higher-order mean square derivatives, we say Z is m-times mean
square differentiable if it is (m - I)-times mean square differentiable and
z(m-l) is mean square differentiable. By repeated application of the stated
results in the preceding paragraph on the mean square differentiability of
a process, it follows that Z is m-times mean square differentiable if and
only if K(2m) (0) exists and is finite and, if so, the autocovariance function
of z(m) is (_I)mK(2m).
The following example shows that Z can have analytic realizations and
not be meansquare differentiable. Let Z(t) = cos(Xt + Y) where X and
Yare independent random variables with X following a standard Cauchy
distribution (Le., has density 1/ {rr(1 + x 2 )} for x E JR) and Y following a
22 2. Properties of Random Fields
Exercises
4 Find a p.d. autocovariance function on ~ that is discontinuous at t =
-1, 0 and 1 and continuous elsewhere.
since the left side equals EI r:7=1 CjZ(Xj) -Er:7=1 cj Z(Xj)1 2 • A function
K satisfying this condition for all finite n, all Xl, ... , Xn E IRd and all
complex CI, •.. ,en is said to be a positive definite complex function on IRd.
E { M(~I)M(~2) } = o.
I assume such a random measure exists; see, for example, Gihman and
Skorohod (1974) for mathematical details. Next consider how to interpret
the integral
which is a sum of the form given in (6). It is possible to show that for any
x, Zn(x) converges in L2 (Exercise 7) and the integral on the right side
of (7) is defined to be this limit. More specifically, by definingFn(~) =
LjESn F(~n(j)1{n-lj E ~} we get
E{Zn(x)Zn(Y)} = L exp{in-ljT(x_y)}F(~n(j)
=[ exp{iwT(x-y)}Fn(dw).
lad
Since Fn converges weakly to F (Exercise 8) and exp {iWT(X - y)} is
bounded and uniformly continuous for wE [-1, 1]d,
Bochner's Theorem
It is easy to see that for any finite positive measure F, the function K given
in (9) is p.d., since
.t
J,k=l
CiCkK(Xj - Xk) = .t
J,k=l
CiCk!ad exp{iwT(Xj - xk)}F(dw)
=[
lR.d
It j=1
2
Cj exp(iw TXj)1 F(dw)
~ o.
Furthermore, all continuous positive definite complex functions are of the
form (9) with F a positive measure of finite mass.
Theorem 1 (Bochner's Theorem). A complex-valued function K on ~d
is the autocovariance function for a weakly stationary mean square contin-
uous complex-valued random field on ~d if and only if it can be represented
as in (9) where F is a positive finite measure.
2.5 Spectral methods 25
1
consider the process Y defined by
1 00 eiW2U _ eiW1 U
Y(t) = -2 . Z(t - u) duo (10)
11" -00 ZU
= eiwt M(dw)
Exercises
7 Show that Zn(x) as defined in (8) converges in L2. If you have trouble
proving this, show that the subsequence Z2n (x) converges in L2.
8 For Zn as defined in (8), show that Fn converges weakly to F. Show
that (9) gives the auto covariance function for Z as defined in (7).
9 For Wl < W2, evaluate
ei(W2-W)U _ ei(Wl-W)U
. du
zu
26 2. Properties of Random Fields
cov{tajZ(Xj), ~bkZ(Yk)}
= (t aj exp(iwTXj), f
j=1 k=1
bk exp(iw T Yk)) .
F
Exercises
11 For a sequence of complex-valued functions Tl, T2, ••. on IR converging
pointwise to the function T, prove that Tn converges in £IR(F) if and
only if it converges to T in £IR(F). Suggestion: use a subsequence argu-
ment similar to the one in the proof of Theorem 19.1 (the completeness
of LP spaces) of Billingsley (1995).
12 For R = [0,1] and K(t) = e- 1tl , show that every element of £R(F) can
be written in the form a+(l +iw) fol eiwtc(t)dt for some real constant a
and real-valued function c that is square-integrable on [0, 1]. This result
is a special case of (1.3) of Ibragimov and Rozanov (1978, p. 30).
Gaussian model
A somewhat commonly used form for the autocovariance function of a
smooth process on lR is K(t) = ce- ot2 , for which the corresponding spec-
tral density is f(w) = ~c(rra:)-1/2e-w2 /(40). Because of its functional form,
it is sometimes called the Gaussian model (Journel and Huijbregts 1978,
30 2. Properties ofRandom Fields
1.0
0.8
0.6
0.4 \
/
I
\
,
0.2 /
.....
,/ "- .....
0.0
-8 -4 0 4 8
t
FIGURE 1. Plots of e- t2j2 (solid line) and e-1tl(1 + ItI) (dashed line).
2.7 Examples of spectral densities on R 31
0.5
0.4
f(w) 0.3
0.2
0.1
0.0
-6w -4w -2w 0 2w 4w 6w
w
FIGURE 2. Plot of the spectral density f(w) = (1- cosw)jw 2 for the triangular
autocovariance function K (t) = 7r -1 (1 - It I) + .
Matern class
A class of autocovariance functions that I believe has considerable prac-
tical value is obtained from spectral densities of the form f(w) = ¢(a 2 +
W 2 )-v-l/2 for II > 0, ¢ > 0 and a > O. The corresponding autocovariance
function is
(14)
K(t) = f=
j=o
b .t2j -
3
7f¢ .
r(2v + 1) sm(v7f)
Itl 2v + O(ltI 2m +2) as t ---+ 0 (15)
for every fixed t. By (4), e-(,BhWlo is p.d. for 0 < {j < 2 and hence so is
e- 1t1 o. Finally, e- ltlO is not p.d. for {j > 2, which can be seen by noting that
the second derivative of this function is 0 for t = 0, which would imply
var{Z'(O)} = O. Yaglom (1987b, p. 48) provides some historical notes on
determining the positive definiteness of the function e- 1t1 o.
Exercises
13 Show that if f is the spectral density of a real process on R and is
rational, thenf can be written as in (12).
14 Show that the function t 4 cos(t- 3) (defined by continuity at t = 0) is of
the form bo+b1t2+b2ItI3+0(t4) as t ---+ 0 but is not twice differentiable
at O.
15 Look up Theorem 2.3.1 of Lukacs (1970) and show that Theorem 2
follows.
16 Suppose Z is a weakly stationary process on R with analytic autoco-
variance K. Show that L:7=o Z(j)(O)tj Ii! ---+ Z(t) in £2 as n ---+ 00 for
any t > O.
17 Using the inversion formula (11), show that the spectral density
corresponding to K(t) = c(a -ltJ)+ is cn- 1 {1 - cos(wa)}/w 2 .
18 Verify (15) and (16) by looking up the relevant series expansions for
modified Bessel functions in, for example, Abramowitz and Stegun
(1965). Give explicit expressions for bo, . .. ,bm in both cases. Verify
(17) and (18).
(19)
Pitman (1968) gives the proof for this general result.I only consider
the special case for which H(x) rv cx- r as x --t 00 for some c > 0 and
o < 7 < 2. Note that the function U2n is generally a principal irregular
part of U.
Using integration by parts,
/-Lo - U(t) =
:.....:...----'--'-
t
10
00
H (x) sin tx dx,
so that
/-Lo - U(t) roo H(x/t) .
H(1/t) = 10 H(1/t) smxdx.
2.8 Abelian and Tauberian theorems 35
Since H is bounded and H{x) '" cx- T as x -+ 00, there exists a finite
constant A such that H{x) :::; A{1 + X)-T for all x > 0 and H(x) :2: !cx- T
for all x sufficiently large. Hence, for all x > 0 and all t sufficiently small,
IH(x/t)
H(1/t)
sinxl < 2A
- c(t + X)T'
which is integrable on [0, p] for any finite p. Furthermore, for any x > 0,
r .
lfo H(x/t)
H(1/t) smx = x
-T .
smx,
so that
I1proo H{x/t).
H(1/t) smxdx
I
H{p/t) 1 Ir27r (Hl) (X). I
: :; 27r H(1/t) + H(1/t) ] ;
00
127rj H t smxdx
Exercises
19 Show that tP(logt)q has index past -+ 00 for any real q.
36 2. Properties of Random Fields
20 In the proof of the special case of Theorem 3, use the Second Mean
Value Theorem for Integrals (see Spivak (1980, p. 367) or Phillips
(1984, p. 302) for a more general version) to provide an alternative
argument to justify that J;{H(x/t)/H(1/t)}sinxdx can be made
arbitrarily small by taking t small and p large. This is the argument
Pitman (1968) uses. The Second Mean Value Theorem for Integrals
was unfamiliar to me prior to my reading Pitman's paper but is a
result well worth knowing.
21 If H(x) rv x- 2 10gx as x -+ 00, find an explicit expression in terms of
elementary functions for a function U2 satisfying (19).
22 For the autocovariance function e- 1t16 on lR with 0 < 8 < 2, show that
the corresponding spectral density is asymptotic to CW- 6 - 1 as w -+ 00
and find C as a function of 8.
i:lt Cj eXP(iwtj)12Iw,-<>dw
= -( ) .
r a
{17r ( )} ""
~ CjCk 1tj -
sm -27r a-I J.. k =1
tk 1<>-1
G(t) = _ 7rIW- 1
r(a) sin {~7r(a - I)}
behaves like an auto covariance function in that '£},k=l CjCkG(tj - tk) ~ 0
whenever ,£7=1 Cj = o. It is possible to show that for 1 < a < 3, there
exist processes (nonstationary, of course) for which
var{ t
J=l
CjZ(t j )} = -¢t
J,k=1
cjckltj - tkl<>-1
(22)
For x = (X1, ... ,Xd) and 0: = (a1, ... ,ad), define xO: = TI~=lX?i and
let Dr be the set of all d-tuples whose components are nonnegative in-
tegers summing to at most r. Then if ,£7=1 Cjx,! = 0 for all 0: E Dr,
flR d 1'£]=1 Cj exp(iwTXj)12 F(dw) < 00, as required. Since the Fourier trans-
form of F will not be defined in the ordinary sense when F has infinite mass
38 2. Properties of Random Fields
G(x) = 1bd
{cos(w T x) - Qr(w T x)} F(dw) + 1
bd
cos(w T x)F(dw), (23)
where bd is the ball of radius 1 centered at the origin and the superscript C
indicates complement. Since Icos(w T x) - Qr(w T x)1 = O(lwI2r+2) for any
fixed x, the first integral on the right side of (23) is well defined for F
satisfying (22). Furthermore, if 2:7=1 Cjxj = 0 for all Q E Dr, then
The choice of bd in (23) is arbitrary; we could just as well use any bounded
region containing a neighborhood of the origin and (24) would still hold.
Matheron (1973) shows that for any positive, symmetric measure F satis-
fying (22) there is a real random field Z for which var{ 2:7=1 cjZ(Xj)} is
given by flR d 12:7=1 Cj exp(iw TXj)12 F(dw) whenever 2:7=1 Cjxj = 0 for all
Q E Dr. Matheron calls such a random field an intrinsic random function of
G(x) = r
}IRd
[cos(w T x) - Qr(w T x)l {Iwl ~ I}] F(dw) + P(x),
where F is a positive symmetric measure satisfying (22) and P is an even
polynomial of order at most 2r + 2 that is conditionally positive definite
of order r. It is trivially true that every even polynomial of order at most
2r is conditionally positive definite of order r, since for any such poly-
nomial P, 2:7,k=l CjCkP(Xj - Xk) = 0 whenever 2:7=1 Cjxj = 0 for all
Q E Dr. It follows that if G is a generalized autocovariance function for
the r-IRF Z, then so is G plus any even polynomial of order at most 2r.
2.9 Random fields with nonintegrable spectral densities 39
Semivariograms
In practice, the most frequently used class of IRFs is the O-IRFs. For a
O-IRF Z with generalized autocovariance function G, var{Z(x) - Z(y)} =
2G(0) - 2G(x - y). Define the semivariogram 'Y of a O-IRF by 'Y(x) =
~ var {Z(x) - Z(O)}. Then -'Y is a generalized autocovariance function for
Z. The semivariogram is commonly used for modeling random fields in the
geostatisticalliterature (Journel and Huijbregts 1978; Isaaks and Srivastava
1989; Cressie 1993). See Cressie (1988) for some historical notes on semi-
variograms. One reason for its popularity is that there is a convenient way
40 2. Properties of Random Fields
to estimate ,(x). For simplicity, suppose the observations Xl, ... ,X n form
some repeating pattern so that there are vectors x for which Xi - Xj = X
for many pairs of observations Xi and Xj. For such a vector x, an unbiased
estimator of ,(x) is
i: i:
square integrable. From the time domain, proceeding formally,
{I:
i: i:
var h(t)Z(t) dt} = h(s)h(t) cov {Z(s), Z(t)} ds dt
i:
=C h(s)h(t)bs_tdsdt
=C h(t)2dt.
From the spectral domain, letting H(w) = J~oo h(t)eiwtdt and again
proceeding formally,
{I:
i: i:
var h(t)Z(t) dt}
{I:
i: {I:
= h(s)h(t) 27rCe iW (S-t)dw} dsdt
{I:
i:
= 27rC h(S)eiWSds} h(t)e-iwtdt} dw
i:
= 27rC IH(w)1 2 dw
= C h(t)2dt,
where the last equality is by Parseval's relation. If h(t) = l{a < t <
b}/(b - a), then var {J~oo h(t)Z(t) dt} = c/(b - a), so that the variance of
an average of Z over an interval tends to 00 as the length of the interval
tends to O. This result is in line with my previous statement that white
noise is not defined pointwise.
One way to think about white noise is as a generalized derivative of
Brownian motion, so that the spectral density of white noise should be w 2
times the spectral density of Brownian motion, which is the case since, as
we noted earlier, the spectral density for Brownian motion is proportional
to w- 2 . Equivalently, Brownian motion can be interpreted as an integral of
white noise. This result is the continuous time analogue to a random walk
being a sum of independent and identically distributed random variables.
Gel'fand and Vilenkin (1964) provide a rigorous development of random
fields that includes nonintegrability of the spectral density at both the
origin and infinity as special cases. Yaglom (1987a, Section 24) provides a
more accessible treatment of these topics.
Exercises
23 For 'L/;=1 Cj = 0, show that (21) holds for w in a neighborhood of the
origin.
42 2. Properties of Random Fields
24 If ell'" ,en E JR and Xl, ... ,Xn E JRd satisfy Lj=l ej exp(iwTxj) =
O(lwlr+l) in w, show that (22) implies
Characterization
Suppose K(r), r ~ 0, is an isotropic autocovariance function in JRd; that is,
there exists a weakly isotropic complex-valued random field Z on JRd such
that cov {Z(x), Z(y)} = K (Ix - yl) for all x,y E JRd. For X E JRd, we have
K (Ixl) = K (I - xl) = K (Ixl), so that K must be real. So, if Z(x) = V(x)+
iW(x) with V and W real, then cov {V(x), W(y)} = cov {V(y), W(x)}.
2.10 Isotropic auto covariance functions 43
K (Ixl) = r exp(iwTx)F(dw).
lRd
Since K(x) = K( -x), without loss of generality, we can take F to be
symmetric about the origin. Furthermore, by isotropy,
K(r) = r
labd
K(rlxI)U(dx),
r cos(rwTx) U(dx)
labd
= Ad
1 r
lo cos(rlwlcos<!»Ad_l(sin<!»d- 2d<!>
2 (d-2)/2
)
= r(d/2)
( rlwl J(d-2)/2(rlwl)
K(r) = 2(d-2)/2r(d/2) 1 00
(ru)-(d-2)/2 J(d-2)/2(ru) dG(u), (26)
44 2. Properties of Random Fields
where Gis nondecreasing, bounded on [0,00) and G(O) = O. The right side
of (26) is known as the Hankel transform of order !(d - 2) of G. We have
the following.
Theorem 5. For d ~ 2, a Junction K is a continuous isotropic autocovari-
ance Junction Jor a random field on IRd iJ and only iJ it can be represented
as in (26) with G nondecreasing, bounded on [0,00) and G(O) = O.
For dodd, (26) can be expressed in terms of elementary functions. For
example, for d = 3, K(r) = IoOO(ru)-l sin(ru)dG(u). It is often difficult to
determine whether a given function can be written as in (26). Christakos
(1984) and Pasenchenko (1996) give sufficient conditions for a function to
be an isotropic auto covariance function that can be easily verified in some
circumstances.
Let Dd be the class of continuous isotropic auto covariance functions in
d dimensions and Doo = n~l Dd the class of functions that are isotropic
continuous auto covariance functions in all dimensions. By considering a d
dimensional weakly isotropic random field along m coordinates, m < d, we
obtain an m-dimensional weakly isotropic random field, so that Dd C Dm.
Thus, Dl :::) D2 :::) ... :::) Doo·
To characterize the elements of Doc, define
Ad(t) = 2(d-2)/2r(dj2)C(d-2)/2 J(d-2)/2(t)
oc (-Ie)j
= r(dj2) L
j=O
'Ir (~+ .)
J. 2 J
t2 t4
= 1- 2d + 8d(d + 2)
so that for fixed t,
(27)
This suggests that a function is in Doc if and only if it has the representation
K(r) = 1 0
00
e- r
2
U
2
dG(u) (28)
K(r) = 1 00
Ad{ (2d)1/2ru} dCd(u),
/K(r) -1 00
e- r2u2 dG(U)/
: :; /1
00
Ad;{(2dj)1/2ru}dCd;(u)-1°O e-r2u2dCd;(U)/
+ /1 00
e- r2u2 dCd;(u) -1 00
e- r2u2 dG(U)/.
The first term on the right side tends to 0 as j --+ 00 because of the
uniform convergence in (27) and {Cd;} uniformly bounded. The second
tends to 0 for any given r > 0 by the vague convergence of {Cd;} and
e- r2u2 --+ 0 as U --+ 00 (Chung 1974, Theorem 4.4.1). Since neither K(r)
nor fooo e- r2u2 dG(u) depend on j, the two functions must be identical.
For d = 2,
C(r) ~ inf Jo(s) ~ -0.403,
s~o
for d = 3,
. sins
C(r) ~ mf - ~ -0.218
s~o s
and for d = 00, C(r) ~ O. For all finite d, the infimum of Ad(S) is attained
at a finite value of s, so the lower bounds on C are achievable (Exercise 30).
For d = 00, the bound cannot be achieved since e- t2 > 0 for all t, so that
K(r) > 0 for all r if K E '000 (unless K(O) = 0).
46 2. Properties of Random Fields
Inversion formula
The relationship (26) has a simple inversion if 1000 r d- I IK(r)1 dr < 00. In
this case, there exists a nonnegative function f with 1000 u d- I f(u) du < 00
such that
and
Smoothness properties
°
We have shown in 2.4 that an autocovariance function on lR that is contin-
uous at is continuous. For continuous isotropic autocovariance functions
in more than one dimension, we can make stronger statements about the
smoothness of K(r) for r ~ 0. From standard properties of Bessel func-
tions, IJI/(t)1 < GI/(l + ItI)-I/2 for some constant GI/ and all t, t-I/ JI/(t) is
bounded and (djdt) {t-I/JI/(t)} = -t-I/ JI/+l(t). From (26), for d ~ 3 and
r > 0,
= _2 Cd - 2)/2r(dj2) 1
00
u(ru)-Cd-2)/2 J d/ 2(ru) dG(u),
IK(s) - K(t)1
:S 1 00
IJo(su) - Jo(tu)1 dG(u)
:SOl(S-t)
rt u
Jo (1+tu)1/2 dG(u) + + tuo)
20
(1
0
1/2 {G(oo)-G(O)}
11 Jo(ur)r(1 - r) dr
= u- 1 J 1 (u) - u- 3 1 u
r2 Jo(r) dr + O(u- 3 )
= (2)1/2
-; U- 5 / 2 sin(u - 4)
311"
+ O(u- 3 ), (30)
K'(r) = -1 00
uJ1(ur) dG(u) (31)
for r > o. An argument similar to the one proving the Lip(~) property
of elements of D2 yields K'(r) is in Lip(~) on [a,oo) for any a > 0 as
claimed (Exercise 36). I am unaware of a general result on what some
degree of smoothness at 0 for an element of Dd implies about its smoothness
elsewhere. Thinking of an element ofDd as being a function on JR by setting
K( -r) = K(r), then for any E > 0, I would expect K to have at least ~(d-
1) - E more "derivatives" away from the origin than it does at the origin,
where, for a positive noninteger t, a function is said to have t derivatives
at a point if in some neighborhood of this point it has lt J derivatives and
this ltJth derivative is Lip(t -ltJ).
To see why I have included the -E term in this conjecture, consider the
following example. Pasenchenko (1996) shows that K(r) = {I - r1/2} +
is in D 2 • This function is in Lip(~) in a neighborhood of 0 but K is not
differentiable at 1. Hence, the proposition that K E Dd has ~(d - 1) more
derivatives away from the origin than at the origin is false using the defini-
tion of fractional derivatives given here. Of course, such a proposition may
be true under a different definition of fractional differentiation.
Matern class
The Matern class of functions
which we saw in (14) to be positive definite on JR, are in fact all in Doo. This
can be verified by using tl;te inversion formula and (6.576.7) of Gradshteyn
2.10 Isotropic autocovariance functions 49
(32)
Jo
which is nonnegative and satisfies oo f(u)Ud-1du < 00. This model is used
by Handcock (1989), Handcock and Stein (1993) and Handcock and Wallis
(1994). Matern (1960) appears to be the first author to have recommended
these functions as a sensible class of models for isotropic random fields
in any number of dimensions. I believe this class of models has much to
recommend it. As we show in Chapter 3, the smoothness of a random field
plays a critical role in interpolation problems. Furthermore, there is often
no basis for knowing a priori the degree of smoothness of some physical
process modeled as a random field. Thus, it is prudent to use classes of
models that allow for the degree of smoothness to be estimated from the
data rather than restricted a priori. The Matern model does allow for great
flexibility in the smoothness of the random field while still keeping the
number of parameters manageable.
It is often convenient to describe the smoothness of an isotropic random
field through the principal irregular term of the isotropic auto covariance
function. Extending the definition I gave in 2.7 for processes on lR in the
obvious way, I call 9 a principal irregular term for the isotropic autoco-
variance function K if g(r)r-2n -4 0 and Ig(r)lr- 2n - 2 - 4 00 as r ! 0 and
K is of the form K(r) = '2:/;=0 cjr2j + g(r) + o(lg(r)l) as r ! o. As in the
one-dimensional setting, if g(r) = olrl,8 is a principal irregular term for K,
I call (3 the power and 0 the coefficient of the principal irregular term. It
follows from (15) and (16) that for the Matern model with v a noninteger,
2v is the power of the principal irregular term and when v is a positive
integer, there is a principal irregular term proportional to r211 log r.
For statistical purposes, the parameterization in (32) may not be best.
In particular, for fixed 0, f becomes more and more concentrated around
the origin as v increases. In particular, suppose C a ,1I is the isotropic au-
tocorrelation function corresponding to f(w) = ¢(02 + IwI 2)-II-d/2. Then
limll-+ oo C a ,/(r) = 1 for all r ~ 0 (Exercise 39). One way to solve this prob-
lem is to use the class of models f(w) = t/1{ 0 2 (v + !d) + Iw 2} -1l-d/2 (see
1
ac(v, p)
4V
( _+u
)/+d
gT/(u) = --~~-:/=,
2
2
(33)
p2
50 2. Properties of Random Fields
( ) _r(v + ~)(4v)V
c v, P - 7I'd/2r(v)p2v
The corresponding isotropic auto covariance function is
which has the nice property that it does not depend on d. In either parame-
terization, v has the same interpretation as a measure of the differentiability
of the random field. The parameter 0' in (33) is just var{Z(x)}. Finally, p
measures how quickly the correlations of the random field decay with dis-
tance. It is thus closely related to what is known as the practical range of
an isotropic auto covariance function in the geostatistical literature, which
is informally defined as the distance at which the correlations are nearly
0, say 0.05, for example (Journel and Huijbregts 1978, p. 164). Figure 3
plots the auto covariance functions corresponding to g.,., for 0' = 1, v = 1
and several values of p and shows how the correlations decay more slowly
as p increases. Although a-I has a similar interpretation in (32), p has
the attractive feature that its interpretation is largely independent of v,
which is not the case for a. To illustrate this point, Figure 4 plots the au-
tocorrelation functions corresponding to p = 1 for v = 1,2 or 3 under (33)
and Figure 5 plots the autocorrelation functions corresponding to a = 1
for v = 1,2 or 3 under (32). The autocorrelation functions in Figure 4
are much more similar at longer distances than those in Figure 5. Another
way to see that the interpretation of p is only weakly dependent on v is to
consider the limit of (33) as v ~ 00. Specifically, for fixed 0', p and u,
(34)
1.0
\
\
0.5 \
\
\
\
,,
0.0 +-----.----~=--":::'--=-=--=-r~----=;========l
o 1 2 3 4 5
t
FIGURE 3. Plots of Matern autocovariance functions under the parameterization
given in (33) with u = 1, II = 1 and several values of p. Solid line corresponds to
p = 2, dashed line to p = 1 and dotted line to p = 4.
1.0
,,
,,
,,
0.5 ,,
0.0 +----.-----,-----r-~~~-:-.:=-:-::::;
. ..,..::=:~
o 1 2 3 4 5
t
FIGURE 4. Plots of Matern autocovariance functions under the parameterization
given in (33) with u = 1, p = 2 and several values of II. Solid line corresponds to
II = 1, dashed line to 11=2 and dotted line to II = 3.
Matern class has no more parameters than this model, it provides much
greater range for the possible local behavior of the random field. The fact
that its use requires the calculation of a Bessel function does not create
a serious obstacle to its adoption as programs that calculate all manners
of Bessel functions are readily available (Cody 1987). Section 6.5 provides
further discussion on the use of the Matern model.
52 2. Properties of Random Fields
1.0
""
""
""
""
0.5 ""
""
0.0 +------.-----,-------.-----.=======1
o 1 2 3 4 5
t
FIGURE 5. Plots of Matern autocorrelation functions under the parameterization
given in (32) with a = 1 and several values of v. Solid line corresponds to v = 1,
dashed line to v = 2 and dotted line to v = 3.
Spherical model
Perhaps the most commonly used model for isotropic autocovariance func-
tions in geological and hydrological applications is the spherical: for positive
constants c and p,
K (r) = { c (1 -2: + 2:
r 3 r3 ), r ~p (35)
0, r > p.
Exercises
30 Using standard properties of Bessel functions given in, say, Chapter 9 of
Abramowitz and Stegun (1965), show that for any d 2': 2, the infimum
of Ad(S) for S 2': 0 is attained at a finite value of s. Use this to show
that if r > 0, there is an isotropic autocorrelation function C E Vd
such that C(r) = infs~o Ad(S).
(ii) Show that the resulting function is continuous but not differen-
tiable at 1 by using properties of hypergeometric functions given,
for example, in Chapter 15 of Abramowitz and Stegun (1965).
33 Provide the details for (30).
34 By considering the correlation. of Z(f) - Z(O) and Z(l + f) - Z(1) as
flO, show that {(I - r)+}''' ¢ VI if 0 < 'Y < 1.
35 Verify (31).
36 Using (31), show that if K E V 2 and K'(O+) exists and is finite, then
K'(r) is in Lip(!) on [a,oo) for any a > O.
37 (P6Iya's criteria). Prove that if K is even and is continuous, nonnega-
tive, nonincreasing and convex on [0,00) then it is in VI by using (3),
(4) and the fact that (l-ltl)+ is in VI.
38 For a weakly isotropic random field in IRd , d ::::: 2, the results given in
this section do not resolve whether such functions must be continuous
away from the origin. Read Crum (1956) and determine to what extent
the results in this paper resolve this issue.
39 Show that if Oo,v is the isotropic autocorrelation function correspond-
= ¢(a 2 + IwI2)-v-d/2, lim,,--+oo Oo,v(t) = 1 for all t. For
ing to few)
few) = ¢{a2 (11+ !d) + Iw12} -v-d/2, find the limiting behavior of the
corresponding isotropic autocorrelation function as 11 -+ 00 for fixed a.
40 Consider a Poisson process N on 1R3 with constant intensity .x, so that
for A a Borel subset of 1R3 , N(A) is the number of events of the process
in A. Let Z(x) = N(b 3 (r) + x), the number of events in the ball in
1R3 of radius r centered at x. Show that the isotropic auto covariance
function of Z is of the form given in (35). Find the corresponding
spectral density. Show that K as given in (35) is not in V 4 .
Exercises
°
41 Suppose Z is a weakly stationary mean random field on IR2 with
auto covariance function K(u, v) = e-1ul-l vl for u and v real. Show that
the BLP of Z(O, 0) based on observing Z(t, 0) and Z(O, t) is .>.Z(O, t) +
56 2. Properties of Random Fields
)'Z(t, 0), where), = 1j(e t +e- t ) = 1j(2sinht) and the mse of the BLP
is tanh t. Next, consider adding a third observation at (t, t). Show that
the BLP of Z(O,O) is e- t Z(O, t) + e- t Z(t, 0) - e- 2t Z(t, t) with mse
(1 - e- 2t )2.
42 Show that the only real functions on ad that are isotropic and factor
into functions of each coordinate are of the form cexp( -alxI 2 ) for a
and creal.
3
Asymptotic Properties of Linear
Predictors
3.1 Introduction
Suppose we observe a Gaussian random field Z with mean function m and
covariance function K at some set of locations. Call the pair (m, K) the
second-order structure of the random field. If (m, K) is known, then as
noted in 1.2, the prediction of Z at unobserved locations is just a matter
of calculation. To review, the conditional distribution of Z at an unob-
served location is normal with conditional mean that is a linear function of
the observations and constant conditional variance. In practice, (m, K) is
at least partially unknown and it is usually necessary to estimate (m, K)
from the same data we use to do the prediction. Thus, it might be natural
to proceed immediately to methods for estimating second-order structures
of Gaussian random fields. However, until we know something about the re-
lationship between the second-order structure and linear predictors, it will
be difficult to judge what is meant by a good estimate of the second-order
structure. In particular, it will turn out that it is possible to get (m, K)
nonnegligibly wrong and yet still get nearly optimal linear predictors. More
specifically, for a random field possessing an autocovariance function, if the
observations are tightly packed in a region in which we wish to predict the
random field, then the low frequency behavior of the spectrum has little
impact on the behavior of the optimal linear predictions.
One way to study the behavior of linear predictors when the second-
order structure is not perfectly known is to consider the behavior of linear
predictors that are optimal under some incorrect second-order structure.
58 3. Asymptotic Properties of Linear Predictors
This approach has been used in the atmospheric sciences (Daley 1991,
Section 4.9) and in the geostatisticalliterature (Diamond and Armstrong
1984) as well as in my own work (Stein 1988, 1990a, 1990b, 1993, 1997, 1999
and Stein and Handcock 1989). I generally denote the actual second-order
structure for a random field Z by (mo, Ko) and the incorrect second-order
structure used to generate and evaluate linear predictors by (ml, K 1 ). I call
linear predictors that are best under an incorrect model pseudo-BLPs.
To be more specific, suppose we observe Z for all x in some set Q c R
and wish to predict Z(xo), Xo E R\Q. Define ej (Z(xo), Q) to be the error
of the best linear predictor if (mj, K j ) is the correct second-order structure
and let E j indicate expected value under (mj, K j ). One measure as to
how well predictions based on KI do when Ko is the correct covariance
function is Eo {edZ(xo), Q)2 }I Eo {eo (Z(xo), Q)2 }, the ratio of the mse
of the suboptimal pseudo-BLP to that of the BLP. This ratio is necessarily
at least 1. More specifically,
EOel(Z(xo), Q? Eo[{el(Z(xo), Q) - eo(Z(xo), Qn
+ eo(Z(xo), Q)]2
Eoeo(Z(xo), Q)2 Eoeo(Z(xo), Q)2
-1 + EO{el(Z(xO),Q) - eo(Z(xo),Q)F (1)
- Eoeo(Z(xo), Q2 '
which follows from the orthogonality of the error of a BLP with all linear
combinations of the observations. In addition to the quality of point pre-
dictions, another concern is the accuracy of assessments of mse. If we not
only compute our prediction under (ml, K 1 ) but also assess its mse under
this model, this amounts to presuming the mse is E1el(Z(XO),Q)2. The
quantity
E1el(Z(XO), Q)2
(2)
Eoel(Z(XO), Q)2
is then the ratio of the presumed mse of the pseudo-BLP to its actual mse.
If both (1) and (2) are near 1, then little is lost by using (ml, Kd instead
of the correct (mo, Ko), at least as far as predicting Z(xo) goes.
This chapter considers two ways of investigating the relationship between
second-order structures and linear prediction. One way is to study (1) and
(2) for various pairs of second-order structures, observations and predic-
tands. The second is to study the spectral characteristics of prediction
errors by making use of the correspondence between linear combinations
of values of a random field and linear combinations of complex exponen-
tials described in 2.6. It turns out that this second approach is helpful in
studying the first.
There are two basic themes to this chapter. One is the differences in
the behavior of pseudo-BLPs when interpolating (predicting at locations
"surrounded" by observations) and extrapolating (predicting at locations
outside the range of observations). The second is the lack of sensitivity of
predictions to misspecifications in the spectrum at low frequencies when
3.2 Finite sample results 59
O< a . f E I h2 E I h2 b
= ill E h 2 $ sup E h 2 = < 00. (3)
hE'H~ 0 hE'H~ 0
Then, as sets, 1tR(mO, Ko) = 1tR(mt, Kd, so call this set 1tR. The con-
dition (3) simplifies matters because now there is no need to worry about,
say, a BLP under 1tR(mo, Ko) not being an element of 1tR(ml, Kd. One
situation where (3) holds is if R = ]Rd, mo = mi and Ko and KI are auto-
covariance functions with corresponding spectral densities fo and It such
that fo/ It is bounded away from 0 and 00.
Under (3), we can give some simple bounds on the effects of using the
wrong second-order structure. Define ej(h, Q) to be the error of the BLP
of h E 1tR based on observing Z on Q and let 1t-Q be those elements h
of 1tR for which Eoeo(h, Q)2 > O. Equation (3) implies a simple bound for
the ratio in (2) on assessing mses of pseudo-BLPs:
E I el(h,Q)2 E I el(h,Q)2 b
a < inf < sup <. (4)
hE'H_Q EOel (h, Q)2 - hE'H_Q EOel (h, Q)2 -
1). Thus, the restriction to 1t-Q is not a significant one. Cleveland (1971)
obtains a more interesting bound for the ratio in (1) on the efficiency of
pseudo-BLPs.
(5)
Eo(h - P l h)2
Eo(h - P oh)2
= 1 + E O(P l h)2
= 1 + {Kl(h,9)}2
K l (g,9)
_1 + {,8lh19l + ,82h292}2
- ,8lgr + ,82g~
,8tgr(gt + hn + ,8~g~(g~ + h~) + 2,8l,82glg2(hl h2 + 9192)
(,8lgr + ,82g~)2
The fact that b - a is squared on the right side of (5) is worth noting.
Specifically, suppose a = 1 + E1 and b = 1 + E2, where both E1 and E2
are small, so that all variances are only slightly misspecified under (0, K 1 ).
Then the right side of (5) is approximately 1 + HE2 - (1)2. For example,
if -E1 = E2 = E, then 1 + HE2 - Et}2 = 1 + E2, which is much nearer to 1
than either a = 1 -E or b = 1 + E, the bounds in (4). Thus, we see that
slight misspecifications of the model can potentially have a much larger
effect on the evaluation of mses of pseudo-BLPs than on the efficiency of
the pseudo-BLPs.
One other simple finite sample result we can give is that if E oh 2 ~ E1h2
for all hE 'HR, then E1e~ ::; E1e~ ::; Eoe~ ::; Eoe~. Consequently,
Eoe~ Eoe~ 1
--2~--2~ , (6)
E 1e1 Eoeo
so that if variances under K1 are always smaller than under the correct K o,
the effect of this misspecification is greater on the evaluation of the mse
than on the efficiency of the prediction. There does not appear to be any
comparable result when K1 always gives larger variances than under Ko.
Both (6) and the comparison of the inequalities in (4) and (5) provide
some support for the general notion that misspecifying the covariance struc-
ture of a random field has a greater impact on evaluating mses than on
efficiency of point predictions, which has been noted as an empirical find-
ing by Starks and Sparks (1987). Many of the examples and much of the
asymptotic theory in the rest of this chapter also support this finding.
Exercise
1 Assuming (3) holds, show that Eoeo(h, Q)2 = 0 if and only if
Eoeo(h, Q)2 = E 1eo(h, Q)2 = E 1e1(h, Q)2 = EOe1 (h, Q)2 = O.
It follows that for a weakly stationary process, the behavior near the origin
of the auto covariance function is critical for interpolation. It turns out that
it is slightly more accurate to focus on the high frequency behavior of the
spectrum, although, as described in 2.6 and 2.8, the two are closely related.
A corollary of sorts of this consideration is that the low frequency behavior
of the spectrum should have little effect on interpolation. The results in the
rest of this chapter provide support for this notion. Furthermore, the results
show that focusing on the high frequency behavior works much better when
interpolating than extrapolating.
Some examples
As a first example, suppose f(w) = (1 + W 2 )-1 so that K(t) = 71'e- 1tl .
Consider the extrapolation problem of predicting Z(O) based on Z( -c5) for
some c5 > O. The BLP of Z(O) is e- li Z( -c5) with mse 71'(1 - e- 2li ) and the
BLP is unchanged if further observations are added at locations less than
-c5, which follows by noting cov{Z(t) , Z(O) _e- li Z( -c5)} = 0 for all t < -c5.
The prediction error is Z(O) - e- li Z( -c5) and the corresponding function
in £R.d(F) is Vli(W) = 1- e-(l+iw)li, so that
as 6 ! 0 (Exercise 4). For 6Tfj small, the fraction of the variance of the
prediction error attributable to [-Tfj, Tfj] is much smaller than for the
extrapolation problem.
As a second example, let us consider a smoother process: f(w) = (2 +
w2)-2 so that K(t) = 2- 3 / 27l"exp(-2 1/ 2Itj) (1 + 21/2Itl). For an extrapola-
tion problem, consider predicting Z(O) based on Z( -6j) for j = 1, ... , 10.
Taking B = [-T, T], Table 1 gives values of (7) for various values of T
and 6. It appears that for 6T not too large, (7) is very nearly proportional
to 6T, which is also what happened for the previous spectral density. For
the corresponding interpolation problem, predict Z(O) based on Z(6j) for
j = ±1, ... , ±1O. For fixed T, it now appears that (7) is proportional
to 65 for 6 sufficiently small, whereas (7) was proportional to 63 when
f(w) = (1 + W2)-1.
For both spectral densities, whether extrapolating or interpolating, the
fraction of variance of the prediction error attributable to the frequency
band [- T, T] is small when 6 and 6T are small. However, when extrapo-
lating, the rate of convergence appears to be linear in 6 regardless of the
smoothness of the process, and when interpolating, it is of order 63 for the
rougher process and appears to be of order 65 for the smoother process as
long as T is not too large. These results suggest that optimal interpolations
are only weakly affected by the low frequency behavior of the spectrum,
particularly for smoother processes. We return to this problem in greater
generality in 3.5.
3.4 Behavior of prediction errors in the frequency domain 65
Exercises
2 Verify (8).
3 Show that if f(w) = (1 + W 2 )-I, then the BLP of Z(O) based on
Z(8) and Z(-8) is ~sech(8){Z(8)+Z(-8)} with mse 7rtanh(8). In
addition, show that these results are unaffected by taking additional
observations outside [-8, 8J.
4 Verify (9).
5 Produce results similar to those in Table 1 for f(w) = (3 +w 2)-3.
TABLE 1. Values of (7) for B = [-T, T] with few) = (2 +W2)-2 when predicting
Z(O) based on Z( -6j) for j = 1, ... ,10 (extrapolation) and based on Z(6j) for
j = ±1, ... , ±1O (interpolation).
Extrapolation Interpolation
.l..T
211" 8 = 0.05 8 = 0.1 8 = 0.05 8 = 0.1
0.1 0.0100 0.0200 1.99 x 10- 8 6.31 X 10- 7
0.2 0.0200 0.0400 5.75 x 10- 8 1.83 X 10- 6
0.5 0.0500 0.100 7.95 x 10- 7 2.53 X 10- 5
1 0.100 0.200 1.60 x 10- 5 5.10 X 10- 4
2 0.200 0.400 4.52 x 10- 4 1.43 X 10- 2
5 0.499 0.913 4.20 x 10- 2 7.68 X 10- 1
66 3. Asymptotic Properties of Linear Predictors
Examples of interpolation
Suppose we observe a stationary process Z at 8j for j = ±1, ... , ±n and
wish to predict Z(O). Furthermore, let (mo, Ko) be the correct second-
order structure and (mi' Kd the presumed second-order structure. Table 2
gives results for 8 = 0.2,0.1 and 0.05, n = 2/8 and two different pairs of
autocovariance functions.
The first case compares Ko(t) = e- 1tl with Ki(t) = ~e-2Itl, for which
it is possible to give analytic answers (Exercise6). For now, just consider
the numerical results in Table 2. Note that the values of Eoei/ Eoe~ and
Ei ei/ Eoe~ are both near 1, especially for smaller 8, despite the fact that
a superficial look at Ko and Ki suggests the two autocovariance functions
are rather different; after all, Ko(O) = 1 and Ki(O) = ~. It is helpful
to consider series expansions in powers of It I of the two functions about
the origin: Ko(t) = 1 -It I + ~t2 + 0 (ItI3) and Ki(t) = ~ - It I + t 2 +
o (ItI3). We see that -It I is a principal irregular term for both functions.
The fact that the power of the principal irregular term is 1 for both Ko
and Ki is the key factor in making Eoei/ Eoe~ near 1 for 8 small. The
3.5 Prediction with the wrong spectral density 67
additional fact that the coefficient of the principal irregular term is -1 for
both models is what makes EleV Eoe~ near 1 for 8 small. So, for example,
if we had Ko(t) = e- Itl and K2(t) = e- 2Itl , then EoeV Eoe~ = EoeV Eoe~
but E2e~/ Eoe~ = 2El eV Eoe~ so that the values of E2e~/ Eoe~ are near 2
for 8 small, despite the fact that we now have Ko(O) = K 2(0) = 1. We can
also see the similarities of the autocovariance functions Ki(t) = aile-ailtl
with ao = 1 and al = 2 through the spectral densities: the spectral density
Ii corresponding to Ki is li(W) = l/{7r(a~ +w2)} = 7r- 1 W- 2 {1- a~w-2 +
O(w- 4 )} as W - t 00. Thus, for i = 0, 1, li(W) '" 7r- 1 W- 2 as W - t 00.
The second pair of second-order structures presented in Table 2 is
rno = rnl = 0, Ko(t) = e- ltl (l + ItI) and Kl(t) = ke-2Itl(1 + 21tl). Here we
see that EoeV Eoe~ is extremely close to 1, especially for smaller 8 and that
EleVEoe~ is quite close to 1, although not nearly as close as EoeVEoe~.
Again, the results support the notion that misspecifying the autocovari-
ance function mainly affects the evaluation of mses. To see why these two
auto covariance functions should yield similar linear predictions and assess-
ments of mses, note that Ko(t) = 1 - ~t2 + !ltl3 - kt4 + 0 (ItI5) and
Kl (t) = k- tt2 +! Itl3- tt4 + 0 (ItI5). For both auto covariance functions,
! Itl3 is a principal irregular term. Comparing the models in the spectral
domain, the spectral densities both are of the form 2/(7rw 4 ) + O(w- 6 ) as
W - t 00, so that the similar high frequency behavior for the two models is
readily apparent.
n+1
- - - {Z(l) + Z(-l)} (10)
4n+1
(Exercise 7). The fact that for all n, Z{±l) and Z{±{n-1)/n) appear in the
BLP, whereas Z(±2/n), ... , Z(±(n-2)/n) do not, should seem strange, but
it is a consequence of the lack of differentiability of Kl at 1. Furthermore,
68 3. Asymptotic Properties of Linear Predictors
as n ---4 00, whereas the mse of the BLP under Kl is asymptotically ~n-1
(Exercise 7). To compare BLPs under the two models, let ej(n) be the pre-
diction error using K j and observations Z(±k/n) for k = 1, ... , n. Then
as n ---4 00, Eoeo(n)2 '" n- 1, E1eo(n)2 '" n-l, E1e1(n)2 '" ~n-1 and
EOe1 (n)2 '" ~n-1 (Exercise 7). Thus, even asymptotically, there are non-
negligible differences between both the actual and presumed performances
of the linear predictors under the two models. Note though, that if one uses
Ko when K1 is the truth, the presumed mse is asymptotically correct; that
is, Eoeo(n)2 / E1eo(n)2 ---4 1 as n ---4 00.
The corresponding spectral densities for this situation are fo(w) =
71"-1(1 + W 2 )-1 and It(w) = 71"-1(1 - cosw)/w 2 , so that It(w)/fo(w) =
(1 - cosw){1 + O(w- 2 )} as w ---4 00, whereas in the previous two cases,
It (w)/ fo(w) converged to 1 as w ---4 00. It is instructive to look at the func-
tions corresponding to the prediction errors in this last example. Letting ej
be the function corresponding to ej, Figure 1 plots lej(w)12 for j = 0, 1 and
n = 5. Now 1t(271"k) = 0 for all nonzero integers k, so that e1 has a smaller
mse than eo under It because leI (w W partially matches the oscillations in
It, being small when It(w) is large and vice versa.
This example shows that under fixed-domain asymptotics, it is not
always the case that two autocovariance functions sharing a common prin-
cipal irregular term yield asymptotically the same BLPs. However, it is
still possible that two spectral densities behaving similarly at high frequen-
8.------------------------------------------,
O+---~~_r--=---_,--------,_--~~_r--=---~
truth when Ko was correct, our predictions will not be all that bad but
we may be wildly overoptimistic about their mses. On the other hand, if
K 1 is correct but Ko is presumed to be the auto covariance function of Z,
our predictions are severely inefficient for small {j but at least our presumed
mses are badly conservative, which is usually a more tolerable mistake than
being badly overoptimistic.
I strongly recommend not using auto covariance functions of the form
Ce- at2 to model physical processes. If the discussion of the theoretical
properties in 2.7 does not convince you of this, the mse of 4.79 x 10- 11 in
Table 3 for predicting Z(O) when observing Z(O.4j) for j = ±1, ... ,±20
should. Considering that var{Z(O)} = 1 and that the maximum correlation
between Z(O) and the observations is e- o.42 / 2 = 0.923, this mse is implau-
sibly small for any physical process. Unfortunately, a number of books on
geostatistics (Carr 1995; Christakos 1992; Isaaks and Srivastava 1989; Jour-
nel and Huijbregts 1978; and Kitanidis 1997) suggest Ce- at2 as a sensible
example of an auto covariance function for a mean square differentiable pro-
cess. Furthermore, the Gaussian model is the only model for differentiable
processes available for fitting semivariograms to spatial data in SAS (SAS
Institute, Inc. 1997, p. 626), S+SPATIALSTATS (Kaluzny, Vega, Cardoso
and Shelly 1998, p. 91) and VARIOWIN (Pannetier 1996, p. 50). Goovaerts
(1997) does recognize some of the serious problems with this model but does
not give any alternative models for mean square differentiable processes.
The Matern models (Sections 2.7 and 2.10) include a parameter that con-
trols the differentiability of the process and I recommend their adoption as
an alternative to Ce- at2 as a model for differentiable processes.
Examples of extrapolation
Let us next reconsider the two pairs of auto covariance functions in Ta-
ble 2 when there are only observations on one side of 0 so that we
are extrapolating rather than interpolating. Table 4 compares these two
pairs of second-order structures for predicting Z(O) based on Z( -oj),
j = 1,2, ... ,2/6. For the first case mo = ml = 0, Ko(t) = e-/ t / and
Kl(t) = ~e-2/t/, we see again that EoeVEoe~ and EleVEoe~ are both
near 1, especially for {j small. However, these ratios are not as close to 1 as
in the interpolation case, particularly so for EoeV Eoe~. Exercise 6 gives an-
alytic results for this problem. Note that the prediction problem itself is not
all that much easier in the interpolation case, since, as 6 1 0, Eoe~ rv 6 when
interpolating and Eoe~ rv 26 when extrapolating. What is true is that if we
use the model K(t) = a-1e- a /t /, or equivalently, f(w) = 1/{1r(o? + w2 )},
the value of a is much less critical when interpolating than extrapolating.
The difference between the extrapolation and interpolation problems is
more dramatic for the second pair of second-order structures with mo =
ml = 0, Ko(t) = e-/ t / (1 + It!) and Kl(t) = ge- 2/t / (1 + 2Itl). Table 4 shows
3.5 Prediction with the wrong spectral density 71
that the values of EoeV Eoe~ and El eV Eoe~ are not nearly as close to 1
when extrapolating as when interpolating.
Let us look more carefully at the transition between interpolation and
extrapolation. More specifically, consider what happens when there are
observations at -j/20 for j = 1, ... ,20 and at j /20 for j = 1, ... ,p for var-
ious values of p. For the exponential auto covariance functions a-le-altl,
the transition is immediate: for all p ~ 1, we get the same results as in
Table 2. Table 5 shows that the transition is more gradual for the second
case in Tables 2 and 4. We see that once p = 2, the mse of the optimal
predictor does not change much by adding further observations. The ef-
fect on the misspecification of the mse of the pseudo-BLP as measured by
E1eV Eoe~ -1 settles down at around p = 3. However, the loss of efficiency
in using the pseudo-BLP as measured by EoeV Eoe~ -1 drops dramatically
with every increase in p up through p = 5. Thus, we see that a prediction
location may need to be quite far from the boundary of the observation
region to be fully in the "interpolation" setting.
The spectral densities used are C(K)(K + w 2)-1< for K = 1, 2 and 3, where
C(K) is chosen to make var{Z(O)} = 1. Values for 8 of 0.2, 0.1 and 0.05
are considered. For all of these examples, no matter what the value of 8
and what two values of v correspond to the true and presumed spectral
densities, EoeU Eoe~ is larger when extrapolating than interpolating, par-
ticularly so for smaller 8. (It is not true that BoeU Eoe~ is always larger
when extrapolating; see Exercise 11 in 3.6 for an example.) Another ap-
parent pattern is that the penalty due to using the wrong spectral density
is smaller if the presumed spectral density decays more quickly at high fre-
quencies than the actual spectral density rather than the other way around.
1.0
y"
"""\"
\"
\"
,,"\"
,,",,"
0.5
" ,"'-
0.0 +-----r------.----=:;:=-=":::-"=-"~
o 1 2 3 4
t
FIGURE 2. Plots of auto covariance functions used in Tables 6 and 7. Solid line
corresponds to K = 1, dashed line to K = 2 and dotted line to K = 3.
3.5 Prediction with the wrong spectral density 73
TABLE 6. Mean squared errors for predicting Z{O) based on Z{6j) for j =
1, ... ,2/6 (extrapolation) and based on Z{±Oj) for j = 1, ... ,2/6 (interpolation)
under three spectral densities and for 6 = 0.2, 0.1 and 0.05. The three spectral
densities are f{w) = c{It){1t + w2 )-tt for It = 1,2,3, where c{lt) is chosen to make
var{Z{O)} = 1. See Figure 2 for plots of the autocovariance functions.
Presumed value of K
True K 1 2 3 6
Extrapolation
1 0.3295 0.653 1.999 0.2
0.1813 0.417 1.572 0.1
0.0952 0.238 1.002 0.05
2 8.72 x 10- 2 3.25 X 10- 2 5.59 X 10- 2 0.2
2.55 x 10- 2 5.33 X 10- 3 1.09 X 10- 2 0.1
6.92 x 10- 3 7.65 X 10- 4 1.74 X 10- 3 0.05
3 6.48 x 10- 2 8.75 X 10- 3 4.39 X 10- 3 0.2
1.80 x 10- 2 7.58 X 10- 4 2.24 X 10- 4 0.1
4.75 x 10- 3 5.59 X 10- 5 9.02 X 10- 6 0.05
Interpolation
1 0.1974 0.2483 0.3195 0.2
0.0997 0.1256 0.1620 0.1
0.0500 0.0630 0.0813 0.05
2 1.23 x 10- 2 6.13 X 10- 3 6.79 X 10- 3 0.2
1.71 x 10- 3 7.83 X 10- 4 8.68 X 10- 4 0.1
2.25 x 10- 4 9.84 X 10- 5 1.09 X 10- 4 0.05
3 3.40 x 10- 3 2.99 X 10- 4 2.43 X 10- 4 0.2
2.54 x 10- 4 9.82 X 10- 6 7.90 X 10- 6 0.1
1.73 x 10- 5 3.11 X 10- 7 2.49 X 10- 7 0.05
74 3. Asymptotic Properties of Linear Predictors
Exercises
6 For a process Z on JR, consider the two models (0, Ko) and (0, K 1 ),
where Ki(t) = ai1e-a,ltl. If Z is observed at ±8j for j = 1, ... , n, then
by a trivial extension of Exercise 3, the BLP of Z(O) under (0, K i ) is
~ sech(6ai){ Z(O) + Z( -o)}. Calculate Eie; and use the result to show
that as 0 ! 0,
Eie~ = 0 - i 2
03 + 0(04 )
and
so that
and
E 2 2 2
~ = 1 ao - a l 62 0(63)
E Oel2 + 3 +.
If, instead, Z is observed at just -OJ for j = 1, ... ,n, the BLP of Z(O)
under (0, K j ) is e- ajo Z(O). Show that as 0 ! 0,
Conclude that
1:
Z6(j) = Z(8j), so that ZI5 is a process on Z. Then for all integers j and k,
17r / 6 00
1:
= -7r/15 exp{ iw8(j - k)} e].;oo few + 27f£8- 1 ) dw
where
(11)
Thus, we can view P as the spectral density on (-7f, 7f] of the process Zti
on Z and, indeed, weakly stationary stochastic processes on the integers
are generally taken to have spectral distributions on (-7f, 7f]. To avoid any
possible ambiguities, I always use a -to indicate a spectral density on
(-7f,7f] for a process on Z.
An interpolation problem
Dropping the 8s for now, let Z be a mean 0 weakly stationary process on
the integers with spectral density ion (-7f, 7f] and consider the interpola-
78 3. Asymptotic Properties of Linear Predictors
tion problem of finding the BLP of Z(O) based On observing Z(j) for all
integers j ::j= O. Since the BLP will obviously have mean 0 (Ao = 0 in the nO-
tation of 1.2), we can restrict attention to predictors with mean 0 in order
to find the BLP. The space of linear predictors with mean 0 is equivalent
to the closed real linear manifold £-0(1) of functions exp(iwj) On (-7f, 7f]
for integers j ::j= 0 under the inner product defined by the spectral density
j. The BLP of Z(O) corresponds to the function H E £-o(]) satisfying
Drr {I - H(w)} exp( -iwj)](w)dw = 0 for all integers j ::j= O. By the com-
pleteness and orthogonality of the functions {exp(iwj)}jEZ in the space
of square integrable functions on (-7f, 7f] (see, for example, Akhiezer and
Glazman 1981), we must have {1- H(w)} ](w) constant almost everywhere
on (-7f, 7f]. Suppose 111 is integrable on (-7f, 7f], so that, in particular1
is positive almost everywhere on (-7f, 7f]. It follows that H must be of the
form H(w) = 1 + bl j(w) for almost every wE (-7f, 7f] for some constant b.
But HE £-0(1) implies Drr
H(w)dw = 0, so
27f
H(w) = 1 --
A
rr -
f(w) 1-rr f(II)-ldll
. (12)
The mean square prediction error is then 47f2/ J::rr ](w)-ldw. Furthermore,
when l(w)-l is not integrable, the mean square prediction error is 0 (Ex-
ercise 13). Hannan (1970, page 164) gives a generalization of these results
to vector-valued time series. Defining ei to be the prediction error under
Ii,
2 47f2 Drr{/0(w)ll1(w)2}dw
Eoe 1 = 2 ' (13)
{ J::rr ]1 (W )-ldw }
which we need to assess the effect of using the wrong spectral density for
prediction.
An extrapolation problem
Next consider the classical extrapolation problem of finding the BLP of
Z(O) when observing Z(j) for all negative integers j. This problem is math-
ematically more difficult than the interpolation case and I only present its
solution; details are provided in many sources including Hannan (1970),
Priestley (1981) and Yaglom (1962). Let £_(/) be the closed real linear
manifold of exp(iwj) for integers j < 0 with respect to the inner product
1:
1
defined by and define
:c "fa(j)zj = exp{c(o)
V 211" j=O
+ 2 "fc(j)zj},
j=l
1 /,.. -
c(j) = -4
11" _,..
exp( -iwj) logf(w) dw
for j =f 0 and
a
!<C. = exp{ c(O)}.
V 211"
For our purposes, we only need this result to obtain Eoe~. Using the
subscript k on ak, ak(j) and Ck(j) to indicate that these quantities are
ik,
i:
defined in terms of we have
Eoe~ = io(w)l~al(j)eXp(-iWj)I-2dw
1 171' N(w)
E oeo(ex,8)2 = 271'c8",-1 exp { 271' -71' log c~"'-1 dJ.JJ
} (14)
The second integral on the right side of (15) converges to J.:71' logry",(w) dw
since the integrand converges to logry",(w) for all w E (-71',71'] other than
w = 0 and is dominated by the integrable function Ilogry",(w)l. Applying
these results to (14) gives
Of course, the right side of (18) must be no greater than the right side of
(17), which can be directly verified by Jensen's inequality. A less trivial
consequence of (17) and (18) is that the mse is of order 8",-1 whether
extrapolating or interpolating.
We can now look at the behavior of the prediction errors in the spectral
domain. Let F be a positive finite measure on R with density f and let
C(F) be the class of functions square integrable with respect to F. Suppose
Vo(w; ex) is the function in C(F) corresponding to the prediction error for
the extrapolation problem with spacing 8 and spectral density f, so that
Ee(ex,8)2 = J~oo IVo(w;ex)1 2f(w)dw. It follows from Theorem 2 that
(Exercise 15). For two positive functions a and b on some set D, write
a(x) x b(x) if there exist positive finite constants Co and C1 such that Co ~
a(x)/b(x) ~ C1 for all xED. If few) x (1 + Iwl)-<>, then for Iwl < 7r6- 1 ,
(Exercise 15). For the interpolation problem, let Vo(w; in) be the function
in C(F) corresponding to the prediction error. From (12),
(Exercise 16). Equations (20) and (22) agree with (8) and (9) in 3.4 for
few) = (1 + W2)-1 and support the numerical results in Table 1 of 3.4 for
few) = (2+W 2)-2. For fixed T, when extrapolating, the rate of convergence
of the fraction of the variance of the prediction error attributable to the
frequencies [-T, TJ is linear in 0 as 6 ~ 0 irrespective of Q. In contrast,
when interpolating, this rate of convergence is of order 6<>+1 as 0 ~ o. Thus,
the low frequencies make much less of a contribution when interpolating,
especially if the process is smooth.
- ~ ex ~ -6}-6
{ 17r-7r 10g f1jg(w)
(w) dw 17r fo (w) dw
- 27r P 27r -7r !few)
82 3. Asymptotic Properties of Linear Predictors
and
) -1 exp {- 171"
TOI (ex, 6 '" "Ia1
1 log - - (W)}
(-) rh.J log 6- 1 (24)
7r 27r -71" "lao W
and fOT 0:1 < 0:0 - 1,
(25)
Furthermore, if fo/ Rand 1/ fo are in Ltoc' then fOT 0:1 > (0:0 - 1)/2,
(.
TOI In,u -
C) J::7I""Iao(W)"Ia1(W)-2dwJ::7I""Iao(W)-ldw
2 ' (26)
{J::7I" "Ia1 (w)-ldw}
(27)
Before proving these results, some comments are in order. For both in-
terpolation and extrapolation, we can use II rather than the correct fo and
still get the optimal rate of convergence for the mse as long as 0:1 is not
3.6 Theoretical comparison of extrapolation and interpolation 83
too much smaller than ao; that is, the process is not too much rougher
under h than under fo. However, for extrapolation, we need a1 > ao - 1
for this to hold, whereas for interpolation, we only need a1 > (ao - 1)/2.
A second point is that whenever the optimal rate is obtained, the limit
for either T01 (ex, 15) or T01 (in, 15) depends on fo and II only through ao
and a1 and is independent of the behavior of fo and II on any bounded
set. Thus, the range of problems for which the low frequency behavior has
asymptotically negligible effect as 15 ! 0 is larger when interpolating. When
a1 > ao - 1, so that both interpolation and extrapolation give the asymp-
totically best rate, Table 8 shows that the limit of TOl (in, 15) tends to be
much smaller than TOl (ex, 15). A reasonable conjecture is that for all ao
and a1> lim6!0 TOl (ex, t5)/T01 (in, 15) 2:: 1. We do know that this limit is +00
whenever a1 ::::; ao - 1.
Let us now return to Tables 6 and 7 and see how these results relate to
Theorem 3. First, as I have already noted, Theorem 3 does not directly
apply to the setting in Tables 6 and 7 in which there are only a finite
number of observations. However, numerical calculations show that the
results in these tables do not noticeably change by increasing n. For inter-
polation, Theorem 3 suggests that the ratios EoeV Eoe~ should tend to a
finite constant except when fo(w) = (3 + w2)-3 and h (w) = (1 + w2)-1,
in which case, it should be proportional to 15- 1 • The numerical results fit
these patterns well, particularly for a1 > ao, when the dependence of
EoeV Eoe~ on 15 is extremely weak. For extrapolating, Theorem 3 suggests
the ratios tend to a finite constant in those entries above the main diag-
onal of the table. When fo(w) = (3 + w2)-3 and h(w) = (2 + w2)-2, or
fo(w) = (2+W 2)-2 and h(w) = (1+W2)-1, then the ratio should grow like
15- 1 . When fo(w) = (3 + w2 )-3 and h(w) = (1 + w2 )-I, the ratio should
grow like 15- 3 • Although the numerical results roughly correspond to these
patterns, the agreement is not nearly as good as when interpolating. It
appears that the asymptotics "kick in" for larger values of 15 when interpo-
lating than when extrapolating, providing another argument for the greater
relevance of shrinking interval asymptotics for interpolation problems.
PROOF OF THEOREM 3. The proofs of (23) and (26) are similar to that
of (17) (Exercise 17). When a 1 < ao - 1, (25) follows from
15010 - 011 exp {~171" log ~~(w) dw} ---T C1 exp {~171" log 11
01 1 (w) dw}
271" -71" fo (w) Co 271" -71" 11
01 0 (w)
and
84 3. Asymptotic Properties of Linear Predictors
TABLE 8. Limiting values of rOt (in, 8) and rOl (ex, 8) as given by Theorem 3 for
various values of ao and at. For each ao, the largest value for at is ao + 4.8,
which facilitates comparisons of how these limits depend on at - ao as ao varies.
ao
2 4 6
al in ex in ex in ex
and
8-2 r !!(w) dw
J-1r 11 (W)2
~ roo lo(w)2 dw
Loo /I(w)
3.6 Theoretical comparison of extrapolation and interpolation 85
f '" .r..~(w)
-6
dw'" 2 C0 810g8- l ,
-",fl(w) Cl
which is a consequence of
f_'"'" I~8(w)
Jf(w)
-
1+
COlC1
Iwl/8
Idw = 0(8 log 8- 1) (29)
jf(w) = cl(8)8- 1 . f
)=-00
h (w +827fj )
in (30) and (31), where cl(8) = co(27f/8)<>'-<>o. If folh E L~oc' then for
0:1 > 0:0 - 1,
86 3. Asymptotic Properties of Linear Predictors
for a1 < ao - 1,
r01 (ex, 8)
- 1 rv (d 1 - do)2
47rC2
[1 7r
-7r
'fl,6(w)2 dw
'fla w()2
_~
27r
{1 7r
-7r
'fl,6(w) dw}2] 82 (,6-a)
'fla(w)
(32)
3.6 Theoretical comparison of extrapolation and interpolation 87
-00
{fo(w) fo(w)}
h(w) -I-log h(w) dw. (33)
(34)
.)
r01 (In,8 - 1 '"
{C 1 00
-00
{fo(w) - h(w)F
fo(w)h(w)2 dw
/111'
-11' 'f/a ()-1
W dw
} 8a+1
.
(35)
Theorem 6. Under the same conditions on fo and h as in Theorem 5,
for (3 < a + 1,
The proofs of (32) and (33) are given at the end of this section; (34)-(39)
are left as exercises. Note that the conditions given imply that all of the
integrals in (32)-(39) are well defined and finite. The condition fi(w) ;::::
(1 + Iwl)-a for all wand i = 0,1 is stronger than necessary but simplifies
the proofs considerably.
The results for interpolation and extrapolation have a number of features
in common and one major difference. In both cases, the relative increase in
mse due to using h rather than fo, given by r01 (-,8) - 1, is of order 82 (f3- a )
when (3 is not too large. Furthermore, again when (3 is not too large, for
88 3. Asymptotic Properties of Linear Predictors
of 8,
1 -2e- 3t + e- 4t
and
1 -e- 2t
(40)
E1er 1 - e- 4t
Eoer = 2(1 - 2e- 3t + e- 4t ) .
For t = 0.5, this gives Eoei/Eoe5 = 1.0901 and E1ei/Eoer = 0.6274. This
example is reconsidered in 4.4.
PROOF OF (32) AND (33). Let us first consider why these results are
plausible. Write
-6
~o (w) = 1 + 8{3-a R(w) + S6(W)
Jf(w) ,
where
R(w) = do - d1 • 1]{3(w)
C 1]a(w) ,
so that for any fixed w f. 0 in (-7r, 7r], S6(W) = 0(8{3-a). If S6(W) were
0(8{3-a) uniformly in w E (-7r, 7r], then we could say
suggesting
TOl(ex, 8) - 1 rv
82 ({3-a) [
47r
r R(W)2dw
L7r 1
- 27r
{r
L7r R(w)dw
}2] ,
which is the same as (32). This argument overlooks what happens for w
near 0 and gives a wrong result for f3 > Q: + ~. However, this calculation
does correctly suggest that frequencies not too near the origin contribute
a term of order 82({3-a) to TOl(ex,8) - 1. For (3 > Q: +~, 82({3-a) = 0(8),
which suggests the following heuristic justification for (33).
Tal (ex, 8)
90 3. Asymptotic Properties of Linear Predictors
X [1 1
+-2
71"
1 61/2 {
_6
~~(w)
-/j
11 (w)
-1
}]
dJJJ + 0(6)
1
1/ 2
= exp { - -
6 6- 1 2
/
log - - dJJJ }
lo(w)
271" _/j-1/2 It(w)
6 00
{/o(w) lo(w) } (42)
= 1 +271" -00 It (w) - 1 -log It (w) dJJJ + 0(6),
where I have used R(w) '" 6- 1 /i(w/6) uniformly for Iwl < 61/ 2 as 6 t
O. Note that it is important to combine the two integrals before letting
the limits of integration go to ±oo since IJ.:o [{fo(w)/ It(w)} - 1) dJJJl and
IJ~ooIOg{fo(w)/It (w)} dJJJl are +00 for (3 ~ a+ 1 but J~oo[{/o(w)/It (w)}-
1 -log{fo(w)/ It(w)})dw is finite (and nonnegative) for all (3 > a +~.
To provide a rigorous proof of (32), we need to consider the behavior of
S/j(w) more carefully for (3 < a+~. First, for Iwl < 6, S/j(w) «1+18/wl.B-0!
so that J~/j IS6(W)ldJJJ «6. Next, define
l~'W'~7r ISo(w)1 dw
« 8(3-0. 17r Iwlo.-(31 ~ 1(3 Ipg(w) - pf (w) Idw + 82«(3-0.)
= 0(8(3-0.),
Since i8(w) ~ N(w), the first term on the right side is 0(8) and
17r
8
log ~8(w) dw -
ff(w)
r {8(3-0. R(w) + S8(W) - ~82«(3-0.) R(W)2} dw
Jo
« 1 8
{8(3-0.IR(w)1 + IS8(W)1 + 82«(3-0.) R(W)2} dw
18 1 2
/
_0 ' / 2
log {~8(W) fr(W/8)} dw«
n(w) fo(w/8)
r
Jo
8 8Udw + 18
8
1 2
/ wUdw = 0(8)
and
«17r {8
8'/2
2 «(3-U) R(w)2 + S8(W)2} dw = 0(8).
Moreover,
1 8' / 2
_8 ' /2
log fo(w/8) dw
fr(w/8)
= 81 8- 1 / 2
_8-1/2
log fo(w) dw
fr(w)
92 3. Asymptotic Properties of Linear Predictors
so that
-0 0- 1 / 2
7r log ~o (w) dJ..; - 81 log fo(w) dJ..;
1
-7r Jt(W) _6- 1/ 2 h(w)
-2 r
}6 1 / 2
{8,8-aR(w) + 8 0 (w)}dJ..; =0(8).
Similarly,
6- 1 / 2 /
Now, L6-1/210g {Jo(w)1 h(w)} dJ..; = 0(8- 1 2), so
r01(ex,8) = [ 1- -2
7r
8 1 0- 1/2
_6-1/2
fo(w)
( dJ..;
log-f )
I w
-.! r
7r }61/2
{8,8-aR(w) + 8 6(w)} dJ..; + 0(8)]
+.!
7r
r
}01/2
{8,8-0 R(w) + 8 0(w)} dJ..; + 0(8)]
1 2
=1+~1 0- / {
fo(w) _1_log fo ( w) } dJ..;
27r _0-1/2 h(w) h(w)
_ [.! r
7r } 01/2
{8,8-a R(w) + 8 0 (w)} dw] 2 + 0(8)
= 1+ ~1 00
{fo(w) _ 1 _log fo(w) } dw + 0(8)
27r -00 h(w) h(w)
proving (33). o
Exercises
11 Give an example of a pair of auto covariance functions for a mean 0
weakly stationary process for which rOI(in, 1) = 00 but rOI(ex, 1) < 00.
12 Suppose Z is a mean 0 weakly stationary process with triangular au-
to covariance function K (t) = (1 - It I)+. For n a positive integer, find
the BLP for Z(O) when Z is observed at j In for all integers j -::f O. Do
the same when Z is observed at j In for all integers j < O.
3.6 Theoretical comparison of extrapolation and interpolation 93
/
11'
-11'
{ 1
P(w)+27f(7'~
}-1 - (7'~.
d!.JJ
Next, suppose that f(w) ,..,. cw-o. as w - 00 for some c > 0 and some 0: >
1. Then for any fixed w E (-7f, 7f] other than w = 0, p(w) ,..,. COo.-17]0.(w)
as 0 ! 0, where 7]0. is defined as in (16). Thus, a plausible conjecture is that
if (7'~ = 0(00.-1) as 6 ! 0, the measurement error will have asymptotically
96 3. Asymptotic Properties of Linear Predictors
471"2 C8o - 1
Eoeo(8)2 rv EOel(8)2 rv Elel(8)2 rv f~11" "lo«w)-ldw .
Eoeo(8)2 rv M o - 1 [
I: {I 271"
+ 2;C "lo(w) }
-1
dw
- 1] , (43)
I:
and if 8CT~ ~ 00 as 810, then
Exercises
28 Prove Theorem 7.
29 Prove (43).
30 Prove (44).
31 Prove (45) and (46). Show that (46) still holds if one observes Y6{j)
for all integers j.
32 Suppose that f is as in Theorem 7 and that observations are restricted
to those YoU) for which j satisfies 0 < 81J1 < a for some fixed a > O.
Show that if a~ = r / 8 for some fixed r > 0,
liminfEoeo(8)2 >
o!o
rjOO
-00 r
f{Wj{) rkv.
+ 211" W
r
JAd(li- 1)
eXP(-i8w Tj ){ L
kEZ d
H(w + 27r8- 1 k)r(w; k) - Hli(W)}Fo(dW) =0
3.8 Observations on an infinite lattice 99
for all j E Zd. By a basic theorem in multiple Fourier series (Stein and
Weiss 1971, p. 248),
Furthermore, if
(50)
then
IB lV(w)1 2 f(w) dw
sup IIVII} = Mo(F, B), (51)
100 3. Asymptotic Properties of Linear Predictors
0= [ V(W)Vli,B(W)f(w) dw
JR,d
= L IV(w)1 2 f(w) dw
+ E'
j
1B
V(w + 27r8- 1 j)Vo,B(W)f(w + 27r8- 1j)dw,
so that
L lV(w)1 2 f(w) dw
$ ~'
J
L I
IV(w + 27r8- 1 j)Vli,B(W) f(w + 27r8- 1j) dw
U( 2 8- 1 .) _ { 1, for j = 0, wE B€;
w + 7r J - 1 _l/r(w), for j =I- 0, wE BE
3.8 Observations on an infinite lattice 101
= F(B,) + L. -I}
{rL) f(w)dw
F(B f )
-< Mo F,B)
( - to
.
Thus, IB IU(w)1 2 f(w) dw/IIUII~ ~ Mo(F, B) - to, which implies (51) since
to can be taken arbitrarily small. 0
(53)
where ~d(Q) = E~ Ijl-a (Exercise 35). We see that for r fixed and 8 ! 0,
Mo (F, bd (r)) tends to 0 more quickly when f tends to 0 more quickly at
high frequencies. As an extreme example, suppose f has bounded support
B, in which case, the process is said to be bandlimited. If B is contained in
A d (8- 1 ) then Mo(F, B) = 0, which implies IB
lV(w)1 2 f(w) dw = O. Thus,
IIVII~ = 0 since f has 0 mass outside B so that Z may be recovered without
error at all x. We have just proven a simple version of what is known as
the sampling theorem for random fields (Jerri 1977).
uniformly similar linear predictions for small 8. Let 1Lli(Fi ) be the set of
those h in 1t(Fi ) for which Eiei(h, 8)2 > 0 and £_li(Fi ) is the corresponding
set of functions.
Theorem 10. For some c > 0, suppose h(w)/fo(w) --+ C as Iwl --+ 00,
fo ;: : : h, !1 is bounded away from 0 on any bounded set and !1 (w) --+ 0 as
Iwl --+ 00. Then
(54)
(55)
and
(56)
PROOF. The result for general c is a trivial consequence of the result for
c = 1, so assume c = 1. Since fo ;: : : h, 1t(Fd = 1t(Fo) as sets, so the left
sides of (54)-(56) are well defined. For H E £(Fi ), define
Lj fi(W + 211"8- 1j)H(w + 211"8- 1j)
Hli,F;(W) = Lj h(w + 211"8- 1j) - H(w),
:s; 1
Ad(li- l )
!1(w)It/J(w)II Hli,Fl (w)1 2dw + m(8-1)IIHli,FJ~l' (57)
1 Ad(li- l )
h(w)It/J(w)IIHli,Fl(w)12dw
1
J j
max(l, b) 2 2
::; h(w)'I/J(w) IHo,F, (w)1 dw, (60)
a Ad(o-')
so that for H E C._ O(F1 ),
IIHo,F, - Ho,FolI~o
{u(8- 1 ) _£(8- 1 )}2 2
< 2 {I + u(8- 1)} {I + £(8-1)} IIHo,FoliFo
+ 2 max(l, b)
a
r
JAd(o-')
h(w)'I/J(w? IHo,F,(w)1 2 dw. (61)
from (54) and (55) (see the proof of the asymptotic efficiency of the pseudo-
BLP in Theorem 8 of Chapter 4). The bound in (61) is needed to obtain
the rates of convergence in Theorem 12. 0
r
JAd(6-
b(w)u(w) IHo,F1 (W)1 2 dw
1
1)
+ D (~).e
7r
r
JAd(O-l )\bd(7r6- 1)
b(w) 1Ho,Fl (w)1 2 dw
~D 1 7r6- 1 0 .e
(1 + r)-.ep(r) dr + D (;;:) II Ho,F II~l' 1
where
p(r) = r
Jabd(r)
b(v) IH 6,F, (v)1 2 JL(dv)
and JL(dv) indicates the surface measure on 8bd (r). Defining P(r) =
J; p(s) ds, Theorem 9 implies P(r) ~ Mo(Fl. bd(r»IIHo,F"'~,. By defini-
tion, P(7rO- 1 ) ~ IIHo,F"'~l. Integrating by parts,
1 7r6- 1
(1 + r)-.ep(r) dr
3.8 Observations on an infinite lattice 105
= (1 + i) (3 f3
P(7r6- 1) + 10'11"0- (1 + r)-(3-1 P(r) dr
1
6 71"0- 1
~ { (:;;:) + f31
(3 }
(1 + r)-(3-1 Mo(FI, bd(r» dr IIHo,FJ}l.
(63)
a: = 2'Y.
PROOF. To prove (62), just apply Lemma 11 and the bound on 11/J1 to
(57). To prove (63), apply Lemma 11 and the bound on 1/J2 to (61). 0
Except possibly in the case a: = 'Y in (62) and a: = 2'Y in (63), these
bounds are sharp in the sense that there exist fo and JI satisfying the
stated conditions for which both conclusions are false if 0(·) is replaced
by 0(·) (Stein 1999). Stein (1999) also gives some analogous results for
a process observed unevenly on a bounded interval, but the arguments
are much more difficult and the conditions on the spectral densities are
somewhat restrictive. The general approach is similar to that taken here
in that the key result is a bound of the type given by Theorem 9 on the
fraction of the mse of a BLP attributable to a given range of frequencies.
decomposed into a term giving the effect of just misspecifying the covari-
ance function and a term giving the effect of just misspecifying the mean
function. In light of these decompositions, let us next consider the effect of
misspecifying just the mean function on linear prediction.
Theorem 13. Suppose (mo,Ko) = (mo,K) and (ml,Kl ) = (ml,K),
where K is an autocovariance function with spectrum F possessing spectral
density f bounded away from 0 on any bounded set and few) ---. 0 as
Iwl ---. 00. If m = ml - mo is square-integrable and of the form m(x) =
JIRd exp( -iw T x)~(w) dw, where
r 1~(wW
i[tdfew) dw < 00,
(66)
then
·
11m Eo{eo(h,8)-el(h,8)}2 0
sup = .
610 hE 1L 6(Fo) Eoeo(h,8)2
PROOF. It is a simple matter to show that EOel (h, 8)2 is unchanged
by subtracting the same fixed function from mo and ml, so there is
no loss in generality in taking mo identically O. Next, for any V E
C(F), the mean of the corresponding random variable in 'Jt(F) is
~d ~(w)V(w) dw (Exercise 37). By (66), for E > 0, we can choose r, so that
Jbd(r,)C 1~(w)12 f(w)-ldw < Eo Using Theorem 9, V E C(F) and orthogonal
to C6 (F) imply
+ 21bd(r,)c
1~(w)12
few)
dw r f(w)lV(w)1 2dw
i[td
r: r
absolutely continuous and has almost everywhere derivative m(p) satisfying
{m(P)(t) dt < 00
(see 1II.4.1 of Ibragimov and Rozanov (1978)). The fact that m must be
square integrable eliminates such obvious candidates for a mean function
as a nonzero constant. Results in 4.3 show that if the observations and
predictands are restricted to a bounded region, it is generally possible to
obtain asymptotically optimal predictions if the mean is misspecified by a
nonzero constant. Thus, for assessing the effect of a misspecified mean, the
infinite lattice setting is somewhat misleading as to what happens in the
fixed-domain setting.
Under stronger assumptions on ~ we can obtain rates of convergence in
Theorem 13. Specifically, suppose f(w) x (1 + Iwl)-<> for some a > d and
1~(w)12 / f(w) « (1 + Iwl)-d-'Y for some 'Y > 0, so that (66) holds. Larger
values of 'Y correspond to smoother mean functions m.
Theorem 14. If, in addition to the conditions of Theorem 13, f(w) x
(1 + Iwl)-<> and 1~(wW / f(w) « (1 + Iwl)-d-'Y for some 'Y > 0,
[
/JlRd ~(w)V(w) dwl2 :5 21 Jbd(1f6-
[ 1 ~(w)V(w) dW/ 2
)
[
IJbd(1f6-1 ~(w)V(w) dwl2
)c
{r ~(w)V(w) dW}2
lbd(7ro- 1 )
« { (69)
8a(log8)211V11~, a = 7.
Theorem 14 follows from (67)-(69). o
Exercises
33 In the proof of Theorem 9, verify that Mo(F, B) < 1 and B symmetric
imply Vo,B E .co(F).
34 In the proof of Theorem 9, verify that U E .c(F), U is orthogonal to
.co(F) and 11U11~ > o.
35 Prove (53).
36 Prove (64) and (65).
37 Using the definitions in Theorem 13, show that for any V E
.c(F), the mean of the corresponding random variable in 1f.(F) is
Irrt ~(w)V(w) dw.
d In addition, show that /flR ~(w)V(w) dW/
d is finite.
38 Suppose Z is a mean 0 weakly stationary process on IR with spectral
density f(w) = l{lwl < 7r}. Show that if Z is observed on Z then the
observations are all uncorrelated and yet perfect interpolation at all
t E IR is possible.
4
Equivalence of Gaussian Measures
and Prediction
4.1 Introduction
The basic message of the results of 3.8 is that for interpolating a mean 0
weakly stationary random field based on observations on an infinite square
lattice, the smaller the distance between neighboring observations in the
lattice, the less the low frequency behavior of the spectrum matters. This
suggests that if our goal is to interpolate our observations and we need to
estimate the spectral density from these same observations, we should focus
on getting the high frequency behavior of the spectral density as accurately
as possible while not worrying so much about the low frequency behavior.
Supposing that our observations and predictions will all take place in some
bounded region R, a useful first question to ask is what can be done if we
observe the process everywhere in R. Answering this question will put an
upper bound on what one can hope to learn from some finite number of
observations in R.
One simple way to formulate the question of what can be learned from
observations on R is to suppose that there are only two possible probability
measures for the process on R and to determine when one can tell which
measure is correct and which is not. For example, consider a mean 0 Gaus-
sian process on lR with two possible auto covariance functions: Ko(t) = e- 1tl
and Kl(t) = ~e-2Itl. Ifwe observe this process for all t E [0, T] with T < 00,
then it turns out that it is not possible to know for sure which autocovari-
ance function is correct, despite the fact that we have an infinite number
of observations. Fortunately, as demonstrated in 3.5, these models can give
110 4. Equivalence of Gaussian Measures and Prediction
(Exercise 1).
Since Gaussian random fields are determined by their first two moments,
all statements about their equivalence or orthogonality can also be written
in terms of the first two moments. For example, as a simple application of
(1), consider a random field Z on some set R with two possible Gaussian
measures Po and Pl' If there exists a sequence of linear combinations Yn =
Ej:l AjnZ(Xjn), Xln ,· .• , xrnn E R such that
(2)
or
(3)
then Po 1.. PI follows from (1) (Exercise 2). For example, suppose Z is a
mean 0 stationary Gaussian process on JR, R = [0,1], Ko(t) = e- 1tl and
KI(t) = e- 1tl (1 + ltD so that Z is mean square differentiable under PI but
not Po. Let Y n = Z(n- l ) - Z(O). Then the limit in (2) is 0 and Po 1.. Pl.
More generally, suppose Z is a mean 0 stationary Gaussian process on
JR with spectral density h under Pj and R = [0,1]. If Z is not infinitely
mean square differentiable under Jt and fo(w)/Jt(w) --.. 0 as w --.. 00, then
Po 1.. Pl. To prove this, note that the condition on Jt implies there exists a
positive integer p such that J:O
w2p Jt(w)dw = 00 (see Section 2.6). Define
the linear operator fl.. by fl..Z(t) = c l {Z(t + f) - Z(t)}. Then
E; {(.:I..)PZ(On' ~ E; { ,~ ~ m( -1)P-'Z(k')}'
Given fJ > 0, we can choose T such that fo(w)/ Jt(w) < fJ for Iwl > T. As
f! 0,
r
J1wl<T
{~sin(W2f)}2Ph(W)dw--"
f
r
J1wl<T
w2Ph(w)dw<00
for j = 0, 1 and
r
J1wl>T
{~sin (W2f) }2P Jt (w) dw
f
Thus,
since lo(w)/ hew) < 0 for Iwl > T, so Po 1- PI follows by the arbitrariness
of 0 and (2).
If mo = ml and 10 x h, then neither (2) nor (3) can happen. However,
we may still be able to prove orthogonality of Gaussian measures by con-
sidering sequences of sums of squares of linear combinations of Z. Suppose
Z is a stationary Gaussian process on [0,1] with auto covariance function
K satisfying K(t) = C - Dltl + o(ltl) as t ~ a for some D > O. Define
(4)
Then
=2n{K(0)-K(~) f
+ n-l
~(n - j) {2K
( j;;:) - K (j+l)
---:;;- - K (j_l)}2
---:;;- (5)
L2 L2
GR(O,Kd, since under GR(O, K o), Un -+ Do and under GR(O,Kd, Un -+
DI ·
Although assuming K j has a bounded second derivative on (0,1] sim-
°
plifies matters considerably, the following result implies that Kj(t) =
Cj - Djltl + o(ltl) as t -+ for j = 0,1 with Do and DI unequal and
positive is sufficient to conclude GR(O,Ko) ..1. GR(O,KI) for R any interval
of positive length (Exercise 4).
° °
D> and Un as defined in (4), var(Un) -+ as n -+ 00.
°
Theorem 1. For a mean 0 stationary Gaussian process on lR with auto-
covariance function satisfying K(t) = C - Dltl + o(ltl) as t -+ for some
1tR(m, K) be the closure of 1t~ with respect to the norm given by second
moments under (m, K). The continuity assumptions about mj and K j im-
ply that 1tR(mj, K j ) is separable for j = 0,1 (Exercise 10). If there is a
basis for 1tR(mo, Ko) and 1tR(ml, Kd that is linearly independent under
one of the covariance functions but not the other, then trivially Po ..1 PI, so
assume there exists hI, h2' ... in 1t~ forming a linearly independent basis
for both 1tR(mo, Ko) and 1tR(ml, K I ). For example, taking hi = Z(Xi)
with XI,X2, .•• dense in R yields a basis for 1tR(mj,Kj ) and if the XiS are
distinct, the hiS will commonly be linearly independent (although see Exer-
cise 16 in 2.7). Ibragimov and Rozanov (1978, Lemma 1 of Chapter 3) show
that two Gaussian measures on the a-algebra generated by hI, h 2, ... are
equivalent if and only if they are equivalent on the a-algebra generated by
Z(x) for X E R, so I do not explicitly consider the distinction between these
two a-algebras subsequently. To determine when Po and PI are equivalent
or orthogonal (or neither), it makes no difference if we subtract a sequence
of constants from the hiS, so without loss of generality, assume Eohi = 0 for
all i. Now we can linearly transform hI, . .. ,hn to hIn, ... ,hnn such that
for j, k = 1, ... , n,
(8)
EologPn =! t{-lOga~n -
2 k=l
+
0' kn
+ 1- (mkn)2},
akn
I n (1 -akn
2)2 + 2mkn
2
varO (1ogPn ) = 2" L..J
~
4 '
k=I a kn
116 4. Equivalence of Gaussian Measures and Prediction
EllogPn = "21~( 2
~ -log(}"kn + (}"kn2 - 1 + mkn
2)
k=l
and
varl(logpn) = ~ t
k=l
{(1- (}"~n)2 + 2(}"~nm%n}' (9)
= "2 Ln
1 2 1 2 mkn
(
(}"kn + (}"2 - 2 + mkn +~
2 )
. (10)
k=l kn kn
PROOF. (Ibragimov and Rozanov 1978, p. 76). From (10) and the mono-
tonicity of Tn, either infk,n (}"~n = 0 or sUPk,n (}"~n = 00 implies both that
Tn ~ 00 and, from (2), Po 1.. Pl. Thus, from now on, suppose
(11)
so that
and
n
;::: Tn ;::: L {(1 - (}"~n)2 + m%n} .
(12)
k=l
Define the event
By Chebyshev's inequality,
4.2 Equivalence and orthogonality of Gaussian measures 117
and
PI(An ) = 1 - PI (-logPn + Eo 10gPn 2: -~rn)
= 1 - PI (-logPn + EllogPn 2: ~rn)
2: 1 -4varl(~Ogpn) -t l.
rn
Thus Po 1- Pl. o
Lemma 3. If rn -t r < 00 then Po == Pl.
PROOF. (Ibragimov and Rozanov 1978, p. 77). Suppose there exists A E
.r such that Po (A) = 0 and PI (A) > O. Let P2 = Po + Pl. Then there exists
a sequence of events AI, A 2 , ... such that An is measurable with respect to
the a-field generated by hI' ... ' h n and P2 (A 0 An) -t 0 as n -t 00, where
o indicates symmetric difference (Exercise 13). Thus,
Po(A 0 An) -t 0 and PI (A 0 An) -t 0 as n -t 00. (13)
Consider.r~ = {0,0,An,A~}. For wE 0, define
for wEAn
for W E A~.
Then
Po(An) 1 - Po(An)
EllogXn = PI (An) log PI (An) + {I - PI (An)} log 1- PI(A n ) .
00.
°
By (13), PI(An ) -t PI(A) > and Po(An) -t Po(A) = 0 so -EllogXn-t
By Exercise 12, -Ellog Xn :S -Ellogp;;-l :S rn so that rn -t 00, which
yields a contradiction. Hence, we cannot have Po(A) = 0 and PI(A) > O.
Similarly, there cannot exist A E .r with PI (A) = 0 and Po(A) > o. The
lemma follows. 0
(12), Tn tends to a finite limit if and only if both E~=l (1 - (1~n)2 and
E~=l m~n are bounded in n. 0
if we set f( -j) = f(j). Under these conditions, for the sum (14) to exist
as an £2 limit of finite sums for all x, it is necessary and sufficient that
E f(j) < 00 and E j >O{J.t(j)2 + 1I(j)2} < 00. Indeed, by a relatively simple
version of Bochner's Theorem, it is not difficult to show that a function K
from IRd to the reals is a continuous positive definite function on IRd with
period 271" in each coordinate if and only if it can be written as in (15)
with all f(j)s nonnegative, f even and E f(j) < 00. We see that f is the
spectral density of the process with respect to counting measure on the
integer lattice.
The explicit representation in (14) of a periodic random field as a sum
of independent random variables makes it relatively easy to study. In par-
ticular, for two such Gaussian measures with mean functions mo and ml
having period 271" in each coordinate and auto covariance functions defined
by spectral densities fo and II on 'l.d, it is a simple matter to determine
their equivalence or orthogonality. First, as previously noted, we can as-
sume without loss of generality that the mean under Po is 0 and let J.t(j)
4.2 Equivalence and orthogonality of Gaussian measures 119
and v(j) be the Fourier coefficients for the mean under Pl. Letting j 1, h, ...
be some listing of those j in Zd for which j > 0, the sequence of random
variables X(O), Y(jd, X(jd, Y(j2), X(j2), ... forms a basis for the Hilbert
space of random variables generated by Z(x) for x E(0,2rrJd under the
inner product defined by 10. Defining Tn as in (10) and T = limn -+ oo Tn, we
get
T
' " [{/I(j) - 10(j)}2
= .~ lo(j)/I(j)
{(0)2
+ /-t J + v
(0)2}
J
{II}]
+
lo(j) /I(j) ,
JE ....
where /-t and v are taken to be even functions and yeO) = O. This definition
is appropriate even when some J;(j)s are 0 as long as % is defined to be 0
and a positive number over 0 is defined to be 00. By Theorem 4, Po and PI
are equivalent if T is finite and are otherwise orthogonal. If lo(j) ;::::: /I(j),
which is necessary for equivalence, then T < 00 if and only if
In one dimension, we can rewrite (16) in terms of the mean and au-
to covariance functions when lo(j) ;::::: /I(j) ;::::: (1 + j2)-p for j E Z and
some positive integer p. First consider the case where the autocovari-
ance functions are equal. Set Po = G(0,27r](0,K) and PI = G(0,27r](mI,K)
where K(t) = 'L'f=-oo I(j) cos(jt) and mI(t) = /-to + 'L'f=I {/-t(j) cos(jt) +
v(j) sin(jt)}. Then 'L'f=0{j.t(j)2 + V(j)2} /I(j) is finite if and only if
'L'f=0{/-t(j)2 + v(j)2}Pp is, which in turn is equivalent to m~P-I) existing
and being absolutely continuous on lR. with almost everywhere derivative
m~p) satisfying
(17)
(Exercise 14). Now consider the case where the means are both 0 and de-
fine k(t) = Ko(t) - KI(t). Then (16) holds if and only if 'LjEz{/I(j) -
IO(j)pj4 p < 00, which in turn holds if and only if k(2p-I) exists and is ab-
solutely continuous on lR. with almost everywhere derivative k(2p) satisfying
(Exercise 14)
(18)
fields but are considerably harder to prove. Let us first consider results in
the spectral domain. Define Qd to be those functions 1 : JRd - JR such that
I(w) X 14>(w)1 2 as Iwl- 00 (19)
for some function 4> that is the Fourier transform of a square-integrable
function with bounded support, where a(w) x b(w) as Iwl - 00 means
there exists a finite constant A such that a(w) x b(w) on Iwl > A. Then
for 10 E Qd and any bounded region R C JRd, GR(O,Ko) == GR(0,K1) if
for some C < 00,
tends to °
Lemma, which says that the Fourier transform of an integrable function
as its argument tends to ±oo (Stein and Weiss 1971, p. 2).
Now suppose that Ko and Kl are two autocovariance functions, and Ko is
analytic and has a spectral density. Then for any T > 0, GT(O, Ko) ==
GT(O,KI ) if and only if Ko = K I , where GT(m,K) = G[O,T](m,K)
(Ibragimov and Rozanov (1978, p. 95) or Exercise 18). The reason an-
alytic autocovariance functions are not excluded from (16) for periodic
processes is that the perfect extrapolation of the process from [0,211"] to JR,
although possible, does not provide any new information about the auto co-
variance function K, and hence K cannot be reconstructed with probability
one. Of course, (21) is not satisfied for a periodic process with continuous
autocovariance K unless K is identically 0.
Let us next consider spectral conditions for equivalence if only the mean
function is misspecified. For a closed region R C lRd and f bounded, Ya-
drenko (1983, p. 138) shows GR(O, K) == GR(ml, K) if and only if mi can
be extended to a square-integrable function on all of lRd whose Fourier
transform mi satisfies
[ I ml(W)1 2 d
lR,d f(w) W < 00.
If R = lR d , then there is no need to extend mI. Comparing this result to
°
Theorem 13 of 3.8, if f is bounded away from and 00 on bounded sets, we
see that GRd (0, K) == GR,d (ml' K) implies the uniform asymptotic optimal-
ity of pseudo-BLPs using the wrong mean function based on observations
at oj for all j E Zd as 0 ! 0.
In one dimension, analogous to (17) and (18) in the periodic setting, there
are results in the time domain on the equivalence of Gaussian measures.
Consider Gaussian measures GT(O, K) and GT(ml, K) for T > 0, where K
has spectral density f. If for some positive integer p,
f(w)w 2p x 1 as w --+ 00, (22)
(Ibragimov and Rozanov, 1978, p. 92), which can be compared with (17).
Since (22) means Z is exactly p - 1 times mean square differentiable under
G(O, K), loosely speaking, we see that (23) says that if the difference of
mean functions has one more derivative than the stochastic part of Z, the
two models are equivalent.
Next consider Gaussian measures GT(O, Ko) and GT(O, Kd for a sta-
tionary process Z on lR. Define k(t) = Ko(t) - Kl(t). For 10 satisfying
(22), GT(O, Ko) == GT(O, K 1 ) if and only if k(2p-l) exists and is absolutely
continuous on (-T, T) with almost everywhere derivative k(2p) satisfying
10r
T 2
{k(2 P )(t)} (T - t) dt < 00, (24)
°
functions be smoother than either of them separately. If, say, h(w) ~ lo(w)
for all w, then we can define independent mean Gaussian processes X and
Y with spectral densities 10 and h - 10, respectively, so that GT(O, Ko)
is the law of X on [0, T] and GT(O, Kd is the law of X + Y on [0, T].
In this case, (24) has the loose interpretation that we cannot distinguish
between X and X + Y if Y has one more derivative than X. Note that (24)
allows us to verify the claim of the preceding section that if Ko(t) = e- 1tl
and Kl (t) = (1 - Itl)+, then GT(O, Ko) == GT(O, K 1 ) if and only if T ::; 1
(Exercise 19).
Proof of Theorem 1
We now return to proving Theorem 1, stated earlier in this section. Let
F be the spectral distribution function for the process, so that if F is the
spectral measure, F(w) = F((-oo,w]). Recalling that Ad(r) = (-rrr,rrrjd,
e 1):- e 1 )
then
2K (~) - K K ~
= 2 (Xl eiwj / n (1- cos~) dF(w)
Loo n
= 2 r
JA1(n)
eiwj / n (1 - cos~)
n
dFn(w),
n-1
~(n-j) {2K
( j;) -K (j+l)
71 -K (j_l)}2
71 (26)
I
n
~(n
-
1
_ j)ei(W-v)j/n I
3=1
= Inei(w-v)/n{l_ ei(w-v)/n} - ei(w-v)/n{1 - ei(w-v)} I
{I - ei(w-v)/np
(27)
« 1 + nsin Iw2~v I
for V,W E A1(n) (Exercise 5). Thus,
n-1
~(n-j) {2K
( j;) -K (j+l)
71 -K (j_l)}2
71
«2"
n
1 1
A2(n)
W2v2
.
1 + nsm 2n
Iw-vl dFn(w)dFn(v).
- -
Suppose {en} and {dn } are positive sequences such that en -+ 00,
Cn = o(n1/2), dn -+ 00 and dn/en -+ 0 as n -+ 00. Divide A2(n)
into three regions: R1 = A 2(en); R2, the part of A2(n)\R1 for which
nsin I(w - v)/(2n)1 < d n ; and R 3 , the rest of A2(n) (see Figure 1). Now,
Cn = o(n 1/ 2 ) implies
r
l(o,'lrn]
wOdFn(w) = _w o Hn(W)I~n + 0: rn w o - 1Hn(w) dw
10
«no - 1 , (28)
4.2 Equivalence and orthogonality of Gaussian measures 125
I
1l"n
____ L __ _
1-1l"n
I
FIGURE 1. Regions of integration in proof of Theorem 1. Horizontal stripes
indicate R 1 , diagonal stripes indicate R2 and unstriped area is R3.
so that
1
«~d
n n
Since F( -t) '" Cit as t -+ 00, given € > 0, we can choose jo such that
21l'nF( -21l'jon) < € for all n sufficiently large. Again using Pitman's Taube-
rian theorem, it is possible to show that the first term on the right side of
(29) tends to 0 uniformly for all v E (c n - d n , 1l'n] (Exercise 7). Since € is
arbitrary, it follows that v{ Fn(v + dn ) - Fn(v - dn )} -+ 0 uniformly for
all v E (en - d n , 1l'n] (Exercise 7), so by (28), the contribution to (26) from
R2 tends to 0 as n -+ 00. Since the first term on the right side of (5) is
O(n- l ), the theorem follows. Exercise 25 outlines a time-domain proof of
Theorem 1.
Exercises
1 Prove (1).
2 Use (1) to show that (2) or (3) imply Po ..1 Pl.
3 Suppose Un is defined as in (4) and K(t) = C - Dltl + o(ltl) as t -+ 0
for some D > O. If K has a bounded second derivative on (0,1] then
EUn = D + O(n-l) and varUn = 2D 2n- 1 + O(n- 2). If, in addition,
K" is continuous on (0,1]' show that EUn = D + o:n- l + o(n- l )
and varUn = 2D 2n- 1 + (3n- 2 + o(n- 2) as n -+ 00 and give explicit
expressions for 0: and (3.
4 Use Theorem 1 to show that if Kj(t) = Cj - Djltl + o(ltl) as t -+ 0
for Dj > 0 for j = 0,1 with Do =/; D I , then GR(O, Ko) ..1 GR(O, Kd on
any interval of positive length.
5 Verify (27).
6 Verify the claim in the proof of Theorem 1 that Hn(w) « w- l for
0< w ~ 1l'n.
7 In the proof of Theorem 1, show that the first term on the right side of
(29) tends to 0 uniformly for all v E (en - d n , 1l'n]. Show that it follows
that v{ Fn(v + d n ) - Fn(v - d n )} -+ 0 as n -+ 00 uniformly for all
v E (en - dn , 1l'n].
8 For K(t) = (~ -Itlt, show that for Un as given in (4), varU2n '"
an- l and VarU2n+1 '" {3n- 1 as n -+ 00. Find 0: and (3.
9 For Wn as defined in (6) and Z a Gaussian process, show that as
n -+ 00, EoWn -+ -1, EI Wn -+ 0, varo Wn -+ 0 and varl Wn -+ 0 for
R = [0,2], mo = ml = 0, Ko(t) = (1 -Itl)+ and KI(t) = e- 1tl .
4.2 Equivalence and orthogonality of Gaussian measures 127
10 Show that for a mean square continuous random field Z on JRd, the
closed real linear manifold of Z(x) for x E JRd is a separable Hilbert
space.
11 Verify the expressions for logPn in (8) and for the mean and variance
of 10gPn under Po and P 1 in (9).
12 Consider probability spaces (n,:F, Po) and (n,:F, P 1 ) with P 1 abso-
lutely continuous with respect to Po and let P be the Radon-Nikodym
derivative of P 1 with respect to Po. If :F' c :F is a a-algebra on n,
show that
Eo[log{Eo(p-11:F')}] ~ 10g{Eo(P-l)}.
Note that to apply this result in the proof of Lemma 3, we should
take n = JRn, :F the Borel sets on JRn and :F' to be :F~ as defined in
Lemma 3.
13 In the proof of Lemma 3, verify that there exists a sequence of events
A 1, A 2 , ••• where An is measurable with respect to the a-field generated
by hi, . .. , h n such that P2 (A 0 An) --t a as n --t 00.
14 Consider a Gaussian process on JR with period 271". For a positive in-
teger P, suppose lo(j} ~ h(j) ~ (1 + P)-P for j E Z. Show that
GIR(a, Ko) = GIR(m, Ko) if and only if m(p-l) exists and is absolutely
continuous and has almost everywhere derivative satisfying (17). In ad-
dition, for k = Ko-Kl, show that GIR(a, Ko) = GIR(a, Kd ifand only if
k(2p-l) exists and is absolutely continuous and has almost everywhere
derivative satisfying (18).
15 Show by example that for d = 1 the integral in (2a) can be infinite and
yet GT(a, Ko) == GT(a, K 1) for some positive T.
16 Show that if I is a spectral density on JR and I(w) ~ w- a as w --t 00 for
some a > 1, then I satisfies (19). This result is (4.31) in Chapter III
of Ibragimov and Rozanov (1978).
17 Show that if a function on JR possesses a Laplace transform in a neigh-
borhood of the origin, then its Fourier transform is analytic on the real
line.
18 Applying (21), provide the details showing that if Ko and Kl are
two auto covariance functions on JR, Ko is analytic and has a spectral
density, then GT(a, Ko) == GT(a, K 1) for T > a if and only if Ko = K 1.
19 If Ko(t) = e- 1tl and Kl(t) = (l-ltl)+, show that (24) implies that
GT(a, Ko) == GT(a, Kd if and only if T ~ l.
20 If T > a, Ko(t) = e- 1tl and in a neighborhood of the origin, Kl(t) =
Kl (a) - ItI + Dltl'"Y + o(ltl'"Y) as t --t a for 'Y E (1,~] and some D 1= a,
show that GT(a, Ko) .L GT(a, Kl).
128 4. Equivalence of Gaussian Measures and Prediction
21 Show that (25) holds and hence that 1tn(mo, K o) is contained in the
Hilbert space generated by Y1, Y 2 , ....
22 If Z is not mean square continuous on R under G n(ml, Kd, show that
Gn(mo,Ko)..L Gn(ml,Kd and Gx (mo,Ko,0'3)..L Gx(ml,Kl'O'~)'
23 Show that it is possible to recover 0'3 with probability 1 from
Y17 Y2 , ... under G x (mo,Ko,0'3). Hence, show Gx(mo,Ko,0'3) ..L
GX(mb K17 an
if 0'3 ¥= O'~.
24 Suppose Gn(mo,Ko) == Gn(m17Kl)' Prove that Gx (mo,KO ,0'2)
Gx(m17 K 1 , 0'2). Suppose Gn(mo, Ko) ..L Gn(ml, K 1 ) and Z is mean
square continuous on R under both models. Prove that Gx (mo , Ko, 0'2)
is orthogonal to Gx (m1! K17 0'2).
(30)
As in the previous section (see (7)), let h ln , . .. ,hnn be a linear transfor-
mation of hI, ... ,hn such that for j, k = 1, ... ,n, Ko(hkn' hjn) = Okj and
KI(hkn,hjn ) = crJnOkj and set mkn = ml(h kn ).
The next theorem shows how to determine the equivalence or orthogo-
nality of Gaussian measures in terms of the bjks and J..I.js. It combines (2.20)
and the last equation on page 78 in Ibragimov and Rozanov (1978).
Theorem 7. Suppose varo h ;::::: varl h for h E 'H.0 . Then G{O, Ko) _
G(mI, Kd if and only if
00
and
00
LmJn=LJ.l;· (34)
j=l j=1
Define Q = 2::rk=l b;k and f3 = 2::;1 J.lJ, so that by (33) and (34),
2::7=1 (1 - O-;n)2 -+ Q and 2::7=1 mJn -+ f3 as n -+ 00. Theorem 4 and (12)
imply that Q and f3 are finite if and only if G(O,Ko) == G(m1,Kd. 0
(36)
(37)
and
(38)
4.3 Applications of equivalence of Gaussian measures to linear prediction 131
L
CXl
eo(1/;, n) = Cj1/;j.
j=n+l
Define bjk and {Lj as in (30). If Eoeo( 1/;, n)2 > 0, then as n -+ 00,
1/2
::; L L
CXl } CXl
{
b~k + {L~, (39)
j,k=n+l j=n+l
Theorem 10. Suppose (0, Ko) and (ml, Kt) are two possible second-order
structures on 11. 0 with G(O, Ko) == G(ml, Kt}. For n :::: 1,2, ... , let 11.n be
a sequence of subspaces of 11.(0, Ko) such that for any given h E 1i(0, K o ),
Eoeo(h, n)2 -+ 0 as n -+ 00. Then (35)-(38) hold.
(40)
and
On
lim "(Vj)2
n~oo~
= {3. (41)
j=l
Consider (40); the proof of (41) is left as an exercise (Exercise 28). As in (7),
let hI, h 2, ... be a linearly independent basis for fi(O, Ko) and hlp, ... , hpp
a linear transformation of hI"'" hp such that Ko(hjp , hkp) = 8jk and
K1 (h jp , hkp) = o"]p8jk for j, k :::; p. From Theorem 7, L~=1 (1 - a;p)2 con-
verges to a finite limit as p -+ 00; call this limit Q. Thus, given € > 0, we
can find p such that L~=1 (1 - a;p)2 > Q - €. Define qjp to be the BLP of
hjp based on fin and Qip the covariance matrix of (qfp"'" q;p) under K i .
Then Q op -+ I as n -+ 00 and Qfp converges to the diagonal matrix with
diagonal elements a~p, ... ,a;p (Exercise 29). Using (33) it can be shown
that
p
EleO(h,S)2 - E oeo(h,S)21
sup
h~S(O,K)
I Eoeo(h, S)2
= 00.
as p ~ 00. o
PI {lim sup
n->oo hE1t,tEIR
IPcf(h:::; t) - P{'(h :::; t)1 = o} = 1. (44)
(50)
sup Ih(w)
Iwl>c, cfo(w)
-11 < f.
Define
( ) _ {C- 1h (W) for Iwl ~ C.,
g. w - fo(w) for Iwl > C•.
Using the subscript f to indicate a calculation done assuming g. is the
spectral density,
Eoe1(h,n)2
= Eo [{ el (h, n) - e.(h, n)} + e.(h, n)]2
~Eo{e1(h,n)-e.(h,n)}2 (51)
+ 2 [Eo {el(h,n) - e.(h,n)}2 Eoe.(h,n)2f/ 2+ Eoe.(h,n)2.
By (20), G R(O, fo) == G R(O, g.), so by Theorem 8,
·
11m Eoe.(h,n)2 1
sup =. (52)
n-+CXl hE1t-n Eoeo(h, n)2
Set fO =! so that fo(w) ~ 2c- 1 h(w) for alllwi ~ C.o • Using GR(O,fo) ==
GR(O,g.o) and Exercise 2 in 4.2, there exists ao < 00 such that
Eoh2 ~ aoE.oh2 for all h E 'JiR(O, Ko).
For all f > 0, let us choose C. ~ C'o ' which we can always do. Then
g.o(w) ~ 2g.(w), so that Eoh 2 ~ 2aoE.h 2 for all f. By Theorem 10, for
4.3 Applications of equivalence of Gaussian measures to linear prediction 137
any particular f,
·
11m sup IE,e,(h,n)2 - 11 = 0
n->oo hE'H- n Eoeo(h, n)2
so that for all n sufficiently large,
Eo {el(h,n) - e,(h,n)}2 4 E, {el(h,n) - e,(h,n)}2
Eoeo(h,n)2 ~ 0:0 E,e,(h,n)2
for all h E 'H.- n , where 0:0 is independent of f. But by Theorem 1 of
Chapter 3,
(53)
By (51)-(53),
r {h(W)fo(w)
j'R,d
- fo(W)}2 dw < 00 (54)
define ei('I/J, n) to be the error of the BLP of'I/J based on YIn, ... , }jnn when
Z has second-order structure (0, Kd. Suppose Ko and KI are continuous
autocovariance functions with corresponding spectral densities fo and h for
a mean 0 random field Z on ]Rd. Assume fo E Qd and h(w)lfo(w) --+ 1
as Iwl --+ 00. If R is a bounded subset of]Rd such that all points in R are
limit points in Rand
then (49) and (50) hold with c = 1, where 1i-n is taken as all nonzero
elements of1iR(O,Ko).
Corollary 13 can be proven by first obtaining an analog to Theorem 8
for observations with measurement error and then essentially repeating the
proof of Theorem 12. The condition (56) just says that every point in R is
near a point in Xn when n is large.
Exercises
26 Prove Corollary 9.
27 Under the conditions of Theorem 8, show that for all h E 1t(0, K o),
Eoeo(h, n)2 -+ 0 as n -+ 00.
28 Prove (41).
29 In the proof of Theorem 8, show that Q(j -+ I as n -+ 00 and Q 1
converges to the matrix with elements u?p, . .. ,u~p along the diagonal
and 0 elsewhere.
30 Prove (42).
31 Prove that (40) follows from (42) and (43).
32 In the proof of Theorem 11, show that it is possible to construct an
orthonormal basis {rj}~l for S.L such that J.'j = mh) is finite for all
j.
33 Under the conditions of Theorem 11, show that Gs(O, K) == Gs(m, K)
and G'H.(O, K) ..1 G'H.(m, K) imply GS-L (0, K) ..1 GS-L (m, K).
34 Prove (50).
35 Show that (54) and (55) imply (49) and (50).
36 Suppose Z is a mean 0 weakly stationary random field on Rd with
actual auto covariance function Ko and presumed auto covariance func-
tion K 1 • For a bounded set R s;:; R d , define the function Kj(x,y) on
R x R by Kj(x,y) = Kj(x - y). Suppose that for j = 0, 1, K j can be
extended to a function on Rd x Rd such that Kj (x, y) depends only on
x - y and inld IKj(x, O)ldx < 00. Define
1
fJ(w) = (2rr)d
(
JlRd {.T}-
exp -zw x Kj(x, 0) dx.
(ii) Define the function ¢a(t) by taking it to have period 4a- I and
setting ¢a(t) = 1- altl for ItI ~ 2a- I . Show that ¢a is p.d. for all
a> O.
(iii) Define 'ljJ(t) = ~ J02¢a(t) da. Show that 'ljJ is p.d. and that 'l/J(t) =
I-ItI on [-1,1].
(iv) Show that 'ljJ has spectral density
A Bayesian version
It is possible to give an exact quantification of Jeffreys's law by taking
a Bayesian perspective. Let P = {Po : () E e} be a finite-dimensional
parametric family of distributions for (Y, Z). Suppose ((), Y, Z) have a
joint density with respect to Lebesgue measure and use P generically to
denote a marginal or conditional density, so that, in particular, p((}) is the
142 4. Equivalence of Gaussian Measures and Prediction
p(8, Z I Y)
E { log p( 8 IY)p( Z I Y)
I Y} .
This expression is 0 if 8 and Z are conditionally independent given Y, so
that both sides of (59) measure the conditional dependence of 8 and Z given
Y. To see why (59) can be viewed as a quantification of Jeffreys's law, we
need to take a closer look at both sides of this equality. For any particular
8 0 , I {p(Z 18 0 , Y),p(Z I V)} measures how far the predictive distribution
for Z diverges from the conditional distribution for Z we would obtain
if 8 = 8 0 were known. The left side of (59) is then just the average of
this divergence over all possible values of 8. Thus, the left side of (59)
measures how much information 8 contains about Z that is not already
contained in Y. Similarly, the right-hand side of (59) is a measure of how
much information Z contains about 8 that is not already contained in Y.
The conclusion I draw from (59) is a sharpening of Jeffreys's law: if the
quantity we wish to predict tells us very little new about the parameters
of our model, then our predictions will be close to those we would obtain
if we knew the true values of the parameters.
Let us now reexamine (40) in 3.6 in light of this result. To review,
suppose Z is a mean 0 Gaussian process on IR observed at Z(-j/n) for
j = O, ... ,n, Ko(t) = e- iti and Kl(t) = ~e-2iti. For any finite interval
R, GR(O, Ko) == GR(O, Kd, so that even if n is large, it will be difficult
to distinguish between these measures. Now consider predicting Z(t) for
t > O. Since prediction of Z(t) does not depend on n under either model,
denote the prediction error for Z(t) under Kj by ej(t). Using (40) in 3.6,
Figure 2 plots EOel(t)2/Eoeo(t)2 and Elel(t)2/Eoel(t)2 as functions of t.
Both functions are near 1 for t small, which, considering (58), implies that
Z (t) for t small does not provide much new information for distinguish-
ing the measures. For larger t, Z(t) does provide nonnegligible additional
information for distinguishing between the measures and this is reflected
particularly in Elel(t)2/Eoel(t)2, which tends to ~ as t ---+ 00. Thus, the
statement "things we shall never find much out about cannot be very im-
portant for prediction" (Dawid 1984) is incorrect in this setting because
4.4 Jeffreys's law 143
1.0
Ir------- .~.:-:-
.. -:-C
..C-.-:""7 --I
• • ..,.....~.~_ _ _
\
\
\
\
,,
,
0.5 .~~~------------
o 1 2 3 4
t
FIGURE 2. Ratios of mean squared errors for predicting Z(t) from Z(O) when
Ko(t) = e- iti and Kl (t) = ~e-2iti. Solid curve gives EoeV Eoe~ and dashed curve
EleVEoe~.
Exercises
38 Prove (57).
39 Prove (58).
5
Integration of Random Fields
5.1 Introduction
This chapter studies the prediction of integrals of random fields based on
observations on a lattice. The goal here is not to give a full exposition of
the topic (see Ritter (1995) for a more detailed treatment) but to make
two specific points about properties of systematic designs. The first is that
simple averages over observations from systematic designs can be very poor
predictors of integrals of random fields, especially in higher dimensions.
The second is that, at least for random fields that are not too anisotropic,
the problem with this predictor is the simple average aspect of it, not the
systematic design. These two points are of interest on their own, but they
are also critical to understanding a serious flaw in an argument of Matheron
(1971) purporting to demonstrate that statistical inference is "impossible"
for differentiable random fields (see 6.3).
Suppose Z is a mean 0 weakly stationary random field on ~d. Define
Om = {I, ... , m}d and let h be the vector of length d with each component
equal to ~. Consider predicting I(Z) = Iro,lld Z(x) dx based on observing
Z at m-l(j - h) for j E Om. This set of observations is called a centered
systematic sample because it places an observation at the center of each
cube of the form X~=l [m-l(ja - 1), m- l ja] for j = (jl, ... ,jd) E Om (see
Figure 1). A natural predictor of the integral is just the simple average
of the observations, Zm = m- d LjEQm Z(m-l(j - h)). Although it may
be natural, it is not necessarily a good predictor. Section 5.2 looks at the
asymptotic mse ofZ m as m ~ 00. Results in 5.3 and 5.4 show that if Z has
5.2 Asymptotic properties of simple average 145
I I I I
1111
1111
- -1- - -1- - -1- - , - - -
1111
- -1- - -1- - -1- - -1- - -
1
- _1- __ 1___ 1 ___ 1 __ _
1111
var{I(Z)-Zm} (1)
=[
lad
f(W)1 [
J[O,l]d
exp(iwTx)dx- E -;'ex
m
jEQm
P {im- wT (j_h)}1 dW.
1
2
146 5. Integration of Random Fields
It follows that
where
d sin2(~) { d (W )}2
gm(W) = f(w) ! ! m 2 sin2 (~) 1-!! sinc 2~ ,
where
sin2 (~)
+ 271'mj) IT (w~ )
d
gm(Wjj) = f(w 2' 2
O!=I m sm
WO! + 271'mjO!
A ( )}
1wE dm.
as m ~ 00, so that
(4)
as m ~ 00 for fixed w, where
d
J[td
r gm(Wj 0) dw « m- 4 (1 + (m)3- p ), (9)
where (m)q = m q for q =I- 0 and (m)O = logm (Exercise 6), yields
as m -+ 00. To simplify this result, note that for w E Ad(m) and j =I- 0,
_ IT sinc 2 (_W<» I
I ~gm-,-::(w.:..::..:.;j)--:-:-
few + 21l"mj) <>=1 2
1n
- <>=1 m 2 sin2 (~) <>=1 W<> + 21l"mj<>
+ 1- sinc 2 (~~)Il·
Now, for w E Ad(m) and j =I- 0,
d
II sin2 (~) lId. 2 (W<»
<>=1 m S1n
2 • (!£...)
2
2m
« <>=1
SIllC 2'
lI
d (-l)jo 2m sin (~) II /w<>/
.
W<> + 21l"mJ<> « . ...L -/·-1
m J<>
<>=1 <>:)aTO
5.2 Asymptotic properties of simple average 149
and
so that
and
lid gm(w;j) dw - f
IR Ad(m)
f(w + 21l"mj) IT
",=1
sinc2 (~",) dwl
« (mljD-pm -2 f Ad(m)
Iwl 2 nd
",=1
sinc2 (~",) dw
« (mljD-pm-21 w~ dw
O<Wd<···<Wl <7rm n!=l (1 + w~)
« m-p-1Ijl-p·
Theorem 2 then follows from (10). o
We can obtain a yet simpler result by making stronger assumptions about
f. The following is essentially a special case of Theorem 2 of Stein (I993c);
its proof is left as an exercise.
Theorem 3. Suppose f(w) x Iwl- P as Iwl -+ 00 for some p < 4 and
there exists a function I: Rd -+ ]R such that for any v E ]Rd and w =f. 0
t-co
lim t P f(v + tw) = I(w). (11)
Then
m P var {I(Z) - Zm} -+ (21l")d- p E'IO).
Note that (11) holds if, for example, f(w) rv IAwl- P as Iwl -+ 00 for some
nonsingular d x d matrix A, in which case I(w) = IAwl- p •
Exercises
1 For a weakly stationary, mean square continuous random field Z on
]Rd, show that I(Z) can be defined as an L2 limit of finite sums.
Show that the second term on the right side is O(m-4) but not o(m-4).
By combining these results with Theorem 5 in 5.4, show that Zm is
not asymptotically optimal when f(w) ~ Iwl- 4 as Iwl --+ 00 but does
have mse converging at the optimal rate.
8(t) = r L
JAd(t) j
I f(w + 271'tj)lV(wWdw
8(8- 1) as 81 o.
PROOF. I only consider the case v = 0 here as it simplifies the notation.
See Stein (1995a) for the more general case. The basic idea of the proof is
to show that there is a family of linear predictors depending on 8 that has
8(8- 1 ) as its asymptotic mse and then to show that the ELP cannot do
better asymptotically.
The following simple result is helpful. Suppose {c n } is a sequence of
nonnegative and measurable functions on Rd and {an} and {b n } are se-
quences of measurable complex functions on Rd such that flR d {la n (w)1 2 +
Ib n (w)l2}c n (w)dw < 00 for all nand
. flR d Ibn (w)l2c n (w) dw
hm =0.
n ...... oo flR d la n (w)1 2 cn (w) dw
Then
r
JAd(I5-1)C
f(w)lV(wWdw «8P r
JA d (I5-1)C
W(wWdw = o(8P), (13)
where the last step holds because v square integrable implies V is as well.
Then
(14)
152 5. Integration of Random Fields
so that
~ (
JAd(O-lY
f(w) { IUo(w)I-IUo(w) - Vo(w)I-IV(w)1 }2 dw
f"V 8(8- 1 ), (15)
( f(w)IUo(w) - "C6(wWdw
JAd(O-l)C
=(
JAd(O-l)
IV(w) - Vo(w)12 L:'f(w + 271"o-1j) dw
j
«8P 1Ad(O-l)
f(W)-21L:' f(w + 271"8- 1j) {V(w) - V(w + 271"8- 1j)
j
}1 2
dw
= o(8P ). (16)
Exercise 12 asks you to provide the details for (16). Theorem 4 follows from
(14) and (15). 0
where
Hence, Zm(v) is asymptotically optimal for p < 4. Stein (1995a) also shows
how to modify this predictor so that it is asymptotically optimal for p ~ 4
using a generalization of the procedure outlined in the next section.
Exercises
10 Prove (12).
11 Show that the function Uo defined in the proof of Theorem 4 is in
£0(1).
12 Provide the details for (16).
13 Show by example that the conclusion of Theorem 4 may not hold if
only f(w) x Iwl- P as Iwl - t 00 is assumed.
1 ~ {. -1 (. 1)} exp(~)sin(~)
m L.J exp zm v J - "2 = . (V)
j=1 msm 2m
154 5. Integration of Random Fields
for 1111 ~ m7r. Although there is more than one way to do this, let us
consider functions of the form
4>m(lI; a q ) = -
1
L exp
m {W J--
' (.
m
I)}
2
m j=1
+m ~ aj [{ill(j-~)}
-1 L.J exp + exp {ill(m-j+~)}] ,
j=1
m m
4>m(lI;aq ) = exp C;) [~ sin (i) {csc (2:) + 2 t,a j sinC: ~ II) }
+ ! cos (i) t, C: ~ II)] .
aj cos
Using
csc(t) = c 1+ t
l=1
(_I)t- 12(2 2l - 1 -1)B2lt2l-1
(2£)!
+ 0 (I t I2 k+ 1 )
and
q 22r- 1 1
"ak(2k
L.J - 1)2r-l = -- 2r B 2r £or r = 1 ..., ,8, (18)
k=1
then
I
I °I h( t )dt - Rk h()1 ~ (2k IB2k+21
2)' 2k+2 sup 1h (2k+2)()1
+.n
t
O:::;t::;I
where bj = n:=l (3jq' (3j = 1 if 2s + 2::::; j ::::; m -2s - 1 and (3j = (3m-j =
1 + aj for 1 ::::; j ::::; 2s + 1. Note that Zm,O = Zm. We then get
« J;.
r w
Im I
4s+4
IIsinc2 (
d W
201.) (1 + Iwl)-P dw
Ad(m) 01.=1
«m- 4s - 4 r
}O<Wd<",<Wl <7rm
(1 + wd 4s +2 - p IT 1:w01. dw
01.=2
2
« m - 4s - 4 (1 + (m)4s+3- P )
= o(m- P )
if 4s + 4> p.
r
d
var {I(Z) - Zm,s} rv
} Ad(m) j
IIsinc
L' f(w + 27rmj) 01.=1 2 (~OI.) dw.
Theorems 4 and 5 imply the following.
Corollary 6. If f(w) :::::: (1 + Iwl)-P and 4s +4 > p, then Zm,s is an
asymptotically optimal predictor for I(Z).
We also have an analogue to Theorem 3.
Theorem 7. If f satisfies (11), then for 4s + 4> p,
Exercises
14 Verify (19) for 1/.11 ::; 7rm.
15 Show that for q = 28 + 1, there is a unique solution to (17) and (18).
16 Show that if f(w) = o(lwl-d) as Iwl - t 00, then for 48 + 4 > d,
var{I(Z) - Zm,s} = o(m- d ). Thus, under this mild condition on f,
the mse ofthe BLP based on the centered systematic sample is o(m- d )
as m - t 00 and hence is better asymptotically than taking a uni-
form simple random sample on [O,I]d of size m d and averaging the
observations.
17 Continuation of 16. For d = 1, show by example that if no conditions
are placed on f, then m var{I(Z) - Zm} may not tend to 0 as m - t 00.
18 Prove Theorem 5.
19 Prove Theorem 7.
TABLE 1. Mean squared errors for predicting I(Z) when K(t) = exp( -4Itl)(1 +
41tl + 16t 2 /3).
K(t) = exp( -4Itl)(1 + 41tl + 16t 2 /3), for which the corresponding spectral
density is few) = 213 / {37r(16 +W2)3}. The results in Table 1 for predict-
ing feZ) = J01Z(t) dt at least qualitatively agree with the asymptotics.
Specifically, since p = 6 > 4, the simple average Zm is badly suboptimal,
particularly for larger m. The modified predictor Zrn,1 performs much bet-
ter, although even for m = 48, it has mse 14% larger than that of the
BLP Zrn. Note that all integrals required to obtain these results can be
computed analytically (Exercise 20), so that numerical integration is not
needed.
Theorem 5 shows that asymptotically there is no penalty for using Zm,s
with 8 larger than necessary. For finite m, using 8 too large does tend to
give larger mses. Table 2 shows what happens when predicting feZ) and the
auto covariance function is K(t) = e- 1tl . Here, few) ;: : : (1 + Iwl)-2 so that
Zrn,s is asymptotically optimal for all nonnegative integers 8. We see that
Zrn is very nearly optimal for all m considered, Zm,1 does somewhat worse
but is still within 2% of optimal even for m = 16, and Zm,2 does noticeably
worse, although it is within 5% of optimal for m = 48. Comparing Tables 1
and 2, it is apparent that the penalty for choosing 8 too small is much more
severe than for choosing 8 too large, as the asymptotic results predict.
The fact that it is possible to find an asymptotically optimal predictor
for f (Z) by choosing any integer 8 such that 48 + 4 > p and then us-
ing Zm,s indicates that prediction of integrals is particularly insensitive
to misspecification of the spectral density. The results of Chapter 4 show
that all prediction problems are insensitive to misspecification of low fre-
quency behavior under fixed-domain asymptotics. The results here indicate
that integrals may be predicted nearly optimally without knowing the high
frequency behavior of the spectral density well, either.
TABLE 2. Mean squared errors for predicting I(Z) when K(t) = e- 1tl .
Exercises
20 For K(t) = exp(-altl) (1 +altl + ~a2t2), find cov{Z(s),fo1 Z(t)dt}
for 0 ::; s ::; 1 and var { fo1 Z (t) dt}.
21 Reproduce the results in Table 1. Try to extend these results to larger
m. You may run into numerical problems for m not much larger than
64. For example, for m = 64, S-Plus gives the condition number (the
ratio of the largest to smallest eigenvalue) of the covariance matrix of
the observations as 4.6 x 109 and refuses to calculate its QR decom-
position due to its apparent near singularity. A good project would
be to develop methods other than using higher-precision arithmetic to
ameliorate these numerical difficulties.
6
Predicting With Estimated
Parameters
6.1 Introduction
Chapters 3 and 4 examined the behavior of pseudo-BLPs. Although the re-
sults given there provide an understanding of how linear predictors depend
on the spectral density of a stationary random field, they do not directly
address the more practically pertinent problem of prediction when parame-
ters of a model must be estimated from the same data that are available for
prediction. The reason I have avoidedprediction with estimated parameters
until now is that it is very hard to obtain rigorous results for this problem.
The basic difficulty is that once we have to estimate any parameters of the
covariance structure, "linear" predictors based on these estimates are no
longer actually linear since the coefficients of the predictors depend on the
data.
The sort of theory one might hope to develop is that, as the number of
observations increases, it is generally possible to obtain:
even when certain unknown parameters are estimated. Such general results
do exist for predicting future values of a time series observed on the integers
with finite-dimensional parameter spaces (Toyooka 1982 and Fuller 1996,
Section 8.5). Gidas and Murua (1997) prove that if a continuous time series
is observed at 8,28, ... ,To, where both 8- 1 and 8T tend to infinity, then (A)
and (B) are generally possible for predictions a fixed amount of time after
6.1 Introduction 161
To. The results in all of these works require that the unknown autocovari-
ance function can be consistently estimated as the number of observations
increases. Under fixed-domain asymptotics, the results on equivalence of
Gaussian measures in 4.2 show that there can quite naturally be parame-
ters that cannot be consistently estimated as the number of observations
increases. Indeed, Yakowitz and Szidarovszky (1985, Section 2.4) essentially
claim that the impossibility of consistently estimating the auto covariance
function based on observations in a fixed region implies that (A) and (B)
are unachievable. The results in Chapter 4 show that, at least for Gaussian
random fields, this line of reasoning is inadequate. Specifically, Theorems 8
and 10 in 4.3 demonstrate that there is no need to distinguish between
equivalent Gaussian measures in order to obtain asymptotically optimal
predictions.
These theorems do not by themselves imply (A) and (B). Indeed, direct
analogues to Theorems 8 and 10 in 4.3 will not generally be possible for
predictions based on estimated models. The problem, as I discuss in 6.8,
has to do with the uniformity of these results over all possible predictions.
If one restricts the class of predictands appropriately, then I expect that
rigorous results in support of (A) and (B) are obtainable. Putter and Young
(1998) provide the first step of an approach to proving (A) and (B) for
predictions based on estimated parameters, although much remains to be
done to obtain any such result when using, as I advocate, the Matern model
for the autocovariance function.
This chapter provides theorems, heuristic derivations, numerical calcula-
tions and a simulated example concerning the estimation of auto covariance
functions and prediction of random fields based on these estimates. Sec-
tion 6.2 describes Matheron's notion of microergodicity, which is closely
related to equivalence and orthogonality of measures and which plays a
crucial role in thinking about whether (A) and (B) should be possible un-
der fixed-domain asymptotics. Section 6.3 demonstrates a crucial flaw in an
argument due to Matheron (1971, 1989) that purports to show that (B) is
unachievable for predicting integrals of sufficiently smooth random fields.
Section 6.4 describes maximum likelihood and restricted maximum likeli-
hood estimation for the parameters of the covariance function of a Gaussian
random field. In many settings, as the number of observations increases,
maximum likelihood estimates are asymptotically normal with mean equal
to the true value of the parameter vector and covariance matrix given by
the inverse of the Fisher information matrix. Section 6.4 briefly describes
such standard asymptotic results and explains why they often do not hold
under fixed-domain asymptotics.
Section 6.5 advocates the Matern class as a canonical class of autocovari-
ance functions for spatial interpolation problems. Recall from 2.10 that the
general form of the Matern spectral density of an isotropic random field on
Rd is f(w) = ¢(0i. 2 + IwI 2 )-v-d/2. The critical parameter here is ZI, which
controls the degree of differentiability of the underlying random field. Any
162 6. Predicting With Estimated Parameters
class of models that does not include a parameter allowing for a varying
degree of differentiability of the random field is, in my opinion, untenable
for general usage when interpolating random fields. Thus, in particular, a
standard semivariogram model such as the spherical (see 2.10) should not
be used unless there is some credible a priori reason to believe the semi-
variogram must behave linearly near the origin. The same criticism applies
to the exponential model, even though it is a special case of the Matern
model with II = ~.
Section 6.6 investigates numerically the Fisher information matrix for
the parameters of the Matern model in various settings, including cases in
which there are measurement errors. An important finding of 6.6 is that
evenly spaced observations can lead to great difficulty in estimating the
parameters of the Matern model.
Theorem 1 in Section 6.7 derives fixed-domain asymptotic properties of
maximum likelihood estimates for a class of periodic random fields closely
related to the Matern class. I would expect that similar results hold for
estimating the parameters of the (nonperiodic) Matern class itself, but
cannot prove such a claim.
Section 6.8 considers some properties of the commonly used plug-in
method for prediction and assessment of mses, in which unknown parame-
ters of the auto covariance function are estimated and then these estimates
are treated as if they were the truth. In particular, I give an approximate
frequentist formulation of Jeffreys's law relating the additional informa-
tion a predictand has about unknown parameters beyond that contained
in the observations to the effect on the prediction of having to estimate
these parameters. This approximation should be compared to the exact
Bayesian formulation of Jeffreys's law given in 4.4. The approximation is
easily computed and provides the basis of a numerical study on the effect
of estimation on subsequent predictions.
Section 6.9 considers an example based on simulated data showing serious
problems with some commonly used methods in spatial statistics when the
process under investigation is differentiable.
Section 6.10 describes and advocates the Bayesian approach as the best
presently available method for accounting for the effect of the uncertainty
in the unknown parameters on predictions. However, it turns out that the
prior distributions on unknown parameters that are a necessary part of any
Bayesian analysis need to be chosen with some care.
is all auto covariance functions for which K" (0) exists and is finite, then
K"(O) is not microergodic. To see this, consider Ko(t) = e- 1tl (1 + Itl) and
K 1(t) = ~e-2Itl(1 +2It/), for which KO"(O) = -1 and K1"(0) = -~. Since,
by (24) in 4.2, GR(O,Ko) == GR(O,KI), K"(O) is not microergodic. If the
class of models is Ko(t) = (he-92Itl (1 +02It/), 8 = (0 1, (2) E (0,00) x (0, 00),
then -0 101 = K"(O) is still not microergodic, but 2010t = KIII(O+) is (Ex-
ercise 5). On the other hand, if the class of models is Ko(t) = OK(t) where
o E (0,00), K is an auto covariance function possessing a spectral density
and K"(O) exists, then 0 and hence KO"(O) is microergodic, which follows
from Exercise 6. However, assuming the auto covariance function is known
up to a scalar multiplier is highly restrictive. Finally, if Ko(t) = (he- 02t2
with 8 = ((h,02) E (0,00) x (0,00), then 8 is microergodic (Exercise 8)
and hence so is Ko"(O). This last example makes use of the unusual prop-
erties of processes with analytic auto covariance functions and should be
considered atypical.
Exercises
1 For a stochastic process Z observed on [0, 1], define the empirical
semivariogram for i'(t) for 0 ::; t < 1 by
i'(t) =()
2
1
1 t-
1
0
1 t
-
{Z(s + t) - Z(s)}2ds.
{ I( ) - } rP ~ ·-6 7rrP
var Z - Zn,l '" 167r5 n 6 L-,.J = 15 120n6 '
i=l '
where the last step uses 23.2.16 in Abramowitz and Stegun (1965). Thus,
if we can estimate rP consistently as n --+ 00, then we can obtain an asymp-
totically valid estimate of the mse of the asymptotically optimal predictor
Zn,l. I believe it is impossible to estimate ¢ consistently from the sample
semivariogram, although I do not know how to prove this claim. However, it
is possible to estimate rP consistently by taking an appropriately normalized
sum of squared third differences of the observations. More specifically, defin-
ing the operator 6, by 6,Z{t) = c1{Z{t + €) - Z{t)}, then f{w) '" rPw-6
as w --+ 00 implies
i: r
168 6. Predicting With Estimated Parameters
Exercises
10 For a weakly stationary process Z on JR with spectral density f and
auto covariance function K with J:x,
w2 f(w) dw < 00, show that
100
-00
W2 sin 2 (~)
2
f(w) dw = ~{K"(1) -
2
K"(O)}.
13 For ¢n as defined in (2), prove that var ¢n ---+ 0 as n -+ 00. The ar-
gument is somewhat reminiscent of Theorem 1 in Chapter 4, although
quite a bit simpler because of the assumption that f(w) '" ¢w- 6 as
6.4 Likelihood Methods 169
= 1 00
-00
4>6
w
{2nsin (2 n
W
)}
6
eiwtw + o(n2 )
uniformly in t.
(3)
where W(lJ) = MTK(lJ)-lM. Thus, the MLE of (lJ,{3) can be found by
maximizing
• n
l(lJ, (3(lJ)) = -2"log(27r) - ~ log det{K(lJ)} (4)
- ~ZT {K(lJ)-l - K(lJ)-lMW(lJ)-lMTK(lJ)-l}Z
170 6. Predicting With Estimated Parameters
(Exercise 14). Maximizing the likelihood over some parameters while hold-
ing others fixed is called profiling and the function l(O, 13(0)) is called the
profile log likelihood for 0 (McCullagh and NeIder 1989, p. 254).
Gaussian assumption
The likelihood functions given in the previous subsections all assume that
the random field is Gaussian. This is a highly restrictive assumption so
that it is reasonable to be concerned about the performance of likelihood-
based methods based on a Gaussian model when the random field is in fact
not Gaussian. In particular, such methods will generally perform poorly if
there are even a small number of aberrant observations. However, methods
that are functions of the empirical semivariogram such as least squares and
generalized least squares (Cressie 1993, Section 2.6) will also be sensitive to
aberrant values even though they do not explicitly assume that the random
field is Gaussian. Cressie and Hawkins (1980) and Hawkins and Cressie
(1984) describe "robust" procedures for estimating semivariograms that
are less sensitive to distributional assumptions than procedures based on
the empirical semivariogram. However, these procedures do not fully take
into account the dependencies in the data and thus may be considerably
less precise than likelihood-based estimates when the Gaussian assumption
is tenable. A good topic for future research would be the development
of models and computational methods for calculating likelihood functions
for non-Gaussian random fields. Diggle, Tawn and Moyeed (1998) make
an important step in this direction and demonstrate that Markov Chain
172 6. Predicting With Estimated Parameters
Monte Carlo methods provide a suitable computational tool for some non-
Gaussian models.
Computational issues
One potentially serious obstacle to employing likelihood methods is com-
puting the likelihood function. In general, if there are n observations,
calculating the determinant of W(O) and quadratic forms in W(O)-l each
require O(n 3 ) calculations. In particular, for irregularly scattered observa-
tions in more than one dimension, an O(n 3 ) calculation is usually necessary
to calculate the values of the likelihood function exactly. Thus, if there
are more than several hundred observations, exact likelihood calculations
are often infeasible. However, if the observations are on a regular lattice,
then it is possible to compute the likelihood function exactly with fewer
calculations (Zimmerman 1989). In this setting, it is also possible to use
spectral methods to approximate the likelihood (Whittle 1954; Guyon 1982;
Dahlhaus and Kiinsch 1987; and Stein 1995c), in which case, the approx-
imate likelihood can be calculated very efficiently by making use of the
fast Fourier transform (Press, Flannery, Teukolsky and Vetterling 1992,
Chapter 12).
Vecchia (1988) describes a general method for efficiently approximating
the likelihood function for spatial data. Let P(Zl, ... , zn) be the joint density
of (Z(Xl)' ... , Z(xn)) evaluated at (Zl' ... ' zn) and write other joint and
conditional densities similarly. Next, write
n
P(Zl, ... ,Zn) =p(zd I1p(Zj I zl, ... ,zj-d
j=2
and then approximate p(Zj I Zl, ... , Zj-l) by the conditional density of
Z(Xj) given just the min(m,j -1) observations among Z(Xl)' ... ' Z(Xj_l)
that are nearest to Xj in Euclidean distance, where m is much smaller
than n. The smaller the value of m, the more efficient the computation but
the worse the approximation to the true joint density. The ordering of the
observations affects the results, but Vecchia (1988) found this effect to be
small in the examples he studied and suggests ordering by the values of
anyone of the coordinate axes of the observation locations.
A somewhat related method for approximating the likelihood is to divide
the observation region into some number of subregions, calculate the like-
lihood for each subregion separately and then multiply these likelihoods
together. Similar to Vecchia's procedure, smaller subregions lead to eas-
ier computation but worse approximations of the likelihood. Stein (1986)
recommended such a procedure for minimum norm quadratic estimators
(Rao 1973), which also require computing quadratic forms of n x n in-
verse covariance matrices. I would recommend using subregions containing
at least 100 observations, in which case, it should be feasible to carry out
6.4 Likelihood Methods 173
(7)
Exercises
14 Verify (4).
15 Verify (5).
16 Show that if one defines the likelihood in terms of g((J), where g is an
invertible function on e, then g(O) is an MLE for g((J) if and only if 0
is an MLE for (J. Thus, MLEs are invariant under arbitrary invertible
transformations of the parameters.
17 Suppose a random vector X of length n has a density with respect
to Lebesgue measure p(. I (J) depending on a parameter (J. If Y =
h(X), where h is an invertible function from IRn to IRn possessing
continuous first partial derivatives, show that i((J; Y) - i((J; X) does
not depend on (J and, hence, it does not matter whether we use X or
Y in finding an MLE for (J. Thus, MLEs are invariant under smooth
invertible transformations of the observations.
The next three exercises review basic properties about likelihood func-
tions and provide a heuristic justification of (7). Assume throughout
these exercises that {P(J : (J E e} is a class of probability models for
the observations with true value (Jo and that switching the order of
differentiation and integration is permissible.
18 Show that E(JoS((Jo) = O. Show that I((Jo) = E(Joi((Jo).
19 Consider random vectors X and Y whose joint distribution is speci-
fied by P(J. Show that I((Jo; X) + I((Jo; Y) - I((Jo; (X, V)) is positive
semidefinite, where the argument after the semicolon in I(·;·) indi-
cates the observations for which the Fisher information matrix is to be
calculated. Show that I((Jo; X) + I((Jo; Y) = I((Jo; (X, V)) if X and
Y are independent for all (J.
176 6. Predicting With Estimated Parameters
20 Suppose Xl, X 2 , ••. is as in the last subsection and that the subscript
n generically indicates quantities based on X n . By taking a first-order
Taylor series in the score function about 8 0 , give a heuristic argument
showing that any consistent sequence of solutions On of the score equa-
tions should satisfy On ~ 8 0 + i n (80 )-ISn(80 ). As n -+ 00, suppose
In(80)-lin (80 ) -+ I in probability (a weak law of large numbers) and
In(80)-1/2Sn(80) ~ N(O, I) (a central limit theorem). Show that (7)
plausibly follows.
Since one would never leave out an overall scale parameter, the presence
of ¢ in the model is also essential. In addition, although Theorem 8 in
4.3 implies that one could leave J1. out of the model with asymptotically
negligible effect, it is hard to argue for arbitrarily taking the mean of Z
to be 0 unless there is some substantive reason to believe that it is. The
serious issue is whether the parameter a is helpful, since it has negligible
impact on the high frequency behavior of the spectral density. The results
in Chapters 3 and 4 show that varying a will have little effect on inter-
polations if the observations are sufficiently dense. Furthermore, in three
or fewer dimensions, a cannot be consistently estimated based on obser-
vations in a fixed domain, which follows from (20) in 4.2. Indeed, Wahba
(1990) essentially argues that a should just be set to o. This leaves us with
the model for the spectral density of ¢lwl- 2v - d , which is not integrable in
a neighborhood of the origin for v > O. Thus, this function is not a spectral
density for a stationary random field. It is, however, the spectral density
of an IRF (intrinsic random function) of order L2vJ (see 2.9).
Although leaving a out of the model is a defensible position, there are a
number of reasons why I mildly favor its inclusion. First, the mathematical
arguments for excluding a are asymptotic and hence should not be con-
sidered universally compelling. Particularly for predictands located near or
outside the boundaries of the observation region, the value of a can matter
substantially. Furthermore, if the correlations of the random field die out at
a distance much shorter than the dimensions of the domain of the observa-
tions, it may be possible to obtain a decent estimate of a. Handcock, Meier
and Nychka (1994) give an example concerning measurements of electrical
conductivity in soil that provides clear evidence of the need for positive a
both to fit the covariance structure of the data well and to provide sensible
interpolations. Second, if the available observations include a substantial
measurement error, then I suspect that badly misspecifying the low fre-
quency behavior of the spectral density could lead to serious bias in ML or
REML estimates of v even for moderately large sample sizes. Measurement
error makes estimating the high frequency behavior of a random field much
more difficult, so that the low frequency behavior can then have a larger
influence on parameter estimates. This greater influence may produce sub-
stantially biased ML estimates of the high frequency behavior if the low
frequency behavior is poorly specified. An example of severe systematic
error in ML estimates due to misspecification of a model at low frequencies
when the underlying process is deterministic is given in Section 6.3 of Stein
(1993b). Using the Matern model of course does not guarantee that the low
frequency behavior of the spectral density is correctly specified. However,
allowing a to be estimated from the data does provide substantial addi-
tional flexibility to the model while only adding one parameter. Further
study of this issue is in order.
My final reason for including a is that I find it somewhat unnatural to
link the high frequency behavior of the spectral density and the order of the
178 6. Predicting With Estimated Parameters
polynomial of the mean of the random field, which setting a-: = 0 requires.
Specifically, for the spectral density ¢ Iw 1- 2 v-d, the corresponding random
field must be an IRF of order at least l/J J and hence its mean is implicitly
taken to be a polynomial of order at least l/J J with unknown coefficients
(see 2.9). I would prefer to be able to assume the mean is constant no
matter how large /J is.
The fact that the order of the polynomial mean must increase with /J if
one sets a-: = 0 causes a bit of difficulty with REML estimation of ¢ and /J.
Specifically, suppose one models Z as a Gaussian IRF with spectral density
¢lwl- 2v - d , where the order r of the IRF is the lowest feasible: r = l/JJ.
Then the number of linearly independent contrasts out of n observations
is n - (dt~JJ), assuming this number is nonnegative (Exercise 21). This
number jumps downward as /J increases at each integer value of /J, which
means that the likelihood of the contrasts for, say, /J = 0.5 is not based on
the same information as for any /J > 1. If one is fairly certain a priori that
II < 1, then this problem does not arise.
On the whole, I would advise leaving a-: in the model. However, if ex-
amination of the likelihood function yields no substantial evidence against
a-: = 0, one can then set a-: = 0, adopt the appropriate order IRF model and
end up with a slightly more parsimonious model for the covariance struc-
ture. As long as all predictions are interpolations, I do not see that much
harm can come from doing so. Furthermore, certain numerical difficulties
that may occur with Matern autocovariance functions when /J is large can
be avoided by using a-: = O. More specifically, for the Matern model with
/J large, the principal irregular term of the autocovariance function (see
2.7) is dominated by many "regular" terms (even order monomials) in a
neighborhood of the origin, which may lead to numerical inaccuracies when
calculating likelihood functions or BL UPs based on this model.
Exercise
21 Show that the number of monomials of order at most p in d dimensions
is (dt P).
ror and then some with measurement error. In interpreting the results, it
is helpful to keep in mind that a mainly affects the low frequency behavior
of Z whereas ¢ and II both have a critical impact on the high frequency
behavior of Z.
For a Gaussian random vector with known mean 0, the Fisher infor-
mation matrix takes on a fairly simple form. Specifically, if Y follows a
N(O, E(6)) distribution, then the jkth element of I(6) is
Ljk(6) = ~ tr {E(6)-lEj(9)E(9)-lEk(9)}, (8)
where Ej(6) = 8E(9)/8()j (Exercise 22). To carry out this calculation
for observations from a Gaussian random field under the Matern model
requires differentiating the modified Bessel function IC v with respect to
II. This can be conveniently donewhen II is an integer (Abramowitz
and Stegun 1965, 9.6.45). In particular, (8/811)IC v (t)lv=1 = r1ICo(t) and
(8/811)IC,At)lv=2 = 2r 2ICo(t) + 2r l IC1(t).
The sets of observation locations on IR I consider include 40 or 80 observa-
tions, varying levels of spacings between observations, and evenly spaced or
randomly located observations. Specifically, in the evenly spaced case, there
are observations at 8, 28, ... ,n8 for n = 40 or 80, where 8 ranges between
0.02 and 1. When the sample size is 40, the randomly located observations
were generated from a single realization of 40 independent and uniformly
distributed random variables on [0,40]. Figure 1 shows these 40 values,
which I denote by tt, ... , t40' When I refer to 40 random locations with
spacing 8, I mean the set of observation locations {8tl' ... ,8t40}' When I
refer to 80 random observations with spacing 8, I mean the set of locations
{8tl,'" ,8t40}U{8(h + 40), ... , 8(t40 + 40)}. The reason for repeating and
shifting the initial 40 locations rather than generating an independent set
of 40 random locations on [40,80] is to make the cases of 40 and 80 ran-
dom locations more readily comparable. In particular, by Exercise 19 and
the stationarity of Z, repeating the same pattern twice yields a value of I
for 80 observations that is at most double the value for 40 observations.
The extent to which this value is not doubled then measures the degree
of redundancy in the information in the two halves of the 80 observation
sample.
o 10 20 30 40
FIGURE 1. Locations of random observations on [0,40].
180 6. Predicting With Estimated Parameters
2.0
1.5
1.0 \
\
0.5
\
,,
" "-
0.0
0 2 4 6 8
t
FIGURE 2. Plots of Matern autocovariance functions used in examples. Solid
line corresponds to 8 = (1,1,1) and dashed line to 8 = (24,2,2).
5000
®
EEl ®
X EEl ®
+ X
EEl ®
1000
+ X
~ ®
500 +
Iv
®
+ X
EEl
X
+
100 EEl
50 +
100
EEl
®
50
~ +
X
~ t
Ia
10 ®
"*
®
5 *
®
*
*
0.02 0.05 0.1 0.2 0.5 1
spacing fJ
FIGURE 3. Diagonal values of Fisher information matrix for Matern model with
(c/>,v,o.) = (1,1,1).
+ indicates 40 evenly spaced observations with spacing 6.
EEl indicates 80 evenly spaced observations with spacing 6.
x indicates 40 randomly placed observations on [0,406].
® indicates the same 40 randomly placed observations on [0,406] together with
each of these observation locations plus 406, for a total of 80 observations.
5 +
EB
+
T<P + +
1 + + EB x
@ @ ®
x
0.5 ffi
® ®
® ® ®
®
1 -r
0.5 EB
+
0.1 EB
IV 0.05 + x
+ EB x
+ ®
+ EB x
®
0.01 EB x
x
~ ®
0.005 ®
®
®
1
+
+x
0.5 +x
~ + + + EB
TO. x
~ EB x
® x
® EB
®
0.1 ® ®
0.02 0.05 0.1 0.2 0.5 1
spacing 0
FIGURE 4. Diagonal values of inverse Fisher information matrix for Matern
model with (¢, v, a) = (1,1, I). Symbols have same meaning as in Figure 3.
184 6. Predicting With Estimated Parameters
0.98 EB
~,fI ~
0.96
0.94 ®
1 EB
Ee
®
$ ~
0.8
~,& Et ~
+
+ EB ®
x
0.6 ~ ®
®
Ee
0.8 $
3
fI,&
+
Et
~
+ EB
0.6 EB x ®
x ®
6()
functions for these two models are plotted in Figure 2. The values of 7" I
consider are 10- 4 ,10- 3 ,10- 2 ,10- 1 and 1. Although the value 10- 4 may
seem small, note that it means the standard deviation of the measurement
error divided by the standard deviation of the process is 0.7%, which strikes
me as quite plausible for many physical quantities.
The reason for including two different values of v is to see whether in-
creasing r has more effect on our ability to estimate larger or smaller values
of v. Theoretical results in Stein (1993a) and the intuition that estimating
the degree of differentiability of a random process with noisy observations
should be harder for smoother processes suggest that r should have more
of an impact on the ability to estimate v when 1/ is 2 rather than 1. Results
in Tables 1 and 2 support this expectation: for evenly spaced observations,
when r goes from 10-4 to 10- 2 , ZV increases by a factor of 1.55 when v = 1
but by a factor of 5.13 when v = 2.
On the other hand, r is much easier to estimate when 1/ = 2 than when
1/ = 1, especially for smaller r and evenly spaced observations. In particular,
for evenly spaced observations and r = 10- 4 , IT /r2 is 1,893 for v = 1 and
1.507 for 1/ = 2. This large value for IT /r2 for v = 1 suggests that these
data provide essentially no information for distinguishing the true value
for r of 10- 4 from either r = 0 or much larger values such as r = 10- 3 .
Fortunately, in this case we have f(w) '" w- 3 as w ~ 00, so that 0: = 3 in
the notation of Theorem 7 of 3.7, and since r/oo.- 1 = 0.01 is small, this
theorem suggests that at least for certain predictions, acting as if r = 0
will produce nearly optimal predictors.
The other diagonal elements of Z depend on r as should be expected.
Specifically, parameters that are more related to high frequency behavior
should be more affected by increasing r than parameters affecting mostly
low frequency behavior. The results in Tables 1 and 2 are in line with this
186 6. Predicting With Estimated Parameters
TABLE 1. Diagonal values of Fisher information matrix and its inverse for
(cP, v, a) = (1,1,1) for various values of r based on 80 observations with spacing
6 = 0.1. Results for evenly spaced observations indicated by F and random design
indicated by R (see Figure 3 for details).
r
10- 4 10- 3 10- 2 10- 1 1
rjJ2I", F 39.33 34.51 19.70 8.996 4.002
R 33.33 24.47 14.96 7.947 3.717
v2Iv F 1130 927.0 358.5 72.64 11.09
R 1225 662.6 241.1 61.19 10.17
0: 2Iet F 10.83 10.83 10.77 10.53 9.517
R 10.81 10.80 10.73 10.44 9.285
r2Ir F 0.005481 0.4036 7.625 22.48 31.71
R 2.862 7.736 17.10 26.17 32.72
Conclusions
One overall pattern that emerges from these calculations is that random
designs can often yield better parameter estimates than evenly spaced de-
signs of comparable density, sometimes dramatically so. However, if our
6.6 Fisher information matrix under the Matern model 187
TABLE 2. Diagonal values of Fisher information matrix and its inverse for
(¢, v, 0:) = (24,2,2) for various values of T. Observation locations are same as
in Table l.
r
10- 4 10- 3 10- 2 10- 1 1
¢2I<p F 30.02 19.05 11.94 7.299 4.038
R 20.41 14.89 10.33 6.657 3.745
v 2Iv F 2791 1223 501.1 193.4 67.27
R 1579 831.5 394.7 166.4 60.47
0i. 2IOI. F 58.91 58.54 57.54 54.84 46.63
R 58.64 58.22 57.17 53.97 44.84
r2IOI. F 2.196 12.12 22.05 28.72 33.14
R 14.22 20.42 25.74 30.20 33.76
Exercises
22 Verify (8).
23 Repeat the numerical calculations in Tables 1 and 2 for different values
of the spacing 8. How do the comparisons between the fixed and ran-
dom designs change with 8? How do the comparisons between II = 1
and II = 2 change with 8?
Note that Z(w) has period 27T in each coordinate, so there is no loss in
information in restricting attention to frequencies in (-7T, 7Tjd. Consider
further restricting to just frequencies w in (-7T, 7Tjd of the form 27Tm- 1 p,
which is equivalent to considering only p E Bm, where Bm = { L~(m- -
l)J, - L~(m - l)J + 1, ... , L~mJ} d. Now, for j E Qm,
= rr
u=1
d
exp{i7rm- 1(qu - pun (10)
where the integrand is defined by continuity for those w for which the
denominator is O. Here and subsequently in this section, a subscript u
indicates the uth component of a vector so that, for example, Pu is the uth
component of p.
Periodic case
If Z has period 27rm6' in each coordinate, a great simplification occurs in
(10). This periodicity implies that F is a discrete measure placing all of
its mass on points (m6') -1 k for k E 7ld , so that Fo puts all of its mass
on points of the form 27r(m6')-1p for p E Bm. Since n:=l
sin 2 (~m6'wu)
has a zero of order 2d at all such points, the only way (10) can be
nonzero for w = 27r(m8)-1r, r E B m , is if sin {7rm-1(ru - = 0 and pun
sin {7rm-1(ru - qu)} = 0 for u = 1, ... , d, which for p, q, r E Bm can only
occur if p = q = r. Thus, for p, q E B m ,
for p 2: 0 and
Xp = L Z(6'j) sin(27rm- 1pTj)
jEQm
190 6. Predicting With Estimated Parameters
for P < O. Using Z(w) = Z( -w ), it follows that from {Xp : P E Bm} one
can recover {Z(21Tm-lp) : p E Bm} and hence the original observations.
Thus, by Exercise 17 in 6.4, we can find the MLE for 0 by maximizing
the likelihood with {Xp : p E Bm} as the vector of observations. Using
the cosine transform for p ;?: 0 and the sine transform for p < 0 in the
definition of Xp is notationally convenient, since the real random variables
Xp for p E Bm are then uncorrelated with mean 0 and
var(Xp) = ~m2dF6(21Tm-lP)fm(P) (11)
for p EBm , where fm(P) = 1 unless all components of 2p are 0 or m, in
which case, fm(P) = 2 (Exercise 26).
Suppose Z is a stationary mean 0 Gaussian random field on IRd with
period 21T in each coordinate and spectral measure with mass 1(J(j) for
j E 7i} and no mass elsewhere for 0 E e. If Z is observed at 21Tm- 1j for
j E (1m, then {Xp : P E Bm} is a one-to-one function of the vector of m d
observations Zm. Furthermore, the Xps are independent mean 0 Gaussian
random variables with variances given by (11), so by Exercise 17 in 6.4,
the log likelihood for 0 is of the form
where the constant does not depend on 0 and hence has no impact on the
maxima of the function. Call 8m a maximum likelihood estimator (MLE) of
o if it maximizes (12) for 0 E e. Suppose that 1(J(j) = f/J(a 2 + IjI2)-II-d/2,
0= (f/J, v,a) and e = (0, (0)3. We know that the MLE cannot be consistent
for any function of 0 that is not microergodic and we expect that it is
consistent for any function of 0 that is microergodic. From (20) in 4.2, we
see that f/J and v are microergodic in any number of dimensions but that a
is microergodic if and only if d ~ 4.
Asymptotic results
This section provides asymptotic results for Zm = Zm(OO), the Fisher in-
formation matrix for 0 based on Zm. Let 8m be an MLE for 0 based on
Zm. As indicated in 6.4, we might then expect
Let us first give exact expressions for the elements of Zm in the current
setting. Defining
ej(p,m,a,v) = L (a 2 + Ip+mjI2)-V-d/2 Iogi (a 2 + Ip+mjI2),
jEZd
var(8l) = ~ L
{e 1 (p,m,a,v)}2,
8v 2 pE B m eo(p, m, a, v)
(8l)
var -
8a
1 2(2 v + d)2,",
= -a
2
L..,.;
B
{eo(p,m,a,v+1)}2 ,
eo(p,m,a,v)
(8l 8l)
pE m
1 L el(p,m,a,v)
cov 8¢' 8v = -2¢ B eo(p,m,a,v) ,
pE
cov(8l, 8l)
m
= a(2v+d) L eo(p,m,a,v+1)
8¢ 8a 2¢ pE
B
m
eo(p, m, a, v)
and
and
h () "L.JjEZ
I d X
+ J"1- 2V-d-2
vx = EjEZd Ix + jl-2v-d
Ford = 4,
-1 1
Im rv var (gv) R.n, (15)
(1 1) _ 2¢21og2 m
rm , -
m4 '
(1 2) = ¢logm
rm , m4 '
1
rm(2, 2) = 2m 4 '
and
(2m d)1/2 [ 2¢
c,b-Cp - (logm + Eg,,) (v - v) 1 ~ N(O,I).
var(g,,)1/2(v-v)
If the mean of Z were an unknown constant and we used REML to
estimate 8, then Theorem 1 would also apply to the Fisher information
°
matrix for the contrasts. Specifically, the likelihood of the contrasts just
leaves the term p = out of the sums over Bm in (12) and it follows that
the asymptotic results in Theorem 1 are unchanged (although the result in
Exercise 37 for d = 4 changes slightly).
If the observations are made with independent Gaussian measurement
errors with mean 0 and constant variance (12, common sense and the results
in 6.6 suggest that the estimation of (cp, v, a) should be more difficult than
when there is no measurement error. However, the discussion in 4.2 suggests
that whatever parameters can be consistently estimated when there is no
6.7 MLE for periodic version of the Matern model 195
measurement error can still be consistently estimated when there is. In the
present setting, the Xps, p EBm are still independent Gaussians when
there are measurement errors, so it is in principle possible to carry out an
asymptotic analysis similar to the one in Theorem 1. Rather than doing
that, I just indicate how we might expect the rates of convergence to change
by considering results in Stein (1993). In that work, I essentially studied
the d = 1 case here except that the parameter Ct was omitted. Specifically,
I assumed that /9(j) = I/>ljl-211-1 for j =I- 0 and /9(0) = O. If there are no
measurement errors, a minor modification of Theorem 1 here shows that the
diagonal elementsof the inverse Fisher information matrix for (I/>, II) are of
the orders of magnitude m- 1 log2 m and m- 1 as in (16) for d = 1. However,
if there are measurement errors of variance (12 (independent of m), then the
diagonal elementsof the inverse Fisher information matrix for (I/>, II, (12) are
of the orders m- 1/(211) log2 m, m- 1/(211) and m- 1 , respectively. Thus, the
convergence rates for ¢ and f) are lower when there is measurement error
and the effect is more severe for larger values of II.
PROOF OF THEOREM 1. As an example of how to approximate the ele-
ments of Xm as m increases, consider cov(af/al/>, of/all). First note that
6(0,m,Ct,II)/~o(0,m,Ct,II)« logm. Next, for p =I- 0,
Similarly,
(23)
for some to> 0 when d ~ 5 (Exercise 33). The case d = 4 requires particular
care. It is possible to show that
( of Of) a(2v + 1) ~ 1
COy o¢' 00. ~ - 2¢ LJ 0.2 + p2 '
p=-oo
for d ~ 3,
2-d (8i Of) __ a(2v + d)Eh O( -E)
m coy o¢' 00. - 2¢ " + m
6.7 MLE for periodic version of the Matern model 197
for d 2:: 3,
and
8m (3,3) = ~md-4a2(2v + d)2 Eh~.
It follows that for some f > 0 (Exercise 35),
m 3d- 4a 2(2v + d)2
det(Im) = 2¢2 [Eg~Eh~ - (Egy)2 Eh~ - Eg~(Ehy)2
(28)
which is nonnegative by the Cauchy-Schwarz inequality and can be shown
to be positive (Exercise 36). Equation (14) of Theorem 1 follows.
Next consider d = 4. The only element of Im as given in (26) that is
now incorrect is var(8f./8a), which we have shown is a 2(2v+4)211'2Iogm+
o(logm). It follows that
Exercises
24 Verify (9).
25 Verify (10).
26 Verify (11).
27 Derive the covariance matrix of the score function for the setting in
this section.
28 Verify that (18) holds if (7) is true.
29 Verify (19).
30 Verify (20).
31 Verify (21).
32 Verify (22).
33 Verify (23).
34 Verify (24) and show that (25) follows.
35 Verify that (27) follows from (26).
36 Verify (28) and show that the result is positive for all v > O.
37 (Only for the truly brave or truly foolish.) Show that for d = 4,
and finally,
B
II =
1{
C4
2U211+6(X) U211+6(X)2
IxI 211 +6 U211+4(x)2 + U211+4(x)2
2 1 }
-lxI211+8U211+4(X) - Ixl 4 dx,
where ull(x) = E~ Ix+jl-Il. To obtain (29), prove the following results:
E {eo(p,m,a,v+1)}2
eo(p, m, a, v)
-1 mC4
{eo(x,m,a,v+1)}2 dX
eo (x, m, a, v)
pE B m
= A" + O(m-'),
1 {eo(x,m,a,v
eo(x,m,a,v)
mC4
+ 1)}2 dx -1 C4
(a: + IX I2 )-2 dx
m
= BII + O(m-'),
and
for some f > O. Use (29) to show that 1- corr«(/J, f)) '" 'Y/log2 m and
find an expression for 'Y.
(32)
where Eo indicates expectation under the true model. Suppose the esti-
mator iJ depends on Z only through its contrasts, which I denote by Y.
Zimmerman and Cressie (1992) point out that existing procedures, includ-
ing ML and REML, do yield estimates that are functions of the contrasts,
and I assume that this is the case in the remainder of this section. It follows
that e(iJ)-e(80 ) is also a function ofthe contrasts and is hence independent
6.8 Predicting with estimated parameters 201
(33)
so that Eoe(6)2 ~ M(Oo). It may seem obvious that the BLUP should
have a smaller mse than a plug-in predictor, which replaces the true 00
with an estimator. However, (33) does require that e(Oo) and e(6) - e(Oo)
be uncorrelated. Although assuming Z to be Gaussian is stronger than
necessary for this uncorrelatedness to hold (Christensen 1991, Section 6.5),
it is not difficult to construct examples of non-Gaussian processes for which
Eoe(6)2 < M(Oo) (see Exercise 38).
Before proceeding to more difficult problems, it is worth noting that in
the simple setting where the autocovariance function K is known up to a
scalar multiple, there is a satisfactory finite sample frequentist solution to
the prediction problem. More specifically, suppose K9 = OK, the rank of
(m(xl) ... m(xn)) is p and 0 is the REML estimate of 0, given in 6.4. Since
the BLUP of Z(xo) does not depend on 0, the EBLUP and the BLUP
are the same. It is then a simple extension of standard results on pre-
diction intervals in regression problems with independent Gaussian errors
(Seber 1977, Section 5.3) to show that e(O)/M(O)1/2 follows a t distribution
with n - p degrees of freedom, which can be used to give exact frequentist
prediction intervals.
More generally, there is no entirely satisfactory frequentist solution to
making inferences based on EBLUPs. Harville and Jeske (1992) and Zim-
merman and Cressie (1992) consider a more sophisticated method for
estimating Eoe(6)2 than the plug-in estimator of mse M(6) in (31). Their
method involves three separate approximations. First, they derive an exact
relationship between Eoe~ and Eoe(6)2 that holds under highly restrictive
conditions and then assume this relationship is approximately true more
generally. Next, they further approximate this result as a function of 00
using Taylor series. Finally, they replace 00 in this expression by 6. Unfor-
tunately, Zimmerman and Cressie (1992) report simulation results showing
that when neighboring observations are strongly correlated, this approach
can sometimes produce worse answers than M(6).
To carry out a simulation study such as the one in Zimmerman and
Cressie (1992), it is only necessary to simulate the observations and not
the predictands. To be more specific, consider approximating Eoe( 6)2 via
simulation under some given Gaussian model for a random field Z, some set
of observations and a particular predictand. Use the subscript 0 to indicate
the true model. Calculate Eoe~ once and for all. Simulate n realizations of
the observations under the true model. For the jth realization, let 6(j) be
the estimator for 0, eo(j) the error of the BLUP and e(6(j), j) the error of
202 6. Predicting With Estimated Parameters
Pr (t. 0)
n,
= ~ ~ CI>
n~
(t -{e(O*(j)~j)- e(o,j)})
M(O)1/2 '
the Kullback divergence of the plug-in conditional density from the con-
ditional density evaluated at 0 0 , Note that the right side of (35) is an
expectation over both Y and Z. The main results of this section are two
plausible approximations to D(Oo, OJ Z I Y).
For random vectors Wand X with joint distribution indexed by a pa-
rameter 0, let I(Oj X) be the Fisher information matrix for 0 when X is
observed. Furthermore, define
and
Eo{ e(Zj 8) - e(Zj ( 0 )}2
is small, (40)
M(80 )
then
1 2
D(80 , 8j Z I Y) ~ 4M(80 )2 Eo{ M(8) - M(80 )}
A A
1 2
+ 2M(80 )
A
-2(8
1 " 80 ) i(80 ; Z I Y)(8"
- T - 8}
0) ]
~ -Eo { (0 - 8 0 fS(80 ; Z I Y) }
+ 2Eo
1 {(8
"- 80 ) T i(80 ; Z I Y)(8A
- 80}
)
1 o {A
+ 2E (8 - 80 ) T S(80 ;Z I Y) }2 . (42)
Now 0 is a function of Y, so
Eo { (0 - 80 fS(8 0 ; Z I Y) }
(44)
(Exercise 42). Next, (Y, Z) Gaussian implies that
Eo { (0 - 80 fi(8 0 ; Z IY)(O - 80 ) }
E
o
{I M(8)}
og M((Jo)
~E
{M(8) - M((Jo)} _ ~E {M(8) - M((JO)}2
M((Jo)
0 2 0 M((Jo)
(48)
Next, since e(Zj (Jo) is independent ofY and hence independent of e(Zj 8)-
e(Z; (Jo) and M(8),
E {M((Jo)}
o M(8)
~ 1 _E {M(8) - M((Jo) }
0 M((Jo) +
E {M(8) - M((Jo)
0 M((Jo)
}2 ,
(50)
and by (40),
Numerical results
Plausible Approximation 1 suggests that it would be instructive to compute
tYI = tr{Z((Jo;y)-lZ((Jo;Z IV)} in various settings to learn about the
effect of estimation on subsequent predictions. As I discussed in the previ-
ous subsection, I believe that f}.Z is at least qualitatively informative even
in situations where not all components of (J are microergodic so that the
argument for Plausible Approximation 1 does not apply. This subsection
numerically examines the behavior of f}.Z for mean 0 stationary Gaussian
processes on lR. under two particular Matern models.
We consider interpolation and extrapolation problems in which the ob-
servations are, for the most part, evenly spaced with distance {j between
neighboring observations and the predictand is a distance {j' from the near-
est observation, where {j' is either {j or 0.50. Figure 7 shows some results
6.8 Predicting with estimated parameters 207
when {j' = {j. More specifically, suppose there are observations at {jj for
j = 1, ... ,40 and 42, ... ,81 and we wish to predict at 41{j (interpola-
tion) or 0 (extrapolation). For various values of {j and the Matern models
(cp, v, a.) = (1,1,1) or (24,2,2), Figure 7 gives the values for !!;I. Several
general trends emerge. First, !!;I depends much more strongly on {j when
interpolating than when extrapolating, with the result that !!;I is smaller
when interpolating for smaller {j but is generally larger for larger {j. Fur-
thermore, the difference between the cases v known and unknown is larger
when interpolating. In particular, for smaller {j, when interpolating, l:l.Z is
quite near to 0.0125, which is what we would get if only cp were unknown.
Thus, to the extent that Plausible Approximation 1 is relevant in this set-
ting, the additional effect of uncertainty about a. on interpolation is quite
small when {j is small.
Figure 8 considers observations at {jj for j = 1, ... ,80, in which case,
there is no way to have an interpolation problem in which the distance
from the predictand to the nearest observation equals the distance between
neighboring observations. Instead, I consider interpolating at 40.50 and,
to have a comparable extrapolation problem in which the predict and is
0.50 from the nearest observation, predicting at 0.50. Thus, we now have
{j' = 0.50. Let us first consider v unknown. In this case, when interpolating,
l:l.Z is in all instances considerably greater than when {j' = {j. However, when
extrapolating, l:l.Z sometimes increases and sometimes decreases from the
results with {j' = {j. Indeed, we now have that when v is unknown, !!;I is
always larger when interpolating than extrapolating in the cases examined
here. When v is known, the values of !!;I are generally quite similar to
those for {j' = {j whether interpolating or extrapolating.
Considering the numerical results in 3.5 and the theorems in 3.6 on the
effect of misspecifying the spectral density on interpolation and extrapola-
tion, the results here showing that !!;I is often larger when interpolating
than extrapolating need some explanation. To review, 3.6 studied the effect
of misspecifying the spectral density on predicting at 0 based on observa-
tions either at {jj for all negative integers j (extrapolation) or at {jj for all
nonzero integers j (interpolation). As {j ! 0, results in 3.6 show that the
effect of misspecifying the spectral density at either high or low frequen-
cies on the actual mse of a pseudo-BLP is smaller when interpolating than
when extrapolating (Theorems 3 and 5 in 3.6). Furthermore, the effect of
misspecifying the spectral density at low frequencies on the assessment of
mse of pseudo-BLPs is also smaller when interpolating (Theorem 6 in 3.6).
However, because of the difficulty of comparing spectral densities with dif-
ferent high frequency behavior when evaluating mses, evaluating the effects
of such misspecifications on the assessment of mses when interpolating and
extrapolating is problematic, although Theorem 4 in 3.6 attempts to ad-
dress this problem. Thus, when v is unknown, so that the estimated high
frequency behavior of the spectral density will be different from the actual
high frequency behavior, we should not necessarily expect !!;Ito be smaller
208 6. Predicting With Estimated Parameters
0.06
+
ta
,
0.04
® ® i
EB
®
EB +
+ +
0.02
0.035
0.0275
ta
0.02
~
EB
+
EB
+ +
0.0125 m ~
0.02 0.05 0.1 0.2 0.5 1
spacing 6
FIGURE 7. Values of fYI for Matern model with (t/J, II, a) = (1,1,1) or (t/J, II, a) =
(24,2,2). In the top figure, all parameters are considered unknown and in the bot-
tom figure II is considered known and t/J and a unknown. The observations are at
6j for j = 1, ... , 40 and j = 42, ... , 81. The predictands are at 416 (interpolation)
and 0 (extrapolation).
+ indicates (t/J, II, a) = (1,1,1) and the predict and at 416.
ED indicates (t/J,II,a) = (24,2,2) and the predictand at 416.
x indicates (t/J, II, a) = (1,1,1) and the predictand at O.
® indicates (t/J, II, a) = (24,2,2) and the predict and at O.
If only t/J is unknown, then fYI = 0.0125 in all cases.
+
0.2
EEl
!:!..I
0.1
+ 0
EEl X
+
t Ig)
0.05
0.0375
M
0.025
~
0.0125
X
.!- +
0.02 0.05 0.1 0.2 0.5 1
spacing 8
FIGURE 8. Values of tl.I for Matern model with (ifJ, v,o) = (1,1,1) or (ifJ, v, 0) =
(24,2,2). In the top figure, all parameters are considered unknown and in the
bottom figure v is considered known and ifJ and 0 unknown. The observations
are at 8j for j = 1, ... ,80. The predictands are at 40.58 (interpolation) and 0.58
(extrapolation).
+ indicates (ifJ, v, 0) = (1,1,1) and the predict and at 40.58.
E9 indicates (</>, v, 0) = (24,2,2) and the predict and at 40.58.
X indicates (ifJ, v, 0) = (1,1,1) and the predict and at 0.58.
® indicates (ifJ, v, 0) = (24,2,2) and the predictand at 0.58.
No results are given for (ifJ, v, 0) = (24,2,2) and 8 = 0.02 due to numerical
difficulties. For (ifJ, v, 0) = (24,2,2), 8 = 1 and the predict and at 40.5, M =
0.6278, which is omitted from the top figure.
TABLE 3. Relative increases in mse due to using Matern model with parameters
1/ = O! = Y instead of the correct values 1/ = O! = 2. Observations and predictands
are as in Figures 7 and 8 with 6 = 0.2. Distance from predictand to nearest
observation is denoted by 6', so that 6' = 0.2 for the setting in Figure 7 and
6' = 0.1 for the setting in Figure 8.
Interpolation Extrapolation
y b' = 0.2 b' = 0.1 b' = 0.2 b' = 0.1
1.6 2.57 x 10- 2 8.24 X 10- 3 5.59 X 10- 2 6.04 X 10- 2
1.7 1.32 x 10- 2 4.10 X 10- 3 3.00 X 10- 2 3.18 X 10- 2
1.8 5.43 x 10- 3 1.62 X 10- 3 1.27 X 10- 2 1.33 X 10- 2
1.9 1.26 x 10- 3 3.65 X 10- 4 3.06 X 10- 3 3.14 X 10- 3
2 0 0 0 0
2.1 1.10 x 10- 3 3.00 X 10- 4 2.84 X 10- 3 2.84 X 10- 3
2.2 4.11 x 10- 3 1.10 X 10- 3 1.10 X 10- 2 1.09 X 10- 2
2.3 8.72 x 10- 3 2.27 X 10- 3 2.40 X 10- 2 2.34 X 10- 2
2.4 1.46 x 10- 2 3.72 X 10- 3 4.14 X 10- 2 4.00 X 10- 2
optimal point predictions, that is, replacing the supremum over 1Ln in
Theorems 8, 10 and 12 in 4.3 by a supremum over all Z(x) for x E R.
However, I do not think uniform asymptotically correct assessment of mses
is possible even if one restricts to point predictions. The problem is that for
any fixed set of observations, as the predict and location tends towards one
of the observations, even a tiny error in ii will lead to unboundedly large
relative errors in the assessment of mses. If there are measurement errors,
then I believe it is possible to obtain uniformly asymptotically correct as-
sessment of mses, since the problems in assessing mse for a predictand very
near to an observation should no longer occur. The approach taken in Put-
ter and Young (1998) may be helpful in solving these problems, but the
results in that work are not nearly strong enough to provide any answers
at present.
Exercises
38 Consider the setting of Exercise 6 in 2.4 and assume, for simplicity,
that the mean of Z is known to be O. Argue that under any reasonable
choice for the estimated autocovariance function of the process, the
plug-in predictor has a smaller mse than the BLP.
39 Verify (36).
40 Show that (38) is an equality if the covariance matrix of (Y, Z) is
known and its mean vector is linear in 8.
41 Suppose X and Yare random vectors whose joint distribution is in the
parametric family P6 for some 8 E e. Let 8(8; (X, V)) be the score
function for 8 based on the observation (X, V). Under suitable regu-
larity conditions, prove E6o{8(8o; (X, Y))IX} = 0 with probability 1.
This result can be thought of as expressing a martingale property of
the score function. Show that (43) follows.
42 Verify (44) and (45).
43 Verify (47).
44 Verify (49).
45 State and provide heuristic derivations of Plausible Approximations 1
and 2 appropriate for BLUPs, EBL UPs and REML estimates assuming
8 contains just the unknown parameters for the covariance structure.
demonstrates some of the problems that can occur when predicting smooth
random fields.
Suppose Z is a stationary Gaussian process on IR with mean 0 and au-
tocovariance function K(t) = e- O.4ltl(l + O.4lt!) so that Z is exactly once
mean square differentiable (see 2.7). We observe this process at the 20 loca-
tions -9.5, -8.5, ... ,8.5,9.5 and wish to predict it at -10, -9, ... , 10 and
at ±1O.5 when the mean of Z and its autocovariance function are unknown.
Figure 9 plots the simulated values of the observations and indicates the lo-
cations of the predictands; the actual simulated values are given in Table 4.
We see that our predictions include both interpolations and extrapolations.
The empirical semivariogram (see 2.9) is a staple of the kriging literature
as a tool for selecting models for semivariograms and estimating the pa-
rameters in these models. Figure 10 plots the empirical semivariogram up
to distance 10 for the simulated realization in Figure 9. Distances greater
than 10 are not plotted because of the severe lack of reliability of em-
pirical semivariograms at distances more than half the dimensions of the
observation region, whichcorresponds to common geostatistical practice
(Journel and Huijbregts 1978, p. 194). Figure 10 also plots the actual semi-
variogram. It is critical to note the rather large and apparently systematic
differences between the actual and empirical semivariograms at the shorter
distances. Far from being unusual, this phenomenon should be expected
in light of the strong correlations that exist in the empirical semivari-
ogram at different distances. For example, using i to indicate the empirical
semivariogram, corr{i(l), i(2)} = 0.981, corr{i(l), i(3)} = 0.938 and
corr{ 1'(2), i(3)} = 0.880 (Exercise 46). Thus, the empirical semivariogram
can appear quite regular and still be substantially in error. Of course, the
fact that the empirical semivariogram has correlated values is well known
(Cressie 1985, 1993), but I believe that the consequences of these poten-
tially large correlations are not generally sufficiently appreciated. If one
had a regression problem with observations that were equal to the underly-
ing regression function plus independent errors that was as smooth as the
empirical semivariogram in Figure 10, then it would be sound to conclude
that the regression function could be well estimated. It is a difficult psy-
chological adjustment to look at Figure 10 and recognize that the strong
correlations present can easily yield a smooth empirical semivariogram so
different from the actual semivariogram.
Considering the apparent quadratic behavior at the origin of the em-
pirical semivariogram in Figure 10, it would now be within the realm of
accepted present practice in spatial statistics to fit a Gaussian semivari-
ogram, 'Y(t; ¢, a) = ¢(1 - e- at2 ) to the data. Although Goovaerts (1997)
recommends never using the Gaussian semivariogram without a measure-
ment error term, note that there is no measurement error term here and
the true semivariogram is quadratic near the origin. Since, as noted in 3.5,
software in spatial statistics commonly includes the Gaussian as the only
semivariogram model that is quadratic near the origin, it would be hard
6.9 An instructive example of plug-in prediction 213
• •
2 • • • • •
Z(t) 1 • •
•
• • • • • • •
o •
-10 -5 o 5 10
t
FIGURE 9. Simulated realization of Gaussian process with mean 0 and autoco-
variance function K(t) = e-o.4Itl(1 + O.4ltl). The xs on the horizontal axis are
the locations of predictands and the Is are the locations of the observations.
TABLE 4. Simulated values of process pictured in Figure 9. The last three rows
are the additional observations used towards the end of this section.
t Z(t) t Z(t)
-9.5 2.3956811 0.5 0.4109609
-8.5 2.2767195 1.5 0.4647669
-7.5 1.9736058 2.5 1.2113779
-6.5 1.8261141 3.5 1.9055446
-5.5 1.3136954 4.5 2.1154852
-4.5 0.2550507 5.5 1.7372076
-3.5 -0.0741740 6.5 0.8333657
-2.5 0.2983559 7.5 0.2932142
-1.5 0.4023333 8.5 -0.1024508
-0.5 0.4814850 9.5 0.0926624
-0.25 0.4267716
0.0 0.4271087
0.25 0.4461579
to fault the practitioner who adopted the Gaussian model in this case. Of
course, the Gaussian model is severely in error in the sense that it implies
Z has analytic realizations when in fact the process has only one deriva-
tive. Nevertheless, it is instructive to compare plug-in predictors based
on the REML estimate and an "eyeball" estimate that fits the empirical
semivariogram very well at the shorter distances.
If we suppose Z is Gaussian with mean p, and semivariogram of the form
¢(1 - e- at ) with (p" ¢, a) unknown, the REML estimate of (J = (¢, a)
is 8 = (0.667,0.247). Figure 11 replots the empirical semivariogram to-
gether with ')'(tj 8). The REML estimate yields a fitted semivariogram with
slightly larger curvature near the origin than the empirical semivariogram.
Figure 11 also plots ')'(tj 8) for my eyeball estimate 8 = (1,0.12}j this eye-
214 6. Predicting With Estimated Parameters
+
1.0 +
+
i'(t)
0.5
+
+
0.0
0 5 10
t
FIGURE 10. Empirical and actual semivariograrns for data shown in Fig-
ure 9. Smooth curve is the actual semivariogram and +s are the empirical
semivariogram.
+
1.0
+ ---
+ +- -; ----
/ +
i'(t) Y +
0.5
I'
/
/
/
"/
v
0.0
0 5 10
t
FIGURE 11. Empirical and estimated semivariograms for data shown in Figure
9. Using the Gaussian model for the semivariogram, solid line indicates REML
estimate and dashed line indicates eyeball estimate.
under the true model as well as the plug-in predictors or EBLUPs using
the Gaussian model and 6 or 8 to estimate 8. Near the middle of the
observation region, both EBLUPs are almost identical to the BLUPs. As
one gets nearer to the edges of the observation region and, particularly,
outside the observation region, the EBLUPs, especially using the eyeball
estimate 8, can be substantially different from the BLUPs. Define eo(t) as
the error of the BLUP of Z(t) and e(t; 6) and e(t; 8) as the errors of the
EBLUPs of Z(t). Take Eo to mean expectation under the true model for
Z and let y be the observed value of the contrasts Y of the observations.
Define M(t; 8) as in (30) with t taking the place of xo, so that, for example,
M(t; 6) is the plug-in estimate of the mse when predicting Z(t) using the
REML estimate 6. Finally, take
(52)
Cross-validation
One method that is sometimes suggested for diagnosing misfits of semivari-
ograms is cross-validation (Cressie 1993, pp. 101-104). Specifically, suppose
we have observations Z(xt} , ... , Z(x n ) and plan to predict at other loca-
tions using an EBLUP where the mean of Z is taken to be an unknown
216 6. Predicting With Estimated Parameters
X
3
o-i-*_
-.- _.-.
i .-.
-• +
2
•
• _.-.-.-.-•- -* +0
X
1
0 -.-. -¥-t- a
-10 -5 0 5 10
t
FIGURE 12. BLUPs and EBLUPs for simulation in Figure 9. The es are observed
values, os are BLUPs, +s are EBLUPs using the REML estimate iJ and xs are
EBLUPs using the eyeball estimate 9.
(53)
TABLE 5. Performance of two EBLUPs when using the Gaussian model for the
semivariogram. The two parameter estimates are the REML 6 and an eyeball
estimate 6 (see Figure 11). M(tj iJ} is, for example, the plug-in value for the mse
of the EBLUP based on iJ (see (31)) and C(tj 6, y) is the conditional mse of the
EBLUP based on 6 (see (52}). All figures other than in the first column are 1,000
times their actual values, so that, for example, M(1O.5j 6 0 ) equals 0.0766.
t M(tj6o) C(tjiJ,y) M(tjiJ) C(tj 6, y) M(tj 6)
-10.5 76.6 147 91.2 259 5.86
10.5 436 9067
-10 18.2 28.1 12.2 38.9 0.414
10 97.2 1071
-9 3.31 3.79 0.447 4.00 4.53x 10- 3
9 9.25 35.3
-8 2.78 2.81 0.106 2.82 3.75 x 10- 4
8 4.10 7.98
-7 2.74 2.80 0.0489 2.80 7.21 x 10- 5
7 3.00 3.90
-6 2.74 2.85 0.0315 2.86 2.27 x 10- 5
6 2.75 2.80
-5 2.74 2.81 0.0244 2.80 9.91 x 10- 6
5 2.98 2.81
-4 2.74 3.38 0.0210 3.36 5.48 x 10- 6
4 2.98 2.85
-3 2.74 2.96 0.0192 2.94 3.61 x 10- 6
3 3.03 2.91
-2 2.74 2.75 0.0182 2.75 2.74 x 10- 6
2 2.82 2.78
-1 2.74 2.88 0.0177 2.91 2.35 x 10- 6
1 2.83 2.87
o 2.74 2.99 0.0176 3.05 2.23 x 10- 6
TABLE 6. Performance of two EBLUPs when use Matern model for the semi-
variogram. The two parameter estimates are 8, the REML, and 8(15), obtained
by arbitrarily setting the parameter v to 15 and then maximizing the likelihood
of the contrasts over the other parameters. As in Table 5, all figures other than
in the first column are 1,000 times the actual figures.
t M(tj 6 0 ) C(tj 8, y) M(tj 6) C(tj 6(15), y) M(tj 6(15))
-10.5 76.6 102 94.7 135 92.4
10.5 145 316
-10 18.2 22.1 16.0 26.6 13.2
10 33.6 71.5
-9 3.31 3.56 1.20 3.73 0.584
9 4.63 7.52
-8 2.78 2.79 0.607 2.81 0.173
8 3.01 3.68
-7 2.74 2.82 0.505 2.81 0.0986
7 2.77 2.89
-6 2.74 2.89 0.481 2.87 0.0754
6 2.78 2.77
-5 2.74 2.77 0.475 2.80 0.0662
5 2.94 3.03
-4 2.74 3.18 0.474 3.35 0.0621
4 2.88 3.00
-3 2.74 2.87 0.473 2.94 0.0601
3 2.91 3.04
-2 2.74 2.75 0.473 2.75 0.0593
2 2.79 2.82
-1 2.74 2.85 0.473 2.88 0.0589
1 2.82 2.83
o 2.74 2.94 0.473 3.00 0.0587
observation region, the plug-in mses are quite similar whether one uses the
Gaussian model and REML or the Matern model with v = 15 and REML.
However, when interpolating near the middle of the observation region, the
Gaussian model gives plug-in roses less than ~ as large as the Matern model
with v = 15.
The large uncertainty about v combined with its critical impact on as-
sessment of mses of interpolants implies that it is essentially impossible to
220 6. Predicting With Estimated Parameters
-6
x xx X.X X X X X X X
-8 X
X
pl(v) -10
-12
-14
.,..
0 2 4 6 8 10
v
FIGURE 13. Profile log likelihood of the contrasts for 1/ using the 20 simulated
observations. The. indicates (ii, pi(ii».
obtain defensible mses from these data without strong prior information
about v. The fact that the plot of pl(v) alerts us to this situation is a
great strength of using the Matern model together with likelihood meth-
ods. Note, though, that it is essential to study the likelihood function and
not just find its maximum.
Considering the numerical results in 6.6, the fact that the likelihood
function provides so little information about v is not surprising. As those
results show, evenly spaced observations make estimation of v particularly
difficult. I simulated three additional observations at -0.25,0 and 0.25 (see
Table 4) and for these 23 observations, recomputed the profile log likelihood
of the contrasts (see Figure 14). We now obtain the much better estimate
for v of 1. 796 and, in addition, have strong evidence against large values
for v. In particular, exp{pl(1.796) - pl(oo)} = 1.27 X 105 • Table 7 shows
that the plug-in predictors now perform well for both interpolations and
extrapolations and the plug-in estimates of mse are all reasonable. Table 7
also shows properties of the plug-in predictors if the Gaussian model is fit
to these data using REML. Figure 15 plots the estimated auto covariance
functions under both the Matern and Gaussian models. Note that the es-
timated Gaussian model has far greater curvature near the origin after the
additional three points are included in the analysis. I leave it to the reader
to explain why the plug-in estimates of mse under the Gaussian model are
now far too conservative for the predictions at ±4, ... , ±1O but are still
badly overoptimistic at ±1 (Exercise 48).
Conclusions
The reader could argue that I have only shown the detailed results of
one simulation, which could be misleading. This is possible, although the
simulation shown was the first one I ran. To the reader who doubts the
6.9 An instructive example of plug-in prediction 221
-2
X ·x
-4 X
X
-6
X
X
p£(v) -8 X
X
-10
X
X
X
-12 X
x
X
-14
0 2 4 6 8 10
v
FIGURE 14. Profile log likelihood of the contrasts for v using the original 20
simulated observations plus the three additional observations. The • indicates
Cf), pRC f))).
- --
1.0
'"
i'(t) / '" '"
/
.
0.5 ./
/
/
/
/
/
/. '"
O.O+-=-.---.---.---.---.---.---.---.--,,--,~
o 5 10
t
FIGURE 15. True and estimated semivariograms using 3 additional observations.
The solid line indicates the truth, the dashed line the REML estimate under the
Matern model and the dotted line the REML estimate under the Gaussian model.
servations that have smaller spacing than the rest of the observations can
dramatically improve the estimation of the semivariogram.
Exercises
46 If the random vector Z of length n has distribution N(O, E), show
that for fixed, symmetric n x n matrices A and B, cov(ZT AZ,
ZTBZ) = 2tr(AEBE) (see Appendix A for relevant results on mul-
tivariate normal distributions). Using this result, write a computer
program to calculate the correlations for the empirical semivariogram
i' at distances 1,2, ... , 10 for n observations evenly spaced 1 unit apart.
Use your program to calculate these correlations when n = 20,40 and
60 and (i) K(t) = e-o.4Itl (1 + 0.4Itl), (ii) K(t) = e-o.2Itl (1 + 0.21tl) and
(iii) K(t) = e-O.75t2. Comment on the results. See Genton (1998) for
further examples of the correlations in empirical semivariograms.
47 Suppose we model Z as a Gaussian random field with isotropic auto-
covariance function from the Matern model and mean function known
except for a vector of linear regression coefficients. We observe Z at
some finite set of locations and let pl(lI) be the profile log likelihood
of the contrasts as a function of II. Show that
lim pl(lI) = l({I, {2),
11-+00
where l(6,6) is the log likelihood of the contrasts under the model
K(r) = {le-e2 r2 for the isotropic auto covariance function and ({1,{2)
is the REML estimate for (6,6).
48 Tell a story explaining the last two columns of Table 7.
with p( 0) the prior density for O. Although some scientists and statisti-
cians are uncomfortable with basing inferences on what is necessarily a
224 6. Predicting With Estimated Parameters
f(v + ~) (4vy
c(v, p) = 7r d/ 2f(v)p2v
predictive distribution for Z(xo) is then also not integrable. The same prob-
lem would occur for any improper prior of the form pea, 1/, p) = pea, p)p(l/)
with p( 1/) not integrable at 00. A similar but more subtle problem can occur
when using a prior whose marginal density for p is not integrable at 00 (see
Exercise 50).
For the random field Z(x) = m(x)T ,8+c(x), where c is a mean 0 isotropic
Gaussian random field with spectral density gfj as given by (56) and ,8 is a
vector of length q of unknown regression coefficients, Handcock and Wallis
(1994) suggest the prior density
1
p(,8,.,.,) = 0'(1 + p)2(l + 1/)2' (57)
on IRq x (0,00) x (0,1)2 (Exercise 51). Thus, for every fixed,8 and 0', p(,8,.,.,)
is an integrable function of p and 1/. Exercise 52 asks you to show that if
there are at least q + 1observations, the posterior density for ,8 and .,.,
obtained using (57) as the prior is proper.
In using the prior (57), one should bear in mind that, unlike 1/, p is
not dimensionless and has units of distance. Therefore, the meaning of
the marginal prior density pep) = (1 + p)-2 depends on the units used to
measure distance. Ifwe want our results to be the same whether we measure
distances in meters or kilometers, we should normalize distances in some
manner. One possible normalization is to set the distance between the two
most distant observations to 1. In conjunction with (57), this normalization
provides an "automatic" prior that could be employed by users who are
either ill-equipped for or uninterested in developing a prior that reflects
their knowledge of the random field under study.
Exercises
49 Consider the random field Z(x) = m(x)T (3 + c(x), where c is a mean
o Gaussian random field with covariance function from some model
K 8 • For a vector of observations Z = (Z(xt}, ... , Z(xn))T and the
improper prior density p({3,.,,) = 1, show that (6) gives the logarithm
of the marginal posterior density for 0 (Harville 1974).
50 This problem gives a simple example of how one can end up with an
improper posterior density by using an improper marginal prior density
on the parameter p in (56). Suppose Z is a stationary Gaussian process
6.10 Bayesian approach 227
1.0
0.5
\
,
1 2 3 4
x
9cr,p(W) = ( 2),
7rp "2 +w2
P
228 6. Predicting With Estimated Parameters
where for i = 1,2, I'i has length qi and for i,j = 1,2, Eij is a qi x qj
matrix. Then the conditional distribution of Xl given X 2 = X2 is N(1'1 +
E 12 E 22 (X2 - 1'2), Ell - EI2E22E2t}, where E22 is any generalized inverse
of E22 (for an invertible matrix, the generalized inverse is unique and equals
the ordinary inverse).
Finally, if X = (X}, ... , Xn)T is N(O, E) and Uij is the ijth element
of E, then for i,j, k,f. = 1, ... ,n, E(XiXjXk) = 0 and E(XiXjXkXl) =
UijUkl + UikUjl + UilUjk·
Appendix B
Symbols
Sets
AC the complement of A
A\B for Be A, those elements in A that are not in Bj equals AnBc
AoB the symmetric difference of A and Bj equals (A U B)\(A n B)
IRd d-dimensional Euclidean space
Zd d-dimensional integer lattice
xt=IAi for subsets AI, ... , Ad of JR., the subset of JR.d whose elements
have ith component in Ai for i = 1, ... , dj also written Al x
••• X Ad
Relationships
for functions f and 9 on some set R, write f(t) rv g(t) as t -+ to
if f(t)/g(t) -+ 1 as t -+ to
« for real-valued functions f and 9 on R, write f(t) « g(t) if
there exists C finite such that If(t)1 ~ Cg(t) for all t E R;
same as f(t) = O(g(t))
;:,:: for nonnegative functions f and 9 on R, write f(t) ;:,:: g(t) if
f(t) « g(t) and g(t) « f(t); write f(t) :::::: g(t) as t -+ to if,
given any sequence tl, t2, ... such that ti -+ to, there exists N
finite such that f(t i ) :::::: g(ti) for all i > N
..1 orthogonal; can refer either to the orthogonality of two el-
ements in a Hilbert space (have inner product 0) or to the
orthogonality of two probability measures
equivalence for probability measures
Abbreviations
BLP best linear predictor
BLUP best linear unbiased predictor
EBLUP estimated best linear unbiased predictor
IRF intrinsic random function
LUP linear unbiased predictor
MLE maximum likelihood estimator
mse mean squared error
p.d. positive definite
REML restricted maximum likelihood
Miscellaneous
:E~ for j E Zd, the sum over all element of Zd except the origin
References