Factor Models
Factor Models
Factor Models
Factor Models
Seton Leonard
Preliminary Draft
c Seton Leonard 2018
Contents
Contents 3
3
4 CONTENTS
6.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 An Example: Bayesian vs. ML Estimation for Simulated Data . . . . . . . 50
References 57
Index 59
Introduction
This short book is primarily concerned with estimating factor models and is accompanied by
R and C++/Armadillo code to implement the routines described herein. These estimation
routines are: (1) maximum likelihood estimation of factor models following Watson and
Engle (1983) and (2) Bayesian estimation of factor models by simulation. All routines
are designed to accept noisy and/or missing data; extracting meaningful information from
noisy data is, after all, the primary purpose of the filtering and smoothing algorithms upon
which factor models are based. Dedicating a book to dynamic factor models may sound
narrow, so it is worth noting at the outset that these models encompass a wide array of time
series tools, and every model can be adapted to backcasting (predicting past information
that was not observed), nowcasting (predicting contemporaneous unobserved information),
and forecasting (predicting future realizations).
In its simplest form a dynamic factor model is described by two equations: a measure-
ment equation
yt = Hxt + εt
(1) yt = H̃zt + εt
(2) zt = Azt−1 + et
0
In the above zt is a vector of stacked factors such that zt = x0t x0t−1 . . . x0t−p+1 where
5
6 INTRODUCTION
helper matrix. A is the companion form of the more familiar VAR model
xt−1
xt−2
(3) xt = B . + et
.
.
xt−p
Thus equipped we could, for example, write down a VAR model for noisy and/or missing
data. Taking the a simple three variable, three lag example we would have the observation
equation
y1,t xt
y2,t = I3 I3 03 03 xt−1 +εt
y3,t |{z} |
H
{z
J
} x
t−2
| {z } | {z }
yt zt
yt − a1 yt−1 = εt + θεt−1
yt = xt
though the state space form of any model is never unique. As an aside, this formulation
of the ARMA(1,1) would make it easy to allow for measurement error in yt by simply
including an error term in the observation equation, that is, letting yt = xt + εt .
1
Note that I am a bit loose with the dimensionality of et . In the case we’re using the companion form
of the transition equation et just has a bunch of zeros (m × (p − 1) of them) under the first m elements.
7
This book begins with some basic Bayesian statistics in chapter 1 that form the build-
ing blocks of our state space framework. Chapter 2 then introduces basic Kalman filtering
and smoothing, followed in chapter 3 by Durbin and Koopman (2012)’s much more com-
putationally efficient disturbance smoothing. Chapter 4 on deals with estimating state
space models. Chapter 4 briefly discusses estimation using principal components, though
other approaches (maximum likelihood or Bayesian simulation) tend to yield better re-
sults. Chapter 5 presents Watson and Engle (1983)’s maximum likelihood approach (I do
not cover simply plugging the likelihood function into a numerical maximization (minimiza-
tion) routine. It’s easy to do but very slow compared with Watson and Engle (1983)’s EM
algorithm). Chapter 6 presents Bayesian estimation by simulation for uniform frequency
data. Finally, chapter 7 introduces mixed frequency models.
All of the notation in this book matches that in my R and C++ code so that hopefully
the code is easy to follow. This text is intentionally short and too the point, geared towards
implementation of state space models as opposed to derivations or theory. To get a better
handle and more depth on factor models, Bayesian filtering and smoothing, and time series
analysis more generally Sarkka (2013), Durbin and Koopman (2012), Hamilton (1994),
Lutkepohl (2007), and Koop et al. (2007) are good resources. As is no doubt already
obvious I assume the reader is familiar with matrix algebra. I also assume some familiarity
with statistics, econometrics, and calculus.
Chapter 1
Bayesian econometrics is founded on Bayes’ theorem, a simple equation which states that
the probability of an event A conditional on B equals the probability of B conditional on
A times the unconditional probability of A over the unconditional probability of B, that
is,
P (B|A)P (A)
(1.1) P (A|B) =
P (B)
This result can be easily derived from the definition of conditional probability, which states
that the probability of A conditional on B equals the probability of A and B occurring
together over the probability of B, that is
P (A ∩ B)
(1.2) P (A|B) =
P (B)
f (x, y)
(1.3) f (y|x) =
f (x)
That is, the distribution of y conditional on x equals the joint distribution of x and y over
the distribution of x. The derivation of Bayes’ theorem, which we use in estimating the
9
10 CHAPTER 1. THE NECESSARY BAYESIAN STATISTICS
P (A = 1|B = 1) = P (B=1|A=1)P
P (B=1)
(A=1)
0.01
= 0.109
= 0.0917
1.1. BASIC EXAMPLES 11
P (A = 1|B = 1, 1) = P (B=1,1|A=1)P
P (B=1,1)
(A=1)
0.01
= 0.1×0.1×0.99+1×1×0.01
= 0.5025
The numerator (the conditional distribution times our prior) is, for θ = 0 (recall θ can only
take values 0 or 1)
1 1
fY (0|θ = 0)f (θ) = √
2π 2
thus for θ = 0
√1 1
2π 2 1
fθ (0|Y = 0) = 1 = 1
−
√1 1
2π 2
1
+ √2π e 2 21
1 + e− 2
For θ = 1 1
√1 e− 2 1 1
2π 2 e− 2
fθ (1|Y = 0) = 1 = 1
√1 1 + √1 e− 2 1 1 + e− 2
2π 2 2π 2
where θ0 is our prior mean and λ1 our prior standard deviation. Suppose also that we
observe a vector X that follows a normal distribution
( )
1 1 X 2
fX (X|θ) = N exp − 2 (xi − θ)
(2πσ 2 ) 2 2σ
i
In the second term above the only parts we care about are those which R ∞ relate to the
parameter θ; the rest goes into the constant of integration which ensures −∞ f (θ|X)dθ = 1,
thus we can simply write the above term as
n h io
fθ (θ|X) ∝ exp − 12 λ2 (θ2 − 2θθ0 ) + σ12 (−2 i xi θ + N θ2 )
P
(1.5) n h io
= exp − 2σ1 2 (λσ)2 (θ2 − 2θθ0 ) − 2 i xi θ + N θ2
P
The trick now is to try and re-write equation 1.5 as a normal distribution. The easiest way
to proceed is to guess and check (assuming, of course, that we can make a good guess of
the posterior). To this end suppose
N +(λσ)2
h
2
−1 P 2
i 2
(1.6) fθ (θ|X) ∝ exp − 2σ2 θ − (N + λσ) i xi + (λσ) θ0
1.2. BAYESIAN LINEAR REGRESSION 13
P
i i x +θ
0
Again, since the term N +(λσ) 2 does not contain our parameter of interest θ we can dump
it into the constant of integration and keep only the relevant part of the above term,
n h io
fθ (θ|X) ∝ exp − 2σ1 2 N + (λσ)2 θ2 − 2θ 2θ
P
x
i i + (λσ) 0
n h io
1 2 2
P 2
= exp − 2σ2 (λσ) (θ − 2θθ0 ) − 2 i xi θ + N θ
(1.7) y = Xβ + ε
(1.8) Y = XB 0 + ε
1 0 0
∝ exp − 2σ [β X Xβ + β 0 Λ0 β − 2β 0 X 0 y − 2β 0 Λ0 β0 ]
These terms match those in equation (1.9), thus our posterior for β is distributed
(1.10) β ∼ N (X 0 X + Λ0 )−1 (X 0 y + Λ0 β0 ), σ(X 0 X + Λ0 )−1
This derivation assumes we know σ already, which of course will not be the case in
practice.
3
Specifying the variance of our prior in terms of the variance of shocks keeps the algebra much cleaner
but is not strictly necessary. Also note that like Σ, σ denotes variance, not standard deviation as is often
the case.
1.2. BAYESIAN LINEAR REGRESSION 15
β = vec(B)
Multiplying this expression out (and ignoring terms that don’t contain β) leaves us with
n h
f (β|X, y, σ) ∝ exp − 21 β 0 (V0 )−1 β − 2y 0 Σ−1 Xβ
io
+ β 0 X0 Σ−1 Xβ − 2β 0 (V0 )−1 β0
Using the property for Kronecker products that for conformable matrices (A⊗B)(C ⊗D) =
AC ⊗ BD this simplifies to
n 1h io
f (β|X, y, σ) ∝ exp − β 0 (V0 )−1 β − 2y 0 (X ⊗ σ −1 )β + β 0 (X 0 X ⊗ σ −1 ) − 2β 0 (V0 )−1 β0
2
in which case we can write the posterior as
n 1 0 o
f (β|X, y, σ) ∝ exp − β −V1−1 (X 0 ⊗σ −1 )y +V0 β0 V1 β −V1−1 (X 0 ⊗σ −1 )y +V0 β0
2
where
V1 = (X 0 X ⊗ σ −1 + V0−1 )
16 CHAPTER 1. THE NECESSARY BAYESIAN STATISTICS
In the future we can use a matrix normal distribution for our multivariate normal models
which avoids the hassle of having to vectorize and then simplify our model, though it is
hopefully useful to illustrate the vectorized version at least once.
1 h 0 i
(1.11) − y y − 2y 0 Xβ + β 0 X 0 Xβ + β 0 Λ0 β − 2β 0 Λ0 β0 + β00 Λ0 β0 + s0
2σ
If we define ΛT ≡ X 0 X + Λ0 then
0
β − Λ−1 0 −1 0
T (X y + Λ0 β0 ) ΛT β − ΛT (X y + Λ0 β0 ) =
(1.12) 0 0 0 0 0 −1 0
β ΛT β − 2β (X y + Λ0 β0 ) + (X y + Λ0 β0 ) ΛT (X y + Λ0 β0 )
Comparing terms in 1.12 and 1.11 shows that we can re-write the exponential terms of the
posterior as
h 0
1
β − Λ−1 0 −1 0
− 2σ T (X y + Λ0 β0 ) ΛT β − ΛT (X y + Λ0 β0 ) +
(1.13) i
y 0 y − (X 0 y + Λ0 β0 )0 Λ−1 0 0
T (X y + Λ0 β0 ) + β0 Λ0 β0 + s0
The first line in 1.13 forms the normal part of our normal-inverse gamma posterior; the
second line contains terms that will go into the posterior scale parameter of the inverse
gamma distribution. We can re-write the scale parameter for the posterior by noting that
the posterior mean for β is βT = Λ−1 0 y + Λ β ) . Thus
T (X 0 0
As previously, the trick is to write our posterior as a sum of squares. Using the fact that
tr(A)tr(B) = tr(A + B) and dealing first with exponential terms, our result is the matrix
equivalent to equation (1.11):
1h i
(1.15) − tr Y 0 Y − 2Y 0 XB + B 0 X 0 XB + B 0 Λ0 B − 2B00 Λ0 B + B00 Λ0 B0 + V0 Σ−1
2
4
Note that in this section I will write B as its transpose for ease of notation. That is, B in this section
corresponds to its transpose elsewhere. This is due to the fact that we will write our stacked model of
observations as Y = XB + ε.
5
I use the vectorized form β = vec(B 0 ) as this stacks B over rows, thus the resulting vector β0 cor-
responds to the covariance matrix Λ0 ⊗ Σ and to X = X ⊗ Ik as defined in the derivation for a simple
normal-normal multivariate linear regression.
1.3. SUMMING UP 19
Again, comparing terms in 1.16 and 1.15 shows that we can re-write the exponential terms
of the posterior as
0
1
B − Λ−1 0 −1 0
− 2 tr T (X Y + Λ0 B0 ) ΛT B − ΛT (X Y + Λ0 B0 ) +
(1.17)
Y 0 Y − (X 0 Y + Λ0 B0 )0 Λ−1 0 0
T (X Y + Λ0 B0 ) + B0 Λ0 B0 + V0 Σ
−1
so that our final result is the matrix equivalent to our result in the univariate case, that is,
(1.18) n h
0 Y + Λ B ) 0 Λ B − Λ−1 (X 0 Y + Λ B ) Σ−1
io
f (B, Σ|Y, X) ∝ |Σ|−m/2 exp − 21 tr B − Λ−1
T (X 0 0 T T 0 0
n h io
− 21 (ν0 +k+T +1)
× Σ tr (Y − XBT ) (Y − XBT ) + (BT − B0 ) Λ0 (BT − B0 ) + V0 Σ−1
0 0
where BT = Λ−1 0
T (X Y + Λ0 B0 ). Thus our posterior for B conditional on Σ and the data is
f (B|Σ, X, Y ) ∼ MN Λ−1T (X 0
Y + Λ 0 B 0 ), Λ T , Σ
1.3 Summing Up
These results will form the basis of our Bayesian estimation routines for dynamic factor
models. In chapter six we will construct posterior densities for parameters of our model
by simulation. These simulations will use the conditional densities we have derived above.
In summary, for a normal-inverse gamma conjugate prior our prior for β is
− k2 1 0
π(β|σ) ∝ (σ) exp − (β − β0 ) Λ0 (β − β0 )
2σ
20 CHAPTER 1. THE NECESSARY BAYESIAN STATISTICS
yt = BXt + εt
so that β = vec(B) and thus the model distribution (again in vectorized form) is
1 0 −1
f (y|β, σ, X) ∝ exp − (y − Xβ) Σ (y − Xβ)
2
where X = X ⊗ Ik . Then the posterior for β conditional on σ is
h i −1
0 0
f (β|σ, X, y) ∼ N vec Λ−1 T (X 0
Y + Λ 0 B0 ) , Λ T ⊗ σ
Kalman filtering, named for Kalman (1960), is a means of estimating the time series model
(2.1) yt = Hxt + εt
(2.2) xt = Axt−1 + et
where xt is an unobserved state or states, yt observed outcomes,
and εt and
et are normally
et Q 0
distributed error terms with the covariance matrix Cov = . States, observed
εt 0 R
outcomes, and error terms can be either scalars or vectors. The Kalman filter works by first
predicting an outcome xt|t−1 where the subscripts indicate the prediction of xt based on
all observations from the initial period until period t − 1 and then updating this prediction
using the outcome from period t. Formally, we first predict
Z
(2.3) p(xt |y1:t−1 ) = p(xt , xt−1 )p(xt−1 |y1:t−1 )dxt−1
2.1 Preliminaries
Suppose we know that the scalars x and y follow the multivariate normal distribution
x µ
(2.5) ∼N ,Σ
y m
21
22 CHAPTER 2. THE KALMAN FILTER
P C
where Σ = and we’re interested in determining the distribution f (x|y). Using the
C S
definition of a conditional distribution we know that
f (x, y)
f (x|y) =
f (y)
E(x|y) = µ + CS −1 (y − m)
and
V ar(x|y) = P − CS −1 C 0
These results are essentially all we need to derive the Kalman filter.
seem since we never in fact observe xt (or xt−1 ). Therefore we need to define a new variable,
call it µt|t , which is our expected value of xt given observations y1:t . Define the variance of
xt|t as Pt|t and the variance of xt|t−1 (xt given observations 1 : t − 1) as Pt|t−1 . Then the
joint distribution for xt|t−1 and yt is
xt|t−1 Aµt−1|t−1
=N , Σt
yt H(Aµt−1|t−1 )
where
P Ct
Σt = t|t−1
Ct0 St
From the results in section 2.1 we can immediately calculate the expected value of xt|t−1
(which is unobserved) given yt (which is observed), E(xt|t−1 |yt ) = Aµt−1|t−1 + Ct St−1 (yt −
H(Aµt−1|t−1 )), as well as V ar(xt|t−1 |yt ) = Pt|t−1 − Ct St−1 Ct0 . However, we still need to
derive the values for Σ. For notational convenience write xt|t−1 as simply xt which is
distinct from xt|t Writing
we then have
Et|t−1 xt x0t − 2xt µ0t−1|t−1 A0 + Aµt−1|t−1 µ0t−1|t−1 A0
Pt|t−1 =
Et|t−1 (Axt−1 + vt )(Axt−1 + vt )0 − 2(Axt−1 + v)µ0t−1|t−1 A0 + Aµt−1|t−1 µ0t−1|t−1 A0
=
= Et|t−1 (Axt−1 x0t−1 A0 ) + Et|t−1 (vt vt0 ) − Aµt−1|t−1 µ0t−1|t−1 A0
= A0 Pt−1|t−1 A0 + Q
S = HPt|t−1 H 0 + R
And finally
Et|t−1 xt yt0 − xt µ0t−1|t−1 A0 H 0 − Aµt−1|t−1 yt + Aµt−1|t−1 µ0t−1|t−1 A0 H 0
Ct =
Et|t−1 (Axt−1 + vt )(H(Axt−1 + vt )wt )0 − 2(Axt−1 + v)µ0t−1|t−1 A0 H 0 + Aµt−1|t−1 µ0t−1|t−1 A0 H 0
=
= Et|t−1 (Axt−1 x0t−1 A0 H 0 ) + vt vt0 H 0 − Aµt−1|t−1 µ0t−1|t−1 A0 H 0
= Pt|t−1 H 0
Thus equipped we can write the Kalman filter as follows. Our prediction for the mean
and variance of xt (without conditioning on yt ) is
µt|t−1 = Aµt−1|t−1
Pt|t−1 = APt−1|t−1 A0 + Q
24 CHAPTER 2. THE KALMAN FILTER
Our prediction for the mean and variance of yt given observations 1 : t − 1 (before yt is
observed), denoted yt|t−1 , is
mt|t−1 = Hµt|t−1
St = HPt|t−1 H 0 + R
Our prediction for the covariance between xt (again, without using yt ) and yt|t−1 is
Ct = Pt|t−1 H 0
Note that the Kalman gain combines this covariance and the estimated variance of yt|t−1
and is typically written as Kt = Ct St−1 . The above equations, called the prediction step,
give us our prior. Our posterior estimates for the mean and variance of xt given yt (recall
yt is observed), called the updating step, are
where f (yt |y1:t−1 ) is normally distributed with mean mt|t−1 and variance St . Denoting
the predictive error calculated by the Kalman filter in each period as νt = yt|t − mt|t−1 we
can thus write the likelihood of our observables as
T
Y
− 12 1
L= (2π)k/2
|St | exp − νt0 St−1 νt
2
t=1
so that the log likelihood, which we typically use for any maximization problem, is
T T
X 1 1 X 0 −1
(2.7) l =κ− log(|St |) − νt St νt
2 2
t=1 t=1
where κ does not contain any parameters of interest and can thus be ignored in the max-
imization problem. The log likelihood is remarkably easy to calculate in practice as both
νt and St are calculated in each period by the Kalman filter.
2.4. A KALMAN SMOOTHER 25
Pt|t Pt|t A0
xt |y1:t µt|t
(2.8) ∼N ,
xt+1 |y1:t µt+1|t APt|t Pt+1|t
where Pt+1|t = APt|t A0 + Q. Using the same results for a joint normal distribution we used
to derive the Kalman filter we then have1
−1
where gt = Pt|t A0 Pt+1|t . Since we do not in fact observe xt+1 we need to slightly modify
the above equations. Using the law of iterated expectations for the first
(2.9) E(xt |y1:T ) = E(E(xt |xt+1 , y1:t )|y1:T ) = µt|t + gt (µt+1|T − µt+1|t )
V ar(xt |y1:T ) = E(V ar(xt |xt+1 , y1:t )|y1:T ) + V ar(E(xt |xt+1 , y1:t )|y1:T )
(2.10) = Pt|t − gt Pt+1|t gt0 + gt Pt+1|T gt0
= Pt|t − gt (Pt+1|t − Pt+1|T )gt0
Equations (2.9) and (2.10) form our smoother. We begin using our last filtered value for
µT |T and PT |T and iterate backwards to the first period.
1
The result for Pt|T comes from the fact that
−1
V ar(xt |xt+1 , y1:t ) = Pt|t − Pt|t A0 Pt+1|t APt|t
0 −1 −1
= Pt|t − Pt|t A Pt+1|t Pt+1|t Pt+1|t APt|t
0
= Pt|t − gt Pt+1|t gt
26 CHAPTER 2. THE KALMAN FILTER
µt|t−1 = Aµt−1|t−1
(2.11) mt|t−1 = Hµt|t−1
µt|t = µt|t−1 + CS −1 (yt|t − mt|t−1 )
This is the steady state Kalman filter. To obtain these steady state values, we can
simply run the system of difference equations that determine the relevant variances and
covariances until convergence. This system is
Pt|t−1 = APt−1|t−1 A0 + Q
St = HPt|t−1 H 0 + R
(2.12)
Ct = Pt|t−1 H 0
Pt|t = Pt|t−1 − Ct St−1 Ct0
2.6 An Example
Suppose we have a model described by
1 −.5
xt = xt−1 + et
.1 .7
.5 1
−1 2
1 −1 xt + εt
yt =
1 −.5
where xt is a 2 × 1 vector of unobserved factors, yt is a 4 × 1 vector of observed data,
et ∼ N (0, I2 ), and t ∼ N (0, I4 ) and we would like to construct the unobserved factors xt
from the observed data.
In this case I already know the parameters A, Q, H, and R. Thus all that remains is
to specify initial conditions. x0 = [0] is a natural but arbitrary choice. Thus, to this initial
guess has little influence on the model I begin with a large initial factor variance, in this
case 5
10 0
P0 =
0 105
Figure 2.1 plots the results for 200 observations.
2.7. MISSING OBSERVATIONS 27
Figure 2.1: True series x1,t versus the series estimated from observations yt
will automatically correspond to the observed series. The same is true of the dimensionality
of terms used in the disturbance smoother in chapter 3. From a programming point of
view, the only downside to the changing dimensions of H, R and the terms calculated
using each is storage. For example, if we wish to save the gain Kt and the prediction error
νt = yt − Hmt|t−1 at each period t we will need a data type that can handle different
dimensions for each observations. In C++/Armadillo this means using field<mat>, and
in R a list as opposed to a cube or array, but this disadvantage is slight and does not
appreciably detract from the speed or efficiency of calculations.
Note that in the rest of this text I do not include time subscripts on parameters such
as H and R that may change dimension from one period to the next due to missing data
in order to keep the notation clean and avoid confusion with objects such as observations
or factors with values that change over time.
Chapter 3
I begin this chapter by re-stating our basic factor model with a slight generalization, al-
lowing for exogenous predetermined variables nt . In practice, these will typically take the
form of seasonal factors; Chapter 5 includes an example of estimating seasonal adjustments
for covariance stationary data by maximum likelihood. Thus our measurement equation
becomes
(3.1) yt = Hxt + M nt + εt
(3.2) zt = Azt−1 + et
where εt and et are normally distributed error terms with the covariance matrix
et Q 0
Cov =
εt 0 R
In the above yt are noisy observations, zt stacked factors zt = xt xt−1 . . . xt−p with
p lags, and nt predetermined, exogenous variables.
Given the parameters H, A, M , Q, and R estimates of factors or estimates of missing
series in yt derive from the Kalman filter and smoother. I re-state the Kalman filter here
1
One could also include exogenous variables in the transition equation, but I do not here as what I have
in mind are seasonal adjustments to observations.
29
30 CHAPTER 3. THE DURBIN-KOOPMAN DISTURBANCE SMOOTHER
as
zt|t−1 = Azt−1|t−1
Pt|t−1 = APt−1|t−1 A0 + Q
yt|t−1 = H̃zt|t−1 + M nt
St = H̃Pt|t−1 H̃ 0 + R
Ct = Pt|t−1 H̃ 0
zt|t = zt|t−1 + Ct St−1 (yt|t − yt|t−1 )
Pt|t = Pt|t−1 − Ct St−1 Ct0
and the Kalman smoother as
zt|T = zt|t + gt (xt+1|T − zt+1|t )
(3.3)
Pt|T = Pt|t − gt (Pt+1|t − Pt+1|T )gt0
−1
where gt = Pt|t A0 Pt+1|t . H̃ in the above incorporates a helper matrix J to extract, in
the simplest example, contemporaneous factors from zt . That is, J = Im 0 0 . . . and
H̃ = HJ. In the above notation xt|t−1 refers to our estimate of xt given observations
through period t − 1 while xt|t is our estimate of xt given observations through t, and xt|T
is our estimate of xt conditional on all available data through period T . Note that in the
above Kt = Ct St−1 is the Kalman gain, νt = yt|t − yt|t−1 is the prediction error, and thus
Kt νt is our forecast update.
Smoothing here requires we store the following matrices (or vectors) from the filter
at every period t: Pt|t , Pt+1|t , zt|t , and zt+1|t . Additionally, we must invert the matrix
Pt+1|t at each iteration of the smoother. This can be computationally burdensome for even
moderately sized models. Consider an example with four factors and five lags. In this case
we will be inverting the 20 × 20 matrix Pt+1|t in each time period. For mixed frequency
models (see chapter 7) inverting Pt+1|t can become prohibitive. For a daily/weekly/monthly
model with three factors and three lags at a monthly (30 day) frequency Pt+1|t will be a
270 × 270 matrix. The Durban-Koopman disturbance smoother allows us to avoid these
cumbersome calculations resulting in much faster and more efficient model estimation.
(3.4) νt = yt − H̃xt|t−1 − M nt
(3.5) St = H̃Pt|t−1 H̃ 0 + R
(3.6) Kt = APt|t−1 H̃ 0 St−1
(3.7) zt+1|t = Azt|t + Kt νt
(3.8) Lt = A − Kt H̃
(3.9) Pt+1|t = APt|t L0t + Q
3.1. DURBAN KOOPMAN (2001) FILTERING AND SMOOTHING 31
Note that in this notation the Kalman gain in (3.7) updates the forecast for next period
factors; factoring A out of equation (3.8) yields the same forecast update as in chapter
2’s more standard Kalman filter. Similarly, multiplying out the elements of (3.9) yields
Pt+1|t = APt|t−1 A0 − ACt St−1 Ct0 A0 + Q. For the above notation, the disturbance smoother
is
Note that in order to recover smoothed factors we must iterate rt back to r0 . Equipped
with et|T we can recover smoothed estimates of the factors by initializing our smoothed
factors as
z1 = z1|0 + P1|0 r0
or, since we typically set z0 to zero simply z1 = P1|0 r0 , and iterating forward again using
In this case we must store the matrices (or vectors) νt , H̃, St−1 , Lt , and potentially Kt .
Since filtering requires solving H̃St−1 disturbance smoothing requires no additional matrix
inversions and will thus be much more computationally efficient.
Chapter 4
By far the simplest and fastest way to estimate the factor model outlined in equations (1)
and (2) is by principal components, though this approach has several major drawbacks.
Perhaps most importantly, principal components solves a static problem. That is, when
using principal components we find factors by looking at the measurement equation in
isolation ignoring the inter-temporal correlations implied by the transition equation. Ad-
ditional problems with principal components estimates are (1) data must be square and
complete, a large drawback when using a methodology particularly adept at handling miss-
ing and noisy observations and (2) principal components limits our ability to fix parameters
or apply prior beliefs to parameter estimates. Despite these drawbacks, when we have a
data set without missing observations principal components provides a very quick and easy
way to estimate a factor model. I begin this chapter by examining reduced rank regressions
before moving on to principal components as a special case. Once we have our principal
component estimates of factors we can estimate the transition equation by OLS and simply
plug these results into the Kalman filter and, if desired, smoother to obtain our final factor
estimates.
33
34 CHAPTER 4. ESTIMATION VIA PRINCIPAL COMPONENTS
where both B and γ are parameter matrices to estimate. If we want the least squares
estimates of B and γ then we need to minimize
n o
(4.2) min tr (Y − BγX) (Y − BγX)0
| {z }| {z }
0
where tr denotes the trace of the covariance matrix for . Note that equation (4.1) will be
observationally equivalent for
Bθ θ−1 γ X +
Y = |{z}
| {z }
Ba γa
where B a and γ a are some alternative B and γ, thus we need to impose some sort of
normalization on γ. It is convenient to impose
(4.3) γXX 0 γ 0 = Im
From a simple OLS model we already know that given γ our estimate for B that minimizes
our loss function (4.2) will be
B = Y X 0 γ 0 (γXX 0 γ 0 )−1
which, given our normalization (4.3) is just Y X 0 γ 0 . Plugging this into our loss function we
have n o
min tr Y Y 0 − 2Y X 0 γ 0 γXY 0 + Y X 0 γ 0 γXX 0 γ 0 γXY 0
Since the first term does not include γ we can re-write this minimization problem as a con-
strained maximization, using the property of a matrix trace that tr(ABC) = tr(CAB) =
tr(BCA) so that tr(Y X 0 γ 0 γXY 0 ) = tr(γXY 0 Y X 0 γ 0 ),
L = tr γXY 0 Y X 0 γ 0 + Λ Im − γXX 0 γ 0
where the constraint comes from our normalization (4.3) and Λ is a diagonal matrix of
Lagrange multipliers. The first order condition for this problem (see the appendix for
useful matrix derivatives) is
2γXY 0 Y X 0 − 2ΛγXX 0 = 0
or
γXY 0 Y X 0 (XX 0 )−1 − Λγ = 0
4.2. PRINCIPAL COMPONENTS AS A SPECIAL CASE 35
so that the solution for γ is the left eigenvectors of XY 0 Y X 0 (XX 0 )−1 . Since left eigenvectors
are a bit annoying (Matlab automatically computes right eigenvectors) we can simply take
the transpose of this result
so that γ 0 is given by the (right) eigenvectors of (XX 0 )−1 XY 0 Y X 0 associated with the
largest m eigenvalues, the diagonal elements of Λ (Λ is a diagonal matrix). Recall that
the Lagrange multiplier is a measure of how strongly our constraint binds.1 We chose the
eigenvectors associated with the largest eigenvalues because those vectors give the factors
γX which will have the largest impact on our objective function tr Y X 0 γ 0 γXY 0 . Once
we have γ we can simply compute B = Y X 0 γ 0 .
in which case the solution for γ 0 is the (right) eigenvectors of Y Y 0 associated with the
largest m eigenvalues. We can actually derive principal components from a slightly more
general problem. Suppose we want to minimize the loss function
min{tr (Y − γF )(Y − γF )0 }
(4.7)
where F can be any set of explanatory variables. Thus, unlike the problem in section 4.1,
we are not restricting what the set of explanatory variables can be (previously it was X).
We again need to impose a normalization of γ since Y = γθθ−1 F + will be observationally
equivalent to the model in (4.7). In this case the normalization
(4.8) Im = γ 0 γ
is convenient. Minimizing first over F , the solution for the factors given γ is
F = (γ 0 γ)−1 γ 0 Y
Plugging this back into our minimization problem, equation (4.7), we have
min{tr Y Y 0 − 2γγ 0 Y Y 0 + γγ 0 Y Y 0 γγ 0 }
Using the property of a matrix trace that tr(ABC) = tr(CBA) = tr(ACB) and our
normalization, equation (4.8) this result becomes
min{tr Y Y 0 − Y 0 γγ 0 Y }
which, again using our normalization, we can write as the constrained maximization prob-
lem
L = tr γ 0 Y Y 0 γ + Λ(I − γ 0 γ)
where again we have used the properties of a matrix trace. Thus the trace of ΣY Y is the
sum of its eigenvalues. With one principal component, that associated with the largest
eigenvalue denoted λ1 , our scaled loss function is
1 0
tr T (Y − γγ Y )(Y − γγ Y ) 0 0
= tr T1 (Y Y 0 − 2γγ 0 Y Y 0 + γγ 0 Y Y 0 γ 0 γ
= 1 0 0
tr T (Y Y − γγ Y Y ) 0
= 1 0 0
tr T (Y Y − γ Y Y γ) 0
= tr ΓΛΓ0 − γ 0 ΓΛΓ0 γ
λ1 0 . . .
= tr Λ − 0 0 . . .
.. .. . .
. . .
2
From our first order condition for m factors using the scaled loss function, ΣY Y γ − γΛ, we can write
the system of equations for the complete set of factors when m = k as ΣY Y Γ = ΓΛ where Λ is a diagonal
matrix of the eigenvalues of ΣY Y and the columns of Γ are the associated eigenvectors. Thus ΣY Y = ΓΛΓ0
where we maintain our normalization Γ0 Γ = Ik , and since Γ is now square it is also true that ΓΓ0 = Ik
4.3. SUMMING UP 37
where the last equality comes from the fact that γ 0 Γ is a matrix of zeros with a one in
the upper left corner, as is Γ0 γ. Thus the first principal component reduces our scaled loss
function by λ1 . Similarly, the first two principal components will reduce our scaled loss
function by λ1 + λ2 .PThe usual interpretation of this result is that the first m principal
m
λi
components explain Pi=1k of the variance of Y , where by variance of Y we mean the sum
i=1 λi
of the diagonal of the covariance matrix.
4.3 Summing Up
Beginning with our transition equation
Yt = Hxt + εt
if our set of k covariance stationary observables Yt does not have missing observations
then we can estimate the k × m matrix of loadings H by principal components as the
(right) eigenvectors associated with the m largest eigenvalues of ΣY Y where ΣY Y is the
(typically scaled to have ones on the diagonal) covariance matrix for Yt , that is, ΣY Y =
(Y − Ȳ )0 (Y − Ȳ )0 . Due to the normalizing assumption Im = H 0 H in equation (4.8) our
estimate of the factors xt is
H 0 Yt = x̂t
and residuals are given by
ε̂t = Yt − H x̂t
allowing us to estimate R. Typically we will use only the diagonal elements of R as Doz
et al. (2012) show that this model, the approximate factor model, even if misspecified still
consistently estimates factors. Using the factors x̂t we can estimate the transition equation
by OLS for parameters A and Q. These are nearly all we will need to run the Kalman
filter and, if desired, smoother. The only thing that remains are initial values x0 and P0 .
Because x0 is not known, it is good practice to initiate the filter with diffuse values, that
is, specifying a large initial factor variance P0 . For example, we might begin our filter with
x0 = 0m and P0 = 105 Im .3
3
In the case of more than one lag p in the transition equation we would have x0 = 0m×p and P0 =
5
10 Im×p .
Chapter 5
Maximum likelihood estimation of state space models using numerical techniques such
as the function optim() in R provide an easy to program method for estimating model
parameters. To use a numerical method we can simply write a function that returns the
log likelihood described in equation (2.7) and search for parameters that maximize it. The
great drawback of numerical methods is that they are computationally intensive and thus
slow to converge and unfeasible for larger models. Watson and Engle (1983) provide an
alternative approach that overcomes these shortcomings of numerical routines.
yt = Hxt + εt
xt = Bzt−1 + et
39
CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION VIA WATSON AND ENGLE
40 (1983)
where zt is defined as
xt
xt−1
zt =
..
.
xt−p+1
and p denotes the number of lags in the transition equation. Then the moment matrices
we need to estimate are, for B: E(zt−1 zt−1 0 ) and E(x z 0 ); for H: E(x x0 ) and E(y x0 );
t t−1 t t t t
for Q: E(et e0t ); and for R: E(εt ε0t ). Proper estimation of the above moment matrices allows
us to then compute maximum likelihood estimates of the model parameters.
In brief, the approach works by noting that, where xt|T , Pt|T , Ct|T , and so on are the
values calculated at the j th iteration of the algorithm (I suppress the superscript j to keep
the notation cleaner and use zt|T to denote our estimate of the factors zt given observations
1 : T)
0
(5.1) E(zt−1 zt−1 ) = E (zt−1|T + (zt−1 − zt−1|T ))(zt−1|T + (zt−1 − zt−1|T ))0
0
(5.2) E(xt zt−1 ) = E (xt|T + (xt − xt|T ))(zt−1|T + (zt−1 − zt−1|T ))0
which we estimate as
1 hX X i
xt|T x0t|T + x
Pt|T
T t t
0
(5.4) E(yt x0t ) = E yt xt|T + (xt − xt|T )
= E(yt x0t|T )
5.1. THE ALGORITHM 41
giving the moment matrices for H; and for the covariance matrices Q and R
(5.5) et = (xt|T − Bzt−1|T ) + (xt − xt|T ) − B(zt−1 − zt−1|T )
1 hX X X X X i
vt|T vt|T + x
Pt|T BPt−1|T B 0 + BCt|T + 0
Ct|T B0
T t t t t t
and
1 hX X i
εt|T εt|T + x
HPt|T H0
T t t
As suggested by Watson and Engle (1983) we can calculate the correlations E (xt −
xt|T )(zt − zt−1|T )0 by including an extra lag of xt in the state vector of the Kalman filter
and smoother. For example, if we wanted to estimate a model with three lags we would
write the transition equation as
xt B1 B2 B3 0 xt−1 vt
xt−1 I 0 0 0 xt−2 0
xt−2 = 0
+
I 0 0 xt−3 0
xt−3 0 0 I 0 xt−4 0
For this three lag example Pt|T is the upper left1 3 × 3 submatrix of Pt|TW E in equation
yt = Hxt + εt
(5.8)
xt = Bxt−1 + et
Factor estimates and the likelihood function for the model in (5.8) would be identical to
those for the model
yt = Hxt + εt
(5.9)
xt = Bxt−1 + et
Identification may be important for interpreting our model.2 It is, for maximum likelihood
and Bayesian estimation, also important for estimation.3 Identification requirements for
maximum likelihood estimation are less stringent than for Bayesian estimation. Bayesian
estimation by simulation relies on the assumption we are drawing factors and parameters
from the same distribution at each iteration. On the other hand, because our convergence
criteria for maximum likelihood is in the likelihood function, parameter estimates can
drift between the models in (5.8) and (5.9) as the likelihood function for the two will be
identical. What is important is that we scale the model at each iteration of the algorithm;
otherwise our parameter estimates may become arbitrarily large or small thereby breaking
the algorithm. There is no one way to scale or identify our model; for a discussion of
several identification possibilities see Stock and Watson (2016). To scale the model we
could enforce H 0 H = I as with principal components or even diag(H 0 H) = I. Perhaps
the simplest identification technique, however, it set the top m × m sub-matrix of H,
where m is the number of factors, to be an identity matrix (this is called “naming factors”
identification). In this case our model will not just be scaled, but be fully identified (thus
2
See Stock and Watson (2016) for a more comprehensive discussion of interpretation and identification
issues.
3
For principal component estimation identification is dealt with by the normalization for H in equation
(4.8).
5.3. AN EXAMPLE: SEASONALLY ADJUSTING STATIONARY DATA 43
we could and will use it for Baysian estimation as well). This strategy has the added
advantage of giving our model a simple interpretation: this first factor describes filtered
movements of the first variable in common with the entire data set, the second factor
describes filtered movements of the second variable in common with the entire data set,
and so on.
Initial Values
As with principal components we will not know initial values z0 with which to initiate the
algorithm. As smoothing produces estimates of z0 one possibility is to begin with z0 = [0]
and then update z0 at each iteration of the algorithm. However, there is no particularly
good theoretical justification for this approach and it may result in the likelihood function
getting worse (that is, smaller) from one iteration to the next. A better approach is to
again begin with diffuse values, for example setting z0 = [0] and P0 to be the identify
matrix times a large number (large relative to the actual variance of shocks).
(5.10) y t = x t + M N t + εt
Furthermore, we will need to add a lag to the transition equation in order to estimate
the adjustment factors for the transition equation. One possible initial guess for M is to
regress Nt on yt by OLS and to calculate our initial guess for r from the residuals. We can
CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION VIA WATSON AND ENGLE
44 (1983)
Figure 5.1: Factor Model Seasonally Adjusted Spanish IP, MoM percent change
also form initial estimates for b and r by OLS using the observed data yt . Finally, we will
need to write our transition equation with an extra lag in companion form as
xt b 0 xt−1 e
(5.11) = + t
xt−1 1 0 xt−2 0
Equipped with our initial guesses for parameters and the initial conditions [x0 x−1 ]0 = [0 0]0
and P0 = 106 × I2 the algorithm proceeds as follows:
1. Run the Kalman smoother for the current iteration value of parameters. If the
likelihood does not improve from the last iteration then the algorithm has converged;
otherwise continue.
0
2. Using yt , zt|T = xt|T xt−1|T , and Pt|T for the augmented model re-estimate the
parameters following equations (5.1) through (5.6)
These two steps alternate until the likelihood converges. Figure 5.1 illustrates seasonally
adjusted IP estimated as outlined above against the raw data. Programs to estimate the
model are contained in the file ML Seasonal Example.
Chapter 6
6.1 Sampling
Chapters 1, 2 and 3 present the main results we will use to estimate our model. Estimation
begins with an initial guess for parameters. Given this initial guess, we would like to draw
45
46 CHAPTER 6. BAYESIAN ESTIMATION BY SIMULATION
(6.1) yt = Hxt + εt
(6.2) xt = Bxt−1 + et
where
et 0 Q 0
∼N ,
εt 0 0 R
One possible approach is to
3. Iterate backwards one period using the standard Kalman smoother. That is, calculate
5. Repeat these iterations back to the initial period to generate draws of x̃t for every
period.
we must invert Pt+1|t at each step of the iteration. For a steady state model, that is,
one in which the variance of factors does not change, this is not a big drawback as we
can calculate g once (it will be the same in every period for a steady state model) and
proceed. If the factor variance is not constant (this will always be the case if yt has miss-
−1
ing observations), calculating Pt+1|t becomes computationally burdensome, particularly in
1
I use the case of one lag in the transition equation for explanatory purposes here as the notation is
simple. For more lags our model becomes
yt = H̃zt + εt
zt = Azt−1 + et
where typically we will have H̃ = H 0 0 . . . , zt = xt xt−1 . . . xt−p+1 where p is the number of
lags, and A is the companion form of B in the more familiar vector autoregression xt = Bzt + et . Note also
that I am somewhat lose with the definition of et as when we have more than one lag this vector includes
zeros as “shocks” to lagged variables in zt to make dimensions agree.
6.1. SAMPLING 47
models with many lags as P has dimension m × p where m is the number of factors and
p the number of lags. For this more general case, Durbin and Koopman (2012) propose
a much more computationally efficient method for drawing factors. Their key result for
normally distributed data, presented earlier in Durbin and Koopman (2001), is a method
for drawing from the conditional distribution f (x|y) when doing so directly is burdensome
but drawing from the joint normal distribution f (x, y) and calculating the expected values
E(x|y) is computationally more efficient. To borrow their notation, we can draw x+ , y +
from f (x, y), calculate x̂ = E(x|y) (the filtered and smoothed factors from our true ob-
servations y), and calculate x̂+ = E(x+ |y + ) (the filtered and smoothed observations from
our simulated observations y + ). Then the draw from our desired distribution f (x|y) is
x̃ = x̂ + x+ − x̂+ . The values x̂ are the posterior means for the factors conditional on the
data y, and x+ − x̂+ are the simulated zero mean shocks. For our purposes the authors
point out that this approach can be made even more efficient as follows:
Algorithm 1
1. Begin with a draw for x0 , εt at every period t, and et at every period t and iterate
equations (6.1) and (6.2) forward to obtain a draw for factors x+ t and a draw for
observations yt+ .
3. Obtain x̂∗t by filtering and smoothing yt∗ using the disturbance smoother described
in chapter 3.
x̃t = x̂∗t + x+
t
An important question is what to use as our draw for x0 . As previously, the best way
to avoid issues of this initial vector of factors is to begin our algorithm with a large factor
variance, that is, allow for uncertainty regarding x0 . Using diffuse values to initialize the
filter and smoother allows us to simply plug in x0 = 0 and proceed with algorithm 1.
This excellent algorithm provides a quick, efficient, and easy to code method for sam-
pling factors. Once we have a draw for factors, we can then draw parameters based on
these simulated factors. Posterior distributions for parameters given factors are presented
in chapter 1. Because we will be using an exact factor model, that is, we will treat the
covariance matrix for shocks to observations R as diagonal, we can estimate each diagonal
element of R and each row of H using a normal-inverse gamma conjugate prior.
Because our draw for Rjj , the j th diagonal of the covariance matrix for shocks to
the observation equation and thus the variance of shocks to the j th observed series, is
48 CHAPTER 6. BAYESIAN ESTIMATION BY SIMULATION
conditional on the observations and factors only, while our draw for Hj , the j th row of H,
is conditional on Rjj , we will begin by drawing variances. Specifically, for series j we draw
Rjj from
(6.3) f (Rjj |X̃, y) ∼ IG (y − XβT )0 (y − XβT ) + (βT − h0 )0 Λ0 (βT − h0 ) + s0 , T + ν0
In the above βT = (X̃ 0 X + Λ0 )−1 (X̃ 0 yj + Λ0 h0 ). s0 is our prior scale parameter — our
prior for the variance of shocks when multiplied by ν0 . ν0 is our prior “degrees of freedom”,
that is, ν0 determines how aggressively we shrink Rjj . Λ0 is a diagonal matrix determining
tightness on our prior for Hj , denoted h0 . Typically Λ0 will be an identity matrix times
a (scalar) tightness parameter, with larger values corresponding to a tighter prior, and h0
will be zero for all rows of H, though we could use different priors for each row if desired.
Once we have a draw for Rjj , we draw row j of H from
(6.4) f (Hj |Rjj , X̃, yj ) ∼ N Λ−1
T (X̃ 0
y j + Λ h
0 0 ), R Λ
jj T
−1
where ΛT = (X̃ 0 X + Λ0 ). Note that the posterior mean for Hj is just βT as defined above.
Our draws for Q (the covariance of shocks to the observation equation) and B are nearly
identical except that, because Q contains off-diagonal elements, we will use a normal-inverse
Wishard conjugate prior. Beginning again with covariances, we draw Q from
(6.5) f (Q|X̃, Y ) ∼ IW (Y − X̃BT )0 (Y − X̃BT ) + (BT − B0 )0 Λ0 (BT − B0 ) + V0 , ν0 + T
where ΛT = X̃ 0 X̃ + Λ0 and our posterior mean (in matrix format) is just BT as defined
above.
Once we have a draw for parameters we can again draw factors. We continue these
iterations until our posterior distributions converge to a stationary distribution, and then
draw a sufficient number of parameters to suit our needs, that is, estimating posterior
means and variances of the parameters themselves. This whole process is re-stated in the
following algorithm:
Algorithm 2
6.2. IDENTIFICATION 49
a) Rjj from the distribution in (6.3) and Hj from the distribution in (6.4) for every
series j.
b) Q from the distribution in (6.5) and B from the distribution in (6.6).
5. Once distributions converge repeat steps 2 and 3 storing draws for R, H, Q, and B
at each iteration.
The question of when the distribution has converged is not straight forward — see
Robert and Casella (2010) for a discussion of convergence criteria. The software associated
with this book draws a default 500 samples in a burn loop before moving on to step 5
above and storing output. Convergence criteria can then be applied to the stored output
to determine whether burn time must be increased.
6.2 Identification
Bayesian estimation of dynamic factor models is complicated by the fact that factors are
not identified. Identifying factors ensures that we are in fact drawing from the same
distributions in each step of Algorithm 2. The issue of identification is described in section
5.2. The simplest, though not the only, solution to identification of Bayesian dynamic
factor models is what Stock and Watson (2011) call naming factors identification, that is,
setting the top m × m submatrix of H (where m is the number of factors) to be the identify
matrix. Employing this identification scheme in our sampling iterations requires that we
create a temporary m × m H matrix for the first m series in Y , which I will call M . Based
on our draw for factors X̃, draw the elements of Rjj , j ∈ (1, m) and M for the first m series
from distributions (6.3) and (6.4) respectively. To enforce that the top m × m submatrix
of H is Im we then use M to normalize our draw for the factors X̃. This turns out to be
quite simple. Using the observation equation, our normalization is
−1
yt,1:m = M
| M
{z } M x̃t,old +εt
| {z }
Im x̃t,new
X̃new = X̃old M
2
We postmultiply by M here as Xold is the matrix of simulated factors with time indexed by rows
50 CHAPTER 6. BAYESIAN ESTIMATION BY SIMULATION
Because we have not yet estimated the transition equation (we do that after normalizing
factors) or any other elements of H and R, this is all that is needed to identify our model.
For a model with p lags where
xt
xt−1
zt = .
. .
xt−p+1
This normalization becomes
1 0 0
0 1 0
0 0 1
1
0 −1
1 −1 0.5
(6.8) yt =
0.5
x t + εt
0.5 0.5
.5 1 1
0
−1 1
0 0.5 −1
0 .5 1
where
εt ∼ N 0, I10
and
1 0 0
et ∼ N 0, 0 1 −0.5
0 −0.5 1
6.3. AN EXAMPLE: BAYESIAN VS. ML ESTIMATION FOR SIMULATED DATA 51
T = 50 T = 100 T = 200
Note that the first observed series yt1 is just the first factor x1t and that this factor is
independent of the other two factors.
My objective in this section will be to estimate the “true” data yttrue = yt − εt or simply
yttrue = Hxt based on the observed series yt . Note that this true series is never observed.
In addition to the noise εt I drop observations from the first three series. Thus for these
series I have five observations followed by five missing values, then five observations and
so on. Simulations proceed by
4. Calculating the mean squared error 1/T (ŷt − yttrue )2 for all observed series and
P
averaging over the 10 series for 1000 repetitions
Table 6.1 presents the results when I estimate a model with three factors and two lags. As
is evident from the table Bayesian estimation yields slightly better results in terms of mean
squared error. However, this comes at the cost of longer run time as we are simulating
entire distributions, not simply estimating a few parameters. Note, however, that Bayesian
estimation may be faster for models with large numbers of lags, such as mixed frequency
models, as inverting the covariance matrix for predicted factors Pt+1|t becomes increasingly
burdensome as p gets large.
Chapter 7
An Introduction to Mixed
Frequency Models
Because state space models are so apt and handling missing data they are particularly
well suited to mixed frequency data sets in which, for example, a quarterly variable will
not be observed for two out of three months. The software accompanying this book does
not include routines for mixed frequency data as each case tends to be fairly specific and
components of routines will need to be tailored to the application. Instead, I provide a
brief introduction to simply outline how one might incorporate mixed frequency data into
the models discussed thus far.
Suppose first that our model is in log levels and, as a concrete example, that frequencies
are either monthly or quarterly. Denote ytq the log of a quarterly observation in month t
and ytm the log of a monthly observation in month t. Then
m m m m
(7.1) eyt = eyt + eyt−1 + eyt−2
The difficulty lies in the fact that equation (1) is linear in the log variables while equation
(7.1) is not; to overcome this issue simply take a linear approximation of (7.1) yielding
1
(7.2) ytm = (ytm + yt−1
m m
+ yt−2 )
3
Plugging equation (1) into the above yields the linear state space structure:
1 1 1
(7.3) ytq = Hxt + Hxt−1 + Hxt−2 + εt
3 3 3
Note that this requires that the model include at least three lags of factors, although one
need not estimate coefficients on factors with more than one lag in the transition equation.
53
54 CHAPTER 7. AN INTRODUCTION TO MIXED FREQUENCY MODELS
To put the model into log differences we begin with equation (7.2) and note that what
we observe, ∆ytq , is
1 m 1 m 1 m
(7.4) ytq − yt−3
q
= m
(yt − yt−3 ) + (yt−1 m
− yt−4 ) + (yt−2 m
− yt−5 )
3 3 3
1 w 2 w w 2 w 1 w
= ∆yt + ∆yt−1 + ∆yt−2 + ∆yt−3 + ∆yt−4
3 3 3 3
This is the result presented in Mariano and Murasawa (2003). Unlike the levels case, we
now need to include at least four lags of the factors.
Estimates of H and R now must take into account the structure of the data in (7.2) for
level data or (7.4) for differenced data. A simple way to proceed is to define a new helper
matrix Jq for low frequency data. Maintaining the monthly/quarterly structure and the
three lag model by way of example, the helper matrix for quarterly data, assuming we are
dealing with data in levels, will be
Jq = 13 Im 31 Im 13 Im
(7.5)
Jq = 13 Im 23 Im Im 2 1
(7.6) 3 Im 3 Im
The difficulty with a more general structure, monthly/daily data for example, is two
fold. First, the number of high frequency periods (days) in a low frequency period (months)
gets large. For differenced data our helper matrix for monthly variables, assuming 31 days
in the month, becomes the m × 61 matrix
1 2 2 1
(7.7) Jq = 31 Im 31 Im . . . 31 Im 31 Im
Thus the vector of factors zt must include at least 61 lags. Second, the number of days in
the month is not constant. Nor is the number of days in February constant from one year
to the next. Thus our helper matrix Jq must change depending on the number of days in
the current month. Though conceptually simple, these facts can become computationally
difficult. To illustrate the point, our daily/monthly data requires 61 lags in zt . If our model
includes a modest three factors, than zt has 183 elements, a huge number for a literature
in which there or typically only a few latent factors.
The structure of the transition equation is an additional consideration when the number
of high frequency periods (days) in a low frequency period (months) is large. The frequency
for factors is the highest frequency in the model — in our daily/monthly example factors
are thus daily. However, assuming we are interested in nowcasting or forecasting a low
frequency (monthly) variable we would like to include, in our monthly-daily example,
at least a month of lags in the transition equation. This raises the problem of over-
parameterization: with three factors and 30 lags we will be estimating 90 parameters in
55
the transition equation excluding the covariances in Q. Our solution is to again introduce
a helper matrix which, for the transition equation, we call JB . We then specify the number
of lags at each frequency in the model. For a daily/weekly/monthly model we we then have
pd daily lags, pw weekly lags, and pm monthly lags (in the transition equation we use the
term month to refer to 30 days regardless of the actual number of days in the month). For
a model with two factors and pd = 1, pw = 1, and pm = 1 our transition equation equation
can be described by d
b11 bd12 bw bw bm bm
B= d 11 12 11 12
b21 bd22 bw w m
21 b22 b21 b22
m
1 0 0 0 0 0 0 ...
0 1 0 0 0 0 0 . . .
1
0 17 0 71 0 17 . . .
JB = 7
1 1 1
0 7 0 7 0 7 0 . . .
1 0 1 0 1 0 1 . . .
30 30 30 30
1 1 1
0 30 0 30 0 30 0 ...
where bw
21 refers to the multiplier of factor 1 on factor 2 at a weekly frequency, and finally
Xt+1 = BJB Zt + et
Equipped with the (sparse) helper matrices described above our model fits into the
smoothing and filtering algorithms described in chapter 3. The only complication is that
H̃t = HJt changes from one month to the next as the number of days in a month changes.
This does not present any theoretical difficulties but requires careful coding when writing
estimation routines.
References
Doz, C., Giannone, D., and Reichlin, L., 2012. A Quasi-Maximum Likelihood Approach
for Large, Approximate Dynamic Factor Models. Review of Economics and Statistics,
94(4):1014–1024.
Durbin, J. and Koopman, S., April 2001. An efficient and simple simulation smoother for
state space time series analysis. Computing in Economics and Finance 2001 52, Society
for Computational Economics.
Durbin, J. and Koopman, S. J., 2012. Time Series Analysis by State Space Methods. Oxford
University Press.
Hamilton, J., 1994. Time Series Analysis. Princeton University Press.
Kalman, R., 01 1960. A new approach to linear filtering and prediction problems. Trans-
actions of the ASME - Journal of basic Engineering, 82:35–45.
Koop, G., Poirier, D., and Tobias, J., 2007. Bayesian Econometric Methods. Cambridge
University Press.
Lutkepohl, H., 2007. New Introduction to Multiple Time Series Analysis. Springer.
Mariano, R. S. and Murasawa, Y., 2003. A new coincident index of business cycles based
on monthly and quarterly series. Journal of Applied Econometrics, 18(4):427–443.
Robert, C. P. and Casella, G., 2010. Monte Carlo Statistical Methods. Springer Publishing
Company, Incorporated.
Sarkka, S., 2013. Bayesian Filtering and Smoothing. Cambridge University Press.
Stock, J. H. and Watson, M. W., 2011. Dynamic Factor Models. In Clements, M. P. and
Hendry, D. F., editors, The Oxford Handbook of Economic Forecasting. Oxford, Oxford
Handbooks.
Stock, J. H. and Watson, M. W., 2016. Factor Models and Structural Vector Autore-
gressions in Macroeconomics. In Taylor, J. B. and Uhlig, H., editors, Handbook of
Macroeconomics, volume 2A, pages 415–525. North-Holland.
57
58 REFERENCES
Watson, M. W. and Engle, R. F., 1983. Alternative Algorithms for the Estimation of
Dynamic Factor, Mimic and Varying Coefficient Regression Models. Journal of Econo-
metrics, 23(3):385–400.
Index
Kalman Filter, 21
Kalman Smoother, 25
59