Factor Models

Practical Implementation of Dynamic
Factor Models
Seton Leonard
Preliminary Draft
c Seton Leonard 2018
Contents
Contents 3
1 The Necessary Bayesian Statistics 9

1.1 Basic Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Summing Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 The Kalman Filter 21

2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 The Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 A Kalman Smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 The Steady State Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Missing Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 The Durbin-Koopman Disturbance Smoother 29

3.1 Durban Koopman (2001) Filtering and Smoothing . . . . . . . . . . . . . . 30
4 Estimation via Principal Components 33

4.1 Reduced Rank Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Principal Components as a Special Case . . . . . . . . . . . . . . . . . . . . 35
4.3 Summing Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Maximum Likelihood Estimation via Watson and Engle (1983) 39

5.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 An Example: Seasonally Adjusting Stationary Data . . . . . . . . . . . . . 43
6 Bayesian Estimation by Simulation 45

6.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3
4 CONTENTS
6.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 An Example: Bayesian vs. ML Estimation for Simulated Data . . . . . . . 50
7 An Introduction to Mixed Frequency Models 53
References 57
Index 59
Introduction
This short book is primarily concerned with estimating factor models and is accompanied by
R and C++/Armadillo code to implement the routines described herein. These estimation
routines are: (1) maximum likelihood estimation of factor models following Watson and
Engle (1983) and (2) Bayesian estimation of factor models by simulation. All routines
are designed to accept noisy and/or missing data; extracting meaningful information from
noisy data is, after all, the primary purpose of the filtering and smoothing algorithms upon
which factor models are based. Dedicating a book to dynamic factor models may sound
narrow, so it is worth noting at the outset that these models encompass a wide array of time
series tools, and every model can be adapted to backcasting (predicting past information
that was not observed), nowcasting (predicting contemporaneous unobserved information),
and forecasting (predicting future realizations).
In its simplest form a dynamic factor model is described by two equations: a measure-
ment equation
yt = Hxt + εt
and a transition equation

xt = Bxt−1 + et
where yt is observed noisy data, xt are (typically) unobserved factors, H is a matrix of

factor loadings, and B is a matrix of parameters that determines factor dynamics.
With a slight generalization on the above every model in this book can be described
by these two equations: denote the measurement equation as
(1) yt = H̃zt + εt
and the transition equation as
(2) zt = Azt−1 + et
0
In the above zt is a vector of stacked factors such that zt = x0t x0t−1 . . . x0t−p+1 where

p is the number of lags in the model. H̃ = HJ where H is a matrix of loadings and J is a
5
6 INTRODUCTION
helper matrix. A is the companion form of the more familiar VAR model
 
xt−1
xt−2 
(3) xt = B  .  + et
 
.
 . 
xt−p
Thus equipped we could, for example, write down a VAR model for noisy and/or missing
data. Taking the a simple three variable, three lag example we would have the observation
equation    
y1,t xt
y2,t  = I3 I3 03 03 xt−1  +εt
y3,t |{z} |
H
{z
J
} x
t−2
| {z } | {z }
yt zt
and the transition equation

   
x1,t xt−1
x2,t  = B1 B2 B3 xt−2  +et
x3,t | {z
B
} x
t−3
| {z } | {z }
xt zt−1
Since we alreadyknow H in this casethe parameters

to estimate are B and the covariance
matrices εt ∼ N [0], R and et ∼ N [0], Q . 1
As a second example, one way to write an ARMA(1,1) model
yt − a1 yt−1 = εt + θεt−1
in state space form would be with the observation equation
yt = xt
and the transition equation

xt a θ xt−1 ε
= + t
εt 0 0 εt−1 εt
though the state space form of any model is never unique. As an aside, this formulation
of the ARMA(1,1) would make it easy to allow for measurement error in yt by simply
including an error term in the observation equation, that is, letting yt = xt + εt .
1
Note that I am a bit loose with the dimensionality of et . In the case we’re using the companion form
of the transition equation et just has a bunch of zeros (m × (p − 1) of them) under the first m elements.
7
This book begins with some basic Bayesian statistics in chapter 1 that form the build-
ing blocks of our state space framework. Chapter 2 then introduces basic Kalman filtering
and smoothing, followed in chapter 3 by Durbin and Koopman (2012)’s much more com-
putationally efficient disturbance smoothing. Chapter 4 on deals with estimating state
space models. Chapter 4 briefly discusses estimation using principal components, though
other approaches (maximum likelihood or Bayesian simulation) tend to yield better re-
sults. Chapter 5 presents Watson and Engle (1983)’s maximum likelihood approach (I do
not cover simply plugging the likelihood function into a numerical maximization (minimiza-
tion) routine. It’s easy to do but very slow compared with Watson and Engle (1983)’s EM
algorithm). Chapter 6 presents Bayesian estimation by simulation for uniform frequency
data. Finally, chapter 7 introduces mixed frequency models.
All of the notation in this book matches that in my R and C++ code so that hopefully
the code is easy to follow. This text is intentionally short and too the point, geared towards
implementation of state space models as opposed to derivations or theory. To get a better
handle and more depth on factor models, Bayesian filtering and smoothing, and time series
analysis more generally Sarkka (2013), Durbin and Koopman (2012), Hamilton (1994),
Lutkepohl (2007), and Koop et al. (2007) are good resources. As is no doubt already
obvious I assume the reader is familiar with matrix algebra. I also assume some familiarity
with statistics, econometrics, and calculus.
Chapter 1
The Necessary Bayesian Statistics
Bayesian econometrics is founded on Bayes’ theorem, a simple equation which states that
the probability of an event A conditional on B equals the probability of B conditional on
A times the unconditional probability of A over the unconditional probability of B, that
is,
P (B|A)P (A)
(1.1) P (A|B) =
P (B)
This result can be easily derived from the definition of conditional probability, which states
that the probability of A conditional on B equals the probability of A and B occurring
together over the probability of B, that is
P (A ∩ B)
(1.2) P (A|B) =
P (B)
Using equation 1.2 we have

P (A|B)P (B) = P (A ∩ B)
and
P (B|A)P (A) = P (A ∩ B)
which, when combined, yields equation 1.1.
We can also write Bayes’ theorem in terms of probability density functions (or proba-
bility mass functions for discrete distributions) by again using the definition of conditional
probability
f (x, y)
(1.3) f (y|x) =
f (x)
That is, the distribution of y conditional on x equals the joint distribution of x and y over
the distribution of x. The derivation of Bayes’ theorem, which we use in estimating the
9
10 CHAPTER 1. THE NECESSARY BAYESIAN STATISTICS
distribution of parameters conditional on observed data (called the posterior distribution

of the parameters), is the same as above:
f (y|x)f (x) = f (x, y)
and
f (x|y)f (y) = f (x, y)
thus
f (x|y)f (y)
(1.4) f (y|x) =
f (x)
In Bayesian jargon, f (y) is our prior distribution (or just prior) for y and f (y|x) is our
posterior distribution (or just posterior) for y. The distribution f (x|y) is the conditional
distribution of x given y (for example the distribution of the data x conditional on the
parameters
R y)Rand f (x) is the marginal distribution of x, typically calculated as f (x) =
f (x, y)dy = f (x|y)f (y)dy.
1.1 Basic Examples

Example 1: Testing for Illnesses
Testing for illnesses provides a morbid but illustrative example of Bayes rule. Suppose
1% of the population has an illness (but that there are no observable symptoms), that if
an individual has the illness he always tests positive, and that the probability of a false
positive is 10%. Denoting A = 1 as having the illness and B = 1 as testing positive then

1 with prob. 0.01
A=
0 with prob. 0.99
and
1 with prob. 1 1 with prob. .1
(B|A = 1) = (B|A = 0) =
0 with prob. 0 0 with prob. .9
Then
P (B = 1|A = 1)P (A = 1) = 1 × 0.01
and
P (B = 1) = P (B = 1|A = 1)P (A = 1) + P (B = 1|A = 0)P (A = 0)
= 1 × 0.01 + 0.1 × 0.99
= 0.109
thus the probability that an individual who tests positive once is in fact ill is
P (A = 1|B = 1) = P (B=1|A=1)P
P (B=1)
(A=1)
0.01
= 0.109
= 0.0917
1.1. BASIC EXAMPLES 11
If an individual tests positive twice, the probability that he is in fact ill is
P (A = 1|B = 1, 1) = P (B=1,1|A=1)P
P (B=1,1)
(A=1)
0.01
= 0.1×0.1×0.99+1×1×0.01
= 0.5025
Example 2: Normal data with a Bernoulli prior

This question was on Mark Watson’s Gerzensee midterm.1 Suppose some data Y follows a
normal distribution with mean θ and standard deviation 1 and that θ follows a Bernoulli
distribution with p = 0.5, that is, we know θ is either 1 or 0 and we attach equal probability
to each outcome. Suppose we observe a single draw of Y = 0. What is our posterior for θ?
To answer this question we need the fact that
1 n 1 o
fY (Y |θ) = √ exp − (Y − θ)2
2π 2

0 with prob. 0.5
fθ (θ) =
1 with prob. 0.5
1
Then evaluating fY (0|θ = 0) = √1 , fY (0|θ = 1) = √1 e− 2 we have that the denominator
2π 2π
in Bayes rule for Y = 0 is
fY (0) = fY (0|θ = 0)fθ (0) + fY (0|θ = 1)fθ (0)
1
= √12π 12 + √12π e− 2 12
The numerator (the conditional distribution times our prior) is, for θ = 0 (recall θ can only
take values 0 or 1)
1 1
fY (0|θ = 0)f (θ) = √
2π 2
thus for θ = 0
√1 1
2π 2 1
fθ (0|Y = 0) = 1 = 1
−
√1 1
2π 2
1
+ √2π e 2 21
1 + e− 2
For θ = 1 1
√1 e− 2 1 1
2π 2 e− 2
fθ (1|Y = 0) = 1 = 1
√1 1 + √1 e− 2 1 1 + e− 2
2π 2 2π 2
thus our posterior distribution for θ is

1

 0 with prob. 1
1+e− 2
f (θ|Y = 0) = 1
e− 2
 1 with prob. 1
1+e− 2
1
Gerzensee is an economics program for first year Swiss PhDs.
Example 3: Normal-Normal Conjugate Density

Suppose we believe a parameter θ follows a normal distribution
2
λ λ 2
fp (θ) = √ exp − (θ − θ0 )
2π 2
where θ0 is our prior mean and λ1 our prior standard deviation. Suppose also that we
observe a vector X that follows a normal distribution
( )
1 1 X 2
fX (X|θ) = N exp − 2 (xi − θ)
(2πσ 2 ) 2 2σ
i
which we could alternatively write as

−N − 12 1 0 −1
fX (X|θ) = (2π) 2 |Σ| exp − (X − θ) Σ (X − θ)
2
where Σ = σ 2 IN and N is the number of observations or elements of X (Note here that I

have assumed the standard deviation of the data σ is known. This assumption facilitates
deriving and computing the posterior for θ, but we will properly specify a prior for both
θ and σ in section 1.2). To derive the posterior distribution of θ it is sufficient to show
that this distribution also has the form of a normal; we need not worry about the constant
fX (X) in the denominator
R ∞ of Bayes rule as this only ensures that the resulting distribution
sums to one, that is, −∞ fθ (θ|X)dθ = 1. Looking only at the numerator of Bayes rule,
fX (X|θ)fp (θ), we have
n 2 o
fθ (θ|X) ∝ exp − λ2 (θ − θ0 )2 exp − 2σ1 2 i (xi − θ)2
P
n h io
1 2 2 2 1 P 2 P 2
∝ exp − 2 λ (θ − 2θθ0 + θ0 ) + σ2 ( i xi − 2 i xi θ + N θ )
In the second term above the only parts we care about are those which R ∞ relate to the
parameter θ; the rest goes into the constant of integration which ensures −∞ f (θ|X)dθ = 1,
thus we can simply write the above term as
n h io
fθ (θ|X) ∝ exp − 12 λ2 (θ2 − 2θθ0 ) + σ12 (−2 i xi θ + N θ2 )
P
(1.5) n h io
= exp − 2σ1 2 (λσ)2 (θ2 − 2θθ0 ) − 2 i xi θ + N θ2
P
The trick now is to try and re-write equation 1.5 as a normal distribution. The easiest way
to proceed is to guess and check (assuming, of course, that we can make a good guess of
the posterior). To this end suppose

N +(λσ)2
h
2
−1 P 2
i 2
(1.6) fθ (θ|X) ∝ exp − 2σ2 θ − (N + λσ) i xi + (λσ) θ0
1.2. BAYESIAN LINEAR REGRESSION 13
Multiplying out equation 1.6 we have

n h P io
xi +θ0
fθ (θ|X) ∝ exp − 2σ1 2 N + (λσ)2 θ2 − 2θ 2
P
i xi + (λσ) θ0 +
i
N +(λσ)2
P
i i x +θ
0
Again, since the term N +(λσ) 2 does not contain our parameter of interest θ we can dump
it into the constant of integration and keep only the relevant part of the above term,
n h io
fθ (θ|X) ∝ exp − 2σ1 2 N + (λσ)2 θ2 − 2θ 2θ
P
x
i i + (λσ) 0
n h io
1 2 2
P 2
= exp − 2σ2 (λσ) (θ − 2θθ0 ) − 2 i xi θ + N θ
which is equation 1.5. Thus our posterior distribution

−1 P of θ given the realization of the
2 2 σ2
data X is also normal with mean (N + λσ) i xi + (λσ) θ0 and variance N +(λσ)2 .
Notice that as the standard deviation of our prior becomes smaller our posterior variance
also shrinks (λ is therefore sometimes referred to as a “shrinkage parameter”) but our
posterior mean becomes biased. In the extreme case that λ → ∞ our posterior mean is
our prior mean θ0 and our posterior variance is zero. Thus by specifying a prior we are
reducing the variance of our parameter estimates at the cost of introducing a bias.
1.2 Bayesian Linear Regression

Bayesian Linear Regressions will become a sort of all purpose tool with which we will
estimate both the observation equation and transition equation of our factor model when
we come to estimation by simulation in section 6. Our model in this section is a simple
linear regression. In the univariate case we have
(1.7) y = Xβ + ε
where y, the dependent variable, is a T × 1 vector, X is a T × m matrix of explanatory

variables, β is a m × 1 vector of unknown parameters, and ε is a T × 1 vector of shocks
with distribution2
εt ∼ N (0, σ)
In the multivariate case our model is
(1.8) Y = XB 0 + ε
in which Y is a T × k matrix of k dependent variables (observations t are indexed by row

though the index here need not necessarily be time), X is a T × m matrix of m explanatory
2
I use σ here to denote variance not standard deviation as is often the case since for the multivariate
model Σ also denotes variance.
variables, B a m × k matrix of unknown parameters, and εt is a T × k matrix of shocks

with distribution
εt ∼ N [0], Σ
In each section below I begin with the derivation of the posterior when we assume the
covariance of shocks is know before moving to normal-inverse gamma or normal-inverse
Wishart conjugate priors which include posteriors for the covariance of shocks.
Simple Univariate Linear Regression

Assuming initially that the distribution of shocks is known and beginning with the uni-
variate case our prior for β is3

1 0
π(β) ∝ exp − (β − β0 ) Λ0 (β − β0 )
2σ
where β0 is our prior for β and Λ0 , our prior covariance of β, defines the strength of our
prior beliefs. The distribution for our model in (1.7) is

1 0
f (y|β, σ) ∝ exp − (y − Xβ) (y − Xβ)
2σ
so that our posterior is

1
(y − Xβ)0 (y − Xβ) + (β − β0 )0 Λ0 (β − β0 )

f (β|y, X, σ) ∝ exp −
2σ
Multiplying out the terms in square brackets above an ignoring those which do not contain
β and thus become part of the constant of integration we have
(1.9) f (β|y, X, σ) ∝ −2y 0 Xβ + β 0 X 0 Xβ + β 0 Λ0 β − 2β 0 Λ0 β0
Letting Λn = (X 0 X + Λ0 ) and again only keeping track of terms which do not form part
of our constant of integration we have
n h io
1
β − Λ−1 0y + Λ β ) 0 Λ −1 (X 0 y + Λ β )

f (β|y, X, σ) ∝ exp − 2σ n (X 0 0 n β − Λ n 0 0
1 0 0
∝ exp − 2σ [β X Xβ + β 0 Λ0 β − 2β 0 X 0 y − 2β 0 Λ0 β0 ]
These terms match those in equation (1.9), thus our posterior for β is distributed

(1.10) β ∼ N (X 0 X + Λ0 )−1 (X 0 y + Λ0 β0 ), σ(X 0 X + Λ0 )−1
This derivation assumes we know σ already, which of course will not be the case in
practice.
3
Specifying the variance of our prior in terms of the variance of shocks keeps the algebra much cleaner
but is not strictly necessary. Also note that like Σ, σ denotes variance, not standard deviation as is often
the case.
Simple Multivariate Linear Regression

To derive the posterior for our multivariate
linear regression when we assume the distri-
bution of shocks to the model εt ∼ N 0, σ is known begin by writing the model for an
observation t as
yt = Bxt + εt
and define the vectorization of B as
β = vec(B)
Our prior for β is

1
π(β|σ) ∝ exp − (β − β0 )0 (V0 )−1 (β − β0 )
2
where V0 = Λ ⊗ σ and Λ determines our prior tightness, that is, the strength of our prior
beliefs. The distribution of our model requires a few more definitions. Let y be the vector of
observations yt stacked over time (that is, y = vec(Y 0 )) and X be the associated matrix of
explanatory variables. Explicitly, X = X ⊗ Ik where k is the number of observed variables
in yt . Then
1
f (y|β, σ, X) ∝ exp − (y − Xβ)0 Σ−1 (y − Xβ)
2
where Σ−1 = IT ⊗ σ −1 . Our posterior is then
i
1h
f (β|X, y, σ) ∝ exp − (β − β0 )0 (V0 )−1 (β − β0 ) + (y − Xβ)0 Σ−1 (y − Xβ)
2
Multiplying this expression out (and ignoring terms that don’t contain β) leaves us with
n h
f (β|X, y, σ) ∝ exp − 21 β 0 (V0 )−1 β − 2y 0 Σ−1 Xβ
io
+ β 0 X0 Σ−1 Xβ − 2β 0 (V0 )−1 β0
Using the property for Kronecker products that for conformable matrices (A⊗B)(C ⊗D) =
AC ⊗ BD this simplifies to
n 1h io
f (β|X, y, σ) ∝ exp − β 0 (V0 )−1 β − 2y 0 (X ⊗ σ −1 )β + β 0 (X 0 X ⊗ σ −1 ) − 2β 0 (V0 )−1 β0
2
in which case we can write the posterior as
n 1 0 o
f (β|X, y, σ) ∝ exp − β −V1−1 (X 0 ⊗σ −1 )y +V0 β0 V1 β −V1−1 (X 0 ⊗σ −1 )y +V0 β0
2
where
V1 = (X 0 X ⊗ σ −1 + V0−1 )
or, using our definition of V0 = Λ ⊗ σ,

V1 = (X 0 X + Λ) ⊗ σ −1
That is, our posterior for β is normally distributed

β ∼ N (X 0 X + Λ)−1 (X 0 y + Λβ0 ), V1−1
In the future we can use a matrix normal distribution for our multivariate normal models
which avoids the hassle of having to vectorize and then simplify our model, though it is
hopefully useful to illustrate the vectorized version at least once.
Univariate Linear Regression with Normal-Inverse Gamma Prior

The previous two subsections have assumed the variances of shocks to our models are
known. This simplifies the derivations but is not a very realistic assumption. These
variances and, in the multivariate case, covariance matrices are particularly important
when we come to filtering and smoothing in chapter 3 as they determine how aggressively
we should update our estimates of unobserved factors based on observables. Getting good
estimates of the scale of shocks is therefore a high priority. My conjugate prior (meaning
the prior and posterior have the same form) for the univariate case models the parameters
β as normally distributed while my prior for σ follows an inverse gamma distribution.
Explicitly,
− k2 1 0
π(β|σ) ∝ (σ) exp − (β − β0 ) Λ0 (β − β0 )
2σ
and ν0
n s o
0
π(σ) ∝ σ 2 −1 exp −
2σ
In the above inverse gamma distribution s0 is our prior scale parameter. Loosely inter-
preted, this is our prior for the residual sum of squares in our model. ν0 is our prior scale
parameter, which we can interpret as the number of “prior observations”. Increasing ν0
will force our posterior for σ towards ν10 s0 .
As we model innovations εt as normally distributed, our model has the form

− T2 1 0
f (y|β, σ, X) ∝ σ exp − (y − Xβ) (y − Xβ)
2σ
so our posterior is

− T2 1 0
f (β, σ|y, X) ∝ σ exp − (y − Xβ) (y − Xβ) ×
2σ
| {z }
f (y|β,σ,X)
− k2 1 ν0
n s o
0
(σ) exp − (β − β0 ) Λ0 (β − β0 ) σ 2 −1 exp −
0
2σ 2σ
| {z }| {z }
π(β|σ) π(σ)
Beginning by multiplying out the exponential terms we have
1 h 0 i
(1.11) − y y − 2y 0 Xβ + β 0 X 0 Xβ + β 0 Λ0 β − 2β 0 Λ0 β0 + β00 Λ0 β0 + s0
2σ
If we define ΛT ≡ X 0 X + Λ0 then
0
β − Λ−1 0 −1 0
T (X y + Λ0 β0 ) ΛT β − ΛT (X y + Λ0 β0 ) =
(1.12) 0 0 0 0 0 −1 0
β ΛT β − 2β (X y + Λ0 β0 ) + (X y + Λ0 β0 ) ΛT (X y + Λ0 β0 )
Comparing terms in 1.12 and 1.11 shows that we can re-write the exponential terms of the
posterior as
h 0
1
β − Λ−1 0 −1 0

− 2σ T (X y + Λ0 β0 ) ΛT β − ΛT (X y + Λ0 β0 ) +
(1.13) i
y 0 y − (X 0 y + Λ0 β0 )0 Λ−1 0 0
T (X y + Λ0 β0 ) + β0 Λ0 β0 + s0
The first line in 1.13 forms the normal part of our normal-inverse gamma posterior; the
second line contains terms that will go into the posterior scale parameter of the inverse
gamma distribution. We can re-write the scale parameter for the posterior by noting that
the posterior mean for β is βT = Λ−1 0 y + Λ β ) . Thus

T (X 0 0
(y − XβT )0 (y − XβT ) + (βT − β0 )0 Λ0 (βT − β0 ) =

y 0 y − 2y 0 XβT + βT X 0 XβT + βT Λ0 βT − 2βT0 Λ0 β0 + β00 Λ0 β0 =
(1.14)
y 0 y − 2(y 0 X + β00 Λ0 )βT + βT0 (X 0 X + Λ0 ) βT + β00 Λ0 β0
| {z }
ΛT
Using the definitions of βT and ΛT
βT0 ΛT βT − (y 0 X + β00 Λ0 )βT = 0
so that collecting all the terms in our posterior we have

n o
k
f (β, σ|y, X) ∝ σ − 2 exp − 2σ 1
β − Λ−1 0 y + Λ β ) 0 Λ β − Λ−1 (X 0 y + Λ β )

T (X 0 0 T T 0 0
ν +T
− 02 −1
exp − 2σ (y − XβT ) (y − XβT ) + (βT − β0 )0 Λ0 (βT − β0 ) + s0
0
1
× σ
That is, our posterior is a normal-inverse gamma distribution such that

f (β|σ, X, y) ∼ N Λ−1
T (X 0
y + Λ β
0 0 ), σΛ −1
T
where ΛT = (X 0 X + Λ0 ) and f (σ|X, y) follows an inverse gamma distribution with scale

parameter sT = (y − XβT )0 (y − XβT ) + (βT − β0 )0 Λ0 (βT − β0 ) + s0 and νT = T + ν0 .
Multivariate Linear Regression with Normal-Inverse Wishart Prior

To derive the posterior for our multivariate normal model
Yt = B 0 Xt + εt
or, for all observations,
Y = XB + ε
we will use the multivariate equivalent of our normal-inverse gamma conjugate prior for
the univariate case, that is, a matrix normal-inverse Wishart distribution.4 Accordingly,
our priors are
π(Σ) ∼ IW V0 , ν0
and
π(B|Σ) ∼ MN B, Λ0 , Σ
which is equivalent to the vectorised form where β = vec(B 0 )5
π(β|Σ) ∼ N (β0 , Λ0 ⊗ Σ)
Given these priors and the distribution for our model

f (Y |B, Σ, X) ∼ MN XB, IT , Σ
we can write our posterior as
n 1 o
f (B, Σ|Y, X) ∝ |Σ|−(ν0 +k+1)/2 exp − tr(V0 Σ−1 )
| {z 2 }
π(Σ)
n 1 o
× |Σ|−m/2 exp − tr (B − B0 )0 Λ0 (B − B0 )Σ−1
| 2 {z }
π(B|Σ)
n 1 o
× |Σ|−T /2 exp − tr (Y − XB)0 (Y − XB)Σ−1
| 2 {z }
f (Y |B,Σ,X)
As previously, the trick is to write our posterior as a sum of squares. Using the fact that
tr(A)tr(B) = tr(A + B) and dealing first with exponential terms, our result is the matrix
equivalent to equation (1.11):
1h i
(1.15) − tr Y 0 Y − 2Y 0 XB + B 0 X 0 XB + B 0 Λ0 B − 2B00 Λ0 B + B00 Λ0 B0 + V0 Σ−1

2
4
Note that in this section I will write B as its transpose for ease of notation. That is, B in this section
corresponds to its transpose elsewhere. This is due to the fact that we will write our stacked model of
observations as Y = XB + ε.
5
I use the vectorized form β = vec(B 0 ) as this stacks B over rows, thus the resulting vector β0 cor-
responds to the covariance matrix Λ0 ⊗ Σ and to X = X ⊗ Ik as defined in the derivation for a simple
normal-normal multivariate linear regression.
1.3. SUMMING UP 19
As previously, define ΛT = X 0 X + Λ0 . We can again propose the sum of squares portion

of our solution as
0
B − Λ−1 0 −1 0

(1.16) T (X Y + Λ0 B0 ) ΛT B − ΛT (X Y + Λ0 B0 ) =
B 0 ΛT B − 2B 0 (X 0 Y + Λ0 B0 ) + (X 0 Y + Λ0 B0 )0 Λ−1
T (X 0Y + Λ B )
0 0
Again, comparing terms in 1.16 and 1.15 shows that we can re-write the exponential terms
of the posterior as

0
1
B − Λ−1 0 −1 0

− 2 tr T (X Y + Λ0 B0 ) ΛT B − ΛT (X Y + Λ0 B0 ) +
(1.17)
Y 0 Y − (X 0 Y + Λ0 B0 )0 Λ−1 0 0
T (X Y + Λ0 B0 ) + B0 Λ0 B0 + V0 Σ
−1
so that our final result is the matrix equivalent to our result in the univariate case, that is,
(1.18) n h
0 Y + Λ B ) 0 Λ B − Λ−1 (X 0 Y + Λ B ) Σ−1
io
f (B, Σ|Y, X) ∝ |Σ|−m/2 exp − 21 tr B − Λ−1

T (X 0 0 T T 0 0
n h io
− 21 (ν0 +k+T +1)
× Σ tr (Y − XBT ) (Y − XBT ) + (BT − B0 ) Λ0 (BT − B0 ) + V0 Σ−1
0 0
where BT = Λ−1 0
T (X Y + Λ0 B0 ). Thus our posterior for B conditional on Σ and the data is

f (B|Σ, X, Y ) ∼ MN Λ−1T (X 0
Y + Λ 0 B 0 ), Λ T , Σ
and our posterior for Σ conditional on the data is

f (Σ|X, Y ) ∼ IW (Y − XBT )0 (Y − XBT ) + (BT − B0 )0 Λ0 (BT − B0 ) + V0 , ν0 + T
where ΛT = X 0 X + Λ0 and Bt = Λ−1 0

T (X Y + Λ0 B0 ). Note that for the practical purpose
of drawing B in programs to simulate these posterior distributions we will need to use the
vectorized form β = vec(B 0 ) of
0 −1
f (β|Σ, X, Y ) ∼ N vec Λ−1 0

T (X Y + ΛB 0 ) , Λ T ⊗ Σ
1.3 Summing Up
These results will form the basis of our Bayesian estimation routines for dynamic factor
models. In chapter six we will construct posterior densities for parameters of our model
by simulation. These simulations will use the conditional densities we have derived above.
In summary, for a normal-inverse gamma conjugate prior our prior for β is

− k2 1 0
π(β|σ) ∝ (σ) exp − (β − β0 ) Λ0 (β − β0 )
2σ
and our prior for sigma is n s o

ν0
−1 0
π(σ) ∝ σ 2 exp −
2σ
With the model distribution

− T2 1
f (y|β, σ, X) ∝ σ exp − (y − Xβ)0 (y − Xβ)
2σ
the posterior distribution conditional on σ and the data X and y for β is

f (β|σ, X, y) ∼ N Λ−1
T (X 0
y + Λ β
0 0 ), σΛ −1
T
where ΛT = (X 0 X + Λ0 ) and f (σ|X, y) follows an inverse gamma distribution with scale

parameter sT = (y − XβT )0 (y − XβT ) + (βT − β0 )0 Λ0 (βT − β0 ) + s0 and νT = T + ν0 .
In the case of the multivariate model with a normal-inverse Wishart conjugate prior
our prior for β conditional on σ is,6 in vectorized form,

1 0 −1
π(β|σ) ∝ exp − (β − β0 ) (Λ0 ) (β − β0 )
2
where Λ0 = Λ0 ⊗ σ and Λ0 determines our prior tightness. Using a prior scale parameter
of Ik our prior for σ ∼ IW(Ik , ν0 ) is

ν +k+1
− 0 2 1 −1

π(σ) ∝ |σ| exp − tr σ
2
Our model is 7
yt = BXt + εt
so that β = vec(B) and thus the model distribution (again in vectorized form) is

1 0 −1
f (y|β, σ, X) ∝ exp − (y − Xβ) Σ (y − Xβ)
2
where X = X ⊗ Ik . Then the posterior for β conditional on σ is
h i −1
0 0
f (β|σ, X, y) ∼ N vec Λ−1 T (X 0
Y + Λ 0 B0 ) , Λ T ⊗ σ
where ΛT = X 0 X + Λ0 and f (σ|X, y) follows an inverse-Wishart distribution with scale

parameter ST = In + (Y − XBT0 )0 (Y − XBT0 ) + (BT − B0 )Λ0 (BT − B0 )0 and νT = ν0 + T .
Note here that BT = Λ−1 0 0 0

T (X Y + Λ0 B0 ) is the k × m matrix formed by stacking βT and
similarly B0 is the stacked form of β0 .
6
In using the vectorized form of the problem I use σ to denote the k × k covariance matrix of shocks
εt and Σ to denote IT ⊗ σ. In using the matrix normal distribution Σ is sufficient to denote the k × k
covariance matrix of shocks as using the Kronecker product is not necessary.
7
Note here that B is back to its normal k × m orientation as opposed its definition in the previous
section.
Chapter 2
The Kalman Filter
Kalman filtering, named for Kalman (1960), is a means of estimating the time series model
(2.1) yt = Hxt + εt
(2.2) xt = Axt−1 + et
where xt is an unobserved state or states, yt observed outcomes,
and εt and
et are normally
et Q 0
distributed error terms with the covariance matrix Cov = . States, observed
εt 0 R
outcomes, and error terms can be either scalars or vectors. The Kalman filter works by first
predicting an outcome xt|t−1 where the subscripts indicate the prediction of xt based on
all observations from the initial period until period t − 1 and then updating this prediction
using the outcome from period t. Formally, we first predict
Z
(2.3) p(xt |y1:t−1 ) = p(xt , xt−1 )p(xt−1 |y1:t−1 )dxt−1
and then update this prediction based on the current observation yt

(2.4) p(xt |yt ) ∝ p(yt |xt )p(xt |y1:t−1 )
In this way equation 2.4 is a Bayesian estimate of our unobserved states xt using equation
2.3 as our prior. Before writing down what the prediction and updating equations will
in fact be when our model follows 2.2 and 2.1 it’s worth looking at a few features of the
multivariate normal distribution.
2.1 Preliminaries
Suppose we know that the scalars x and y follow the multivariate normal distribution

x µ
(2.5) ∼N ,Σ
y m
21
22 CHAPTER 2. THE KALMAN FILTER

P C
where Σ = and we’re interested in determining the distribution f (x|y). Using the
C S
definition of a conditional distribution we know that
f (x, y)
f (x|y) =
f (y)
or, since f (y) is a normalizing constant,

( 0 !)
− 12 1 x−µ −1 x − µ
f (x|y) ∝ |Σ| exp − Σ
2 y−m y−m

S −C
The inverse of the covariance matrix is Σ−1 = 1
P S−C 2
thus the exponential
−C P
terms are
1
− [(x − µ)2 S + (y − m)2 P − 2(x − µ)(y − m)C]
2(P S − C 2 )
We can re-write this expression as
1
− [(y −m)2 P +(x−µ−(y −m)CS −1 )S(x−µ−(y −m)CS −1 )−(y −m)2 C 2 S −1 ]
2(P S − C 2 )
or, dumping the terms which don’t contain our parameter of interest x into the normalizing
constant,
1
(2.6) − [(x − (µ + (y − m)CS −1 ))P̃ −1 (x − (µ + (y − m)CS −1 ))]
2
where P̃ = P − C 2 S −1 . f (x|y) is therefore normally distributed with mean (µ + (y −
m)CS −1 ) and variance P̃ = P − C 2 S −1 . This
result
generalizes to the case in which x and
P C
y are vectors with covariance matrix Σ = as
C0 S
E(x|y) = µ + CS −1 (y − m)
and
V ar(x|y) = P − CS −1 C 0
These results are essentially all we need to derive the Kalman filter.
2.2 The Kalman Filter

To use the results from section 2.1 requires two elements.
The first is our model, equations
xt|t−1
2.2 and 2.1. The second is the distribution for ; this is not as obvious as it may
yt
2.2. THE KALMAN FILTER 23
seem since we never in fact observe xt (or xt−1 ). Therefore we need to define a new variable,
call it µt|t , which is our expected value of xt given observations y1:t . Define the variance of
xt|t as Pt|t and the variance of xt|t−1 (xt given observations 1 : t − 1) as Pt|t−1 . Then the
joint distribution for xt|t−1 and yt is

xt|t−1 Aµt−1|t−1
=N , Σt
yt H(Aµt−1|t−1 )
where
P Ct
Σt = t|t−1
Ct0 St
From the results in section 2.1 we can immediately calculate the expected value of xt|t−1
(which is unobserved) given yt (which is observed), E(xt|t−1 |yt ) = Aµt−1|t−1 + Ct St−1 (yt −
H(Aµt−1|t−1 )), as well as V ar(xt|t−1 |yt ) = Pt|t−1 − Ct St−1 Ct0 . However, we still need to
derive the values for Σ. For notational convenience write xt|t−1 as simply xt which is
distinct from xt|t Writing
(xt − Aµt−1|t−1 )(xt − Aµt−1|t−1 )0 (xt − Aµt−1|t−1 )(yt − HAµt−1|t−1 )0

 
 | {z } | {z } 
 Pt|t−1 Ct 
Σ = Et|t−1  0 0

(y
| t − HAµ t−1|t−1 )(x t − Aµ t−1|t−1 ) (y t − HAµ t−1|t−1 )(y t − HAµ t−1|t−1 )
{z } | {z }
Ct0 St
we then have
Et|t−1 xt x0t − 2xt µ0t−1|t−1 A0 + Aµt−1|t−1 µ0t−1|t−1 A0

Pt|t−1 =
Et|t−1 (Axt−1 + vt )(Axt−1 + vt )0 − 2(Axt−1 + v)µ0t−1|t−1 A0 + Aµt−1|t−1 µ0t−1|t−1 A0

=
= Et|t−1 (Axt−1 x0t−1 A0 ) + Et|t−1 (vt vt0 ) − Aµt−1|t−1 µ0t−1|t−1 A0
= A0 Pt−1|t−1 A0 + Q
since Pt−1|t−1 = Et−1|t−1 (xt−1 x0t−1 ) − µt−1|t−1 µ0t−1|t−1 . Similarly
S = HPt|t−1 H 0 + R
And finally
Et|t−1 xt yt0 − xt µ0t−1|t−1 A0 H 0 − Aµt−1|t−1 yt + Aµt−1|t−1 µ0t−1|t−1 A0 H 0

Ct =
Et|t−1 (Axt−1 + vt )(H(Axt−1 + vt )wt )0 − 2(Axt−1 + v)µ0t−1|t−1 A0 H 0 + Aµt−1|t−1 µ0t−1|t−1 A0 H 0

=
= Et|t−1 (Axt−1 x0t−1 A0 H 0 ) + vt vt0 H 0 − Aµt−1|t−1 µ0t−1|t−1 A0 H 0
= Pt|t−1 H 0
Thus equipped we can write the Kalman filter as follows. Our prediction for the mean
and variance of xt (without conditioning on yt ) is
µt|t−1 = Aµt−1|t−1
Pt|t−1 = APt−1|t−1 A0 + Q
Our prediction for the mean and variance of yt given observations 1 : t − 1 (before yt is
observed), denoted yt|t−1 , is
mt|t−1 = Hµt|t−1
St = HPt|t−1 H 0 + R
Our prediction for the covariance between xt (again, without using yt ) and yt|t−1 is
Ct = Pt|t−1 H 0
Note that the Kalman gain combines this covariance and the estimated variance of yt|t−1
and is typically written as Kt = Ct St−1 . The above equations, called the prediction step,
give us our prior. Our posterior estimates for the mean and variance of xt given yt (recall
yt is observed), called the updating step, are
µt|t = µt|t−1 + Ct St−1 (yt|t − mt|t−1 )

Pt|t = Pt|t−1 − Ct St−1 Ct0
2.3 The Likelihood Function

We can write the likelihood of observing y1 y2 . . . yT as
T
Y
f (y1:T ) = f (yT |y1:T −1 )f (y1:T −1 ) = f (yT |y1:T −1 )f (yT −1 |y1:T −2 )f (y1:T −2 ) = f (yt |y1:t−1 )
t=1
where f (yt |y1:t−1 ) is normally distributed with mean mt|t−1 and variance St . Denoting
the predictive error calculated by the Kalman filter in each period as νt = yt|t − mt|t−1 we
can thus write the likelihood of our observables as
T
Y
− 12 1
L= (2π)k/2
|St | exp − νt0 St−1 νt
2
t=1
so that the log likelihood, which we typically use for any maximization problem, is
T T
X 1 1 X 0 −1
(2.7) l =κ− log(|St |) − νt St νt
2 2
t=1 t=1
where κ does not contain any parameters of interest and can thus be ignored in the max-
imization problem. The log likelihood is remarkably easy to calculate in practice as both
νt and St are calculated in each period by the Kalman filter.
2.4. A KALMAN SMOOTHER 25
2.4 A Kalman Smoother

What the Kalman filter of the previous section delivers are estimates of xt|t , that is, an
estimate of xt given observations from period 1 through t. However, if the states of the
model are autocorrelated then presumably observations realized after period t also contain
information about states in period t. The Kalman smoothers provide a means of employing
this information so that our final estimate of states becomes xt|T , that is, an estimate of
xt given observations from period 1 through T . There are several approaches to Kalman
smoothing; I’ll outline the simple and popular Rauch-Tung-Striebel smoother. The process
begins by running the Kalman filter and saving the values for Pt|t−1 , Pt|t , µt|t−1 , and of
course µt|t . The key to the smoother is the fact that f (xt |xt+1 , y1:T ) = f (xt |xt+1 , y1:t ),
which states that if we know xt+1 then further realizations of observables after period t do
not add any additional information. We can summarize the relationship between xt|t and
xt+1|t as
Pt|t Pt|t A0

xt |y1:t µt|t
(2.8) ∼N ,
xt+1 |y1:t µt+1|t APt|t Pt+1|t
where Pt+1|t = APt|t A0 + Q. Using the same results for a joint normal distribution we used
to derive the Kalman filter we then have1
E(xt |xt+1 , y1:t ) = µt|T = µt|t + gt (xt+1 − µt+1|t )

V ar(xt |xt+1 , y1:t ) = Pt|T = Pt|t − gt Pt+1|t gt0
−1
where gt = Pt|t A0 Pt+1|t . Since we do not in fact observe xt+1 we need to slightly modify
the above equations. Using the law of iterated expectations for the first
(2.9) E(xt |y1:T ) = E(E(xt |xt+1 , y1:t )|y1:T ) = µt|t + gt (µt+1|T − µt+1|t )
Using the law of iterated variance for the second
V ar(xt |y1:T ) = E(V ar(xt |xt+1 , y1:t )|y1:T ) + V ar(E(xt |xt+1 , y1:t )|y1:T )
(2.10) = Pt|t − gt Pt+1|t gt0 + gt Pt+1|T gt0
= Pt|t − gt (Pt+1|t − Pt+1|T )gt0
Equations (2.9) and (2.10) form our smoother. We begin using our last filtered value for
µT |T and PT |T and iterate backwards to the first period.
1
The result for Pt|T comes from the fact that
−1
V ar(xt |xt+1 , y1:t ) = Pt|t − Pt|t A0 Pt+1|t APt|t
0 −1 −1
= Pt|t − Pt|t A Pt+1|t Pt+1|t Pt+1|t APt|t
0
= Pt|t − gt Pt+1|t gt
2.5 The Steady State Filter

Note that in the above Kalman filter neither Pt|t , St , nor Ct depend on the realization of
yt or the expected values of xt (they do, however, depend on the number of series observed
each period). Thus, if our series is covariance stationary, the number of observations
remains the same each period, and if we happen to know the long run value of Ct , call it
C, and St , call it S, we could simplify our Kalman filter as
µt|t−1 = Aµt−1|t−1
(2.11) mt|t−1 = Hµt|t−1
µt|t = µt|t−1 + CS −1 (yt|t − mt|t−1 )
This is the steady state Kalman filter. To obtain these steady state values, we can
simply run the system of difference equations that determine the relevant variances and
covariances until convergence. This system is
Pt|t−1 = APt−1|t−1 A0 + Q
St = HPt|t−1 H 0 + R
(2.12)
Ct = Pt|t−1 H 0
2.6 An Example
Suppose we have a model described by

1 −.5
xt = xt−1 + et
.1 .7
 
.5 1
−1 2 
 1 −1  xt + εt
yt =  
1 −.5
where xt is a 2 × 1 vector of unobserved factors, yt is a 4 × 1 vector of observed data,
et ∼ N (0, I2 ), and t ∼ N (0, I4 ) and we would like to construct the unobserved factors xt
from the observed data.
In this case I already know the parameters A, Q, H, and R. Thus all that remains is
to specify initial conditions. x0 = [0] is a natural but arbitrary choice. Thus, to this initial
guess has little influence on the model I begin with a large initial factor variance, in this
case 5
10 0
P0 =
0 105
Figure 2.1 plots the results for 200 observations.
2.7. MISSING OBSERVATIONS 27
Figure 2.1: True series x1,t versus the series estimated from observations yt
2.7 Missing Observations

There are several ways to handle missing observations for the filtering and smoothing
algorithms in this book. I describe here the approach used in the accompanying R and
C++ code.
Using the example from the previous section, suppose the second series in yt is not
observed in a given period t. We can simply re-write our model for the data we do observe
by dropping the rows of H and rows and columns of R corresponding to the missing data.
The dimensions of the unobserved factors xt remain the same, thus this does not present
any problems in updating our factor predictions from one period to the next. Explicitly, if
the second series of xt were missing in period t then our transition equation would become
 
.5 1
yt =  1 −1  xt + εt
1 −.5
| {z }
Ht
and R becomes a 3 × 3 identity matrix.

By re-casting the dimensions of Ht and Rt depending on which series of data are
observed, our calculation of the gain
−1
Kt = Pt|t−1 Ht0 Ht Pt|t−1 Ht0 + Rt
will automatically correspond to the observed series. The same is true of the dimensionality
of terms used in the disturbance smoother in chapter 3. From a programming point of
view, the only downside to the changing dimensions of H, R and the terms calculated
using each is storage. For example, if we wish to save the gain Kt and the prediction error
νt = yt − Hmt|t−1 at each period t we will need a data type that can handle different
dimensions for each observations. In C++/Armadillo this means using field<mat>, and
in R a list as opposed to a cube or array, but this disadvantage is slight and does not
appreciably detract from the speed or efficiency of calculations.
Note that in the rest of this text I do not include time subscripts on parameters such
as H and R that may change dimension from one period to the next due to missing data
in order to keep the notation clean and avoid confusion with objects such as observations
or factors with values that change over time.
Chapter 3
The Durbin-Koopman Disturbance

Smoother
I begin this chapter by re-stating our basic factor model with a slight generalization, al-
lowing for exogenous predetermined variables nt . In practice, these will typically take the
form of seasonal factors; Chapter 5 includes an example of estimating seasonal adjustments
for covariance stationary data by maximum likelihood. Thus our measurement equation
becomes
(3.1) yt = Hxt + M nt + εt
The transition equation (in companion form) is unchanged,1 that is
(3.2) zt = Azt−1 + et
where εt and et are normally distributed error terms with the covariance matrix

et Q 0
Cov =
εt 0 R

In the above yt are noisy observations, zt stacked factors zt = xt xt−1 . . . xt−p with
p lags, and nt predetermined, exogenous variables.
Given the parameters H, A, M , Q, and R estimates of factors or estimates of missing
series in yt derive from the Kalman filter and smoother. I re-state the Kalman filter here
1
One could also include exogenous variables in the transition equation, but I do not here as what I have
in mind are seasonal adjustments to observations.
29
30 CHAPTER 3. THE DURBIN-KOOPMAN DISTURBANCE SMOOTHER
as
zt|t−1 = Azt−1|t−1
Pt|t−1 = APt−1|t−1 A0 + Q
yt|t−1 = H̃zt|t−1 + M nt
St = H̃Pt|t−1 H̃ 0 + R
Ct = Pt|t−1 H̃ 0
zt|t = zt|t−1 + Ct St−1 (yt|t − yt|t−1 )
and the Kalman smoother as
zt|T = zt|t + gt (xt+1|T − zt+1|t )
(3.3)
Pt|T = Pt|t − gt (Pt+1|t − Pt+1|T )gt0
−1
where gt = Pt|t A0 Pt+1|t . H̃ in the above incorporates a helper matrix J to extract, in

the simplest example, contemporaneous factors from zt . That is, J = Im 0 0 . . . and
H̃ = HJ. In the above notation xt|t−1 refers to our estimate of xt given observations
through period t − 1 while xt|t is our estimate of xt given observations through t, and xt|T
is our estimate of xt conditional on all available data through period T . Note that in the
above Kt = Ct St−1 is the Kalman gain, νt = yt|t − yt|t−1 is the prediction error, and thus
Kt νt is our forecast update.
Smoothing here requires we store the following matrices (or vectors) from the filter
at every period t: Pt|t , Pt+1|t , zt|t , and zt+1|t . Additionally, we must invert the matrix
Pt+1|t at each iteration of the smoother. This can be computationally burdensome for even
moderately sized models. Consider an example with four factors and five lags. In this case
we will be inverting the 20 × 20 matrix Pt+1|t in each time period. For mixed frequency
models (see chapter 7) inverting Pt+1|t can become prohibitive. For a daily/weekly/monthly
model with three factors and three lags at a monthly (30 day) frequency Pt+1|t will be a
270 × 270 matrix. The Durban-Koopman disturbance smoother allows us to avoid these
cumbersome calculations resulting in much faster and more efficient model estimation.
3.1 Durban Koopman (2001) Filtering and Smoothing

Sticking with the above notation, Durbin and Koopman (2001) write the Kalman filter as
(3.4) νt = yt − H̃xt|t−1 − M nt
(3.5) St = H̃Pt|t−1 H̃ 0 + R
(3.6) Kt = APt|t−1 H̃ 0 St−1
(3.7) zt+1|t = Azt|t + Kt νt
(3.8) Lt = A − Kt H̃
(3.9) Pt+1|t = APt|t L0t + Q
3.1. DURBAN KOOPMAN (2001) FILTERING AND SMOOTHING 31
Note that in this notation the Kalman gain in (3.7) updates the forecast for next period
factors; factoring A out of equation (3.8) yields the same forecast update as in chapter
2’s more standard Kalman filter. Similarly, multiplying out the elements of (3.9) yields
Pt+1|t = APt|t−1 A0 − ACt St−1 Ct0 A0 + Q. For the above notation, the disturbance smoother
is
(3.10) rt = H̃ 0 St−1 νt + L0t rt+1
with rT = 0 from which we can recover the vector of errors (disturbances) as

−1
−RKt0 νt

εt|T RSt
(3.11) =
et|T 0 Q rt
Note that in order to recover smoothed factors we must iterate rt back to r0 . Equipped
with et|T we can recover smoothed estimates of the factors by initializing our smoothed
factors as
z1 = z1|0 + P1|0 r0
or, since we typically set z0 to zero simply z1 = P1|0 r0 , and iterating forward again using
zt|T = Azt−1|T + et|T
In this case we must store the matrices (or vectors) νt , H̃, St−1 , Lt , and potentially Kt .
Since filtering requires solving H̃St−1 disturbance smoothing requires no additional matrix
inversions and will thus be much more computationally efficient.
Chapter 4
Estimation via Principal

Components
By far the simplest and fastest way to estimate the factor model outlined in equations (1)
and (2) is by principal components, though this approach has several major drawbacks.
Perhaps most importantly, principal components solves a static problem. That is, when
using principal components we find factors by looking at the measurement equation in
isolation ignoring the inter-temporal correlations implied by the transition equation. Ad-
ditional problems with principal components estimates are (1) data must be square and
complete, a large drawback when using a methodology particularly adept at handling miss-
ing and noisy observations and (2) principal components limits our ability to fix parameters
or apply prior beliefs to parameter estimates. Despite these drawbacks, when we have a
data set without missing observations principal components provides a very quick and easy
way to estimate a factor model. I begin this chapter by examining reduced rank regressions
before moving on to principal components as a special case. Once we have our principal
component estimates of factors we can estimate the transition equation by OLS and simply
plug these results into the Kalman filter and, if desired, smoother to obtain our final factor
estimates.
4.1 Reduced Rank Regression

Suppose we observe a k × t set of data Y that we wish to explain using a s × t set of data
X, but that the number of series in X (denoted by s) is very large and that we think the
relationship between X and Y can in fact be summarized by the data in an m × t matrix
γX. That is, we would like to reduce the rank of X before estimating its relationship with
Y . Assuming we want to model a linear relationship, we can write this problem succinctly
as
(4.1) Y = BγX +
33
34 CHAPTER 4. ESTIMATION VIA PRINCIPAL COMPONENTS
where both B and γ are parameter matrices to estimate. If we want the least squares
estimates of B and γ then we need to minimize
n o
(4.2) min tr (Y − BγX) (Y − BγX)0
| {z }| {z }
0
where tr denotes the trace of the covariance matrix for . Note that equation (4.1) will be
observationally equivalent for
Bθ θ−1 γ X +
Y = |{z}
| {z }
Ba γa
where B a and γ a are some alternative B and γ, thus we need to impose some sort of
normalization on γ. It is convenient to impose
(4.3) γXX 0 γ 0 = Im
From a simple OLS model we already know that given γ our estimate for B that minimizes
our loss function (4.2) will be
B = Y X 0 γ 0 (γXX 0 γ 0 )−1
which, given our normalization (4.3) is just Y X 0 γ 0 . Plugging this into our loss function we
have n o
min tr Y Y 0 − 2Y X 0 γ 0 γXY 0 + Y X 0 γ 0 γXX 0 γ 0 γXY 0
which, given (4.3) is just

n o
(4.4) min tr Y Y 0 − Y X 0 γ 0 γXY 0
Since the first term does not include γ we can re-write this minimization problem as a con-
strained maximization, using the property of a matrix trace that tr(ABC) = tr(CAB) =
tr(BCA) so that tr(Y X 0 γ 0 γXY 0 ) = tr(γXY 0 Y X 0 γ 0 ),
L = tr γXY 0 Y X 0 γ 0 + Λ Im − γXX 0 γ 0

where the constraint comes from our normalization (4.3) and Λ is a diagonal matrix of
Lagrange multipliers. The first order condition for this problem (see the appendix for
useful matrix derivatives) is
2γXY 0 Y X 0 − 2ΛγXX 0 = 0
or
γXY 0 Y X 0 (XX 0 )−1 − Λγ = 0
4.2. PRINCIPAL COMPONENTS AS A SPECIAL CASE 35
so that the solution for γ is the left eigenvectors of XY 0 Y X 0 (XX 0 )−1 . Since left eigenvectors
are a bit annoying (Matlab automatically computes right eigenvectors) we can simply take
the transpose of this result
(4.5) (XX 0 )−1 XY 0 Y X 0 γ 0 − γ 0 Λ = 0
so that γ 0 is given by the (right) eigenvectors of (XX 0 )−1 XY 0 Y X 0 associated with the
largest m eigenvalues, the diagonal elements of Λ (Λ is a diagonal matrix). Recall that
the Lagrange multiplier is a measure of how strongly our constraint binds.1 We chose the
eigenvectors associated with the largest eigenvalues because those vectors give the factors
γX which will have the largest impact on our objective function tr Y X 0 γ 0 γXY 0 . Once
we have γ we can simply compute B = Y X 0 γ 0 .
4.2 Principal Components as a Special Case

Principal components is a special case of reduced rank regressions in which X = Y . That
is, γY is a collection of m < k series that we believe is sufficient to explain Y . If X = Y
then equation (4.5) simply becomes
(4.6) Y Y 0γ0 − γ0Λ = 0
in which case the solution for γ 0 is the (right) eigenvectors of Y Y 0 associated with the
largest m eigenvalues. We can actually derive principal components from a slightly more
general problem. Suppose we want to minimize the loss function
min{tr (Y − γF )(Y − γF )0 }

(4.7)
where F can be any set of explanatory variables. Thus, unlike the problem in section 4.1,
we are not restricting what the set of explanatory variables can be (previously it was X).
We again need to impose a normalization of γ since Y = γθθ−1 F + will be observationally
equivalent to the model in (4.7). In this case the normalization
(4.8) Im = γ 0 γ
is convenient. Minimizing first over F , the solution for the factors given γ is
F = (γ 0 γ)−1 γ 0 Y
or, using (4.8),

F = γ0Y
1
Specifically Λ is the partial derivative of our objective function with respect to the constraint, in this
case Ik .
36 CHAPTER 4. ESTIMATION VIA PRINCIPAL COMPONENTS
Plugging this back into our minimization problem, equation (4.7), we have
min{tr Y Y 0 − 2γγ 0 Y Y 0 + γγ 0 Y Y 0 γγ 0 }

Using the property of a matrix trace that tr(ABC) = tr(CBA) = tr(ACB) and our
normalization, equation (4.8) this result becomes
min{tr Y Y 0 − Y 0 γγ 0 Y }

which, again using our normalization, we can write as the constrained maximization prob-
lem
L = tr γ 0 Y Y 0 γ + Λ(I − γ 0 γ)

with first order condition

Y Y 0 γ − γΛ = 0
thus γ is the eigenvectors of Y Y 0 associated with the m largest eigenvalues (note that this
γ is the transpose of the previous one we derived for reduced rank regressions). It is fairly
simple in this context to calculate how much each principal component, in this case each
row of γY , reduces our original loss function, equation (4.7). If we scale our loss function
by 1/T and assume Y has zero mean our loss function is simply the trace of the covariance
matrix of Y , denoted tr(ΣY Y ). We can re-write this trace using the eigendecomposition of
ΣY Y as2
tr ΓΛΓ0 = tr(Γ0 ΓΛ) = tr(Λ)

where again we have used the properties of a matrix trace. Thus the trace of ΣY Y is the
sum of its eigenvalues. With one principal component, that associated with the largest
eigenvalue denoted λ1 , our scaled loss function is

1 0
tr T (Y − γγ Y )(Y − γγ Y ) 0 0

= tr T1 (Y Y 0 − 2γγ 0 Y Y 0 + γγ 0 Y Y 0 γ 0 γ

= 1 0 0
tr T (Y Y − γγ Y Y ) 0

= 1 0 0
tr T (Y Y − γ Y Y γ) 0

= tr ΓΛΓ0 − γ 0 ΓΛΓ0 γ
  
λ1 0 . . .
= tr Λ −  0 0 . . .
  
.. .. . .
. . .
2
From our first order condition for m factors using the scaled loss function, ΣY Y γ − γΛ, we can write
the system of equations for the complete set of factors when m = k as ΣY Y Γ = ΓΛ where Λ is a diagonal
matrix of the eigenvalues of ΣY Y and the columns of Γ are the associated eigenvectors. Thus ΣY Y = ΓΛΓ0
where we maintain our normalization Γ0 Γ = Ik , and since Γ is now square it is also true that ΓΓ0 = Ik
4.3. SUMMING UP 37
where the last equality comes from the fact that γ 0 Γ is a matrix of zeros with a one in
the upper left corner, as is Γ0 γ. Thus the first principal component reduces our scaled loss
function by λ1 . Similarly, the first two principal components will reduce our scaled loss
function by λ1 + λ2 .PThe usual interpretation of this result is that the first m principal
m
λi
components explain Pi=1k of the variance of Y , where by variance of Y we mean the sum
i=1 λi
of the diagonal of the covariance matrix.
4.3 Summing Up
Beginning with our transition equation
Yt = Hxt + εt
if our set of k covariance stationary observables Yt does not have missing observations
then we can estimate the k × m matrix of loadings H by principal components as the
(right) eigenvectors associated with the m largest eigenvalues of ΣY Y where ΣY Y is the
(typically scaled to have ones on the diagonal) covariance matrix for Yt , that is, ΣY Y =
(Y − Ȳ )0 (Y − Ȳ )0 . Due to the normalizing assumption Im = H 0 H in equation (4.8) our
estimate of the factors xt is
H 0 Yt = x̂t
and residuals are given by
ε̂t = Yt − H x̂t
allowing us to estimate R. Typically we will use only the diagonal elements of R as Doz
et al. (2012) show that this model, the approximate factor model, even if misspecified still
consistently estimates factors. Using the factors x̂t we can estimate the transition equation
by OLS for parameters A and Q. These are nearly all we will need to run the Kalman
filter and, if desired, smoother. The only thing that remains are initial values x0 and P0 .
Because x0 is not known, it is good practice to initiate the filter with diffuse values, that
is, specifying a large initial factor variance P0 . For example, we might begin our filter with
x0 = 0m and P0 = 105 Im .3
3
In the case of more than one lag p in the transition equation we would have x0 = 0m×p and P0 =
5
10 Im×p .
Chapter 5
Maximum Likelihood Estimation

via Watson and Engle (1983)
Maximum likelihood estimation of state space models using numerical techniques such
as the function optim() in R provide an easy to program method for estimating model
parameters. To use a numerical method we can simply write a function that returns the
log likelihood described in equation (2.7) and search for parameters that maximize it. The
great drawback of numerical methods is that they are computationally intensive and thus
slow to converge and unfeasible for larger models. Watson and Engle (1983) provide an
alternative approach that overcomes these shortcomings of numerical routines.
5.1 The Algorithm

Watson and Engle (1983) propose an iterative scheme that describes an expectation-
maximization (EM) algorithm for estimating state space models. The insight Watson
and Engle (1983) for the purpose of estimating the model described in equations (1) and
(2) is that, though we do not observe the true factors, we can still calculate the necessary
moment matrices for A, H, Q, and R using an appropriate adjustment and, moreover, we
can use the standard Kalman smoother to calculate this adjustment. I begin by re-stating
the model for clarity. The measurement equation for k observables and m factors is
yt = Hxt + εt
and the transition equation is
xt = Bzt−1 + et
39
CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION VIA WATSON AND ENGLE
40 (1983)
where zt is defined as  
xt
 xt−1
zt = 
 
..
 .
xt−p+1
and p denotes the number of lags in the transition equation. Then the moment matrices
we need to estimate are, for B: E(zt−1 zt−1 0 ) and E(x z 0 ); for H: E(x x0 ) and E(y x0 );
t t−1 t t t t
for Q: E(et e0t ); and for R: E(εt ε0t ). Proper estimation of the above moment matrices allows
us to then compute maximum likelihood estimates of the model parameters.
In brief, the approach works by noting that, where xt|T , Pt|T , Ct|T , and so on are the
values calculated at the j th iteration of the algorithm (I suppress the superscript j to keep
the notation cleaner and use zt|T to denote our estimate of the factors zt given observations
1 : T)

0
(5.1) E(zt−1 zt−1 ) = E (zt−1|T + (zt−1 − zt−1|T ))(zt−1|T + (zt−1 − zt−1|T ))0
which we can estimate as

1 hX 0
X i
zt−1|T zt−1|T + Pt−1|T
T t t

0
(5.2) E(xt zt−1 ) = E (xt|T + (xt − xt|T ))(zt−1|T + (zt−1 − zt−1|T ))0
which we can estimate as

1 hX X i
(xt|T (zt−1|T )0 + Ct|T
T t t
giving the moment matrices for B;

(5.3) E(xt x0t ) = E (xt|T + (xt − xt|T ))(xt|T + (xt − xt|T ))0
which we estimate as
1 hX X i
xt|T x0t|T + x
Pt|T
T t t
0
(5.4) E(yt x0t ) = E yt xt|T + (xt − xt|T )
= E(yt x0t|T )
5.1. THE ALGORITHM 41
giving the moment matrices for H; and for the covariance matrices Q and R

(5.5) et = (xt|T − Bzt−1|T ) + (xt − xt|T ) − B(zt−1 − zt−1|T )
so that we estimate E(et e0t ) as
1 hX X X X X i
vt|T vt|T + x
Pt|T BPt−1|T B 0 + BCt|T + 0
Ct|T B0
T t t t t t
and
(5.6) εt = yt − Hxt|T − H(xt − xt|T )
so that we estimate E(εt ε0t ) as
1 hX X i
εt|T εt|T + x
HPt|T H0
T t t
As suggested by Watson and Engle (1983) we can calculate the correlations E (xt −
xt|T )(zt − zt−1|T )0 by including an extra lag of xt in the state vector of the Kalman filter
and smoother. For example, if we wanted to estimate a model with three lags we would
write the transition equation as
      
xt B1 B2 B3 0 xt−1 vt
xt−1   I 0 0 0 xt−2   0 
xt−2  =  0
    + 
I 0 0 xt−3   0 
xt−3 0 0 I 0 xt−4 0
Denote the variance of this augmented vector of states in period t as

 
σt,t σt,t−1 σt,t−2 σt,t−3
WE
σt−1,t σt−1,t−1 σt−1,t−2 σt−1,t−3 
(5.7) Pt|T =σt−2,t σt−2,t−1

σt−2,t−2 σt−2,t−3 
σt−3,t σt−3,t−1 σt−3,t−2 σt−3,t−3
For this three lag example Pt|T is the upper left1 3 × 3 submatrix of Pt|TW E in equation
W E , and P x refers to element

(5.7); Ct|T refers to the upper right 1 × 3 submatrix of Pt|T t|T
σt,t . This gives all the results needed to estimate the moment matrices above. The EM
algorithm proceeds by alternately estimating the unobserved factors via the Kalman filter
and smoother and estimating the parameters of the model as outlined above until the log
likelihood function converges.
1
Of course each element σi,j is itself a matrix so in fact Pt|T will have dimension 3m × 3m.
42 (1983)
5.2 Practical Issues
Identification
An important practical issue for either maximum likelihood or Bayesian estimation of factor
models is identifying the model. The issue arises as the factors xt are never observed. Thus
the model
yt = Hxt + εt
(5.8)
xt = Bxt−1 + et
is observationally equivalent to the model

−1
yt = Hθ θxt +εt
| {z } |{z}
H xt
−1
θxt = θBθ
| {z } θx t−1 + θet
|{z} | {z } |{z}
xt B xt−1 et
Factor estimates and the likelihood function for the model in (5.8) would be identical to
those for the model
yt = Hxt + εt
(5.9)
xt = Bxt−1 + et
Identification may be important for interpreting our model.2 It is, for maximum likelihood
and Bayesian estimation, also important for estimation.3 Identification requirements for
maximum likelihood estimation are less stringent than for Bayesian estimation. Bayesian
estimation by simulation relies on the assumption we are drawing factors and parameters
from the same distribution at each iteration. On the other hand, because our convergence
criteria for maximum likelihood is in the likelihood function, parameter estimates can
drift between the models in (5.8) and (5.9) as the likelihood function for the two will be
identical. What is important is that we scale the model at each iteration of the algorithm;
otherwise our parameter estimates may become arbitrarily large or small thereby breaking
the algorithm. There is no one way to scale or identify our model; for a discussion of
several identification possibilities see Stock and Watson (2016). To scale the model we
could enforce H 0 H = I as with principal components or even diag(H 0 H) = I. Perhaps
the simplest identification technique, however, it set the top m × m sub-matrix of H,
where m is the number of factors, to be an identity matrix (this is called “naming factors”
identification). In this case our model will not just be scaled, but be fully identified (thus
2
See Stock and Watson (2016) for a more comprehensive discussion of interpretation and identification
issues.
3
For principal component estimation identification is dealt with by the normalization for H in equation
(4.8).
5.3. AN EXAMPLE: SEASONALLY ADJUSTING STATIONARY DATA 43
we could and will use it for Baysian estimation as well). This strategy has the added
advantage of giving our model a simple interpretation: this first factor describes filtered
movements of the first variable in common with the entire data set, the second factor
describes filtered movements of the second variable in common with the entire data set,
and so on.
Initial Values
As with principal components we will not know initial values z0 with which to initiate the
algorithm. As smoothing produces estimates of z0 one possibility is to begin with z0 = [0]
and then update z0 at each iteration of the algorithm. However, there is no particularly
good theoretical justification for this approach and it may result in the likelihood function
getting worse (that is, smaller) from one iteration to the next. A better approach is to
again begin with diffuse values, for example setting z0 = [0] and P0 to be the identify
matrix times a large number (large relative to the actual variance of shocks).
5.3 An Example: Seasonally Adjusting Stationary Data

As an example of using maximum likelihood via Watson and Engle (1983) we could sea-
sonally adjust noisy data using our factor model with exogenous regressors. This example
will specifically consider month on month percent changes is Spanish industrial production
(IP). Our observation equation is then
(5.10) y t = x t + M N t + εt
where yt is observed raw IP data, xt is unobserved adjusted industrial production, and

M is a matrix of parameters to estimate multiplying a dummy for each month contained
in N . Thus N is a T × 12 matrix of predetermined variables where T is the number of
observations of Spanish IP. To keep the example simple, I will use only one lag in the
transition equation, that is,
xt = bxt−1 + et
to use the Watson and Engle (1983) algorithm we will need:
1. An initial guess for M and r = var(εt )
2. An initial guess for b and q = var(et )
3. Diffuse initial conditions for x0 and P0
Furthermore, we will need to add a lag to the transition equation in order to estimate
the adjustment factors for the transition equation. One possible initial guess for M is to
regress Nt on yt by OLS and to calculate our initial guess for r from the residuals. We can
44 (1983)
Figure 5.1: Factor Model Seasonally Adjusted Spanish IP, MoM percent change
also form initial estimates for b and r by OLS using the observed data yt . Finally, we will
need to write our transition equation with an extra lag in companion form as

xt b 0 xt−1 e
(5.11) = + t
xt−1 1 0 xt−2 0
Equipped with our initial guesses for parameters and the initial conditions [x0 x−1 ]0 = [0 0]0
and P0 = 106 × I2 the algorithm proceeds as follows:
1. Run the Kalman smoother for the current iteration value of parameters. If the
likelihood does not improve from the last iteration then the algorithm has converged;
otherwise continue.
0
2. Using yt , zt|T = xt|T xt−1|T , and Pt|T for the augmented model re-estimate the
parameters following equations (5.1) through (5.6)
These two steps alternate until the likelihood converges. Figure 5.1 illustrates seasonally
adjusted IP estimated as outlined above against the raw data. Programs to estimate the
model are contained in the file ML Seasonal Example.
Chapter 6
Bayesian Estimation by Simulation
Bayesian estimation by simulation is a third approach to estimating factor models which

for many applications will be the best option. There are of course theoretical reasons
why one might prefer Bayesian statistics generally — humans typically never encounter
a situation in life in which they have no prior beliefs, even if these beliefs are strongly
biased. Specifically to our purpose of estimating factor models, using prior beliefs may
have practical applications. For example, we may believe unobserved factors to transition
smoothly from one period to the next despite noisy observations. We could incorporate
this belief into our model by biasing the parameters of the transition equation B towards
zero and/or increasing the value of ν0 , the degrees of freedom in our inverse-Wishart prior
for the variance of shocks in the transition equation.
I should note at the outset that this chapter only discusses one approach to estimation
by simulation. In the general case, one can simulate posterior distributions using a range
of mixing distributions — distributions which determine the probability of accepting a
draw from our sampling distribution. We will use the simplest possible mixing distribution
and accept draws from sampling distributions with probability one. This approach, Gibbs
sampling, may not always be the most efficient but is simple to code and generally works
well for the purpose of estimating the models in this book. For a much richer discussion
of Monte Carlo methods see Robert and Casella (2010).
6.1 Sampling
Chapters 1, 2 and 3 present the main results we will use to estimate our model. Estimation
begins with an initial guess for parameters. Given this initial guess, we would like to draw
45
46 CHAPTER 6. BAYESIAN ESTIMATION BY SIMULATION
factors from our model1
(6.1) yt = Hxt + εt
(6.2) xt = Bxt−1 + et
where
et 0 Q 0
∼N ,
εt 0 0 R
One possible approach is to
1. Use the standard Kalman filter to obtain an estimate of xT |T and PT |T describing

our posterior distribution for factors in the final period.

2. Draw x̃T ∼ N xT |T , PT |T
3. Iterate backwards one period using the standard Kalman smoother. That is, calculate
xT −1|T = xT −1|T −1 + gT −1 (x̃T − xT |T −1 )

PT −1|T = PT −1|T −1 − gT −1 (PT |T −1 − PT |T )gT0 −1

4. Draw x̃T −1 ∼ N xT −1|T , PT −1|T
5. Repeat these iterations back to the initial period to generate draws of x̃t for every
period.
This approach is simple. However, as

−1
gt = Pt|t A0 Pt+1|t
we must invert Pt+1|t at each step of the iteration. For a steady state model, that is,
one in which the variance of factors does not change, this is not a big drawback as we
can calculate g once (it will be the same in every period for a steady state model) and
proceed. If the factor variance is not constant (this will always be the case if yt has miss-
−1
ing observations), calculating Pt+1|t becomes computationally burdensome, particularly in
1
I use the case of one lag in the transition equation for explanatory purposes here as the notation is
simple. For more lags our model becomes
yt = H̃zt + εt
zt = Azt−1 + et

where typically we will have H̃ = H 0 0 . . . , zt = xt xt−1 . . . xt−p+1 where p is the number of
lags, and A is the companion form of B in the more familiar vector autoregression xt = Bzt + et . Note also
that I am somewhat lose with the definition of et as when we have more than one lag this vector includes
zeros as “shocks” to lagged variables in zt to make dimensions agree.
6.1. SAMPLING 47
models with many lags as P has dimension m × p where m is the number of factors and
p the number of lags. For this more general case, Durbin and Koopman (2012) propose
a much more computationally efficient method for drawing factors. Their key result for
normally distributed data, presented earlier in Durbin and Koopman (2001), is a method
for drawing from the conditional distribution f (x|y) when doing so directly is burdensome
but drawing from the joint normal distribution f (x, y) and calculating the expected values
E(x|y) is computationally more efficient. To borrow their notation, we can draw x+ , y +
from f (x, y), calculate x̂ = E(x|y) (the filtered and smoothed factors from our true ob-
servations y), and calculate x̂+ = E(x+ |y + ) (the filtered and smoothed observations from
our simulated observations y + ). Then the draw from our desired distribution f (x|y) is
x̃ = x̂ + x+ − x̂+ . The values x̂ are the posterior means for the factors conditional on the
data y, and x+ − x̂+ are the simulated zero mean shocks. For our purposes the authors
point out that this approach can be made even more efficient as follows:
Algorithm 1
1. Begin with a draw for x0 , εt at every period t, and et at every period t and iterate
equations (6.1) and (6.2) forward to obtain a draw for factors x+ t and a draw for
observations yt+ .
2. Obtain yt∗ = yt − yt+ .
3. Obtain x̂∗t by filtering and smoothing yt∗ using the disturbance smoother described
in chapter 3.
4. We can then calculate our simulated factors as
x̃t = x̂∗t + x+
t
An important question is what to use as our draw for x0 . As previously, the best way
to avoid issues of this initial vector of factors is to begin our algorithm with a large factor
variance, that is, allow for uncertainty regarding x0 . Using diffuse values to initialize the
filter and smoother allows us to simply plug in x0 = 0 and proceed with algorithm 1.
This excellent algorithm provides a quick, efficient, and easy to code method for sam-
pling factors. Once we have a draw for factors, we can then draw parameters based on
these simulated factors. Posterior distributions for parameters given factors are presented
in chapter 1. Because we will be using an exact factor model, that is, we will treat the
covariance matrix for shocks to observations R as diagonal, we can estimate each diagonal
element of R and each row of H using a normal-inverse gamma conjugate prior.
Because our draw for Rjj , the j th diagonal of the covariance matrix for shocks to
the observation equation and thus the variance of shocks to the j th observed series, is
conditional on the observations and factors only, while our draw for Hj , the j th row of H,
is conditional on Rjj , we will begin by drawing variances. Specifically, for series j we draw
Rjj from

(6.3) f (Rjj |X̃, y) ∼ IG (y − XβT )0 (y − XβT ) + (βT − h0 )0 Λ0 (βT − h0 ) + s0 , T + ν0
In the above βT = (X̃ 0 X + Λ0 )−1 (X̃ 0 yj + Λ0 h0 ). s0 is our prior scale parameter — our
prior for the variance of shocks when multiplied by ν0 . ν0 is our prior “degrees of freedom”,
that is, ν0 determines how aggressively we shrink Rjj . Λ0 is a diagonal matrix determining
tightness on our prior for Hj , denoted h0 . Typically Λ0 will be an identity matrix times
a (scalar) tightness parameter, with larger values corresponding to a tighter prior, and h0
will be zero for all rows of H, though we could use different priors for each row if desired.
Once we have a draw for Rjj , we draw row j of H from

(6.4) f (Hj |Rjj , X̃, yj ) ∼ N Λ−1
T (X̃ 0
y j + Λ h
0 0 ), R Λ
jj T
−1
where ΛT = (X̃ 0 X + Λ0 ). Note that the posterior mean for Hj is just βT as defined above.
Our draws for Q (the covariance of shocks to the observation equation) and B are nearly
identical except that, because Q contains off-diagonal elements, we will use a normal-inverse
Wishard conjugate prior. Beginning again with covariances, we draw Q from

(6.5) f (Q|X̃, Y ) ∼ IW (Y − X̃BT )0 (Y − X̃BT ) + (BT − B0 )0 Λ0 (BT − B0 ) + V0 , ν0 + T
where BT = (X̃ 0 X̃ + Λ0 )−1 (X 0 Y + Λ0 B0 ). Again, Λ0 determines the tightness on our

prior for B, denoted B0 . Typically we will use a zero prior for B, though using a random
walk prior is also popular. V0 is our prior scale parameter (our prior for the covariance to
shocks when multiplied by ν0 ) and ν0 determines how aggressively we shrink Q. In most
programming languages, including the C++ and R code associated with this book, we will
need to use the vectorized form of our posterior for B. Thus we draw β = vec(B 0 ) from
0 −1
f (β|Q, X̃, Y ) ∼ N vec Λ−1 0

(6.6) T (X̃ Y + ΛB0 ) , Λ T ⊗ Σ
where ΛT = X̃ 0 X̃ + Λ0 and our posterior mean (in matrix format) is just BT as defined
above.
Once we have a draw for parameters we can again draw factors. We continue these
iterations until our posterior distributions converge to a stationary distribution, and then
draw a sufficient number of parameters to suit our needs, that is, estimating posterior
means and variances of the parameters themselves. This whole process is re-stated in the
following algorithm:
Algorithm 2
6.2. IDENTIFICATION 49
1. Begin with a guess for parameters.
2. Given parameters, sample factors following Algorithm 1.
3. Given factors obtained in step (2), draw:
a) Rjj from the distribution in (6.3) and Hj from the distribution in (6.4) for every
series j.
b) Q from the distribution in (6.5) and B from the distribution in (6.6).
4. Repeat steps 2 and 3 until distributions converge to stationary distributions.
5. Once distributions converge repeat steps 2 and 3 storing draws for R, H, Q, and B
at each iteration.
The question of when the distribution has converged is not straight forward — see
Robert and Casella (2010) for a discussion of convergence criteria. The software associated
with this book draws a default 500 samples in a burn loop before moving on to step 5
above and storing output. Convergence criteria can then be applied to the stored output
to determine whether burn time must be increased.
6.2 Identification
Bayesian estimation of dynamic factor models is complicated by the fact that factors are
not identified. Identifying factors ensures that we are in fact drawing from the same
distributions in each step of Algorithm 2. The issue of identification is described in section
5.2. The simplest, though not the only, solution to identification of Bayesian dynamic
factor models is what Stock and Watson (2011) call naming factors identification, that is,
setting the top m × m submatrix of H (where m is the number of factors) to be the identify
matrix. Employing this identification scheme in our sampling iterations requires that we
create a temporary m × m H matrix for the first m series in Y , which I will call M . Based
on our draw for factors X̃, draw the elements of Rjj , j ∈ (1, m) and M for the first m series
from distributions (6.3) and (6.4) respectively. To enforce that the top m × m submatrix
of H is Im we then use M to normalize our draw for the factors X̃. This turns out to be
quite simple. Using the observation equation, our normalization is
−1
yt,1:m = M
| M
{z } M x̃t,old +εt
| {z }
Im x̃t,new
Thus for a model with one lag, our normalization is just2
X̃new = X̃old M
2
We postmultiply by M here as Xold is the matrix of simulated factors with time indexed by rows
Because we have not yet estimated the transition equation (we do that after normalizing
factors) or any other elements of H and R, this is all that is needed to identify our model.
For a model with p lags where  
xt
 xt−1 
zt =  . 
 
 . .
xt−p+1
This normalization becomes
(6.7) Z̃new = Z̃old (Ip ⊗ M )
6.3 An Example: Bayesian vs. ML Estimation for

Simulated Data
As a first example I simulate ten series (k = 10) from a factor model with observation
equation
1 0 0
 
0 1 0
 
0 0 1
 
1
 0 −1
1 −1 0.5
(6.8) yt = 
0.5
 x t + εt
 0.5 0.5

 .5 1 1
 
0
 −1 1
0 0.5 −1
0 .5 1
and transition equation

  
0.4 0 0 0.3 0 0 0.1 0 0 xt−1
(6.9) xt =  0 0.3 0.2 0 0.2 0.1 0 0.1 0  xt−2  + et
0 0.2 0.4 0 0 0.2 0 0.1 0.1 xt−3
where

εt ∼ N 0, I10
and   
1 0 0
et ∼ N 0, 0 1 −0.5
0 −0.5 1
6.3. AN EXAMPLE: BAYESIAN VS. ML ESTIMATION FOR SIMULATED DATA 51
T = 50 T = 100 T = 200
Bayes 0.42 0.34 0.30
ML 0.44 0.36 0.33
Table 6.1: MSE for Bayesian vs. ML Estimation
Note that the first observed series yt1 is just the first factor x1t and that this factor is
independent of the other two factors.
My objective in this section will be to estimate the “true” data yttrue = yt − εt or simply
yttrue = Hxt based on the observed series yt . Note that this true series is never observed.
In addition to the noise εt I drop observations from the first three series. Thus for these
series I have five observations followed by five missing values, then five observations and
so on. Simulations proceed by
1. Simulating data according to equations (6.1) and (6.2)
2. Dropping blocks of five observations from the first three series
3. Estimating the model by simulation as outlined in chapter 1 and by maximum like-

lihood as outlined in chapter 5
4. Calculating the mean squared error 1/T (ŷt − yttrue )2 for all observed series and
P
averaging over the 10 series for 1000 repetitions
Table 6.1 presents the results when I estimate a model with three factors and two lags. As
is evident from the table Bayesian estimation yields slightly better results in terms of mean
squared error. However, this comes at the cost of longer run time as we are simulating
entire distributions, not simply estimating a few parameters. Note, however, that Bayesian
estimation may be faster for models with large numbers of lags, such as mixed frequency
models, as inverting the covariance matrix for predicted factors Pt+1|t becomes increasingly
burdensome as p gets large.
Chapter 7
An Introduction to Mixed
Frequency Models
Because state space models are so apt and handling missing data they are particularly
well suited to mixed frequency data sets in which, for example, a quarterly variable will
not be observed for two out of three months. The software accompanying this book does
not include routines for mixed frequency data as each case tends to be fairly specific and
components of routines will need to be tailored to the application. Instead, I provide a
brief introduction to simply outline how one might incorporate mixed frequency data into
the models discussed thus far.
Suppose first that our model is in log levels and, as a concrete example, that frequencies
are either monthly or quarterly. Denote ytq the log of a quarterly observation in month t
and ytm the log of a monthly observation in month t. Then
m m m m
(7.1) eyt = eyt + eyt−1 + eyt−2
The difficulty lies in the fact that equation (1) is linear in the log variables while equation
(7.1) is not; to overcome this issue simply take a linear approximation of (7.1) yielding
1
(7.2) ytm = (ytm + yt−1
m m
+ yt−2 )
3
Plugging equation (1) into the above yields the linear state space structure:
1 1 1
(7.3) ytq = Hxt + Hxt−1 + Hxt−2 + εt
3 3 3
Note that this requires that the model include at least three lags of factors, although one
need not estimate coefficients on factors with more than one lag in the transition equation.
53
54 CHAPTER 7. AN INTRODUCTION TO MIXED FREQUENCY MODELS
To put the model into log differences we begin with equation (7.2) and note that what
we observe, ∆ytq , is
1 m 1 m 1 m
(7.4) ytq − yt−3
q
= m
(yt − yt−3 ) + (yt−1 m
− yt−4 ) + (yt−2 m
− yt−5 )
3 3 3
1 w 2 w w 2 w 1 w
= ∆yt + ∆yt−1 + ∆yt−2 + ∆yt−3 + ∆yt−4
3 3 3 3
This is the result presented in Mariano and Murasawa (2003). Unlike the levels case, we
now need to include at least four lags of the factors.
Estimates of H and R now must take into account the structure of the data in (7.2) for
level data or (7.4) for differenced data. A simple way to proceed is to define a new helper
matrix Jq for low frequency data. Maintaining the monthly/quarterly structure and the
three lag model by way of example, the helper matrix for quarterly data, assuming we are
dealing with data in levels, will be
Jq = 13 Im 31 Im 13 Im

(7.5)
or, for quarterly variables in first differences
Jq = 13 Im 23 Im Im 2 1

(7.6) 3 Im 3 Im
The difficulty with a more general structure, monthly/daily data for example, is two
fold. First, the number of high frequency periods (days) in a low frequency period (months)
gets large. For differenced data our helper matrix for monthly variables, assuming 31 days
in the month, becomes the m × 61 matrix
1 2 2 1

(7.7) Jq = 31 Im 31 Im . . . 31 Im 31 Im
Thus the vector of factors zt must include at least 61 lags. Second, the number of days in
the month is not constant. Nor is the number of days in February constant from one year
to the next. Thus our helper matrix Jq must change depending on the number of days in
the current month. Though conceptually simple, these facts can become computationally
difficult. To illustrate the point, our daily/monthly data requires 61 lags in zt . If our model
includes a modest three factors, than zt has 183 elements, a huge number for a literature
in which there or typically only a few latent factors.
The structure of the transition equation is an additional consideration when the number
of high frequency periods (days) in a low frequency period (months) is large. The frequency
for factors is the highest frequency in the model — in our daily/monthly example factors
are thus daily. However, assuming we are interested in nowcasting or forecasting a low
frequency (monthly) variable we would like to include, in our monthly-daily example,
at least a month of lags in the transition equation. This raises the problem of over-
parameterization: with three factors and 30 lags we will be estimating 90 parameters in
55
the transition equation excluding the covariances in Q. Our solution is to again introduce
a helper matrix which, for the transition equation, we call JB . We then specify the number
of lags at each frequency in the model. For a daily/weekly/monthly model we we then have
pd daily lags, pw weekly lags, and pm monthly lags (in the transition equation we use the
term month to refer to 30 days regardless of the actual number of days in the month). For
a model with two factors and pd = 1, pw = 1, and pm = 1 our transition equation equation
can be described by d
b11 bd12 bw bw bm bm

B= d 11 12 11 12
b21 bd22 bw w m
21 b22 b21 b22
m
 
1 0 0 0 0 0 0 ...
 0 1 0 0 0 0 0 . . .
1
0 17 0 71 0 17 . . .


JB =  7 
1 1 1
 0 7 0 7 0 7 0 . . .

 1 0 1 0 1 0 1 . . .
30 30 30 30
1 1 1
0 30 0 30 0 30 0 ...
where bw
21 refers to the multiplier of factor 1 on factor 2 at a weekly frequency, and finally
Xt+1 = BJB Zt + et
Equipped with the (sparse) helper matrices described above our model fits into the
smoothing and filtering algorithms described in chapter 3. The only complication is that
H̃t = HJt changes from one month to the next as the number of days in a month changes.
This does not present any theoretical difficulties but requires careful coding when writing
estimation routines.
References
Doz, C., Giannone, D., and Reichlin, L., 2012. A Quasi-Maximum Likelihood Approach
for Large, Approximate Dynamic Factor Models. Review of Economics and Statistics,
94(4):1014–1024.
Durbin, J. and Koopman, S., April 2001. An efficient and simple simulation smoother for
state space time series analysis. Computing in Economics and Finance 2001 52, Society
for Computational Economics.
Durbin, J. and Koopman, S. J., 2012. Time Series Analysis by State Space Methods. Oxford
University Press.
Hamilton, J., 1994. Time Series Analysis. Princeton University Press.
Kalman, R., 01 1960. A new approach to linear filtering and prediction problems. Trans-
actions of the ASME - Journal of basic Engineering, 82:35–45.
Koop, G., Poirier, D., and Tobias, J., 2007. Bayesian Econometric Methods. Cambridge
University Press.
Lutkepohl, H., 2007. New Introduction to Multiple Time Series Analysis. Springer.
Mariano, R. S. and Murasawa, Y., 2003. A new coincident index of business cycles based
on monthly and quarterly series. Journal of Applied Econometrics, 18(4):427–443.
Robert, C. P. and Casella, G., 2010. Monte Carlo Statistical Methods. Springer Publishing
Company, Incorporated.
Sarkka, S., 2013. Bayesian Filtering and Smoothing. Cambridge University Press.
Stock, J. H. and Watson, M. W., 2011. Dynamic Factor Models. In Clements, M. P. and
Hendry, D. F., editors, The Oxford Handbook of Economic Forecasting. Oxford, Oxford
Handbooks.
Stock, J. H. and Watson, M. W., 2016. Factor Models and Structural Vector Autore-
gressions in Macroeconomics. In Taylor, J. B. and Uhlig, H., editors, Handbook of
Macroeconomics, volume 2A, pages 415–525. North-Holland.
57
58 REFERENCES
Watson, M. W. and Engle, R. F., 1983. Alternative Algorithms for the Estimation of
Dynamic Factor, Mimic and Varying Coefficient Regression Models. Journal of Econo-
metrics, 23(3):385–400.
Index
Approximate factor model, 37 Normal-normal conjugate prior (linear re-

gression), 14
Bayes’ theorem, 9 Normal-normal conjugate prior (mulitvari-
Bayesian esimation by simulation, 45 ate linear regression), 15
Conditional distribution, 9 Posterior, 10
Conditional distribution, normally distributed Principal components, 35
data, 21 Prior, 10
Disturbance smoother, 29 Reduced rank regression, 33
Exact factor model, 47 Seasonal adjustment by maximum likeli-
hood, 43
Gibbs sampling, 45
Steady state Kalman filter, 26
Identification, 42, 49
Initial values, 43
Kalman Filter, 21
Kalman Smoother, 25
Likelihood function, Kalman fiter, 24

Linear regression, 13
Linear Regression, multivariate, 15, 18
Linear Regression, univariate, 14, 16
Maximum likelihood estimation, 39

Mixed frequency factor models, 53
Monte Carlo estimation, 45
Normal-invererse Wishart conjugate prior,

18
Normal-inverse gamma conjugate prior, 16
Normal-normal conjugate density, 12
59

Factor Models

Uploaded by

Copyright:

Available Formats

Factor Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Factor Models

Uploaded by

Copyright:

Available Formats

Practical Implementation of Dynamic

1 The Necessary Bayesian Statistics 9

2 The Kalman Filter 21

3 The Durbin-Koopman Disturbance Smoother 29

4 Estimation via Principal Components 33

5 Maximum Likelihood Estimation via Watson and Engle (1983) 39

6 Bayesian Estimation by Simulation 45

7 An Introduction to Mixed Frequency Models 53

and a transition equation

where yt is observed noisy data, xt are (typically) unobserved factors, H is a matrix of

and the transition equation as

p is the number of lags in the model. H̃ = HJ where H is a matrix of loadings and J is a

and the transition equation

Since we alreadyknow H in this casethe parameters

As a second example, one way to write an ARMA(1,1) model

in state space form would be with the observation equation

and the transition equation

The Necessary Bayesian Statistics

Using equation 1.2 we have

distribution of parameters conditional on observed data (called the posterior distribution

1.1 Basic Examples

If an individual tests positive twice, the probability that he is in fact ill is

Example 2: Normal data with a Bernoulli prior

thus our posterior distribution for θ is

Example 3: Normal-Normal Conjugate Density

which we could alternatively write as

where Σ = σ 2 IN and N is the number of observations or elements of X (Note here that I

Multiplying out equation 1.6 we have

which is equation 1.5. Thus our posterior distribution

1.2 Bayesian Linear Regression

where y, the dependent variable, is a T × 1 vector, X is a T × m matrix of explanatory

in which Y is a T × k matrix of k dependent variables (observations t are indexed by row

variables, B a m × k matrix of unknown parameters, and εt is a T × k matrix of shocks

Simple Univariate Linear Regression

Simple Multivariate Linear Regression

Our prior for β is  

or, using our definition of V0 = Λ ⊗ σ,

Univariate Linear Regression with Normal-Inverse Gamma Prior

Beginning by multiplying out the exponential terms we have

(y − XβT )0 (y − XβT ) + (βT − β0 )0 Λ0 (βT − β0 ) =

Using the definitions of βT and ΛT

βT0 ΛT βT − (y 0 X + β00 Λ0 )βT = 0

so that collecting all the terms in our posterior we have

That is, our posterior is a normal-inverse gamma distribution such that

where ΛT = (X 0 X + Λ0 ) and f (σ|X, y) follows an inverse gamma distribution with scale

Multivariate Linear Regression with Normal-Inverse Wishart Prior

As previously, define ΛT = X 0 X + Λ0 . We can again propose the sum of squares portion

and our posterior for Σ conditional on the data is

where ΛT = X 0 X + Λ0 and Bt = Λ−1 0

and our prior for sigma is n s o

where ΛT = (X 0 X + Λ0 ) and f (σ|X, y) follows an inverse gamma distribution with scale

where ΛT = X 0 X + Λ0 and f (σ|X, y) follows an inverse-Wishart distribution with scale

The Kalman Filter

and then update this prediction based on the current observation yt

or, since f (y) is a normalizing constant,

2.2 The Kalman Filter

(xt − Aµt−1|t−1 )(xt − Aµt−1|t−1 )0 (xt − Aµt−1|t−1 )(yt − HAµt−1|t−1 )0

since Pt−1|t−1 = Et−1|t−1 (xt−1 x0t−1 ) − µt−1|t−1 µ0t−1|t−1 . Similarly

Since we alreadyknow H in this casethe parameters

Our prior for β is