Course v13

September 1, 2023
Preface
T
his book has been developed over the years and has been used in a compulsory
course in econometrics at the University of Copenhagen. The starting point
is an assumed basic knowledge on linear regression models at the level of,
e.g., Wooldridge (2006). In addition, these notes have been used in a summer school
arranged by the United Nations University, Unu-Wider, and Universidade Eduardo
Mondlane, Maputo, Mozambique.
The book is still in development and comments on choice of material and presen-
tation are most welcome. The present version has bene…tted from comments from
Mette Ejrnæs, Rasmus Søndergaard Pedersen, Anders Rahbek, Frederik Vilandt Ras-
mussen, Morten Nyboe Tabor, as well as many students over the years.
Heino Bohn Nielsen

September 1, 2023.
Contents
1 Characteristic Features of Economic Time Series . . . . . . . . . . . . . . . . . . . . . 7

1.1 Main Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Stochastic Processes and Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Measuring Time Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Transformations to Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Linear Regression Models for Economic Time Series . . . . . . . . . . . . . . . . . . 23
2.1 The Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 The OLS Estimator and its Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Formulation and Misspeci…cation Testing . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Empirical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Summary and Practical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Introduction to Likelihood Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 The Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Properties of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Example: The AR(1) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Three Classical Test Principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.6 Conclusion and Main Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4 Univariate Models for Stationary Economic Time Series . . . . . . . . . . . . . . 81
4.1 Estimating Dynamic E¤ects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Stationarity and Weak Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Moving Average Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5 ARMA and ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.6 Estimation and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.7 Univariate Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.8 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.A Moving Average Solution for the AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . 111
6
5 The Autoregressive Distributed Lag Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Dynamic- and Long-Run Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3 Error-Correction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 Conditional Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6 Analysis of Vector Autoregressive Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2 The VAR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3 MA Solution and Stationarity Condition . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4 Conditioning and Single-Equations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5 Estimation and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.6 Impulse-Responses and Structural VARs . . . . . . . . . . . . . . . . . . . . . . . . 139
6.7 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.8 Granger Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.A Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7 Non-Stationary Time Series and Unit Root Testing . . . . . . . . . . . . . . . . . . . 151
7.1 Stationary and Non-Stationary Time Series . . . . . . . . . . . . . . . . . . . . . 151
7.2 Non-Stationarity in Economic Time Series . . . . . . . . . . . . . . . . . . . . . . 153
7.3 Stationary and Unit Root Autoregressions . . . . . . . . . . . . . . . . . . . . . . 159
7.4 Deterministic Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.5 Testing for a Unit Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.6 Dickey-Fuller Test with a Trend Term. . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.7 Further Issues in Unit Root Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.A Solution for an I(1) Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8 Analysis of Non-Stationary and Co-integrated Time Series . . . . . . . . . . . . 181
8.1 Introduction and Main Statistical Tools . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2 Mathematical Structure of Co-integration . . . . . . . . . . . . . . . . . . . . . . . 182
8.3 How is the Equilibrium Sustained? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.4 Introduction to Estimation and Inference . . . . . . . . . . . . . . . . . . . . . . . 193
8.5 Estimation Based on a Static Regression . . . . . . . . . . . . . . . . . . . . . . . . 196
8.6 Dynamic Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.7 Concluding Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . 209
7
9 The Co-integrated Vector Autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

9.1 The Vector Error Correction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.3 Test for the Co-integration Rank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.4 The Moving Average Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.6 Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.7 Concluding Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . 224
10 Modelling Volatility in Financial Data: Introduction to ARCH. . . . . . . . 225
10.1 Changing Volatility in Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
10.2 The ARCH Model De…ned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.3 A Test for No-ARCH E¤ects and Misspeci…cation Testing . . . . . . . 236
10.4 Generalized ARCH (GARCH) Models . . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.5 Volatility Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.6 Extensions to the Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.7 Multivariate ARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
10.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
11 Introduction to Regime-Switching Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
11.2 Threshold Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.3 Smooth Transition Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
11.4 Markov Switching Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
11.5 More on Linearity Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
11.A Filter Algorithm for MS Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
12 State-Space Models and the Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
12.1 The Linear State-Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
12.2 The Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
12.A Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
13 Instrumental Variables and GMM Estimation . . . . . . . . . . . . . . . . . . . . . . . . 279
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
13.2 Method of Moments Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
13.3 GMM Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
13.4 Weight-Matrix Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
13.5 Test of Overidentifying Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
13.6 Empirical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
13.A Quasi-Maximum-Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 316
8
13.B Linear IV Estimation and 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

14 Introduction to Vector and Matrix Di¤erentiation . . . . . . . . . . . . . . . . . . . . 321
14.1 Conventions for Scalar Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
14.2 Conventions for Vector Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
14.3 Some Special Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
14.4 The Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Chapter 1
Characteristic Features
of Economic Time Series
T
his chapter introduces some key concepts in the analysis of economic time
series. First, we present examples of characteristic features of economic time
series. We then present some underlying assumptions on the way time se-
ries data are sampled and give a heuristic introduction to the theory of stochastic
processes. This leads to the de…nition of stationarity and we discuss how a time
series which is not stationary can sometimes be made stationary by means of sim-
ple transformations. The assumption of stationarity turns out to be important for
the application of a law of large numbers, such that the well-known results from
cross-sectional regression analysis carry over to the time series case.
1.1 Main Features

Most data in macroeconomics and …nance come in the form of time series. A time
series is a set of observations
y1 ; y2 ; :::; yt ; :::; yT ; (1.1)
where the index t represents time such that the observations have a natural temporal
ordering. Sometimes we also write fyt gTt=1 . Most time series in economics–and all
time series considered in this course–are observed with …xed intervals, such that the
distances between successive time points, t and t + 1, are constant. This section
presents some characteristic features of economic time series.
Heino Bohn Nielsen, University of Copenhagen, September 1, 2023.
10 Characteristic Features of Economic Time Series
Time Dependence. One characteristic feature of many economic time series is a

clear dependence over time, and there is often a non-zero correlation between observa-
tions at time t and t h for small or moderate values of h. As an example, Figure 1.1
(A) shows monthly data for the US unemployment rate in percent of the labour force,
1948(1) 2018(8). The development in unemployment is relatively sluggish, and a
reasonable conjecture on the unemployment rate a given month (for example the most
recent, u2018(8) ) would be a function of the unemployment rate the previous month,
u2018(7) . Often we say that there is a high degree of persistence in the time series.
The temporal ordering of the observations implies that the observation for 2018(7)
always precedes the observation for 2018(8), and technically ut 1 is predetermined
when ut is generated. This suggests that the past of ut can be included in the
information set in the analysis of ut , and a natural object of interest in a time series
analysis would be a model for ut given the past, e.g. the conditional expectation,
E(ut j ut 1 ; ut 2 ; :::; u1 ): (1.2)
Throughout, it will be important to distinguish the conditional expectation and the

unconditional expectation, E(ut ).
The idea of time dependence, and an introduction to simple models for dependent
time series, is given in below and in Chapter 2. A more detailed treatment of the
mathematical structure of the models is given in Chapter 4.
Trends. Another characteristic feature shared by many time series is a tendency

to be trending. As an example, consider in Figure 1.1 (B) the quarterly Danish
productivity, 1971(1) 2005(2), compiled as the log of real output per hour worked.
In this case the trend represents the underlying economic growth, and the slope of
the trend is by and large constant and approximately equal to 0:5% from quarter to
quarter.
Long-Run Co-Movements. A third interesting feature is a tendency for some

time series to move together in the longer run. Graph (C) shows time series for
the log of real disposable income, yt , and the log of real consumption, ct , for the
private sector in Denmark, 1971(1) 2017(3). The movements from quarter to quarter
sometimes di¤er markedly, but the slow underlying movements seem to be related.
This suggests that there could be some sort of equilibrium relation governing the
movements of consumption and income in the long run.
A stylized economic theory may suggest that while both yt and ct have an upward
drift due to the growth in productivity and real wages, the savings rate, yt ct ,
should be stable. That implies that the short-term movements in the consumption
1.1 Main Features 11
and income series in Figure 1.1 (C) may di¤er but they are tied together in the longer
run and do not move too far apart.
The phenomenon with co-movements between long-term developments in eco-
nomic time series is known as co-integration between time series with unit-roots. The
topic of unit-root non-stationarity is discussed in Chapter 7 while the analysis of
co-integrated economic time series is outlined in Chapter 8 based on single-equation
regression model, and in Chapter 9 based on the more general vector autoregression.
(A) US unemployment rate (B) Danish productivity (logs)

12.5 1.0
10.0 0.8
7.5
0.6
5.0
0.4
2.5
0.2
0.0
1950 1960 1970 1980 1990 2000 2010 1970 1980 1990 2000
(C) Danish income and consumption (logs) (D) Consumer prices, cloth and footwear
7.0 Income 105
Consumption
6.8 100
6.6 95
6.4 90
6.2 85
1970 1980 1990 2000 2010 2002 2006 2010 2014 2018
(E) Stock price of Pandora A/S (F) Daily change in the S&P 500 index (%)
10
450
5
400
0
350 -5
-10
2018-7-1
7-8 7-22 8-5 8-19 9-2 9-16 9-30 2000 2005 2010 2015
Figure 1.1: Examples of macro-economic and …nancial time series.

Seasonality. Another feature that characterizes some economic time series is a

tendency to systematic movements over the calendar. Examples include a speci…c
pattern of energy use over the day, di¤erent sales over the weekdays, or di¤erent
production over the calendar months or quarters. Figure 1.1 (D) shows monthly time
series observations of consumer prices for cloth and footwear taken from the Danish
consumer price index, cpit , as well as a 12 month moving average, calculated as
X11
1
movavgt = cpit i : (1.3)
i=0
12
The picture shows marked movements from months to month, with systematic lower
prices in January/February and July/August–re‡ecting winter and summer sales.
The 12-month moving average removes most of the seasonal variation, because each
moving average contains one observation of each month. The moving average is a
simple example of seasonal adjustment, and analysis of seasonally adjusted time series
sometimes makes it easier to focus on other underlying features, e.g. business cycle
movements.
We may note, that the time series in Figure 1.1 (A)-(C) are all the seasonally
adjusted versions. These have been seasonally adjusted by the statistical o¢ ces using
some version of the leading seasonal adjustment procedure developed by the United
States Census Bureau (denoted X11, X12, or X13). This allows us to focus on the
time persistence and other features of the data, and not on the seasonal movements.
Level Shifts and Structural Breaks. Figure 1.1 (E) shows the daily stock price
of the Danish jewelry …rm Pandora a/s for the period July 1, 2018, to September 27,
2018, and it is an example of a structural break. Before August 7, the time series path
exhibits a relatively ‡at movement around 430 Danish kroner but then it suddenly
falls to approximately 325. Later the stock price recovers slightly but continues to
move at a lower level than before. The reason is that Pandora a/s, on August 7,
announced a reduced expectation to turnover in 2018 and 2019.
This series is an example of a structural break, where some event, here the lowering
of expectations, changes the level of the time series. In economics, levels shifts often
occur as results of changes in legislation, disruptions from adoption of new technology,
or other marked changes in the structure of the economy.
Volatility Clustering. For many …nancial time series, not only the mean but also
the variance is of interest. This is because the variance is related to measures of the
uncertainty or risk of a particular variable. A characteristic feature of this type of
data is a tendency for the variance or volatility to change over time re‡ecting periods
of rest and unrest on …nancial markets. Figure 1.1 (F) shows percentage changes from
1.2 Stochastic Processes and Stationarity 13
day to day in the Standard & Poor’s 500 index, often abbreviated as the s&p 500,
for the period January 1, 1996, to February 27, 2018. A visual inspection suggests
that the volatility is much larger in some periods than in others.
The phenomenon of changing volatility is known as volatility clustering, and it is
often modelled using the so-called autoregressive conditional heteroskedastic (ARCH)
model. We return to this topic in Chapter 10.
One primary goal of time series analysis is to understand the characteristics pre-
sented in Figure 1.1. As you might expect from the graphs, there is not a single
tool which is suitable in all cases and the econometric literature for time series data
have developed specialized tools applicable in di¤erent situations. Applicability of
the tools depends on the ful…llment of certain assumptions, which have to be checked
in each situation.
The main goal of the …rst two chapters in this book is to understand when the
linear regression model, as given by
yt = 1 x1t + 2 x2t + ::: + k xkt + t; t = 1; 2; :::; T; (1.4)
where yt is the variable of interest and x1t ; x2t ; :::; xkt are explanatory variables, t =
1; 2; :::; T; can be safely applied to time series data, and in which situations the linear
regression should be used with care. In cases where the linear regression model is
not applicable, other–more elaborate–techniques have to be employed. This will be
covered in later chapters.
1.2 Stochastic Processes and Stationarity

To be able to build statistical models for the economic time series and to use probabil-
ity theory to quantify the uncertainty of estimated parameters, the main assumption
underlying time series analysis is that the observation at time t, yt , is a realization
of a random variable, yt . Note that the standard notation does not distinguish be-
tween the random variable and a realization, but it should be clear from the context.
We often refer to an observed time series with a …xed number of observations as yt ;
t = 0; 1; 2; :::; T , while for the random variable we may use yt , t 2 N, where N de-
notes the in…nite set of non-negative integers, N = f0; 1; 2; :::g, or, yt , t 2 Z, where Z
denotes the set of integers, Z = f:::; 2; 1; 0; 1; 2; :::g, depending on the context.
Taken as a whole, the observed time series in (1.1) is a realization of a sequence
of random variables, fyt gTt=1 , often referred to as a stochastic process.
Here we notice an important di¤erence between cross-section data and time series
data. Recall, that in the cross-section case we think of a data set, fxi gN i=1 , as being
sampled as N independent draws from a large population; and if N is su¢ ciently
large we can characterize the distribution by the sample moments: e.g. the mean
and variance. In the time series context, on the other hand, we are faced with T
random variables, fyt gTt=1 , and only one realization from each. In general, therefore,
we have no hope of characterizing the distributions corresponding to each of the
random variables, unless we impose additional restrictions. Figure 1.2 (A) illustrates
the idea of a general stochastic process, where the distributions di¤er from time to
time. It is obvious that based on a single realized time series we cannot say much
about the underlying stochastic process.
A realization of a stochastic process is just a sample path of T real numbers; and
if history took a di¤erent course we would have observed a di¤erent sample path. If
we could rerun history a number of times, M say, we would have M realized sample
paths corresponding to di¤erent states of nature. Letting a superscript (m) denote
the realizations (m = 1; 2; :::; M ) we would have M observed time series:
(1) (1) (1) (1)

Realization 1 : y1 ; y2 ; :::; yt ; :::; yT
(2) (2) (2) (2)
Realization 2 : y1 ; y2 ; :::; yt ; :::; yT
.. .. .. ..
. . . .
(m) (m) (m) (m) (1.5)
Realization m : y1 ; y2 ; :::; yt ; :::; yT
.. .. .. ..
. . . .
(M ) (M ) (M ) (M )
Realization M : y1 ; y2 ; :::; yt ; :::; yT :
For each point in time, t, we would then have a cross-section of M random draws,
(1) (2) (m) (M )
yt ; yt ; :::; yt ; :::; yt , from the same distribution. This cross-section is not drawn
from a …xed population, but is drawn from a hypothetical population of possible
outcomes, corresponding to a particular distribution in Figure 1.2 (A). Often we are
Figure 1.2: Stochastic processes and realized time series.

interested in the unconditional mean, E(yt ), which we could estimate with the sample
average
1 X (m)
M
^
E(yt ) = y ; (1.6)
M m=1 t
provided that we had observed the M realized sample paths. The cross-sectional
mean in (1.6) is sometimes referred to as the ensemble mean and it is the mean of
a particular distribution in Figure 1.2. This concept is fundamentally di¤erent from
the time average of a particular realized sample path, e.g.
1 X (1)
T
yT = y : (1.7)
T t=1 t
Notice, that when we analyze a single realization, we ignore the superscript and use
(1)
the notation yt = yt .
1.2.1 Stationarity
Of course, it is not possible in economics to generate more realizations of history.
But if the distribution of the random variable yt remains unchanged over time, then
we can think of the T observations, y1 ; :::; yT , as drawn from the same distribution;
and we can make inference on the underlying distribution of yt based on observations
from di¤erent points in time. The property that the distribution of yt is the same for
all t is referred to as stationarity. Formally, the de…nition is given as follows:
Definition 1.1 (strict stationarity): A time series, y1 ; y2 ; :::; yt ; :::; yT , is strictly

stationary if the joint distributions of the s random variables
(yt ; yt+1 ; :::; yt+s ) and (yt+h ; yt+1+h ; :::; yt+s+h )
are the same for all s 2 N and all h 2 N.
Strict stationarity implies that all structures and characteristics do not depend on
the location on the time axis, and all moments are constant over time. Another
concept focuses on the …rst two moments and requires only constancy of the mean
and variance:
Definition 1.2 (weak stationarity): A time series, y1 ; y2 ; :::; yt ; :::; yT , is weakly

stationary if
E(yt ) =
V (yt ) = E((yt )2 ) = 0
cov(yt ; yt h ) = E((yt ) (yt h )) = h for h 2 N
for all values of t.
Note that the mean, , and the variance, 0 , are the same for all t and the co-
variance between yt and yt h , denoted h , only depends on the distance between
the observations, h. Since the de…nition is related only to mean and covariances,
a weakly stationary time series is sometimes referred to as covariance stationary or
second-order stationary.
The idea of a stationary stochastic process is illustrated in Figure 1.2 (B). Here
each new observation, yt , contains information on the same distribution, and we can
use all observations to estimate the common mean, . Notice that the realized time
series are identical in graph (A) and (B), and for a small number of observations it
is often di¢ cult to distinguish stationary from non-stationary time series.
Remark 1.1 (finite moments): In many cases, it is required that that variance of
yt is …nite, 0 < 1, as this allows the application of central limit theorems to scaled
averages of yt , t = 1; 2; :::; T . In other cases, application of limit results require …nite
higher-order moments, e.g. bounded fourth order moments, E(yt4 ) < 1. Although
moment requirements are important for particular models, the precise requirements
are not always emphasized in this introductory book.
1.2.2 Weak Dependence

Recall, that to show consistency of the estimators in a regression model we use the law
of large numbers (LLN), stating that the sample average converges (in probability)
to the population mean. And to derive the asymptotic distribution of the estimator,
such that we can test hypotheses, we use the central limit theorem (CLT) stating that
the appropriately normalized sample average converges (in distribution) to a normal
distribution. In regression models for identically and independently distributed (i.i.d.)
observations this is reasonably straightforward and the simplest versions of the LLN
and CLT apply, see Wooldridge (2006, p. 774 ¤.) or Nielsen (2017, Appendix C).
In a time series setting, where the i.i.d. assumption is unreasonable, things are
more complicated. There exists more advanced versions of the LLN and the CLT,
however, that allow the analysis of dependent observations, and in cases where such
results apply, most of the results derived for regression models for i.i.d. data carry
over to the analysis of time series. Two main assumptions are needed: The …rst
important assumption is stationarity, which replaces the cross-sectional assumption
of identical distributions. That assumption ensures that observations origin from the
same distribution.
The second assumption is comparable to the assumption of independence, but
is less stringent. In particular, the LLN and CLT can be extended to allow yt to
depend on yt h , but the dependence cannot be too strong. We make the following
assumption:
Definition 1.3 (weak dependence): A time series, y1 ; y2 ; :::; yt ; :::; yT , is weakly

dependent if yt and yt h become approximately independent for h ! 1.1
The assumption of weak dependence ensures that each new observation contains some
new information on E(yt ). The interested reader is referred to Hayashi (2000, Section
2.2) for a more detailed discussion. The su¢ cient conditions on the models to ensure
stationarity and weak dependence are typically very similar, and all the stationary
processes in this course are also weakly dependent.
Under the assumption of stationarity and weak dependence the time average, y T
in (1.7), is a consistent estimator of the ensemble mean, E(yt ) = , and most of the
results from i.i.d. regression carry over to the time series case. If the assumptions
fail to hold, the standard results from regression analysis cannot be applied; and the
most important distinction in time series econometrics is whether the time series of
interest are stationary and weakly dependent or not.
To illustrate the kinds of mechanisms that are at play, Figure 1.3 illustrates one
realization of 200 i.i.d. observations with = 0 in (A), a stationary stochastic process
with = 0 in (C), and a non-stationary stochastic process in (E). The right hand
side column illustrates the LLN by showing the average of the …rst T observations,
1X
T
1
yT = yt = (y1 + y2 + ::: + yT ); (1.8)
T t=1 T
for increasing values of T . For i.i.d. observations the average clearly converges to
the true zero mean. The same is the case for the dependent but stationary process,
although the ‡uctuations are larger. Note, however, that the LLN does not apply to
the non-stationary case. Here the time dependence is too strong and the average, y T ,
has no tendency to converge to zero.
1
The formulation here is not precise, but is meant to capture the idea of the mixing concept from
probability theory, see e.g. Davidson (2001, p. 70).
(A) i.i.d. observations (B) Average of i.i.d. observations

4 0.50
2 0.25
0.00
0
-0.25
-2
-0.50
-4 -0.75
0 50 100 150 200 0 500 1000 1500 2000
(C) Stationary process (D) Average of stationary process

4 0.50
2 0.25
0.00
0
-0.25
-2
-0.50
-4 -0.75
0 50 100 150 200 0 500 1000 1500 2000
(E) Non-stationary process (F) Average of non-stationary process

2.5
8
6
0.0
4
2
-2.5
0
-5.0
0 50 100 150 200 0 500 1000 1500 2000
Figure 1.3: Time series of 200 observations for (A): An i.i.d. process. (C): A
dependent but stationary process. And (E): A non-stationary process. The graphs
P
(B), (D), and (F) show the sample averages of y1 ; :::; yT , that is y T = T1 Tt=1 yt , as
a function of T = 1; 2; :::; 2000:
1.3 Measuring Time Dependence

It follows from De…nition 1.1 and De…nition 1.2, that a stationary time series, yt ,
‡uctuates around a constant level, the mean = E(yt ). For a stationary time
series can be seen as the equilibrium value of yt , while deviations from the mean,
1.3 Measuring Time Dependence 19
yt , can be interpreted as deviations from equilibrium. In terms of economics, the

existence of an equilibrium requires that there are some forces in the economy that
pull yt towards the equilibrium level. We say that the variable is mean reverting or
equilibrium correcting. If the variable, yt , is hit by a shock in period t such that yt
is pushed out of equilibrium, stationarity implies that the variable should revert to
equilibrium, and a heuristic characterization of a stationary process is that a shock
to the process yt has only transitory e¤ects.
The fact that a time series adjusts back to the mean, does not imply that the
deviations from equilibrium cannot be systematic. Stationarity requires that the
unconditional distribution of yt is constant, but the distribution given the past, e.g.
the distribution of yt j yt 1 may depend on yt 1 such that yt and yt 1 are correlated.
Another way to phrase this is in terms of forecasts: You may think of the expec-
tation E(yt j It 1 ) as the best forecast of yt given the information set,
It 1 = fyt 1 ; yt 2 ; :::g: (1.9)
In this respect the mean of an unconditional distribution in Figure 1.2 (B) corresponds
to the best forecast based on an empty information set, It 1 = ?, i.e. in a situation
where we have not observed the history of the process, yt 1 ; yt 2 ; :::; y1 . If yt has a
tendency to ‡uctuate in a systematic manner and we have seen that observation
four is below average, y4 < , that could suggest that the best forecast of the next
observation, the conditional expectation E(y5 j y4 ), is also likely to be smaller than
. This highlights the important di¤erence between conditional and unconditional
expectations.
In terms of economics the deviations from equilibrium could re‡ect business cycles;
we expect business cycles to be systematic but to revert at some point in time.
1.3.1 The Autocorrelation Function

One way to characterize the time dependence in a time series, yt , t 2 Z, also known
as the persistence, is by the correlation between yt and yt h , de…ned as
cov(yt ; yt h )
corr(yt ; yt h) =p : (1.10)
V (yt ) V (yt h )
If the correlation is positive we denote it positive autocorrelation, and in a graph of

the time series that is visible as a tendency of a large observation, yt , to be followed
by another large observations, yt+1 , and vice versa. Negative autocorrelation is visible
as a tendency of a large observation to be followed by a small observation.
Under stationarity it holds that cov(yt ; yt h ) = h only depends on h, and the
variance is constant, V (yt ) = V (yt h ) = 0 . For a stationary process the formula can
therefore be simpli…ed, and we de…ne the autocorrelation function (ACF) as
cov(yt ; yt h) h
h = = ; h 2 Z: (1.11)
V (yt ) 0
The term autocorrelation function refers to the fact that we consider h as a function
of the lag length h. It follows from the de…nition that 0 = 1, the autocorrelations are
symmetric, h = h , and they are bounded, 1 h 1. For i.i.d. observations,
we have that h = 0 for all h 6= 0. For a stationary and weakly dependent time
series the autocorrelations could be non-zero for a number of lags, but we expect h
to approach zero as h increases. In most cases the convergence to zero is relatively
fast.
For a given data set, the sample autocorrelations can be estimated by e.g.
XT
1
T h
(yt y) (yt h y)
t=h+1
~h = XT (1.12)
1
T
(yt y)2
t=1
or alternatively, by
XT
1
T h
(yt y) (yt h y)
t=h+1
^h = XT : (1.13)
1
T h
(yt h y)2
t=h+1
The …rst estimator, ~h , is the most e¢ cient as it uses all the available observations
in the denominator. For convenience it is sometimes preferred to discard the …rst
h observations in the denominator and use the estimator ^h , which is just the OLS
estimator in the regression model
yt = c + ^ h yt h + residual.
There are several formulations of the variance of the estimated autocorrelation func-
tion. The simplest result is that if the true correlations are all zero, 1 = 2 = ::: = 0,
1
then the asymptotic distribution of ^h is normal with p variance V (^h ) = T . A 95%
con…dence band for ^h is therefore given by 1:96= T .
An complementary measure of the time dependence is the so-called partial auto-
correlation function (PACF), which is the correlation between yt and yt h , conditional
on the intermediate values, i.e.
h = corr(yt ; yt h j yt 1 ; :::; yt h+1 ):
One way to understand the PACF is the following: If yt is correlated with yt 1 ,

then yt 1 is correlated with yt 2 . This implies that yt and yt 2 are correlated by
1.4 Transformations to Stationarity 21
construction, but some of the e¤ect is indirect (it goes through yt 1 ). The PACF
measures the direct relation between yt and yt 2 , after the indirect e¤ect via yt 1 is
removed. The PACF can be estimated as the OLS estimator ^h in the regression
yt = c + 1 yt 1 + ::: + h 1 yt h 1 + h yt h + residual,
where the intermediate lags are included. If 1 = 2 = ::: = 0, p it again holds that
^ 1
V ( h ) = T , and a 95% con…dence band is given by 1:96= T . For a weakly
dependent time series, the partial autocorrelation function, h , should also approach
zero as h increases.
In the next section we consider examples of the estimated ACF, and later in the
course we discuss more precisely how the ACF and PACF convey information on the
properties of the data.
1.4 Transformations to Stationarity

A brief look at the time series in Figure 1.1 suggests that not all economic time series
are stationary. The US unemployment rate in Figure 1.4 (A) seems to ‡uctuate
around a constant level. The deviations are extremely persistent, however, and the
unemployment rate is above the constant level for most of the 20 year period 1975
1995. If the time series is considered stationary, the speed of adjustment towards
the equilibrium level is very low; and it is probably a more reasonable assumption
that the expected value of the unemployment rate is not constant. An economic
interpretation of this …nding could be that there have been systematic changes in the
natural rate of unemployment over the decades, and that the actual unemployment
rate is not adjusting towards a constant level but towards a time varying NAIRU.
The persistence of the time series itself translates into very large and signi…cant
autocorrelations. Graph (B) indicates that the ACF remains positive for a very long
sample length, and the correlation between ut and ut 15 is around 0:6. Whether the
unemployment rate actually corresponds to a sample path from a stationary stochastic
process is an empirical question, and the literature contains many di¤erent tests that
can be used to distinguish stationary from non-stationary time series.
1.4.1 De-Trending
The Danish productivity level in Figure 1.1 (B) has a clear positive drift and is hence
non-stationary. The trend appears to be very systematic, however, and it seems that
productivity ‡uctuates around the linear trend with a constant variance. One reason
could be that there is a constant autonomous increase in productivity of around 2%
(A) US unemployment rate (B) ACF for (A)

1.0
10.0
7.5
0.5
5.0
1960 1980 2000 0 5 10 15
(C) Danish productivity (log) minus trend (D) ACF for (C)
1.0
0.05
0.5
0.00
-0.05
1970 1980 1990 2000 0 5 10 15
(E) change in Danish consumption (log) (F) ACF for (E)

1
0.05
0
0.00
-0.05
1970 1980 1990 2000 2010 0 5 10 15
(G) Danish savings rate (log) (H) ACF for (G)

0.20 1
0.15
0.10 0
0.05
0.00
1970 1980 1990 2000 2010 0 5 10 15
Figure 1.4: Examples of time series transformations to stationarity.

per year; and equilibrium is de…ned in terms on this trend, such that the deviations
of productivity from the underlying trend are stationary.
Figure 1.4 (C) shows the deviation of productivity from a linear trend. The
deviations are calculated as the estimated residual, ^ t , in the linear regression on a
constant and a trend,
log(prodt ) = + t + t ; (1.14)
where log(prodt ) is the log of measured productivity. The deviations from the trend,
^ t = log(prodt ) ^ ^ t, are still systematic, re‡ecting labour hoarding and other
business cycle e¤ects, but there seems to be a clear reversion of productivity to the
trending mean. This is also re‡ected in the ACF of the deviations, which dies out
relatively fast.
A non-stationary time series, qt , that becomes stationary after a deterministic
linear trend has been removed is denoted trend stationary. One way to think about
a trend-stationary time series is that the stochastic part of the process is stationary,
but the stochastic ‡uctuations appear around a trending deterministic component,
+ t. Deterministic de-trending, like in the regression (1.14), is one possible way to
transform a non-stationary time series to stationarity.
A similar attempt of removing the non-stationarity implied by a level shift could
have been done for the case of the Pandora stock price in Figure 1.1 (E). Here we
could have de…ned a dummy variable
Dt = I(t T0 ); (1.15)
where I( ) is the indicator function taking the value one if the expression is true and
zero otherwise, and where T0 is chosen as the breakpoint August 7. For a known break
point Dt is a deterministic variable, and we could look at the estimated residuals from
the regression
Pandorat = + Dt + t ; (1.16)
where Pandorat denotes the daily stock price.
1.4.2 Transformation by Differencing

The aggregate Danish consumption in Figure 1.1 (C) is also clearly non-stationary
due to the upward drift. In this case, the deviations from a deterministic trend
are still very persistent, and the de-trended consumption also looks non-stationary.
An alternative way to remove the non-stationarity is to consider the …rst di¤erence,
ct = ct ct 1 . The changes in Danish consumption, ct , depicted in Figure 1.4
(D) ‡uctuates around a constant level and could look stationary. The persistence is
not very strong, which is also re‡ected in the ACF in graph (E). A variable, ct , which
itself is non-stationary but where the …rst di¤erence, ct , is stationary is denoted
di¤erence stationary. Often it is also referred to as integrated of …rst order or I(1)

because it behaves like a stationary variable that has been integrated (or cumulated)
once.
Whether it is most appropriate to transform a variable to stationarity using de-
terministic de-trending or by …rst di¤erencing has been subject to much debate in the
economics and econometrics literature. The main di¤erence is whether the stochastic
component of the variable is stationary or not, that is whether the stochastic shocks
to the process have transitory e¤ects only (the trend-stationary case), or whether
shock can have permanent e¤ects (the di¤erence stationary case). The formal test of
this hypothesis will be considered in later chapters.
We may note that the …rst di¤erence of a variable in logs has a special interpre-
tation. As an example, let ct = log(const ) measure the log of consumption. It holds
that
const const const 1 const const 1
ct = ct ct 1 = log = log 1 + ' ;
const 1 const 1 const 1
where the last approximation is good if the growth rate is close to zero. Therefore,
ct can be interpreted as the relative growth rate of const ; and changes in ct are
interpretable as percentage changes (divided by 100).
1.4.3 Co-integration
A third way to remove non-stationarity is by taking linear combinations of several
variables. In Figure 1.1 (C) the development of income and consumption have many
similarities and the savings rate, st = yt ct , depicted in Figure 1.4 (G) is much more
stable. It seems to ‡uctuate around a constant level with peaks corresponding to
known business cycle episodes in the Danish economy. The time dependence dies out
relatively fast, cf. graph (H), and the savings rate may correspond to a stationary
process.
The property that a linear combination of non-stationary variables become sta-
tionary is denoted co-integration. The interpretation is that the variables themselves,
here yt and ct , move in a non-stationary manner, but they are tied together in an
equilibrium relation, and the dynamics of the economy ensures that the savings rate
only shows transitory deviations from the equilibrium level. In general, we may have
p variables, xt = (x1t ; x2t ; :::; xpt )0 , that are all non-stationary, but a particular linear
combination, e.g.
0
xt = 1 x1t + ::: + p xpt ;
may be stationary, where = ( 1 ; :::; p )0 is a vector of parameters.
Co-integration is the main tool in time series analysis of non-stationary variables,
and plays an important role in the econometric literature. All econometric textbooks
have chapters devoted to analysis of co-integrating variables. We return to the topic

of co-integration in Chapter 8 and in Chapter 9.
Chapter 2
Linear Regression Models

for Economic Time Series
T
his section gives a short repetition of the linear regression model, with the
notation adapted to time series and with a focus on the su¢ cient conditions
for consistency and asymptotic normality of the OLS estimator. We brie‡y
discuss the issues of model formulation, misspeci…cation testing and forecasting.
2.1 The Linear Regression Model

Let yt be a variable of interest, and let xt be a k 1 dimensional vector of explanatory
variables. To model yt as a function of xt we consider the linear regression model:
Assumption 2.1 (linear regression model): The relationship between yt and

xt is linear in the parameters:
yt = x0t + t = x1t 1 + x2t 2 + ::: + xkt k + t; (2.1)
for t = 1; 2; :::; T .
In many applications the …rst explanatory variable is a constant term, x1t = 1, in

which case 1 is the intercept of the regression.
We will sometimes refer to the left hand side variable, yt , as the regressand, the
dependent variable, or the endogenous variable. The right hand side variables, xt , are
sometimes referred to as explanatory variables, regressors, covariates, and, under some
speci…c assumptions, exogenous variables. Finally, we say that (2.1) is a regression of
yt on xt .
28 Linear Regression Models for Economic Time Series
2.1.1 Interpretation and the Ceteris Paribus Assumption

So far the regression model in (2.1) is a tautology, and it does not say anything on the
relationship between yt and xt . For any set of observations (yt ; x0t )0 and any parameter
value b the residual term can be chosen as t = yt x0t b to satisfy (2.1). To make
the equation informative on the relationship between yt and xt we therefore have to
impose restrictions on the behavior of t that allow us to determine a unique value
of from equation (2.1). This is, loosely speaking, what is called identi…cation in
econometrics. And a condition for being able to estimate the parameter consistently,
such that the estimator ^ converges to the true value as T ! 1, is that the
parameter is identi…ed.
To achieve identi…cation in a linear regression, we think of the model (2.1) as
representing the conditional expectation,
E(yt j xt ) = x0t ; (2.2)
and we make the following assumption:
Assumption 2.2 (predeterminedness): For the error term in the model (2.1) it
holds that
E( t j xt ) = 0: (2.3)
Under Assumption 2.2, we can think of a parameter j as the marginal e¤ect of the
expected value of yt of a change in xjt , i.e. as the partial derivative,
@
E(yt j xt ) = j: (2.4)
@xjt
We therefore interpret j as the e¤ect of a marginal change in the variable xjt holding
the remaining variables in xt constant; this is known as the ceteris paribus assumption.
If (2.3) is ful…lled, we refer to the regressor as being predetermined. As we will see
in §2.2 below, predeterminedness is a su¢ cient condition for identi…cation of in a
linear regression model.
Note that the assumption in (2.3) is not an innocuous technicality. It states
that all information that is relevant for the relationship between xt and yt has been
included in the model. Firstly, this excludes that a variable in xt depends on yt
through some feedback mechanism operating at time t. If this is the case there exists
two equations linking yt and xt and in our simple regression there is no way to say
which one we would actually obtain. Secondly, remember that if a relevant variable
is excluded from the regression model then it will be picked up by the error term.
For (2.3) to be true we need that any such variable is unrelated to xt .
2.1 The Linear Regression Model 29
2.1.2 Properties of Conditional Expectations

The interpretation of the linear regression is intimately linked to the conditional
expectation. First note that it is always possible to decompose a stochastic variable,
yt , into a conditional expectation and an error term with conditional expectation
zero, i.e. for any vector xt ,
yt = E(yt j xt ) + t ; (2.5)
where E( t j xt ) = 0. In general E(yt j xt ) is some nonlinear function of xt . The
central assumptions in the linear regression model are that xt includes all the relevant
conditioning information and that the functional form of the conditional expectation
is linear, E(yt j xt ) = x0t . At this point we emphasize three important properties of
the conditional expectation.
Firstly, we have the well-known result that for some function g( );
E(g(xt ) j xt ) = g(xt ); (2.6)
meaning that if we condition on xt then we can treat the stochastic variable as non-
random.
Secondly, the condition (2.3) implies that t is uncorrelated with any function of
xt , and therefore also uncorrelated with the individual variables, x1t ; x2t ; :::; xkt . The
condition (2.3) therefore also states that the functional form of the regression model
has been correctly speci…ed; no non-linear e¤ects have been neglected.
Thirdly, E( t j xt ) = 0 implies an unconditional zero expectation, E( t ) = 0. This
is an example of a results called the law of iterated expectations, which states that
E (E (yt j xt ) j xt ; zt ) = E (yt j xt ) (2.7)
E (E (yt j xt ; zt ) j xt ) = E (yt j xt ) : (2.8)
It is easy to follow the intuition in the result (2.7): Recall that E(yt j xt ) = g(xt ) is
some general function of xt and since all the information in xt is also contained in
the larger information set wt = (x0t ; zt0 )0 , the conditional expectation of g(xt ) is the
function itself. To understand the result in (2.8) it is informative to think of the
conditional expectation as a prediction. The result states that we can not improve
the prediction given the small information set, E(yt j xt ), by …rst imagining the
prediction using a larger information set, E(yt j xt ; zt ), and then try to forecast that
best prediction using only xt . The general result is that it is always the smallest
information set that dominates.
2.1.3 Examples of Time Series Regressions

Depending on the variables included in the vector of regressors, xt , some interesting
interpretations of the linear regression in (2.1) emerges.
Static Regression. As a …rst example, let the vector of regressors contain k ex-
planatory variables dated at the same point in time as the left hand variable, i.e. the
equation (2.1). Then the linear regression is called a static regression,
yt = x0t + t : (2.9)
The interpretation of the relationships is therefore like a comparative static exercise,

and measures the expected e¤ect on yt of a change in xt .
Autoregression. Next recall, that due to the temporal ordering of the time series
observations, past events can be treated as given in the analysis of current events.
Since many economic time series seem to depend on their own past it is natural to
include the lagged values, yt 1 ; yt 2 ; :::, in the explanation of the current value. As an
example we can let, xt = yt 1 , and the regression model is given by
yt = yt 1 + t: (2.10)
A model where the properties of yt are characterized as a function of only its own
past is denoted a univariate time series model, and the speci…c model in (2.10), where
yt depend only on yt 1 , is denoted a …rst order autoregressive model, or an AR(1). A
higher order autoregressive model, an AR(p) model, is de…ned similarly.
Autoregressive Distributed Lag Model. The dynamic structure of the regres-

sion model can easily be more complex than (2.10) with lagged values of both the
regressand, yt , and the regressors, xt . As an example, consider the dynamic regression
model
yt = 1 yt 1 + 0 xt + 1 xt 1 + t ; (2.11)
where yt is modelled as a function of yt 1 , xt , and xt 1 . This model is denoted
an autoregressive distributed lag, or ADL, model. This model is particularly useful
in order to characterize the dynamic adjustment properties such as the most likely
dynamic impacts on yt ; yt+1 ; yt+2 ; ::: following by a shock to xt . The ADL model
can easily be extended to include more lags and to the case where xt is vector of
regressors.
We return to the dynamic interpretation of the ADL model in Chapter 5
2.2 The OLS Estimator and its Properties

One way to derive the ordinary least squares (OLS) estimator of in the linear
regression in Assumption 2.1 is by appealing to the so-called method of moments
2.2 The OLS Estimator and its Properties 31
(MM) estimation principle, see also Wooldridge (2006). Here we brie‡y review the
MM estimation principle and the relation to the identi…cation of the parameters.
The conditional zero mean in Assumption 2.2 states that xt does not contain
information on the expected value of t . This implies in particular that xt and t are
uncorrelated, i.e. that
E(xt t ) = 0: (2.12)
This holds from the properties of the conditional expectation, as
E(xt t ) = E(E(xt t j xt )) = E(xt E( t j xt )) = 0:
We will refer to (2.12) as a set of moment conditions. Inserting the expression for
the error terms, t = yt x0t , yields the system of k equations to determine the k
parameters in , and if there is a unique solution we say that the system identi…es
the parameter. In particular we have that
E(xt (yt x0t )) = 0 or,

E(xt yt ) E(xt x0t ) = 0:
If the k k matrix E(xt x0t ) is non-singular it can be inverted to give the solution:
= E(xt x0t ) 1 E(xt yt ); (2.13)
which is the population parameter. We formulate the invertability of E(xt x0t ) as an

assumption:
Assumption 2.3 (no perfect collinearity): The regressors xt in Assumption

2.1 are not perfectly collinear, i.e. the matrix E(xt x0t ) is non-singular.
From a given …nite sample of yt and xt (t = 1; 2; :::; T ), we cannot compute the

expectations in (2.13), and the idea of the MM estimation principle is to replace the
expectations by sample averages, which de…nes the well known OLS sample estimator,
! 1 !
X
T X
T
^= T 1
xt x0t T 1
xt y t : (2.14)
t=1 t=1
For the step from (2.13) to (2.14) to work, we need a law of large numbers (LLN) to
apply, implying that that sample averages converge in probability to the expectations,
i.e.
X
T
p
X
T
p
1 1
T xt yt ! E(xt yt ) and T xt x0t ! E(xt x0t ): (2.15)
t=1 t=1
To ensure that a LLN applies to the sample averages as in (2.15), we have to

impose additional assumptions on the data. A detailed statistical analysis of the
time series model is somewhat complicated, and the presentation below focus on the
intuition and the importance of the assumptions. A more rigorous and technically
demanding coverage, including proofs of the theorems, can be found in Davidson
(2001). Hamilton (1994) also go through the calculations for many relevant time
series regressions, and a very informative discussion on time series regressions is found
in Hayashi (2000).
There are several ways to formulate the assumptions necessary for the LLN in
the time series case, see e.g. Davidson (2001). In this course we use the following
formulation:
Assumption 2.4 (stationarity and weak dependence): Consider a time se-

ries yt and the k 1 vector time series xt in Assumption 2.1. For the process
zt = (yt ; x0t )0 make the following assumptions:
(i) zt has a stationary distribution;
(ii) zt is weakly dependent.
For the analysis of cross-sectional data it is common to assume that the observations
are identically and independently distributed (i.i.d.), see e.g. assumption MLR.2 in
Wooldridge (2006). That assumption is typically too restrictive for the case of time
series observations, but most of the results for the regression model continue to hold
for time series satisfying Assumption 2.4. The idea is that ’identical distributions’is
replaced by the assumption of stationarity, while ’independence’is replaced by the
assumption of weak dependence.
2.2.1 Consistency
A …rst requirement for an estimator is that it is consistent, such that the estimator
converges to the true value as we get more and more observations (i.e. for T ! 1):
Theorem 2.1 (consistency): Consider the linear regression model in Assump-

tion 2.1. Under Assumption 2.2 (predeterminedness), Assumption 2.3 (no perfect
collinearity), and Assumption 2.4 (stationarity and weak dependence), it holds that
the OLS estimator in (2.14) is a consistent estimator of the true value, i.e.
p
^! as T ! 1:
An alternative notation for convergence in probability is

plim ^ = : (2.16)
T !1
The intuition for the result is straightforward and follows the steps in the derivation
in §2.2. First, we need the moment conditions to be satis…ed. Next we need a LLN
p
to apply to the sample averages such that ^ ! , and Assumption 2.4 is su¢ cient to
ensure the LLN. A sketch of the proof for consistency of OLS in a simple time series
regression in given as follows:
Sketch of Proof: Consider the regression model in Assumption 2.1,
y t = xt + t ; t = 1; 2; :::; T: (2.17)
Insert the expression for yt in the formula for the OLS estimator in (2.14) to obtain
1
PT 1
PT 1
PT
^= T t=1 yt xt T t=1 (xt + t )xt T t=1 t xt
1
PT 2
= 1
PT 2
= + 1
PT 2
: (2.18)
T t=1 xt T t=1 xt T t=1 xt
We look at the behavior of the last term as T ! 1.

If a LLN applies, i.e. under Assumption 2.4, the limit of the denominator is given
by
1X 2
T
plim xt = E(x2t ): (2.19)
T !1 T t=1
This needs to be invertible, and from Assumption 2.3 it holds that E(x2t ) > 0. For
the numerator, we get
1X
T
plim t xt = E( t xt ) = 0;
T !1 T
t=1
where the last equality follows from Assumption 2.2 of predeterminedness. Combining
the results, we obtain
1
PT
plim t=1 t xt 0
plim ^ = + T
1
PT 2
= + 2 = ;
T !1 plim T t=1 xt x
which shows the consistency of the OLS estimator, ^ .
It should be emphasized that Theorem 2.1 gives su¢ cient conditions for con-
sistency. The conditions are not necessary, however, and in the analysis of non-
stationary variables some estimators are consistent by other arguments than the one
used to obtain Theorem 2.1. As an example, it turns out that the estimator ^ of
in the …rst order autoregressive model (2.10) is consistent even if yt does not ful…ll
Assumption 2.4.
2.2.2 What is OLS Actually Estimating? The Case with

Omitted Variables
Recall, that the moment condition in (2.12) is implied by predeterminedness,
E( t j xt ) = 0;
which implies that the regression model represents the conditional expectation, E(yt j
xt ) = x0t . To gain insight in the mechanics of OLS and the interpretation of the
moment condition, it is natural to ask what OLS is actually estimating if the moment
condition does not hold.
To illustrate this consider a linear equation
yt = 1 + xt 2 + wt 3 + ut ; (2.20)
where yt is generated by two explanatory variables. We assume that the requirements

for OLS are ful…lled in (2.20), such that E(ut j xt ; wt ) = 0, and all variables are
stationary and weakly dependent. Now suppose that we incorrectly omit the variable
wt and consider the linear regression
yt = 1 + xt 2 + t; (2.21)
which is misspeci…ed due to the omitted variable, wt . The regression error term is
given by t = wt 3 + ut , and based on the arguments above, the OLS estimator in
the misspeci…ed model, 2 say, is inconsistent when
E( t xt ) = E((wt 3 + ut ) xt ) 6= 0; (2.22)
i.e. when xt and wt are correlated.

To see what OLS is estimating, remember that for a stochastic variable of inter-
est, yt , and a set of conditioning variables, xt , we may always decompose yt into a
conditional expectation given xt and a remainder term with conditional expectation
zero. For the misspeci…ed regression model above we may write
yt = E(yt j xt ) + t ; where E( t j xt ) = 0. (2.23)
Assuming that the conditional expectation is a linear function, E(yt j xt ) = b1 + xt b2 ,

we get the model
yt = b1 + b2 xt + t : (2.24)
The OLS moment condition, E( t j xt ) = 0, is ful…lled per construction for (2.24)
and the OLS estimator, ^b2 (which is identical to 2 from the misspeci…ed model)
is a consistent estimator of b2 . This parameter has the interpretation of a partial

derivative,
@E(yt j xt ) @E(yt j xt ; wt )
plim 2 = b2 = 6= = 2: (2.25)
@xt @xt
The intuitive point is that OLS consistently estimates the parameters in the linear
conditional expectation. The problem is that the structural economic equation stated
in (2.20) does not correspond to the conditional expectation in (2.24) and the model
will not identify the economically relevant parameter 2 . To put it di¤erently, the
problem is not that the mechanics of the OLS estimator are invalid, the problem is
that it consistently estimates an irrelevant quantity.
This means that the discussion of validity of the moment condition can be re-
formulated as a discussion of which variables to include in the model, i.e. which
conditional expectation is the most meaningful. In some situations, the danger is to
omit relevant variables such that ^ will be an inconsistent estimator of the relevant
true parameter; in other situations the danger may be to condition on too many
variables, such that the e¤ect of the variable of interest, xt , is already explained by
other included variables.
2.2.3 Unbiasedness and Finite Sample Bias

Consistency is a minimal requirement for an estimator. A more ambitious requirement
is that of unbiasedness, which is often quoted for OLS in regressions for i.i.d. data.
To obtain unbiasedness, consider the following assumption:
Assumption 2.5 (strict exogeneity): For the linear regression model in As-
sumption 2.1, it holds that
E( t j x1 ; x2 ; :::; xt ; :::; xT ) = 0: (2.26)
Observe, that the assumption of strict exogeneity in (2.26), which implies a zero
correlation between the error term, t , and both past, current and future values of xt ,
is stronger than Assumption 2.2 of predeterminedness, which only pertains to xt .
Many introductory textbook analyzing i.i.d. cases, e.g. Wooldridge (2006, MLR.4),
does not distinguish between predetermindness and strict exogeneity. Because the
observations (yt ; x0t )0 are independent, it doesn’t matter if we condition on xt or the
entire time series, fxt gTt=1 . For time series data, however, there is a big di¤erence,
and while strict exogeneity is often a reasonable assumption for cross-sectional data
where the randomly sampled cross-sectional units are independent, it is a very strong
assumption for time series data. In particular it requires that there is no feedback
e¤ects from yt to any future value of xt+h ; h > 0.
As an example consider a simple linear phillips curve, where the unemployment

rate determines the wage growth,
WageGrowtht = 0 + 1 UnemploymentRatet + t; t = 1; 2; :::; T: (2.27)
It it likely that the unemployment rate is predetermined in (2.27), because the current
wage growth does not a¤ect the unemployment rate directly within one quarter or
one month. It is highly unlikely, however, that the unemployment rate is strictly
exogenenous, i.e. that the wage-growth does not a¤ect the unemployment rate in the
future. One could imagine that a high wage-growth would deteriorate international
competitiveness, and lower exports induced by higher wage growth would increase
future unemployment rates. This would be a violation of strict exogeneity.
If, nevertheless, we are willing to make the assumption of strict exogeneity, we
can state the following result:
Theorem 2.2 (unbiasedness): Consider the linear regression model in Assump-

tion 2.1. Under Assumption 2.3 (no perfect collinearity), Assumption 2.4 (stationar-
ity and weak dependence), and under the additional Assumption 2.5 (strict exogene-
ity), it holds that the OLS estimator in (2.14) is unbiased, i.e.
E( ^ ) = : (2.28)
We give a brief sketch of the proof:
Sketch of Proof: Consider again the estimator in (2.18),

1
PT
^= T t=1 t xt
+ 1
PT 2
: (2.29)
T t=1 xt
Recall that the unconditional expectation of the ratio is not the ratio of the expec-
tations. Instead take expectations conditional on x1 ; :::; xt ; :::; xT and under Assump-
tion 2.5 it holds that
1
PT
^ T t=1 xt E( t j x1 ; x2 ; :::; xt ; :::; xT )
E( j x1 ; x2 ; :::; xt ; :::; xT ) = + 1
PT 2
= :
T t=1 x t
From the rules for conditional expectations we may also write the result uncondition-
ally:
E( ^ ) = E(E( ^ j x1 ; x2 ; :::; xt ; :::; xT )) = ;
which shows (2.28).
Whereas consistency is an asymptotic property, prevailing as T ! 1, unbiased-

ness is a …nite sample property stating that the expectation of the estimator equals
the true value for all sample lengths.
To illustrate that Assumption 2.5 of strict exogeneity is often unrealistic in time
series settings, consider the …rst order autoregressive model in (2.10), i.e.
yt = y t 1 + t: (2.30)
Due to the structure of the time series it might be reasonable to assume that t is
uncorrelated with lagged values of the explanatory variables, yt 1 ; yt 2 ; :::; y1 . But
since yt is function of t it is clear that t cannot be uncorrelated with current and
future values of the explanatory variables, i.e. yt ; yt+1 ; :::; yT .
In general, we have the following auxiliary result:
Corollary 2.3 (estimation bias in dynamic models): In a regression model

including the lagged dependent variable, the OLS estimator is in general not unbiased.
As an example, it can be shown that the OLS estimator of the autoregressive

coe¢ cient in the AR(1) model (2.30) is biased towards zero. The derivation of the
bias is technically demanding and instead we presents a Monte Carlo simulation to
illustrate the idea.
Bias of OLS in an AR(1) Model. To illustrate the bias of OLS we use a Monte
Carlo simulation. As a data generating process (DGP) we use the AR(1) model
yt = y t 1 + t; t = 1; 2; :::; T; (2.31)
d
with an autoregressive parameter of = 0:9 and t = N (0; 1). We generate M = 5000
(m) (m) (m)
time series with a sample length T , i.e. y1 ; y2 ; :::; yT for m = 1; 2; :::; M . For
each time series we apply OLS to the regression model (2.31) and get the estimate
^m .
To characterize the estimator we calculate the Monte Carlo average across the M
replications,
1 X^
M
^
average( ) = m;
M m=1
such that the bias of the estimator is average(^) . The results are reported
in Figure 2.1 (A) for sample lengths T 2 f10; 15; :::; 100g. The con…dence bands,
average(^) 2 sd(^), measures the uncertainty of ^ in each replication. The mean is
lower than the true value for all sample lengths. For a very small sample length of
T = 10 the average of the OLS estimator is 0:794.
(A) Mean of OLS estimate in AR(1) model (B) Bias of the autoregressive parameter
1.25
0.00
average ± 2·sd
-0.02
1.00
-0.04
average -0.06
0.75
θ=0.9 × T
-0.08 θ=0.7 × T
θ=0.5 × T
0.50 -0.10 θ=0.3 × T
θ=0.0 × T
-0.12
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Figure 2.1: Bias of the OLS estimator in an AR(1) model.
To illustrate how the bias depends on the autoregressive parameter, Figure 2.1
(B) reports the bias for other values, 2 f0; 0:3; 0:5; 0:7; 0:9g. If the DGP is static,
= 0, the estimator is unbiased. If > 0 the estimator is downward biased, with a
bias that increases with .
2.2.4 Asymptotic Distribution

To make inference on the estimator, i.e. to test hypotheses on the parameter, we
need a way to approximate the distribution of ^ . To derive this we need to impose
further restrictions on the error terms.
Theorem 2.4 (asymptotic distribution): Consider the linear regression model

in Assumption 2.1. Under Assumption 2.2 (predeterminedness), Assumption 2.3 (no
perfect collinearity), Assumption 2.4 (stationarity and weak dependence), and under
the assumption that (i) t is homoskedastic:
2 2
E( t j xt ) = ; (2.32)
and (ii) t has no serial correlation:
E( t s j xt ; xs ) = 0; for all t 6= s: (2.33)
Then the OLS estimator in (2.14) is asymptotically normally distributed,

p d
T (^ ) ! N (0; ); (2.34)
with
2
= E(xt x0t ) 1 ; (2.35)
as T ! 1.
To interpret the result for the asymptotic variance, , consider the case of a single
regressor with mean zero. In this case,
2
V ( t)
= = ;
E(x2t ) V (xt )
which is a measure of the noise-to-information ratio. The more noise relative to
information, the higher variance. Note that the variance of the estimator is V ( ^ ) =
T 1 , which converges to zero with the rate of T . The result implies that we can test
hypotheses on . Inserting natural estimators for 2 and E(xt x0t ), the distributional
result in (2.34) can be written as
0 ! 11
XT
^ a N @ ; ^2 xt x0 A; (2.36)
t
t=1
which is again similar to the formula for the cross-sectional case. It is worth empha-
sizing that the asymptotic normality is the result of a central limit theorem (CLT)
and it does not require normality of the error term, t .
Heteroskedasticity. If the homoskedasticity assumption (i) in Theorem 2.4 is vi-

olated, the asymptotic normality in (2.34) still holds, but the variance formula in
(2.35) has to be replaced by the robust version, that can be consistently estimated
by
! 1 T ! T ! 1
XT X X
^ = xt x0t ^2t xt x0t xt x0t ; (2.37)
t=1 t=1 t=1
where ^t = yt x0t ^ denote the estimated residuals. This is also known from regression
models for cross-sectional data, see e.g. Wooldridge (2006, Chapter 8).
No-serial correlation. The precise formulation of assumption (ii) in (2.33) is a

little di¢ cult to interpret. An alternative way to relate to the condition of no-serial-
correlation is to think of a model for the conditional expectation of yt given the entire
joint history of yt and xt . If it holds that
E(yt j xt ; yt 1 ; xt 1 ; yt 2 ; xt 2 ; :::; y1 ; x1 ) = E(yt j xt ) = x0t ; (2.38)
i.e. that xt contains all relevant information in the available information set: xt ; yt 1 ; xt 1 ;
yt 2 ; xt 2 ; :::; y1 ; x1 , then we refer to the regression model as being a complete dynamic
model. Assuming the complete dynamic model in (2.38) is practically the same as the
no-serial-correlation assumption in (2.33); and the idea is that there is no systematic
information in the past of yt and xt which has not been used in the construction of
the regression model.
If assumption (ii) in Theorem 2.4 of no no-serial-correlation is violated, it means

that consecutive error terms are correlated, such that cov( t ; t h ) 6= 0 for some h > 0.
In this case we say there is autocorrelation of the error term. A simple example of a
model with autocorrelation of order one could be:
yt = x0t + t where t = t 1 + vt ;
where t is independent over time. These two equations can be combined, however,
to yield
(yt yt 1 ) = x0t x0t 1 + ( t t 1) ; (2.39)
or equivalently
yt = y t 1 + x0t x0t 1 + vt ; (2.40)
where the error term vt = t t 1 is now serially uncorrelated. This indicates that
autocorrelation can typically be removed by including more lags of yt and xt , and the
resulting model (2.40) is e¤ectively an ADL model.
Alternatively, the problem of autocorrelation may be solved by including variables
that were wrongly omitted in the initial model. If an explanatory variable is omitted
its e¤ect will be picked up by the residual, and if the omitted variable has persistent
movements, it will show up as residual correlation. One example is if the level of yt
changes at some point in time and the DGP contains a level shift. If we do not account
for the break in the regression, then the predicted values, y^t = x0t ^ , will correspond
to an average between the level before and after the shift. As a consequence, the
residuals are mainly positive before the break and mainly negative after the break
(or the opposite). That, again, results in residual autocorrelation. In this case the
solution is to try to identify the shift and to account for it in the regression model,
e.g. by including a dummy variable.
Consequences of autocorrelation. If the model su¤ers from autocorrelation, it

will not in general violate the assumptions for Theorem 2.1, and OLS is consistent
if the explanatory variables, xt , are contemporaneously uncorrelated with the error
term. As for heteroskedasticity, however, the standard formula for the variance in
(2.36) is no longer valid. The asymptotic normality may still hold, and in the spirit
of the heteroskedasticity robust standard errors in (2.37), it is possible to …nd a
consistent estimate of the correct covariance matrix under autocorrelation, the so-
called heteroskedasticity-and-autocorrelation-consistent (HAC) standard errors. This
is discussed in e.g. Verbeek (2017, Section 4.10.2) and Stock and Watson (2003, p.
504-507).
If the model includes a lagged dependent variable, however, autocorrelation of the
error term will violate the assumption in (2.12). To see this, consider an AR(1) model
2.3 Formulation and Misspeci…cation Testing 41
like (2.10), and assume that the error term exhibit autocorrelation of …rst order, i.e.
that t follows a …rst order autoregressive model,
t = t 1 + vt ; (2.41)
where vt is an i.i.d. error term. Consistency requires that E [ t yt 1 ] = 0, but that is

clearly not satis…ed since both yt 1 and t depends on t 1 . We have the following
result:
Corollary 2.5 (inconsistency of ols): If a regression model includes the lagged

dependent variable, the OLS estimator is in general not consistent in the presence of
autocorrelation of the error term.
This means in practice, that no-autocorrelation is an important design criteria for

dynamic regression models.
2.3 Formulation and Misspecification Testing

In this section we brie‡y outline an empirical strategy for dynamic modelling; with
the explicit goal to …nd a model that is dynamically complete, and describe the
main features of the data. The empirical strategy includes tests of the underlying
assumptions, and below we also list some of the most important misspeci…cation tests
in time series economics.
So far we have assumed knowledge of the list of relevant regressors, xt . In reality
we need a way to choose these variables; and often economic theory is helpful in
pointing out potential explanations for the variable of interest yt . From Theorem 2.1
we know that the estimator ^ is consistent for any true value . So if we include
a redundant regressor (i.e. a variable with true parameter i = 0) then we will
p
be able to detect it as ^ i ! 0 for T ! 1. If, on the other hand, we leave out an
important variable (with i 6= 0), then the remaining estimators will not be consistent
in general. This asymmetry suggests that it is recommendable to start with a larger
model and then to simplify it by removing insigni…cant variables. This is the so-called
general-to-speci…c principle. The opposite speci…c-to-general principle is dangerous
because if the initial model is too restricted (by leaving out an important variable)
then estimation and inference will in general be invalid.
We can never prove that a model is correctly speci…ed; but we can estimate a
model and test for indications of misspeci…cations in known directions, and if the
model passes the tests, then we have no indications that the model is misspeci…ed
and we may think of the model as representing the main features of the data.
Below we present a number of standard misspeci…cation tests; and in §2.4 we

consider an empirical example.
2.3.1 Test for No-Autocorrelation

Residual autocorrelation can indicate many types of misspeci…cation of a model,
and because autocorrelation may imply inconsistency, the test for no autocorrelation
should be routinely applied in all time series regressions. The most commonly used
test for the null hypothesis of no autocorrelation is a so-called Breusch-Godfrey La-
grange Multiplier (LM) test. As an example we consider the test for no …rst order
autocorrelation in the regression model (2.1). This is done by running the auxiliary
regression model
^t = x0t + ^t 1 + ut ; (2.42)
where ^t is the estimated residual from (2.1) and ut is a new error term. The original
explanatory variables, xt , are included in (2.42) to allow for the fact that xt is not
necessarily strictly exogenous and may be correlated with ^t 1 .
The null hypothesis of no autocorrelation corresponds to = 0, and can be
tested by the t ratio on , which follows an N (0; 1) asymptotically. Alternatively
we can compute the LM test statistic, AR = T R2 , where R2 is the coe¢ cient
of determination in the auxiliary regression (2.42). Note, that the residual ^t is
orthogonal to the explanatory variables xt , and any explanatory power in the auxiliary
regression must be due to the included lagged residual, ^t 1 . Under the null hypothesis
the statistic is asymptotically distributed as
d
AR = T R2 ! 2
(1): (2.43)
Note, that the auxiliary regression needs one additional initial observation. It is
customary to insert zeros in the beginning of the series of residuals, i.e. ^0 = 0, and
estimate the auxiliary regression for the same sample as the original model.
2.3.2 Test for No-Heteroskedasticity

To test the assumption of no-heteroskedasticity in (2.32), we use an LM test against
heteroskedasticity of an unknown form (due to White). It involves the auxiliary
regression of the squared residuals on the original regressors and their squares:
^2t = 0 + x1t 1 + ::: + xkt k + x21t 1 + ::: + x2kt k + ut : (2.44)
The null hypothesis is unconditional homoskedasticity, 1 = ::: = k = 1 = ::: =

2
k = 0, and the alternative is that the variance of t depends on xit or the squares xit
for some i = 1; 2; :::; k, i.e. at least one of the parameters 1 ; :::; k ; 1 ; :::; k non-zero.
Again the test is based on the LM statistic HET = T R2 , which is distributed as a

2
(2k) under the null. Sometimes a more general test is also considered, in which the
auxiliary regression is augmented with all the non-redundant cross terms, xit xjt .
2.3.3 Test for Correct Functional Form: RESET

To test for the functional form of the regression, the so-called RESET test can be
used. The idea is to consider the auxiliary regression model
^t = x0t + y^t2 + ut ; (2.45)
where y^t = x0t ^ is the predicted value of the original regression. The null hypothesis
of correct speci…cation is = 0. The alternative is that the square of y^t = x0t ^ has
been omitted. This indicates that the original functional form is incorrect and could
be improved by powers of linear combinations of the explanatory variables, xt . The
RESET test statistic is given by, RESET = T R2 , and it distributed as 2 (1) under
the null hypothesis
2.3.4 Test for Normality of the Error Term

The derived results for the regression model hold without assuming normality of the
error term. It is still a good idea, however, to thoroughly examine the residuals from a
regression model, and the normal distribution is a natural benchmark for comparison.
The main reasons to focus on the Gaussian distribution is that the asymptotic normal
approximation of ^ to is known to be better if t is close to normal2 . Furthermore,
under normality of the error terms, least squares estimation coincides with maximum
likelihood (ML) estimation, which implies that is asymptotically e¢ cient.
It is always a good starting point to plot the residual to get a …rst visual impression
of possible deviations from normality. If some of the residuals fall outside the interval
of say three standard errors, it might be an indication that an extraordinary event
has taken place. If a big residual at time T0 corresponds to a known shock, the
observation may be accounted for by including a dummy variable with the value 1 at
T0 and zero otherwise. Similarly, it is also useful to plot a histogram of the estimated
residuals and compare with the normal distribution.
A formal way of comparing a distribution with the normal is to calculate skewness
(S), which measures the asymmetry of the distribution, and kurtosis (K), which
measures the proportion of probability mass located in the tails of the distribution.
Let t be the error term of the regression model, with E( t ) = 0, and let ^t be the
2
If the error term, t , has a Gaussian distribution, then the estimator, ^ , conditional on the
regressors, has an exact Gaussian distribution, see e.g. Wooldridge (2006, Chapter 4).
corresponding estimated residual. Skewness (S) and kurtosis (K) are de…ned as the
third and fourth central moments,
S = E[( t = )3 ] and K = E[( t = )4 ]; (2.46)
with sample counterparts given by

X
T
^t
3 X
T
^t
4
1 1
ST = T and KT = T ; (2.47)
t=1
^ t=1
^
1
PT
where = T t=1 ^t (typically zero if the model contains a constant term) and
PT
2
^ =T 1
t=1 (^t )2 :
The normal distribution is symmetric and has a skewness of S = 0. The normal
distribution also has a kurtosis measure of K = 3 and K 3 is often referred to as
excess kurtosis. If K is larger than three the distribution has ’fat’tails in the sense
that more probability mass is located in the tails. Under the assumption of normality,
H0 , it holds that the estimated skewness and kurtosis are asymptotically normal (due
to a central limit theorem),
p d p d
T ST ! N (0; 6) and T (KT 3) ! N (0; 24); (2.48)
such that
T d T d
S = S2T ! 2
(1) and K = (KT 3)2 ! 2
(1) under H0 .
6 24
Because S and K are asymptotically independent, the joint test for S = (K 3) = 0
can be based on
d 2
JB = S + K ! (2) under H0 ; (2.49)
which is known as the Jarque-Bera test.
2.3.5 Testing for Parameter Instability

To test for parameter instability, Hansen (1992) suggests a statistic based on the
constancy over time of the …rst-order condition for the OLS estimator. Recall …rst,
that the …rst-order conditions for the OLS estimates, ^ i i = 1; 2; :::; k, in the model
(2.1) are given by
XT
0= xit^t ; i = 1; 2; :::; k,
t=1
with ^t = yt x0t ^ , and similarly for ^ 2 :

X
T
0= (^2t ^ 2 ):
t=1
De…ning fi;t = xit^t , i = 1; 2; :::; k, and fk+1;t = ^2t ^ 2 , we can write this as
0 1
f1;t
X T B . C XT
B .. C
0= B C= ft :
t=1
@ fk;t A t=1
fk+1;t
The statistic is based on the cumulated …rst-order condition,
X
t
St = fj ;
j=1
that should ‡uctuate close to zero if the parameters are stable.

The test statistic, seeking to identify if the ‡uctuations for parameter i in St =
(S1;t ; :::; Si;t ; :::; Sk+1;t )0 is larger than expected is given by
PT X
T
Si;t
t=1 2
Li = ; with Vi = fi;t ;
T Vi t=1
where Vi is the corresponding variance. Likewise, the joint statistic for all parameters
is given by
1X 0 1 X
T T
L= St V St with V = ft ft0 :
T t=1 t=1
The asymptotic distribution for Li is non-standard, but critical values are given
in Hansen (1992) and many software packages, including OxMetrics automatically
indicates the signi…cance of the statistic. Note that the stability statistic is not
calculated in the presence of dummy variables.
Remark 2.1 (recursive estimation): An alternative diagnostic approach to pa-

rameter instability is to estimate the relevant model for di¤erent sub-samples, e.g.
using data fyt ; xt gTt=T
1
0
and fyt ; xt gTt=T
3
2
for some chosen counters, T0 , T1 , T2 , T3 , with
1 Ti T . In some situations, the time points are suggested by political or institu-
tional changes and the approach is easy to implement.
If little or no a priori information is available, a simple alternative is the concept
of recursive estimation, which is very useful in analyzing the structural stability of an
estimated model. Recursive estimation is done by estimating a model for increasing
samples fyt ; xt gt=1 , where takes the values
2 fTmin ; Tmin +1 ; Tmin +2 ; :::; T g: (2.50)
That is, we …rst estimate the model with only the …rst Tmin e¤ective observations;
and then we successively add a new observation and reestimate the model. For each
sample we do an OLS estimation and obtain all the usual statistics. Afterwards we
can consider the sample paths of the di¤erent statistics calculated for each sample.
For example we can consider the estimated coe¢ cients for the expanding samples
^ ( ) for 2 fTmin ; Tmin +1 ; Tmin +2 ; :::; T g: (2.51)
The model is estimated under the assumption of constant coe¢ cients, which implies
that a graphs for ^ i ( ), i = 1; 2; :::; k, should not ‡uctuate too much. According to the
theory, the recursively calculated con…dence bands for the parameters should decrease,
and the point estimates should converge towards a constant. The minimum number of
observations, Tmin ; should be chosen such that the parameters can be reliably estimated.
One rule of thumb in the literature is to set Tmin to approximately 2k, but, depending
on the data, more observations may be needed.
Other statistics from the recursive estimation could also be considered, and there
exists a number of formal tests for parameter constancy based on the recursive diag-
nostics.
2.4 Empirical Example

As an empirical example we estimate a model for house prices in Denmark based
on quarterly data for 1971(1)-2017(3). We consider a data set including the …rst
di¤erences of the logarithm of the house price, denoted qt . As potential regressors
to explain the development in house prices, we include changes in consumer prices,
measured as the change in the log of the private consumption de‡ator, pt , changes
in the log of real disposable income for households, yt , and changes in the after-tax
interest rate, rt , where the interest rate is given by , rt = (1 t )Rt , where t is
the average tax rate and Rt is the yield of a 30 year mortgage bond. These variables
are suggested by a simple economic theory for house-demand.
The four time-series are shown in Figure 2.2 (A)-(D). The series for house price
changes in graph (A) is relatively volatile but it also show some systematic movements
that we want to model. Re recognize the big house price changes in the mid eighties
and around 2006, and the pronounced fall after the …nancial crisis. The time series
behavior of the in‡ation rate in (B) is a bit problematic, because of a level shift
in the beginning of the eighties. Overall, however, the …rst di¤erenced time series
in Figure 2.2 (A)-(D) look stationary, and it does not seem unreasonable to invoke
Assumption 2.4. A more thorough analysis would probably look into the nature of
the shift in the level of in‡ation.
If we believe that interest rates ( rt ), income ( yt ), and consumer prices ( pt )
are predetermined relative to house prices ( qt ), we may consider the linear regression
2.4 Empirical Example 47
(A) Changes in house prices (B) Changes in consumer prices

0.050
0.1
0.025
0.0
0.000
1970 1980 1990 2000 2010 1970 1980 1990 2000 2010
(C) Changes in real disposable income (D) Changes in after tax interest rate
0.05
0.01
0.00 0.00
-0.05 -0.01
1970 1980 1990 2000 2010 1970 1980 1990 2000 2010
(E) Estimated residuals from model (I) (F) Histogram of residuals from model (II)
2.5
0.4
0.0
0.2
-2.5
1970 1980 1990 2000 2010 -2 0 2 4
(G) Forecast based on model (III) (H) In-sample prediction based on model (III)
0.05 0.05
0.00 0.00
Forecasts Forecasts
-0.05 Actual -0.05 Actual
2000 2005 2010 2015 2020 2000 2005 2010 2015
Figure 2.2: Time series data for house prices and potential determinants.
model. Because we do not know the dynamic structure, i.e. how long time it takes
before interest rates and income changes are transmitted to house prices, we begin
with a large general model with three lags of each variables,
X
3 X
3 X
3 X
3
qt = + i qt i + i rt i + 'i pt i + i yt i + t: (2.52)
i=1 i=0 i=0 i=0
The regression model has 16 regressors, with
xt = (1; qt 1 ; :::; qt 3 ; rt ; :::; rt 3 ; pt ; :::; pt 3 ; yt ; :::; yt 3 )0
and
0
=( ; 1 ; :::; 3 ; 0 ; :::; 3 ; '0 ; :::; '3 ; 0 ; :::; 3) :
The predeterminedness assumption is not testable and it requires, for example,

no-reverse causality, such that house prices do not a¤ect consumer prices, interest
rates, and income at time t. This is not trivial and there could be reverse e¤ects e.g.
via the housing component of the consumer price index. In addition, it requires that
any omitted variable from the model is uncorrelated with the included regressors,
rt , pt , and yt . Again this has be defended in each case. In our case, candidates
for omitted variables could be expected capital gains in the housing marked and
developments in the stock of houses. We do not want to dwell anymore with the
moment conditions, just emphasize the point that the assumptions are not trivial
and should be discussed.
Running this regression for the full sample, 1972(1) 2017(3), produces the es-
timates reported in column (I) in Table 2.1, where the numbers in parentheses are
t ratios for the hypothesis that i = 0. In this equation R2 = 0:63, indicating that
63% of the variation in house price changes is explained by the regressors. Although it
is tempting to discuss signi…cance of the coe¢ cients, we cannot attach much weight
to the t ratios at this point because we do not know whether the assumptions in
Theorem 2.4 are ful…lled.
The estimated residuals from model (I) are depicted in Figure 2.2 (E). We note
some relatively large residuals, and to account for those we include to additional
dummy variables, taking the values one in 1993(3) and 2008(4), respectively. Es-
timates from the model augmented with the two dummy variables are reported in
column (II). We note that most estimates are only marginally changed.
To test the null hypothesis of no-autocorrelation we use the Breusch-Godfrey test
based on the auxiliary regression in (2.42), where ^t denote the estimated residuals.
From Table 2.1 we see that the inclusion of the two dummies has a large impact
on the test for autocorrelation, and the test is borderline rejected in model (II). We
nevertheless continue with the model, and hope to …nd a simpli…ed representation of
the model, which is dynamically complete.
2.4 Empirical Example 49
The LM test for no heteroskedasticity is clearly rejected indicating that the resid-
uals are heteroskedastic. Here we cannot remove heteroskedasticity by changing the
model, and instead we report t ratios that are robust to heteroskedasticity.
Figure 2.2 (F) shows the histogram of the estimated residuals and there seems to
be a nice correspondence with the normal distribution. Looking more formally at the
residuals, we …nd ST = 0:131 indicating a slight skewness to the left compared to
the normal distribution, while the measure of excess kurtosis is KT 3 = 0:257. The
test for ST = KT 3 = 0 can be constructed as the Jarque-Bera statistic, which is
1:03 and not signi…cant in a 2 (2) distribution.
Based on the heteroskedasticity-robust t-ratios, we observe that many of the co-
e¢ cients in model (2) are not signi…cantly di¤erent from zero. We therefore simplify
the model using the general-to-speci…c approach, by successively restricting insignif-
icant coe¢ cients to zero. We begin by excluding regressors with lowest t ratios and
continue until all coe¢ cients are signi…cant. The …nal speci…c model may depend
on the order of the reduction steps, and it may be a good idea to consider di¤erent
sequences of reductions. Some software packages, including Stata, OxMetrics, and
R, have automatic algorithms that consider many di¤erent search-paths. Some intro-
ductory theory for the automatic reduction algorithms is given in Hendry and Nielsen
(2007).
The preferred model for the changes in house prices is reported in column (III).
We observe …rst that the misspeci…cation tests are …ne for the preferred model, except
for heteroskedasticity, that we handle by using robust t ratios. We see a strong e¤ect
from the contemporaneous change in the interest rate, and ^ 0 = 3:038 is clearly
signi…cant. The coe¢ cient indicates that a one percent fall in the interest rates raises
house prices in the …rst quarter with over 3 percent on average, other things equal.
There is also a transmission from the consumer price in‡ation, indicating a pass-
though of ' ^ 0 = 0:278. Somewhat surprisingly, the e¤ect from changes in income is
close to zero in all models and income does not enter the …nal model.
To conclude, the preferred model appears to be well speci…ed and economically
interpretable. This does not imply, however, that we have reached a …nal model. A
main drawback of the model is that all variables have been transformed to stationarity
by …rst di¤erences and all information in the levels of the variables is eliminated. An
alternative would have been to look for co-integration between the variables in levels,
and these techniques will be introduced later.
(I) (II) (III)

Constant 0:00107 0:00131 0:00114
(0:488) (0:726) (0:741)
rt 0 3:16 3:12 3:04
( 6:93) ( 8:08) ( 7:56)
rt 1 1 0:153 0:321 .
(0:291) (0:673)
rt 2 2 0:378 0:399 .
(0:712) (0:853)
rt 3 3 0:464 0:465 .
(0:883) (1:08)
qt 1 1 0:520 0:527 0:466
(5:30) (5:99) (5:44)
qt 2 2 0:177 0:180 0:179
(1:79) (2:02) (2:68)
qt 3 3 0:0195 0:0243 .
(0:228) (0:331)
pt '0 0:365 0:306 0:278
(1:72) (1:60) (2:40)
pt 1 '1 0:112 0:0778 .
( 0:576) ( 0:4531)
pt 2 '2 0:195 0:225 .
(0:951) (1:23)
pt 3 '3 0:286 0:298 .
( 1:55) ( 1:81)
yt 0 0:0335 0:00662 .
(0:518) (0:121)
yt 1 1 0:0629 0:0395 .
(0:860) (0:635)
yt 2 2 0:00561 0:0127 .
(0:0804) ( 0:212)
yt 3 3 0:0268 0:0194 .
(0:499) (0:407)
I:1993(3) . 0:0494 0:0429
(12:1) (17:7)
I:2008(4) . 0:0484 0:0503
( 9:73) ( 17:1)
^ 0:0164 0:0156 0:0156

R2 0:626 0:664 0:640
Log-lik. 501:362 511:169 504:996
No autocorrelation 0:181 5:81 0:00193
[0:67] [0:02] [0:97]
No heteroskedasticity. 47:6 61:6 47:3
[0:02] [0:00] [0:00]
Normality 2:89 1:03 1:38
[0:24] [0:60] [0:50]
Functional form 3:34 2:13 2:54
[0:07] [0:14] [0:11]
Table 2.1: Modelling changes in house prices, qt , by OLS for t = 1972(1)

2017(3), T = 183. Numbers in parentheses are heteroskedasticity-robust t-ratios while
numbers in square brackets are p-values for misspeci…cation tests.
2.5 Summary and Practical Remarks 51
Idea /
Data Model Results
problem
Economic
question is Economic
meaningful interpretation
Data relevant Model relevant
is convincing
Assumptions Assumptions
on the data on the model
fulfilled fulfilled
Calculated results
are reliable:
consistency?
efficiency?
Probability theory as. normality?
Figure 2.3: Reliability of statistical analyses.
2.5 Summary and Practical Remarks

This chapter has presented the linear regression model for time series data and given
some advice in terms of model building. The main ideas are presented in the ‡ow chart
in Figure 2.3. Usually we begin with some economic idea or problem to analyze; and
for the empirical analysis we choose some data to represent the economic variables.
Based on the characteristic of the data, as well as on the economic problem in mind,
we then formulate a statistical model, and some calculations based on the model
generate the quantitative results of the analysis.
The fundamental question is when the results–and the conclusions emerging from
an interpretation of the results–can actually be trusted. This requires two things:
(1) That the logic and interpretation of the model are convincing; and
(2) that the calculated results are reliable.
For the …rst issue, we need (a) that the economic question is meaningful, (b) that the
data are relevant measures representing the economic variables, (c) that the model
is relevant in order to answer the question, and (d) that the assumptions for the
model are ful…lled. For the linear regression we therefore have to argue that the
linearity assumption is reasonable and that the assumption of predeterminedness
holds, such that the coe¢ cients have meaningful interpretations. These arguments
are fundamental and should always be a part of the presentation of the statistical
analysis.
For the second issue, we have to ensure that the relevant results from proba-
bility theory apply, such that estimators are consistent, unbiased, asymptotically
normal, etc. For this to hold we need to argue that there exists a law of large num-
bers and a central limit theorem for the relevant functions of the data, e.g. that
P p
T 1 Tt=1 xt x0t ! E(xt x0t ). This requires assumptions on the data, e.g. stationarity
and weak dependence, as well as assumptions on the model.
A statistical analysis is therefore not only a matter of …nding estimators, but also a
matter of convincing ourselves that the assumptions for the statistical model and the
data are ful…lled. Some assumptions can be tested, others have to be substantiated
by economic arguments. In this process it is necessary to be very critical. If you
cannot convince yourself that your results are valid and meaningful, you will have a
very hard time convincing other people.
Chapter 3
Introduction to
Likelihood Theory
T
his chapter brie‡y introduces the main points from the statistical analysis
based on the likelihood function and generalizes the framework from the
simple case of independently and identically distributed (i.i.d.) data to also
allow dependent observations in the form of time series. We consider the …rst-order
autoregressive model as the main example, and later in the course we look at other
models. Verbeek (2017) gives an introduction to the likelihood analysis and consider
some examples. For a more detailed introduction to likelihood-based estimation with
a focus on independently and identically distributed data, see Nielsen (2017). For
more thorough theoretical analyses also introducing the treatment of time series, see
e.g. Hamilton (1994), Hayashi (2000) or Davidson (2001).
3.1 The Likelihood Function

Consider a set of observed data, fyt gTt=1 , where yt may be a vector, yt 2 Rp . We
want to construct a statistical model for yt , and the starting point for the likelihood
analysis is to consider yt as a realization of a stochastic variable. As it is normal
in the analysis of time series data, the notation in this chapter will not distinguish
between a random variable and a realization, and please have in mind, that yt could
denote both depending on the context.
We then assume that the joint distribution of the T random variables is known
and characterized by the probability function (p.f.) or probability density function
54 Introduction to Likelihood Theory
(p.d.f.) given by
f (y1 ; :::; yT j ); (3.1)
where 2 Rk contains the k parameters that characterizes the speci…ed distribution.
Typically, not all values of the parameters are allowed, and we write that 2 ,
where Rk is the admissible parameter space.
Example 3.1 (parameter space): If p = 1, and the distribution is assumed to be

the Gaussian distribution, i.e.
d
yt = N ( ; ); t = 1; 2; :::; T;
the parameter vector would be = ( ; )0 , containing the mean, , and the variance,
. Because the variance has to be positive, the parameter space is given by
2 = f( ; )0 2 R2 j > 0g:
2
Often the notation = is used to impose the positivity of .
The joint density in (3.1) takes the parameters as known and evaluates the probability
or likelihood of a certain set of values for fyt gTt=1 given . In practice the situation is
the reverse, because the data set is observed while the parameters are unknown. To
emphasize this shift, we make the following de…nitions:
Definition 3.1 (likelihood function): Let f (y1 ; :::; yT j ) be an assumed joint

density function for fyt gTt=1 . The likelihood function is de…ned as
L( j y1 ; :::; yT ) = f (y1 ; :::; yT j );
i.e. the joint density considered as a function of . The log-likelihood function is
log L( j y1 ; :::; yT ) = log f (y1 ; :::; yT j ):
Sometimes we suppress the dependence on the data and simply write log LT ( ) to
indicate that we have used T observations or simply log L( ). Recall that the log( ) is
a monotone transformation and LT ( ) and log L( ) are maximized for the same value
of . In the likelihood analysis, we use the log-likelihood function to …nd an estimate
of the parameter, by choosing the value of that maximizes the likelihood. The
likelihood is interpretable as the probability3 of the observed data given the model,
3
If the data has a discrete sample space, e.g. binary data or count data, the probability function
f (y j ) is identical to the probability of observing a certain y. For continuous random variables,
the density is proportional to the probability of observing an outcome in a small neighborhood of
y, see e.g. Nielsen (2017, p. 52).
3.1 The Likelihood Function 55
and the maximum likelihood estimate is chosen to make the observed data as likely
as possible.
Definition 3.2 (statistical model): A statistical model for the data fyt gTt=1 is
de…ned by the log-likelihood function log LT ( ) and the parameter space , with
2 .
Remark 3.1 (likelihood function as a random variable): We may consider

the likelihood function as a function of the data or as function of the corresponding
random variables. In the latter case, the likelihood function becomes a random vari-
able. This is convenient because we may use results from probability theory (and
in particular limiting results for stochastic variables) to characterize the results we
obtain.
Construction of the likelihood function depends on our assumptions on the data. Here
we consider three di¤erent cases:
3.1.1 Independently and Identically Distributed Data

If we are willing to assume that the T random variables fyt gTt=1 are independent and
follow the same distribution, we say that they are i.i.d. Then it holds that the joint
density is the product of individual densities
Y
T
f (y1 ; :::; yT j ) = f (yt j ) ; (3.2)
t=1
where f (y j ) is the density function for a single random variable evaluated at some
y. We call
`t ( ) = `( j yt ) = f (yt j ) ; (3.3)
the contribution to the likelihood from observation t, with t = 1; :::; T , and the
likelihood function has a multiplicative structure,
Y
T
L( j y1 ; :::; yT ) = `( j yt ); (3.4)
t=1
such that log-likelihood function is a sum

X
T
log L( j y1 ; :::; yT ) = log (`( j yt )) : (3.5)
t=1
This structure is important later because it makes it easy to …nd the derivative of
the log-likelihood function for optimization. We also want to apply limit results for
stochastic variables, i.e. the law of large numbers and the central limit theorem, to
derivatives of the likelihood function, and these limit results apply to scaled sums of
random variables.
Example 3.2 (i.i.d. likelihood function): Consider a sample of count data ob-
servations, fyt gTt=1 , and assume that they are drawn from a Poisson distribution with
intensity parameter ,
d
yt = Poisson( ):
This means that
= E(yt ) = V (yt ) > 0; (3.6)
for all t = 1; :::; T , and the parameter space is given by =f 2Rj > 0g. The
likelihood contribution is given by
yt
exp( )
`( j yt ) = f (yt j ) = ;
yt !
which is the Poisson p.f. It follows that
log `( j yt ) = + yt log( ) log(yt !);
and given independence, the log-likelihood function is given by

X
T
log L( j y1 ; :::; yT ) = ( + yt log( ) log(yt !))
t=1
X
T X
T
= T + log( ) yt log(yt !); (3.7)
t=1 t=1
that we could maximize to …nd the estimate. Observe that the last term does not
depend on the parameters and could be disregarded in the likelihood analysis.
3.1.2 Conditionally i.i.d. Data

Sometimes we have a conditional model in mind, and we want to build a statistical
model for yt given xt . If we are willing to assume that the conditional distributions
are identical, we de…ne the likelihood terms from the conditional density:
`t ( ) = `( j yt ; xt ) = f (yt j xt ; ) :
If we also assume independence, we again get a multiplicative structure and the log-
likelihood function is again a sum
X
T
log L( j y1 ; :::; yT ; x1 ; :::; xT ) = log (`t ( )) :
t=1
3.1 The Likelihood Function 57
This does not imply that the marginal distributions of yt are identical for all t, they
are only assumed identical conditional on xt . As an example, all men in Denmark do
not have the same income and the marginal distribution of income is not the same for
all men–in particular the expected income and the variance could di¤er. If, however,
we condition on their education, age, place to live, etc. they could have the same
distribution.
Example 3.3 (likelihood function for a conditional model): Consider a

sample of observations, fyt ; xt gTt=1 where yt 2 R and xt 2 Rm is a vector of m
exogenous explanatory variables. Consider a linear regression model
yt = x0t + t ; t = 1; 2; :::; T: (3.8)
For maximum likelihood estimation, we make an assumption on the distribution of

t , e.g. that
d 2
t j xt = N (0; ) (3.9)
for all t = 1; 2; :::; T . This is equivalent to
d
yt j xt = N (x0t ; 2
);
such that the conditional mean is linear, E(yt j xt ) = x0t , and the model is condi-
tionally homoskedastic, V (yt j xt ) = 2 .
Observe that from (3.9) it follows that
E( t j xt ) = 0; (3.10)
which is the well-known assumption of predeterminedness in the linear regression.

Because of the assumption of normality, the likelihood contribution of observation t
is given by the Gaussian density
!
1 (yt x0t )2
`( j yt ; xt ) = p exp ;
2 2 2 2
with = ( 0 ; 2 )0 , and under the assumption of conditional independence, the log-

likelihood function is given by the sum
!
XT
1 (yt x0t )2
2
log L( j y1 ; :::; yT ; x1 ; :::; xT ) = log 2 ;
t=1
2 2 2
that we could maximize with respect to = ( 0 ; 2 0

) (over a parameter space given
by = Rm ]0; 1[) to …nd the estimates.
3.1.3 Time Series Data

For most time series, we cannot assume that yt and ys are independent for all t 6= s.
In this case it does not hold that the joint density factorizes into a product of marginal
densities as in (3.2) and we do not get the multiplicative form directly.
In time series models, however, the object of interest is often the properties of yt
given the past, e.g. in terms of the conditional expectation, E(yt j y1 ; :::; yt 1 ), or
the conditional variance, V (yt j y1 ; :::; yt 1 ). In this case, we may still factorize the
likelihood function and get a multiplicative structure.
Recall from the de…nition of a conditional density, that we can factorize a joint
density into a conditional and a marginal density:
f (y1 ; y2 ; :::; yT j ) = f (yT j y1 ; :::; yT 1; ) f (y1 ; y2 ; :::; yT 1 j ):
Next we factorize f (y1 ; y2 ; :::; yT 1 j ) as
f (y1 ; y2 ; :::; yT 1 j ) = f (yT 1 j y1 ; :::; yT 2; ) f (y1 ; y2 ; :::; yT 2 j );
and repeating the factorization, we have a multiplicative structure
Y
T
f (y1 ; :::; yT j ) = f (y1 ; y2 ; :::yp j ) f (yt j y1 ; :::; yt 1 ; ) :
t=p+1
If the model under consideration has the property that yt depends on p lagged values,
yt 1 ; :::; yt p , then the T p conditional densities
f (yt j y1 ; :::; yt 1 ; ) = f (yt j yt 1 ; yt 2 ; :::; yt p ; ) ;
each represents a model equation for yt j yt 1 ; yt 2 ; :::; yt p .
The density of the marginal distribution, f (y1 ; y2 ; :::yp j ), is di¤erent from the
T p conditional densities. Because yt depends on p lagged values, this density cannot
be factorized in a way where the terms represent model equations. To have identical
terms, we therefore de…ne the likelihood function in terms of the density conditional
on the p initial values, y1 ; y2 ; :::; yp , that is
f (y1 ; :::; yT j )
L( j y1 ; :::; yT ) =
f (y1 ; :::yp j )
= f (yp+1 ; :::; yT j y1 ; :::; yp ; )
YT
= f (yt j y1 ; :::; yt 1 ; ) : (3.11)
t=p+1
This has a multiplicative structure with T p identical terms, and the log-likelihood
function will be a sum,
X
T
log L( j y1 ; :::; yT ) = log `t ( ); (3.12)
t=p+1
3.2 The Maximum Likelihood Estimator 59
with `t ( ) = f (yt j y1 ; :::; yt 1 ; ) :
Example 3.4 (likelihood function for an autoregression): Consider the au-

toregressive model of order one, AR(1), as given by
yt = yt 1 + t; t = 1; 2; :::; T;
with initial value, y0 , given. Again we make an assumption on the distribution of t ,

d
and the standard choice is that t j yt 1 = N (0; 2 ) for all t = 1; 2; :::; T . This is very
similar to the linear regression model in Example 3.3, and the likelihood contribution
of observation t is !
1 (yt yt 1 )2
`t ( ) = p exp ;
2 2 2 2
with = ( ; 2 )0 and parameter space given by = R ]0; 1[. Because of the sequen-
tial factorization of the joint density, we get that the likelihood function conditional
on the initial value, y0 , is given by
!
XT
1 (y t y t 1 ) 2
log L( j y0 ; y1 ; :::; yT ) = log 2 2 :
t=1
2 2 2
The likelihood function for a Gaussian AR(1) is similar to the linear regression, but
the multiplicative structure of the likelihood is here based on a sequential factorization
into conditional densities rather than based on an assumption of independence.
3.2 The Maximum Likelihood Estimator

The maximum likelihood estimate and the corresponding estimator are de…ned as
follows:
Definition 3.3 (estimate and estimator): For a statistical model de…ned by

the log-likelihood function log L( j y1 ; :::; yT ) and parameter space , the maximum
likelihood estimate is de…ned as
^(y1 ; :::; yT ) = arg max log L( j y1 ; :::; yT ): (3.13)
2
The maximum likelihood estimator is the corresponding function of the stochastic

variables.
Observe that the estimator is a random variable, while the estimate is a vector of
numbers. Because fyt gTt=1 are realizations of the random variables, we think of the
estimate as a realization of the corresponding estimator. In most cases we suppress

the dependence on the data or random variables and we simply write ^T or ^. Both
estimate and estimator are typically abbreviated as the MLE.
3.2.1 Some Additional Notation

Consider the log-likelihood function given by
X
T
log LT ( ) = log `t ( ) :
t=1
We de…ne the score vector, evaluated at some point in the parameter space, , as the
…rst derivative,
@ log LT ( ) X @ log `t ( ) X
T T
ST ( ) = = = st ( );
@ t=1
@ t=1
where st ( ) is the score contribution. Because is a k dimensional vector, the score

vector is also k dimensional,
0 @ log `t ( ) 1
@ 1
B .. C
st ( ) = @ . A:
@ log `t ( )
@ k
The score vector, ST ( ), is a function of the data fyt gTt=1 , and we could have empha-
sized that with the notation, e.g.
ST ( ) = S( j y1 ; :::; yT );
but we have suppressed that to simplify the presentation. Sometimes we will consider
the score as a random variable, but we will still use the same notation.
The …rst order conditions for the MLE, ^T , are given by the the k so-called
likelihood equations,
ST (^T ) = 0; (3.14)
that implicitly de…ne the MLE4 . To …nd the MLE, we can try to solve these k equa-
tions with k unknowns, but in practice this may be di¢ cult or impossible.
The second derivative of the likelihood function is called the Hessian matrix
@ 2 log LT (p) X @ 2 log `t ( ) X

T T
HT ( ) = = = Ht ( );
@ @ 0 t=1
@ @ 0
t=1
4
In some cases, the likelihood function is maximized at the boundary of the parameter space, in
which case (3.14) does not hold. We disregard such cases in the following.
3.2 The Maximum Likelihood Estimator 61
where
@ 2 log `t ( )
Ht ( ) = (3.15)
@ @ 0
is the contribution to the Hessian from observation t. The Hessian matrix is of
dimension k k and symmetric, containing all second derivatives
0 2 2
1
@ log `t ( ) @ log `t ( )
@ 1@ 1 @ 1@ k
B .. .. .. C
Ht ( ) = B
@ . . . C:
A
@ 2 log `t ( ) @ 2 log `t ( )
@ k@ 1 @ k@ k
If the Hessian matrix is negative de…nite at ^T , the solution to (3.14), ^T , is known

to be a local maximum of the log-likelihood function.
As you will see below, the Hessian contribution (3.15) plays an important role in
calculating the variance of ^T , and for later use, we de…ne the information matrix as
the negative of the expected Hessian contribution
I( ) = E(Ht ( )): (3.16)
Often we need to evaluate the …rst and second derivatives at a certain point in
the parameter space, e.g. at ^T , and to simplify the notation we use the notation
@ log LT (^T ) @ log LT ( )

ST (^T ) = = ;
@ @ =^T
where we …rst take derivatives to …nd the expression as a function of , and next
evaluate that function at = ^T .
Example 3.5 (mle for the poisson model): For the Poisson example, the log-
likelihood function is given in Example 3.2 as
X
T
log L( j y1 ; :::; yT ) = log `t ( ) with log `t ( ) = + yt log( ) log(yt !):
t=1
The score contribution is the …rst derivative,
@ log `t ( j yt ) yt
st ( ) = = 1;
@
and the score is
X
T T h
X i 1X
T
yt
ST ( ) = st ( ) = 1 = yt T:
t=1 t=1 t=1
The …rst-order condition, which de…nes the MLE, is therefore

PT
yt
ST ( ^ T ) = t=1 T = 0;
^T
and the MLE is given by the sample average,
XT
^T = 1 yt : (3.17)
T t=1
The Hessian contribution is the second derivative,

@ 2 log `t ( ) @ yt
Ht ( ) = 0 = st ( ) = 2; (3.18)
@ @ @
such that the Hessian is,
X
T PT
t=1 yt
HT ( ) = Ht ( ) = 2 :
t=1
We note that the Hessian is negative for all , and ^ is a maximum of the likelihood
function.
For the Poisson model we know that E(yt ) = , and we could …nd the information
as
yt E(yt ) 1
I( ) = E(Ht ( )) = E 2 = 2 = :
3.2.2 Approaches to Estimation

There are di¤erent strategies for …nding the MLE.
(1) One approach is to seek to maximize the log-likelihood function analytically

by solving the …rst order conditions in (3.14) as in Example 3.5. This would
give a closed form solution for the estimator, which makes it straightforward
to calculate the estimate without numerical problems. The closed form also
makes it easier to characterize the properties of the estimator as a function of
the random variables. In many cases, however, this approach is not feasible,
because there is no known solution to the …rst order conditions.
(2) Alternatively, we can apply grid search, which is a brute-force technique that
tries di¤erent values over a pre-speci…ed grid. The grid can then be made …ner
in the promising areas of the parameter space, until the estimate is obtained
with the desired precision. Grid search is easy to implement in small models,
but is very ine¢ cient in models with many parameters.
3.3 Properties of the MLE 63
(3) A …nal approach is numerical optimization of log L( ). This can be done by

giving a starting values, (0) , and then move the sequence f (j) gj=1;2;::: in a
direction that increases the likelihood until (j) (j 1)
. Many di¤erent opti-
mization algorithms are available. Numerical optimization is often implemented
using numerical approximation of the …rst and second derivatives. For the score
vector, we may use the …nite di¤erence approximation
@ log LT ( ) log LT ( + h) log LT ( h)
ST ( ) = ;
@ 2h
for some small step-length h. A simple introduction to numerical optimization,
using the Newton-Raphson optimization algorithm, is given in Nielsen (2017,
Chapter 5).
3.3 Properties of the MLE

We want to state the properties of the maximum likelihood estimator. Sometimes,
we have a closed form solution for the estimator, as in Example 3.5 where the ML
P
estimator was given by the sample average, ^ T = T1 Tt=1 yt . In this case we can derive
the properties of the estimator by direct calculation of the limit of ^ T as T ! 1.
This is similar to the asymptotic analysis of the method of moments estimator in the
linear regression model, see Wooldridge (2006).
In the general case, however, the MLE is only implicitly de…ned as the solution
to the …rst order condition,
ST (^T ) = 0; (3.19)
i.e. without having a closed form solution.
To characterize the properties of the estimator in the general case where it is only
implicitly de…ned, the starting point is to linearize the score function, ST (^T ), around
the true value, 0 , via a …rst-order Taylor expansion. This yields
ST (^T ) = ST ( 0 ) + HT ( 0 )(^T 0) + RT ; (3.20)
where we have used that
@
HT ( 0 ) = S( 0 ):
@ 0
The last term in (3.20), RT , is a remainder term involving higher order derivatives.
Using the …rst order condition (3.19) and disregarding the remainder term, we
solve for ^T to get
^T 0 [ HT ( 0 )] 1 [ST ( 0 )]
" # 1" #
1X 1X
T T
= Ht ( 0 ) st ( 0 ) : (3.21)
T t=1 T t=1
where the approximation is a result of disregarding the remainder terms and where
we have used the expressions for HT ( 0 ) and ST ( 0 ) in terms of sums of contributions
for t = 1; 2; :::; T .
The expression in (3.21) can be used to characterize the properties of the MLE,
by appealing to limit results for the di¤erent scaled sums. In particular, we may use a
central limit theorem to show that the estimator is asymptotically normal as T ! 1.
To obtain the results, we make the following high-level assumptions:
Assumption 3.1 (derivatives): Consider the log-likelihood function log LT ( ) with

2 , and let 0 be an interior point of . Assume that the likelihood function has
three continuous derivatives, such that:
(i) The score contributions obey a law of large numbers,
1X
T
p
st ( 0 ) ! E(st ( 0 )) = 0: (3.22)
T t=1
(ii) The score contributions obey a central limit theorem,

p T
T X d
st ( 0 ) ! N (0; J ( 0 )): (3.23)
T t=1
(iii) The Hessian contributions obey a law of large numbers
1X
T
p
Ht ( 0 ) ! I( 0 ); with I( 0 ) invertible. (3.24)
T t=1
(iv) The third derivative is bounded by a constant, such that
1 X @ 3 log `t ( )
T
p
max CT ! C < 1; (3.25)
i;j;k T @ i@ j @ k
t=1
for in a small neighborhood of 0.
Because we can …nd the derivatives of the log-likelihood function as explicit functions
of the data, these–somewhat abstract–assumptions on the derivatives can be trans-
lated into primitive assumptions on the behavior of the data. In principle this has to
be done on a case-by-case basis for each new suggested model–and importantly–the
derived assumptions on the data can be checked for each application of the model
in terms of misspeci…cation testing. We will consider an example below, and more
examples later in the course.5
5
Strictly speaking, Assumption 3.1 (i) could be omitted, because it is implied by (ii)-(iv). Here
we have chosen to keep it as an intuitive argument for consistency of the estimator.
Note, that for all cases considered in this course, the k k matrix J ( 0 ) used in
the limit in (3.23) is just the variance of the score contributions, i.e.
E(st ( 0 )st ( 0 )0 ) = J ( 0 ): (3.26)
Also note that the abstract condition in (3.25) ensures that the remainder term, RT ,
is negligible in the asymptotic analysis. We discuss this brie‡y below.
The properties of the MLE are given in the following theorem:
Theorem 3.1 (properties of the mle): If the likelihood function is correctly spec-
i…ed, such that the joint density of fyt gTt=1 is given by
f (y1 ; :::; yT j 0 ); (3.27)
with 2 and with the true value 0 in the interior of the parameter space , and if
Assumption 3.1 holds, then the maximum likelihood estimator in De…nition 3.3 has
the following properties:
(a) The MLE is consistent, such that for T ! 1,
p
^T ! 0: (3.28)
(b) The MLE is asymptotically normal,

p d
T (^T 0) ! N (0; ): (3.29)
as T ! 1, where the asymptotic variance is given by
( 0 ) = I( 0 ) 1 : (3.30)
(c) The MLE is asymptotically e¢ cient, such that any other consistent and asymp-
totically normal estimator has asymptotic variance larger than or equal to I( 0 ) 1 ,
such that the variance is I( 0 ) 1 + Q with Q positive semide…nite.
Some remarks are in order:

p
Remark 3.2 (speed of convergence): The result in (3.29) gives T as the speed
of convergence. For some types of datapand some estimators, the speed of convergence
may be slower or faster. In our case, T convergence is a result of the central limit
theorem in Assumption 3.1 (ii).
Remark 3.3 (asymptotic distribution): Sometimes, the asymptotic distribution

in (3.29) is written as
^T a N 0 ; T 1 ( 0 ) ; (3.31)
where we can see that the variance of the estimator,
V (^T ) = T 1
( 0 );
converges to zero with the speed of T 1 . Also observe that the asymptotic distribution
depends on the true value of the parameter, 0 .
In practical applications, the data is never generated exactly by f (y1 ; :::; yT j 0 ).

Instead, LT ( ) = f (y1 ; :::; yT j 0 ) represents a model, often in quite idealized form,
of the process that generated the data. The main point is to build models that
characterize as well as possible the main important features of the data. As a con-
sequence, the likelihood analysis is based on a distributional assumption which is
only approximately correct. As an example, we may assume a Gaussian distribution
although the data indicates that the distribution of fyt gTt=1 is likely to have fatter
tails. If this is the case, we will call the maximizer of log L( ) the pseudo maximum
likelihood estimator, PMLE, or the quasi maximum likelihood estimator, QMLE. The
QMLE is obviously not e¢ cient–we could have done better if we knew the correct
likelihood–but we nevertheless still have the following results:
Theorem 3.2 (quasi maximum likelihood estimator): If the likelihood func-

tion L( ) is only approximately correct, such that E(st ( 0 )) = 0, and if Assump-
tion 3.1 holds, property (a) and (b) in Theorem 3.1 still hold, with ( 0 ) replaced
by
( 0 ) = I( 0 ) 1 J ( 0 )I( 0 ) 1 : (3.32)
The variance in (3.32) is called the sandwich formula or the robust variance.
3.3.1 Estimating the Asymptotic Variance

To use the results in Theorem 3.1 and Theorem 3.2, we need to estimate ( 0 ) in
(3.30) or ( 0 ) in (3.32).
If we have a closed form expression for I( ) and J ( ) from the analytical deriva-
tives, it is natural to replace 0 with the estimate, ^T , and use
(^T ) = I(^T ) 1
(3.33)
and
(^T ) = I(^T ) 1 J (^T )I(^T ) 1 : (3.34)
Because ^T is a consistent estimator for 0 ; inference based on the estimated variance

will still be asymptotically valid.
Alternatively, and in particular when estimation is based on numerical optimiza-
tion of the likelihood function, we may also use the sample averages rather than the
expectations in (3.16) and (3.26), that is
1 X @ 2 log `t (^T )
T
^ ^T ) =
I( (3.35)
T t=1 @ @ 0
and
1X ^
T
^ ^
J ( T) = st ( T )st (^T )0 : (3.36)
T t=1
The estimate I( ^ ^T ) is called the observed information and J^(^T ) is called the outer
product of the gradients (OPG).
In practice it is often useful to compare the MLE variance (3.33) and the QMLE
variance (3.34). If they are roughly similar, it is an indication that the likelihood
function is correctly speci…ed, while big di¤erences indicate that the likelihood func-
tion does not well represent the distribution of the data. In the latter case, you
should always use the robust QMLE variance for inference–or maybe try to improve
the statistical model.
3.3.2 Some Details of the Derivations

The results above are stated using matrices, because is a k dimensional vector. To
simplify notation, we derive the results for the univariate case, k = 1.
Consistency. Proofs of consistency are generally complicated, and here we only

give a brief outline of the arguments.
We know that ^T maximizes log LT ( ) and solves ST (^T ) = 0 for all T . From
Assumption 3.1 (i) the law of large numbers states that, as T ! 1,
1X
T
1 p
ST ( 0 ) = st ( 0 ) ! E(st ( 0 )); (3.37)
T T t=1
and
E(st ( 0 )) = 0: (3.38)
The question then is if we can be sure that the limit of ^T that solves ST ( ) = 0 for
…nite T is also the solution to the limit E(st ( )) = 0. For this, the law of large numbers
in (3.37) is not su¢ cient because it only holds point-wise. The assumption therefore
has to be extended to include also higher order derivatives, as in Assumption 3.1,
see Jensen and Rahbek (2004), or has to be replaced by a stronger requirement of

uniform convergence, see e.g. Ploberger (2010) or Greene (2008).
Expected Score. For the quasi likelihood analysis, we assume that E(st ( 0 )) = 0.
For the correctly speci…ed model, this holds automatically. To show this assume that
the likelihood function is correctly speci…ed, such that the joint density of fyt gTt=1 is
given by f (y1 ; :::; yT j 0 ). Since st ( 0 ) is just a function of the random variable, yt ,
it follows by direct calculations that
Z
@ log ` ( 0 j yt )
E(st ( 0 )) = f (yt j 0 )dy
@
Z
1 @` ( 0 j yt )
= f (yt j 0 )dy
` ( 0 j yt ) @
Z
@f (yt j 0 )
= dy
@
Z
@
= f (yt j 0 )dy
@
@
= 1 = 0: (3.39)
@
Here all integrals run over the support of y. The third step follows because ` ( 0 j y) =
f (y j 0 ). The fourth step requires the regularity condition that di¤erentiation and
integration can be interchanged, which is typically met if the support of y does not
depend on the parameters.
Asymptotic Distribution. To show asymptotic normality, consider the Taylor

expansion in (3.20) and insert the …rst order condition, ST (^T ) = 0, to get
0 = ST ( 0 ) + HT ( 0 )(^T 0) + RT :
The remainder term is given by higher order derivatives, and from the mean value
theorem it holds that there exists a ~ between 0 and ^T such that
@ 3 log LT (~) (^T 0)

2
RT = :
@ 3 2
PT
Now insert the sums for the …rst twop derivatives, ST ( 0 ) = st ( 0 ) and
P t=1
HT ( 0 ) = Tt=1 Ht ( 0 ), and multiply with T =T :
p p
T X 1X
T T
p T
0= st ( 0 ) + Ht ( 0 ) T (^T 0) + RT : (3.40)
T t=1 T t=1 T
p
If we disregard the remainder term for a second, and solve for T (^T 0 ), we get
p PT
p T
st ( 0 )
T (^T 0) = T
1
PT
t=1
(3.41)
T t=1 Ht ( 0 )
p P
d
From Assumption 3.1 (ii) it holds that TT Tt=1 st ( 0 ) ! N (0; J ( 0 )), and from
P p
Assumption 3.1 (iii) it holds that T1 Tt=1 Ht ( 0 ) ! I( 0 ). Now we use that if
d p
AT ! A, where A is a random variable, and BT ! b, where b 6= 0 is a constant, then
d
AT =BT ! A=b, and it follows that
p PT
p T
st ( 0 ) d N (0; J ( 0 ))
T (^T 0) = T
1
PT
t=1
! ;
T t=1 Ht ( 0 ) I ( 0)
that we can write as

p d
T (^T 0) ! N 0; I ( 0 ) 1
J ( 0) I ( 0) 1
: (3.42)
This gives the general robust variance

1 1
( 0) = I ( 0) J ( 0) I ( 0) :
Remainder Term. We need to show that the remainder term in (3.40) does not
matter for the distribution as T ! 1. By rearranging, we have
p p 3
1 X @ 3 log `t (~) 1 ^
T
T T @ log LT (~) (^T 0)
2 p
RT = = ( T 0) T (^T 0 ):
T T @ 3 2 T t=1 @ 3 2
This
p expression seems to suggest that we get another contribution to the distribution
of pT (^T 0 ) from the third derivative, and if we solve (3.40) including the remainder
for T (^T 0 ), the denominator in (3.41) would include a new term,
p PT
p T
st ( 0 )
T (^T 0) = PT
T t=1
PT :
1 1 @ 3 log `t (~) 1 ^
T t=1 Ht ( 0 ) T t=1 @ 3
(
2| T {z 0)
}
| {z }
A B
p
Note, however, that because ^T is consistent it holds that B = ^T 0 ! 0. Next,
because ~ is between ^T and 0 , ~ is in a small neighborhood of 0 , and the third
derivative is bounded by Assumption 3.1 (iv), such that the term A is smaller than
p
some constantp
CT ! C. Therefore, we conclude that the extra term, A B, converges
to zero and TT RT does not matter for the distribution of ^T .
Information Matrix Equality. Under correct speci…cation, it holds that
J ( 0) = I ( 0) ;
such that the variance simplify to the inverse information.

To show this, use again
Z
@ log ` ( 0 j yt )
E(st ( 0 )) = f (yt j 0 )dy = 0;
@
and take derivatives to get
Z
@ log ` ( 0 j yt ) @f (yt j 0) @ 2 log ` ( 0 j yt )
+ f (yt j 0) dy = 0;
@ @ @ @
or Z Z
@ log ` ( 0 j yt ) @f (yt j 0 ) @ 2 log ` ( 0 j yt )
dy = f (yt j 0 )dy:
@ @ @ @
For the left hand side, we know that
@ log ` ( 0 j yt ) 1 @` ( 0 j yt )
=
@ ` ( 0 j yt ) @
1 @f (yt j 0 )
= :
f (yt j 0 ) @
Inserting @f (yt j 0 )=@ we get

Z Z 2
@ log ` ( 0 j yt ) @ log ` ( 0 j yt ) @ log ` ( 0 j yt )
f (yt j 0 )dy = f (yt j 0 )dy
@ @ @ @
@ log ` ( 0 j yt ) @ log ` ( 0 j yt ) @ 2 log ` ( 0 j yt )
E = E
@ @ @ @
J ( 0) = I ( 0) :
The information that J ( 0 ) = I ( 0 ) holds if the distributional assumption is

correct, can be used to judge the degree of misspeci…cation. If the standard-errors
of the parameter estimates calculated from the MLE variance formula, (^T ) =
I(^T ) 1 , and the robust QMLE variance formula,
(^T ) = I(^T ) 1 J (^T )I(^T ) 1 ;
are close, it suggest that the model is probably correctly speci…ed. If the di¤erences
between the standard errors are large, it indicates some degree of misspeci…cation.
3.4 Example: The AR(1) Model 71
3.4 Example: The AR(1) Model

Consider the autoregressive model of order one in Example 3.4, as given by
yt = yt 1 + t; t = 1; 2; :::; T; (3.43)
where we condition the analysis on y0 and where

d 2
t j yt 1 = N (0; ): (3.44)
2 0
The parameters of the model are given by =( ; ) with a parameter space
2 0
= f( ; ) 2 R2 j 2
> 0g:
Alternatively, we could also have taken as a parameter instead of 2 . We denote

by 0 = ( 0 ; 20 )0 the true values of the parameters.
We discuss the statistical properties of yt generated from the AR(1) model in more
details later in the course, for now we just state that the process yt is stationary and
weakly dependent if j 0 j < 1. Taking expectations we get
E(yt ) = 0 E(yt 1 ) + E( t );
and under stationarity, E(yt ) = E(yt 1 ), such that the unconditional expectation of
yt is given by,
E( t )
E(yt ) = = 0;
1 0
which is well de…ned for 0 6= 1. Likewise for the variance,
V (yt ) = V ( 0 yt 1 ) + V ( t )
2
Because V (yt ) = V (yt 1 ) and V ( t ) = 0, we get
2 2
V (yt ) = 0V (yt ) + 0;
or
2
0
V (yt ) = E(yt2 ) = 2: (3.45)
1 0
3.4.1 The Likelihood Function and the MLE

Based on the assumption of Gaussian errors, the log-likelihood function is
X
T
2 2
log L( ; )= `t ( ; )
t=1
with
2 1 (yt yt 1 )2
`t ( ; )=log 2 2 :
2 2 2
The score contributions are given by the …rst derivatives
0 1 0 1
@ log `t ( ; 2 ) yt 1 (yt yt 1 )
B @ C B 2 C
st ; 2 = @ 2 A = @ ;
@ log `t ( ; ) 1 (yt yt 1 )2 A
+
@ 2 2 2 2 4
which is a 2 1 vector. The …rst order conditions are therefore given by the 2
equations with 2 unknowns,
0 1
y
X t 1 t
T y ^ y t 1
B C
B 2 C
XT
B ^ C 0
^
ST ( ) = st ( ; ^ ) = B
^ 2
B
t=1 C
2 C = :
B XT y ^ y C 0
t=1
@ T 1 t t 1
A
2 + 4
2^ 2 t=1
^
For ^ 2 > 0, the …rst condition is satis…ed where
X
T X
T X
T
yt 1 yt ^ yt 1 = yt 1 yt ^ yt2 = 0; (3.46)
1
t=1 t=1 t=1
which directly gives a closed form solution for the estimator
PT
^ = Pt=1 yt 1 yt : (3.47)
T 2
t=1 yt 1
We recognize this as being identical to the OLS estimator and conclude that in a
regression model with Gaussian errors, OLS is the maximum likelihood estimator.
This is not very surprising. The …rst order condition in (3.46) can be written as
1X
T
yt 1^t = 0;
T t=1
where ^t = (yt ^ yt 1 ) is the estimated residual. But this is the sample counterpart
to the OLS moment condition E(yt 1 t ) = 0, see also (3.10).
The second condition in (3.46) yields
1 X ^2t
T
T
4 = ;
2 t=1 ^ 2^ 2
and we solve to …nd the MLE,
1X 2
T
2
^ = ^:
T t=1 t
This is slightly di¤erent from the OLS estimator of the variance, which typically uses
T 1 instead of T in the denominator.
3.4.2 Variance of the MLE

The second derivatives of the log-likelihood contributions are given by
@ 2 log `t ( ; 2
)@ yt 1 (yt yt 1 ) yt2 1
= 2
= 2
@ @ @
@ 2 log `t ( ; 2
) @ yt 1 (yt yt 1 ) yt 1 (yt yt 1 ) yt 1 t
= = =
@ @ 2 @ (2 2
)
4 4
@ 2 log `t ( ; 2
) @ 1 (yt yt 1 )2 1 1 2
t
= =
@ 2@ 2 @ 2 2 4 2 2 2 4 6
( )
@ 2 log `t ( ; 2 ) @ 1 (yt yt 1 )2 1 yt 1 (yt yt 1 ) yt 1 t
= = =
@ 2@ @ 2 4 2 2 4 4
And using that E( t ) = 0, E( 2t ) = 20 , and E( t yt 1 ) = 0, we can …nd the 2 2

information matrix:
0 1
yt2 1 yt 1 t
B 2 4 C 2 2
0 E(yt 1 ) 0
2
I( 0 ; 0 ) = E @B 0 0 C
2 A = :
yt 1 t 1 1 4
t 0 2 0
4
0 2 40 6
0
Observe that the information matrix is block diagonal. Using the result for the
2
unconditional variance, E(yt2 1 ) = 20 =(1 0 ) from (3.45) we get
2 2
0 0 2
( 0) = = 2 =1 0:
E(yt2 1 ) 2
0 =(1 0)
If the model is correctly speci…ed, we can test hypotheses on using

^ a 1 d 1 2
N 0; T ( 0) = N 0; T 1 0 :
For a given sample, we can estimate the variance of the MLE by inserting the esti-
2
mator, (^) = (1 ^ ), or we can use the observed information based on the sample
average,
! 1 ! 1
XT XT
^ (^) = T 1 ^ 2 1 yt2 1 = ^2 yt2 1 : (3.48)
T t=1 t=1
3.4.3 Variance of the QMLE

If we are in doubt about the assumption for the model, we may use instead the
variance (3.32) of the quasi maximum likelihood estimator. Using that
0 1 0 1
yt 1 (yt 0 yt 1 )
yt 1 t
B 2 C B 2
C
st ( 0 ) = B
@
0
2
C=@
A 1
0
2 A;
1 1 (yt 0 yt 1 ) + t4
2
+ 4 2 2
2 0
2 0 2 0 0
we …nd the variance of the score by direct calculation,
J ( 0; 2
0) = E(st ( 0 )st ( 00 ))
0 2 2 1
t 2 1 t 1
B y
4 t 1 6 4
yt 1 t C
B 0 2 0 0 C
= E@ 2 2 2 A
1 t 1 1 t
6 4
yt 1 t 2
+ 4
2 0 0 2 0 2 0
0 2 1
t 2
B E y
4 t 1
0 C
B
= @ 0 C;
1 2
t
2 A
0 E 2
+ 4
2 0 2 0
where we use that E(yt 1 t) = 0: The estimator of the upper left corner would be
1 X ^2t 2
2 T
t 2
E y
4 t 1
y : (3.49)
0 T t=1 ^ 4 t 1
Now, if the model is correctly speci…ed, such that E( 2t ) = 20 for t = 1; :::; T , this
would converge to 0 2 E(yt2 1 ) and coincide with the entry in I( 0 ; 20 ), as expected.
If the model has heteroskedasticity, on the other hand, the QMLE variance esti-
mator corresponding to would be the upper left corner of
^ ^T ) 1 J^(^T )I(
^ = I( ^ ^T ) 1 ;
which due to block diagonality is given by

! 1 ! ! 1
XT
1 X T 2 XT
^ = ^ 2 yt2 1 t 2
^ 2 yt2 1
4 yt 1
t=1
T t=1
^ t=1
! 1 ! ! 1
XT
1X 2 2
T X T
2 2
= yt 1 t yt 1 yt 1 ; (3.50)
t=1
T t=1 t=1
which is the heteroskedasticity robust variance formula for the OLS estimator, see
e.g. Wooldridge (2006, Chapter 8). The conclusion is that the QMLE variance
automatically makes inference robust to heteroskedasticity.
Example 3.6 (robust standard errors): As an empirical example we recon-

sider the model for house prices in Denmark based on quarterly data for 1971(1)-
2017(3) taken from Chapter 2 on regression for time series data. To ensure sta-
tionarity, we consider a data set including the …rst di¤erences with qt being the
change in the logarithm of the house price, pt is the change in the log of the private
consumption de‡ator, rt is the change in the after-tax interest rate.
Estimates t-ratios based on di¤erent variance formulas

Information OPG matrix Sandwich
^
I( ) 1 ^
J( ) 1
I( ) J (^)I(^) 1
^ 1
^ (3.33) (3.36) (3.34)

Constant 0:000913 0:521 0:462 0:554
rt 3:13 10:0 12:6 7:83
qt 1 0:480 8:09 11:2 5:61
qt 2 0:170 2:87 3:11 2:59
pt 0:288 2:55 2:46 2:51
^ 0:0167
Log-lik. 496:223
Table 3.1: Modelling changes in house prices, qt , for t = 1972(1) 2017(3).
The preferred model, excluding dummy variables, is reproduced in Table 3.1.

Although the conclusions regarding statistical signi…cance are the same, there are
some di¤erences between the di¤erent variance formulas. This is in line with …nding of
heteroskedasticity in the model and suggests the use of the robust standard errors.
3.4.4 Discussion of the Assumptions

Now consider the assumptions on the derivatives in Assumption 3.1, that are su¢ cient
conditions for consistency and asymptotic normality. To simplify notation, we assume
2
0 to be known, such that = 2 R.
First, we need a law of large numbers to apply to the …rst and second derivatives,
see Assumption 3.1 (i) and (iii), i.e.
1X 1 X yt 1 t p E(yt 1 t )
T T
st ( 0 ) = 2
! 2
T t=1 T t=1 0 0
1X 1 X yt 1 p
T T 2
Ht ( 0 ) = ! I( 0 ):
T t=1 T t=1 20
There are many versions of law of large numbers, but it holds for example for i.i.d.
observations, known from the cross sectional case. For our time series example of
the AR(1), the law of large numbers holds for stationary and weakly dependent
processes. Our results above therefore require that the stationarity condition for the
AR(1) model is satis…ed, i.e. that j 0 j < 1.
Next, we need a central limit theorem for the score contribution, see Assump-
tion 3.1 (ii),
p T p T
T X T X yt 1 t d
st ( 0 ) = 2
! N (0; J ( 0 )): (3.51)
T t=1 T t=1 0
This is more complicated, but it again holds for i.i.d. observations, or for stationary
and weakly dependent processes with …nite fourth order moments, see e.g. Hamilton
(1994) for the precise details.
Finally, consider the third derivative in Assumption 3.1 (iv),
@ 3 log `t ( ) @ yt2 1
3 = 2
= 0:
@ @ 0
This is zero for the AR(1) (and also for the linear regression model), in which case
Assumption 3.1 (iv) is automatically satis…ed. In this case the second order Taylor
expansion in (3.20) has no remainder term, RT = 0:
We may conclude that the asymptotic results in Theorem 3.1 (or Theorem 3.2)
hold if (i) the AR(1) model is stationary, j 0 j < 1, (ii) the model is correctly speci…ed
or the score has expectation score, E(yt 1 t ) = 0, and (iii) the fourth order moments
are …nite. Observe that the condition for the score is violated if the error terms are
autocorrelated.
3.5 Three Classical Test Principles

After estimation, we are often interested in testing particular hypotheses formulated
as restrictions on the parameters. To illustrate, consider a null hypothesis of interest,
H0 , and an alternative, HA , formulated as
H0 : R 0 0 =q against HA : R0 0 6= q.
where R is a k j matrix that impose j linear restrictions on the parameters. Each

of these cases corresponds to a statistical model, and we let ^ denote the MLE for the
unrestricted model, while ~ denotes the MLE obtained after imposing the restriction,
i.e. the restricted estimates. Observe that the model under H0 is a special case of
the unrestricted model, HU , and we say that the models are nested
H0 HU :
Example 3.7 (formulation of restrictions): As an example of formulating

restrictions, assume k = 3 and = ( 1 ; 2 ; 3 )0 . The j = 2 restrictions 1 = 2 and
3.5 Three Classical Test Principles 77
2 2 3 = 0 could be imposed using

0 1
1
1 0 0 @ A= 1 2
H0 : 2 = :
0 1 2 2 2 3 0
3
Below, we discuss three di¤erent test principles:

(1) The Wald test uses the model estimates from the unrestricted model and look
at the distance R0 ^ q normalized by an appropriate covariance matrix. The
illustration in Figure 3.1 shows the likelihood function (maximized at ^) and
the hypothesis H0 : = 0 . The Wald test of H0 is based on the horizontal
distance, ^ 0.
(2) The likelihood ratio (LR) test uses estimates obtained both under H0 , i.e. ~,
and unrestrictedly, ^, and it is based on the loss in likelihood, log L(^) log L(~),
i.e. the vertical distance in Figure 3.1.
(3) Finally, the Lagrange multiplier (LM) or score test, is based solely on the re-
stricted estimate, ~, and is based on the …rst order condition, i.e. checks if
ST (~) = 0 is signi…cantly violated, where ST ( ) is the score function of the
unrestricted model.
Which test to use is typically chosen by convenience. The likelihood ratio test is
the most e¢ cient in general, because it uses all information both under the null and
under the alternative. If possible, this test should be preferred. The Wald test, on the
other hand, is very convenient if we have estimated the general model and want to
test if it can be simpli…ed–or if all parameters are statistically signi…cant. Finally, the
LM test is convenient if the general models are di¢ cult to estimate and we want to
check our preferred model against misspeci…cations in di¤erent directions. Then we
may perform misspeci…cations tests without ever estimating the complicated general
models.
3.5.1 Wald Test

The Wald test requires only estimation of the unrestricted model. From the properties
a
of the MLE, we know that ^ N ( 0 ; T 1 ), see (3.31), such that
a
R0 ^ N R0 0 ; T 1
R0 R :
a
If the null hypothesis is true, it holds that R0 0 = q, and R0 ^ N (q; T 1
R0 R),
and a natural test statistic is
WT (R0 0 = q) = T (R0 ^ q)0 (R0 R) 1 (R0 ^ q);
Figure 3.1: Illustration of the three test principles, Wald test (W), likelihood ratio
test (LR), and the Lagrange multiplier test (LM).
where we may use some consistent estimator for the variance, . If the null hypoth-
d
esis is true, and if Assumption 3.1 holds for the model, it holds that WT (R0 0 = q) !
2
(j), where j is the number of restrictions imposed.
Remark 3.4 (wald statistic as a distance): Recall that for a vector
x = (x1 ; x2 ; :::; xj )0 2 Rj ;
the Euclidean norm, kxk, is given by

q p
kxk = x21 + x22 + ::: + x2j = x0 x:
Noting that x0 x = x0 Ij x, where Ij is the j j identity matrix, we may think of kxk

as the norm of x with respect to Ij , denoted kxkIj . In general, we may consider, for
any positive de…nite , p
kxk = x0 x 0:
The norm induces a distance (metric) between any two vectors x; y 2 Rj ,
p
kx yk = (x y)0 (x y):
It follows that the Wald statistic, WT (R0 0 = q); is the squared distance between R0 ^
and q given by the metric with respect to the variance T (R0 R) 1 , i.e.
WT (R0 0 = q) = jjR0 ^ qjjT (R0 R) 1 = T (R0 ^ q)0 (R0 R) 1 (R0 ^ q):

Remark 3.5 (t-statistic): Note that for j = 1, the usual t ratio for H0 : i =b
is given by
î b d
t i =b = q ; with t i =b ! N (0; 1);
V (î )
and the Wald statistic is just the square of the t ratio, WT (R0 0 = q) = t2i =b . The
Wald test is based on the squared distance and is therefore a two-sided test. The usual
t ratio, on the other hand, can be used also for one-sided hypothesis testing.
Remark 3.6 (robust wald test): One advantage of the Wald test is that it is
easily made robust to model misspeci…cation, i.e. test based on QMLE estimates,
i.e. using instead of . This robusti…cation is typically harder for, e.g., the LR
test. An example is the linear regression model, where robust t-statistics are easily
calculated using the heteroskedasticy robust standard errors rather than the usual OLS
standard errors.
3.5.2 Likelihood Ratio (LR) Test

For the LR test, estimation is required both under H0 and for the unrestricted model,
and the statistic is a measure of the fall in likelihood
!
L(~)
LR = 2 log = 2 (log L(~) log L(^));
L(^)
where L(~) and L(^) are the two likelihood values. Under the null hypothesis (and
d
under Assumption 3.1) the LR statistic is asymptotically distributed as LR ! 2 (j),
see Nielsen (2017, Appendix E) for a derivation.
The LR test is again a two-sided test. For a single hypothesis, H0 : i = b, a
one-sided version is the signed likelihood ratio statistic
p d
! i =b = sign(î b) LR( i = b) ! N (0; 1);
which is parallel to a t test.

Because the LR test is based on two separate estimations, it is important to ensure
that the models to be compared are actually nested, H0 HU , such that H0 is a
special case of the unrestricted model.
Remark 3.7 (non-robustness of the lr-test): In general, the LR test is not

robust to misspeci…cation. If the asymptotic covariance of ^T is the sandwich matrix
6= , then LR is not asymptotically 2 (j) distributed.
M0 M1 M2
Constant ( 1000) 0:155 0:173 2:322
(0:0954) (0:107) (1:40)
yt 0:197 0:196 0:2021
(3:13) (3:13) (2:99)
rt 0:238 0 1:030
(0:27) ( 1:18)
wt 0:572 0:559 0
(4:26) (4:45)
log-likelihood 324:545 324:506 315:825

R2 0:209 0:208 0:086
T 121 121 121
Table 3.2: Modelling changes in consumption, ct .
Example 3.8 (Wald and likelihood ratio tests): Consider a regression model
for stationary data,
ct = 0 + 1 yt + 2 rt + 3 wt + t ;
where c is the log of consumption, y is the log of income, r is the interest rate, and
w is the log of real wealth. Assuming correct speci…cation, the estimation results are
given in Table 3.2, with t values in parentheses.
The Wald test for the hypothesis that the interest rate is not needed is simply
the t ratio in model M0 , given by t 2 =0 = 0:27. The critical value of the N (0; 1) dis-
tribution is 1:96 and we cannot reject the hypothesis. The corresponding hypothesis
that wealth is not needed given a statistic of t 3 =0 = 4:26, which is clearly rejected.
The squared statistics are given by WT ( 2 = 0) = 0:073 and WT ( 3 = 0) = 18:15
that are asymptotically 2 (1).
The likelihood ratio statistics for 2 = 0 is given by twice the fall in log-likelihood,
i.e. LR( 2 = 0) = 2(324:545 324:506) = 0:078; which is very close to the Wald
statistic. Similarly, the likelihood ratio statistic for 3 = 0 is given by LR( 3 = 0) =
2(324:545 315:825) = 17:44–again close to the Wald statistic.
Importantly, the models in M1 and M2 are not nested and cannot be compared
by a likelihood ratio test. The reason is that we cannot impose a restriction on M1
to obtain M2 or vice versa.
3.5.3 Lagrange Multiplier (LM) Test

Let ST ( ) be the score function of the unrestricted model. Recall that the score is
zero at the unrestricted estimate
X
T
ST (^) = st (^) = 0:
t=1
If the hypothesis of interest is true, e.g. R0 0 = q, it should also hold that
X
T
ST (~) = st (~) 0; (3.52)
t=1
where ~ is the estimate obtained under the hypothesis. We can therefore test the
hypothesis by considering the quadratic form
! !! 1 !
X
T X
T X
T
LMT = st (~)0 V st (~) st (~) :
t=1 t=1 t=1
Under the null hypothesis, this is distributed as a 2 (j). Note, that the quadratic
P ~
form is of dimension k, but k j elements are unrestricted and st ( ) = 0 for these
elements. Therefore the quadratic form has a (j) distribution and not a 2 (k).
2
PT ~
To …nd V t=1 st ( ) , we note that the variance of the individual score, st ( ), is
J ( 0 ) = E[st ( 0 )st ( 0 )0 ];
which can be estimated by the outer product of the gradients,
1X ~
T
^ ~
J( ) = st ( )st (~)0 :
T t=1
PT
The estimated variance of t=1 st (~) is therefore
X
T
T J^(~) = st (~)st (~)0 ;
t=1
and the LM statistic can be written as

! T ! 1 !
XT X X
T
LMT = st (~)0 st (~)st (~)0 st (~) : (3.53)
t=1 t=1 t=1
LM Tests by Auxiliary Regressions. In practice, the LM statistics are often

calculated as T R2 , where R2 is the coe¢ cient of determination in an auxiliary
regression. To see why this is the case, de…ne the matrices
0 1 0 1
s1 (~)0 1
B . C B C
S =@ .. A and = @ ... A :
(T k) (T 1)
sT (~)0 1
Then it holds that

X
T X
T
0
= T; S 0
= st (e); and SS = 0
st (e)st (e)0 ;
(1 1) (k 1) (k k)
t=1 t=1
and the LM statistic can be calculated as

! T ! 1 !
X T X X
T
st (e)0 st (e)st (e)0 st (e)
1
LMT = = 0 S (S 0 S) S0 :
t=1 t=1 t=1
Now consider an auxiliary regression model given by
= S + residual, (3.54)
where the regressand is just a vector of unities, = (1; 1; :::; 1)0 2 RT , and the
regressors are the score contributions evaluated at ~. The OLS estimator and the
predicted values are given by, respectively,
1 1
^ = (S 0 S) S0 and ^ = S^ = S (S 0 S) S0 ;
and the LM statistic can the written as

1
0 0 1 0
0
S (S 0 S) S0 ^0^
LMT = S (S S) S =T 0
=T 0
= T R2 ;
where R2 is the coe¢ cient of determination from the (somewhat strange) auxiliary
regression in (3.54).
Remark 3.8 (auxilliary regressions): Sometimes, alternative auxiliary regres-

sions are used for the LM tests as an alternative to (3.54). To illustrate, consider the
linear regression model
yt = x0t + t ; t = 1; 2; :::; T; (3.55)
with
d 2
t jxt = N (0; ):
3.6 Conclusion and Main Points 83
To test for omitted variables, wt 2 Rm , we could use the auxiliary regression as in the
equation (3.54)
1 = ~t x0t 0 + ~t wt0 1 + residual, (3.56)
where the left hand side is just the constant unity. Alternatively, we can use the
regression of the estimated residuals, ~t , from (3.55) and use the auxilliary regression
including the original and the proposed new regressors:
~t = x0t 0 + wt0 1 + residual. (3.57)
In both cases, the LM statistic is LMT = T R2 , which is asymptotically 2 (m), where

m is the dimension of the omitted variable wt . The two version are asymptotically
equivalent.
Similarly, the well-known Breusch-Godfrey test for no …rst-order autocorrelation
is based on one of the auxiliary regressions
1 = ~t x0t 0 + ~t~t 1 1 + residual (3.58)

~t = x0t 0 + ~t 1 1 + residual, (3.59)
which is simply a test for the omitted variable t 1 : In both cases, the test statistic is
LMT = T R2 , which is asymptotically 2 (1). To test for higher order autocorrelation
the auxilliary regression includes more lags, ~t 1 ; ~t 2 ; :::; ~t m .
Finally, the typical Breusch-Pagan test for no heteroskedasticity is based on the
regression
~2t = x0t + residual, (3.60)
with LM statistic LMT = T R2 .
3.6 Conclusion and Main Points

We will use the likelihood principle for the analysis of most models in this course,
and this chapter gave a reiteration of some of the principles and theoretical results.
The main points include:
(1) The likelihood function is based on an assumed joint density of the data,
LT ( ) = f (y1 ; :::; yT j ) and a statistical model consists of the (log-) likeli-
hood function and the parameter space, , with 2 . The assumptions for
the analysis can and should be tested.
(2) For time series data, the multiplicative form of the likelihood function is implied
by a sequential factorization based on the conditional densities instead of the
marginal densities.
(3) Under certain assumptions, the MLE is consistent, asymptotically normal and
has the smallest possible variance.
(4) The assumptions involve convergence of the score and the Hessian, and bound-
edness of the third derivative.
(5) If the likelihood function is only an approximation, but it still holds that
E(st ( )) = 0, the QMLE is consistent and asymptotically normal, but with
a larger (robust) variance.
(6) Hypotheses on the parameters can be tested using the Wald test, the likelihood
ratio test, and the Lagrange multiplier test principle. Which to use is typically
decided from convenience.
Chapter 4
Univariate Models for Stationary

Economic Time Series
T
his chapter introduces some popular classes of single-equation dynamic time
series models and goes into details with the mathematical structure and the
economic interpretation of the models. To introduce the ideas, we …rst con-
sider the moving average (MA) model, which is probably the simplest dynamic model.
Next, we consider the popular autoregressive (AR) model and present conditions for
this model to generate stationary time series. We also discuss the relationship be-
tween AR and MA models and introduce mixed ARMA models allowing for both AR
and MA terms.
4.1 Estimating Dynamic Effects

A typical feature of time series data is a pronounced time dependence, and it follows
that shocks may have dynamic e¤ects. Below we present a number of popular time
series models for the estimation of dynamic responses after an impulse to the model.
As a starting point it is useful to distinguish univariate from multivariate models.
4.1.1 Univariate Models

Univariate models consider a single time series, y1 ; y2 ; :::; yT , and model the systematic
variation in yt as a function of its own past, e.g. the conditional expectation,
E(yt j yt 1 ; yt 2 ; :::);
86 Univariate Models for Stationary Economic Time Series
or the conditional variance,

V (yt j yt 1 ; yt 2 ; :::):
An important example considered in detail below is the so-called autoregressive
model, where the conditional expectation is assumed to be a linear function of the
conditioning set, e.g.
yt = + 1 yt 1 + t ; (4.1)
where t is the deviation from the conditional mean. Although the univariate ap-
proach is limited it nevertheless serves two important purposes.
The …rst purpose is to serve as a descriptive tool to characterize the dynamic
properties of a time series. We may for example be interested in the strength of
the time dependence, or persistence, of the time series, and we often want to assess
whether the main assumption of stationarity is likely to be ful…lled. In empirical
applications the univariate description often precedes a more elaborate multivariate
analysis of the data.
The second purpose is forecasting: To forecast yT +1 at time T based on a multi-
variate model for yt conditional on xt it is obviously necessary to know xT +1 , which
is generally not in the information set at time T . Univariate models o¤er the pos-
sibility of forecasts of yt based solely on its own past. These forecasts are simple
extrapolations based on the systematic variation in the past.
Below we look at two classes of univariate models: We …rst introduce the moving
average model in §4.3. This is the simplest class of dynamic models and the conditions
for stationarity is straightforwardly obtained.
We next present the autoregressive model in §4.4 and derive the stationarity
condition by referring to the results for MA models. We emphasize the relationship
between the two models and present the ARMA class of mixed models allowing for
both autoregressive and moving-average terms. We continue in §4.6 and §4.7 to
discuss estimation and forecasting.
4.1.2 Single-Equation Multivariate Models

An alternative to univariate models is to consider a model for yt given an information
set including other explanatory variables, the vector xt (say). These models are obvi-
ously more interesting from an economic point of view, and they allow the derivation
of dynamic multipliers,
@yt @yt+1 @yt+2
; ; ; :::
@xt @xt @xt
An important example could be the analysis of monetary policy in which case xt
could be the policy interest rate and yt the variable of interest e.g. unemployment or
4.1 Estimating Dynamic E¤ects 87
in‡ation. In Chapter 5 we look at a model for yt conditional on xt and the past, i.e.
E(yt j yt 1 ; yt 2 ; :::; xt ; xt 1 ; xt 2 ; :::): (4.2)
Assuming that the conditional expectation is a linear function produces the model of
the form
y t = + 1 y t 1 + 0 xt + 1 x t 1 + t : (4.3)
This so-called autoregressive distributed lag (ADL) model is the workhorse in single-
equation dynamic modelling.
The single equation models are based on an assumed causal structure, i.e. that
it is xt which determines yt and not the other way around. This may be natural in
some cases, but can also be controversial.
4.1.3 Multiple-Equation Models

If the causal direction between variables is uncertain, we may also use multiple-
equation tools, where we let yt and xt be determined jointly by their past, e.g. in
terms of the expectation of the vector Zt = (yt ; xt )0 conditional on the past:
yt
E (Zt j Zt 1 ; Zt 2 ; :::) = E j yt 1 ; yt 2 ; :::; xt 1 ; xt 2 ; ::: : (4.4)
xt
Assuming again a linear structure we may write the two equations as, e.g.,
yt = 1 + 11 yt 1 + 12 xt 1 + 1t
xt = 2 + 21 yt 1 + 22 xt 1 + 2t ;
and the statistical analysis could analyze the two equations jointly.
Using vector notation, the two equations may be written as
yt 1 11 12 yt 1 1t
= + + ; (4.5)
xt 2 21 22 xt 1 2t
or simply
Zt = + Zt 1 + t:
This generalization of the autoregressive model in (4.1) is known as the vector au-
toregressive (VAR) model, and it will be covered in Chapter 6.
4.2 Stationarity and Weak Dependence

Recall from the likelihood theory, that consistency of the maximum likelihood esti-
mator (MLE) requires a law of large numbers (LLN), stating that the sample average
of derivatives of the likelihood function (that are functions of the data) converges
in probability. Likewise the asymptotic normality of the estimator requires a cen-
tral limit theorem (CLT), stating that the appropriately normalized sample average
converges in distribution to a normal distribution.
In models for identically and independently distributed (i.i.d.) observations this
is straightforward and the simplest versions of the LLN and CLT apply, see e.g.
Wooldridge (2006, p. 774 ¤.). In a time series setting, where the i.i.d. assump-
tion is often not ful…lled, things are more complicated. More advanced versions of
the LLN and the CLT exist, however, that allow the analysis of dependent observa-
tions. For the versions we consider here, two main assumptions are needed: The …rst
important assumption is stationarity, which replaces the cross-sectional assumption
of identical distributions. That assumption ensures that observations origin from the
same distribution. The second assumption is weak dependence, which replaces the
assumption of independence, see the introduction in Chapter 1.
4.3 Moving Average Models

Possibly the simplest class of univariate time series models is the moving average
(MA) model. To discuss the MA model we …rst de…ne an i.i.d. error process:
Definition 4.1 (i.i.d. error process): A process f t g is called an i.i.d. error

process, if it has mean zero, E( t ) = 0, t and s are independent for all t 6= s, and
all t has the same distribution. In most cases we require that E( 2t ) = 2 < 1, and
use the notation:
t is i.i.d.(0; 2 ): (4.6)
f t g is often referred to as a white noise process.
In the time series literature, t is sometimes called an innovation or a shock. Re-

call that a (weakly) stationary stochastic process is characterized by constant mean,
variance, and autocovariances (unconditionally), and the white noise process, t , is
obviously stationary.
4.3 Moving Average Models 89
4.3.1 Finite MA-Models

Next we de…ne the moving average model of order q, MA(q):
Definition 4.2 (moving average model): The moving average model of order
q, MA(q), is given by the equation
yt = + t + 1 t 1 + 2 t 2 + ::: + q t q; t = 1; 2; :::; T; (4.7)

2
where t is i.i.d.(0; ).
The equation in (4.7) de…nes yt as a moving average of q past shocks for t = 1; 2; :::; T .
This means that we need q initial values for the unobserved error process, and it is
customary to assume that
(q 1) = (q 2) = ::: = 1 = 0 = 0: (4.8)
The speci…c model in equation (4.7) includes a constant term, but the deterministic
speci…cation could be made more general and the equation could include e.g. a linear
trend or seasonal dummies. We note that i is the response on yt of an impulse to
t i , and the sequence of ’s are often referred to as the impulse response function.
The stochastic process yt can be characterized directly from (4.7). The uncondi-
tional expectation is given by
E(yt ) = E( + t + 1 t 1 + 2 t 2 + ::: + q t q) = ; (4.9)
which is just the constant term in (4.7). The variance can be derived by inserting
(4.7) in the de…nition,
0 = V (yt ) = E (yt )2
2
= E ( t+ 1 t 1 + 2 t 2 + ::: + q t q)
2 2 2 2 2 2 2
= E t +E 1 t 1 +E 2 t 2 + ::: + E q t q
2 2 2 2
= 1+ 1 + 2 + ::: + q : (4.10)
The third line in the derivation follows from the independent distribution of t such
that all covariances are zero, E( t t h ) = 0 for h 6= 0. This implies that the variance
of the sum is the sum of the variances. The autocovariances of yt can be found in a
similar way, noting again that all cross terms are zero in expectation, i.e.
1 = cov(yt ; yt 1 ) = E((yt )(yt 1 ))

= E (( t + 1 t 1 + ::: + q t q) ( t 1 + 1 t 2 + ::: + q t q 1 ))
2
= ( 1 + 2 1 + 3 2 + ::: + q q 1) ;
2 = cov(yt ; yt 2 ) = E((yt )(yt 2 ))
= E (( t + 1 t 1 + ::: + q t q) ( t 2 + 1 t 3 + ::: + q t q 2 ))
2
= ( 2 + 3 1 + 4 2 + ::: + q q 2) ;
..
.
2
q = cov(yt ; yt q ) = q ;
while the autocovariances are zero for larger lag lengths, h = cov(yt ; yt h ) = 0 for
h > q.
We note that the mean and variance are constant, and the autocovariance h
depends on h but not on t. By this we conclude that the MA(q) process is stationary
by construction. The intuition is that yt is a linear combination of stationary terms,
and with constant weights the properties of yt are independent of t.
A common way to characterize the properties of the time series is to use the
autocorrelation function, ACF. It follows from the covariances that the ACF is given
by the sequence
h k + k+1 1 + k+2 2 + ::: + q q h
h = = ; for h q, (4.11)
0 1 + 21 + 22 + ::: + 2q
while h = 0 for h > q. We note that the MA(q) process has a memory of q periods.
To illustrate the appearance of MA processes, Figure 4.1 reports some simulated
series and their theoretical and estimated autocorrelation functions. We note that
the appearance of the MA(q) process depends on the order of the process, q, and on
the parameters, 1 ; :::; q , but the shocks to the white noise process are in all cases
recognizable, and the properties do not change fundamentally. Also note that the
MA(q) process has a memory of q periods, in the sense that it takes q time periods
before the e¤ect of a shock t has disappeared.
A memory of exactly q periods may be di¢ cult to rationalize in many economic
settings, and the pure MA model is not used very often in econometric applications.
4.3.2 Infinite MA-Models

The arguments for the …nite MA-process can be extended to the in…nite moving
average process, MA(1). In this case, however, we have to ensure that the variance
of yt is bounded,
V (yt ) = 1 + 21 + 22 + ::: 2 < 1; (4.12)
4.3 Moving Average Models 91
(A) y t =ε t (B) Autocorrelation function for (A)

1.0
2.5
0.5
0.0 0.0
-0.5
-2.5
0 20 40 60 80 100 -1.0 0 1 2 3 4 5 6 7 8 9 10 11
(C) y t =ε t +0.80⋅εt−1 (D) Autocorrelation function for (B)

5.0 1.0
2.5 0.5
0.0
0.0
-0.5
-2.5
0 20 40 60 80 100 -1.0
0 1 2 3 4 5 6 7 8 9 10 11
(E) y t =ε t −0.80⋅εt−1 (F) Autocorrelation function for (E)

1.0
2.5
0.5
0.0 0.0
-0.5
-2.5
-1.0
0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 11
(G) y t =ε t +εt−1 +.8⋅εt−2 +.4⋅εt−3 −.6⋅εt−4 −εt−5 (H) Autocorrelation function for (G)
1.0
5
0.5
0
0.0
-5 -0.5
-1.0
0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 11
Figure 4.1: Examples of simulated MA(q) processes. Black bars indicate the theo-
retical ACF while grey bars indicate the estimated ACF. Horizontal lines are the 95%
con…dence bounds for zero autocorrelations derived for an i.i.d. process.
P1 2
which requires that j=0 j < 1 such that the in…nite sum converges.
4.4 Autoregressive Models

Next, de…ne the autoregressive model of order p, AR(p):
Definition 4.3 (autoregressive model): The autoregressive model with p lags

is de…ned by the equation
yt = + 1 yt 1 + 2 yt 2 + ::: + p yt p + t; t = 1; 2; :::; T; (4.13)
where t is i.i.d.(0; 2 ). We use again the convention that the equation in (4.13)
holds for observations y1 ; y2 ; :::; yT , which means that we have observed also the p
previous values, y (p 1) ; y (p 2) ; :::; y 1 ; y0 ; they are referred to as initial values for
the equation.
The de…nition implies that we can think of the systematic part as the best linear
prediction of yt given the past,
E(yt j yt 1 ; yt 2 ; :::) = E(yt j yt 1 ; yt 2 ; :::; yt p ) = + 1 yt 1 + 2 yt 2 + ::: + p yt p :
We note that the …rst p lags capture all the information in the past, and that the
conditional expectation is a linear function of the information set.
4.4.1 The AR(1) Model

The analysis of the autoregressive model is more complicated that the analysis of
moving average models and to simplify the analysis we focus on the autoregressive
model for p = 1 lag, the so-called …rst order autoregressive, AR(1), model:
yt = + y t 1 + t: (4.14)
In this case the …rst lag captures all the information in the past. Here we have
included a constant term, but the model could be easily extended to more general
deterministic speci…cations.
The only exogenous (or forcing) variable in (4.14) is the error term and the de-
velopment of the time series yt is determined solely by the sequence of innovations
1 ; :::; T . To make this point explicit, we can …nd the solution for yt in terms of the
4.4 Autoregressive Models 93
innovations and the initial value. To do this we recursively substitute the expressions
for yt h , h = 1; 2; :::; to obtain the solution
yt = + yt 1 + t
= + ( + yt 2 + t 1) + t
2
= (1 + ) + t + t 1 + yt 2
2
= (1 + ) + t + t 1 + ( + yt 3 + t 2)
2 2 3
= 1+ + + t + t 1 + t 2 + yt 3
..
.
2 t 1 2 t 1 t
= 1+ + + ::: + + t + t 1 + t 2 + ::: + 1 + y0 (: 4.15)
We see that yt is given by a deterministic term, a moving average of past innovations,

and a term involving the initial value. Due to the moving average structure the
solution is often referred to as the moving average representation of yt .
If we for a moment make the abstract assumption that the process yt started in
the remote in…nite past, then we may state the solution as an in…nite sum,
2 3 2
yt = 1 + + + + ::: + t + t 1 + t 2 + :::; (4.16)
which we recognize as an in…nite moving average process. It follows from the result for
in…nite MA processes that the AR(1) process is stationary if the MA terms converge
to zero such that the in…nite sum converges. That is the case if j j < 1 which
is known as the stationarity condition for an AR(1) model. In the analysis of the
AR(1) model below we assume stationarity and impose this condition. We emphasize
that while a …nite MA process is always stationary, stationarity of the AR process
requires conditions on the parameters.
The properties of the time series yt can again be found directly from the moving
average representation. The expectation of yt given the initial value is
2 3 t 1 t
E(yt j y0 ) = 1 + + + + ::: + + y0 ;
where the last term involving the initial value converges to zero for t increasing,
t
y0 ! 0. The unconditional mean is the expectation of the convergent geometric
series in (4.16), i.e.
= : E(yt ) = (4.17)
1
This is not the constant term of the model, , it also depends on the autoregressive
parameter, . We hasten to note that the constant unconditional mean is not de…ned
if = 1, but that case is ruled out by the stationarity condition.
The unconditional variance can also be found from the solution in (4.16). Using
the de…nition we obtain
0 = V (yt ) = E (yt )2
2 3 2
= E t + t 1 + t 2 + t 3 + :::
2 2 2 4 2 6 2
= + + + + :::
2 4 6 2
= 1+ + + + ::: ;
which is again a convergent geometric series with the limit
2
0 = 2: (4.18)
1
The autocovariances can also be found from (4.16):
1 = cov(yt ; yt 1 ) = E((yt )(yt 1 ))

2 2
= E(( t + t 1 + t 2 + :::)( t 1 + t 2 + t 3 + :::))
2 3 2 5 2 7 2
= + + + + :::
= 0;
where we again use that the covariances are zero: E( t t h ) = 0 for h 6= 0. Likewise
it follows that h = cov(yt ; yt h ) = h 0 . The autocorrelation function, ACF, of a
stationary AR(1) is given by
h
h 0 h
h = = = ; (4.19)
0 0
which is an exponentially decreasing function if j j < 1. It is a general result that

the autocorrelation function goes exponentially to zero for a stationary autoregressive
time series.
Graphically the results imply that a stationary time series will ‡uctuate around
a constant mean with a constant variance. Non-zero autocorrelations imply that
consecutive observations are correlated and the ‡uctuations may be systematic, but
over time the process will not deviate too much from the unconditional mean. This
is often phrased as the process being mean reverting and we also say that the process
has an attractor, de…ned as a steady state level to which it will eventually return: In
this case the unconditional mean, .
Figure 4.2 shows examples of …rst order autoregressive processes. Note that the
appearance depends more fundamentally on the autoregressive parameter. For the
non-stationary case, = 1, the process wanders arbitrarily up and down with no
attractor; this is known as a random walk. If j j > 1 the process for yt is explosive.
We note that there may be marked di¤erences between the true and estimated ACF
in small samples.
(A) y t =0.5⋅yt−1 +ε t (B) Autocorrelation function for (A)

1.0
2.5
0.5
0.0 0.0
-0.5
-2.5
0 20 40 60 80 100 -1.0 0 2 4 6 8 10 12 14 16 18 20 22
(C) y t =0.9⋅yt−1 +ε t (D) Autocorrelation function for (C)

1.0
5
0.5
0 0.0
-0.5
-1.0
0 20 40 60 80 100 0 2 4 6 8 10 12 14 16 18 20 22
(E) y t =−0.9⋅yt−1 +ε t (F) Autocorrelation function for (E)

1.0
5
0.5
0
0.0
-5 -0.5
-1.0
0 20 40 60 80 100 0 2 4 6 8 10 12 14 16 18 20 22
(G) y t =yt−1 +ε t (H) y t =1.05⋅yt−1 +ε t

5 0
-50
-5
0 20 40 60 80 100 0 20 40 60 80 100
Figure 4.2: Examples of simulated AR(p) processes. Black bars indicate the theo-
retical ACF while grey bars indicate the estimated ACF. Horizontal lines are the 95%
con…dence bounds for zero autocorrelations derived for an i.i.d. process.
4.4.2 Lag Polynomials and Characteristic Roots

A useful tool in time series analysis is the lag-operator, L, that has the property that
it lags a variable one period, i.e.
Lyt = yt 1 and L2 yt = yt 2 :
The lag operator is related to the well-known …rst di¤erence operator, =1 L,

and
yt = (1 L) yt = yt yt 1 :
Using the lag-operator we can write the AR(1) model as
yt yt 1 = + t
(1 L)yt = + t;
where (z) = 1 z is a (…rst order) polynomial in z 2 R. The characteristic equation

is then de…ned as the polynomial equation,
(z) = 1 z = 0; (4.20)
and the solution, z1 , is denoted the characteristic root,

1
z1 = ;
which is just the inverse coe¢ cient.

The usual results for convergent geometric series also hold for expressions involving
the lag-operator, and if j j < 1 it holds that
2 3 1 1
1+ L+ L2 + L3 + ::: ! = (L); (4.21)
1 L
where the right hand side is called the inverse polynomial. The inverse polynomial
1
(L) is in…nite and it exists if the terms on the left hand side converges to zero, i.e.
if j j < 1.
Using the inverse polynomial gives an alternative to recursive substitution. For
the AR(1) model we may write
(L)yt = + t ;
and if j j < 1,
1
yt = (L) ( + t )
2 3
= 1+ L+ L2 + L3 + ::: ( + t )
2 3 2
= 1+ + + + ::: + t + t 1 + t 2 + :::
where we have used that is constant and L = . This result is identical to (4.16).
Using formulations in terms of lag-polynomials and roots, we may say that the
stationarity condition for the AR(1) model is that the inverse characteristic root,
1
1 = z1 , is smaller than unity in absolute value or that the characteristic polynomial,
(z), can be inverted.
4.4.3 The AR(p) Model

The results for a general AR(p) model in (4.13) are most easily presented in terms of
the lag-polynomial. In particular, the AR(p) model
yt = + 1 yt 1 + 2 yt 2 + ::: + p yt p + t;
can be written as
yt 1 yt 1 2 yt 2 ::: p yt p = + t
(L)yt = + t; (4.22)
where
2 p
(z) = 1 1z 2z ::: pz ; z 2 C;
is the autoregressive polynomial. The polynomial for the AR(p) model is of degree p
and has p roots, and for p 2 the roots may be complex numbers of the form, for
j = 1; 2; :::; p;
zj = rj cj i 2 C; with rj ; cj 2 R: (4.23)
p
Here i is the unit imaginary number with the property i2 = 1 (or i = 1) and C
denotes the set of complex numbers.
Remark 4.1 (complex numbers): For the complex number, z 2 C, as given by

p
z = r ci; i = 1;
the number r 2 R is called the real part of z and c 2 R is called the imaginary part,
and the number is typically illustrated as a two-dimensional vector, with the real part
as the …rst component and the imaginary part as the second component, (r; c)0 . This
coordinate system is referred to as the complex plane. The modulus of the complex
number is the length (the Euclidean norm) of the vector and is given by
p
kzk = kr + cik = r2 + c2 ; (4.24)
which plays an important role in the analysis below.

Calculation with complex numbers may be unfamiliar but the rules imply that for
two complex numbers z1 = r1 + c1 i and z2 = r2 + c2 i, addition is given by
z1 + z2 = (r1 + c1 i) + (r2 + c2 i) = (r1 + r2 ) + (c1 + c2 ) i; (4.25)
while multiplication gives
z1 z2 = (r1 + c1 i) (r2 + c2 i) = (r1 r2 c1 c2 ) + (r1 c2 + r2 c1 ) i: (4.26)
We will not use these results further, and we will not do calculations involving complex
numbers by hand, but we frequently use the inverse of the complex root, z = r +ci 6= 0,
de…ned by
1 1 r c
z 1= = = 2 i: (4.27)
z r + ci r + c2 r 2 + c2
Often the inverse root, = z 1 , is compared with the complex unit circle, i.e. a circle
in the complex plane with radius equal to one.
For the AR(p) model, the polynomial (z) is of degree p and if we evaluate the
polynomial in z = 1 we get
(1) = 1 1 2 ::: p;
which is one minus the sum of the coe¢ cients. Based on the roots we can factorize
the polynomial as
2 p
(z) = 1 1z 2z ::: pz = (1 1 z) (1 2 z) 1 pz ; (4.28)
where j = zj 1 denotes an inverse root. If an inverse root is smaller than unity,

j < 1, it holds that the corresponding factor can be inverted as in (4.21):
1 2 2 3 3
=1+ jz + jz + jz + ::: (4.29)
1 jz
Based on the factorization in (4.28), it therefore holds that the characteristic polyno-
mial for the AR(p) model, (z), is invertible if each of the factors are invertible, i.e.
if all the inverse roots are smaller than one in absolute value,
j < 1; j = 1; 2; :::; p;
which is the stationarity condition for the AR(p). The inverse polynomial is the
product of the inverse factors and it has in…nitely many terms,
1 1 1 1
(z) =
1 1z 1 2z 1 pz
2 2 3 3 2 2 3 3
= 1+ 1L + 1L + 1 L + ::: 1+ pL + pL + pL + :::
2
= 1 + c1 z + c2 z + c3 z 3 + c4 z 4 + :::; (4.30)
where the coe¢ cients can be found as complicated functions of the autoregressive
parameters:
c1 = 1
c2 = c1 1 + 2
c3 = c2 1 + c1 2 + 3
c4 = c3 1 + c2 2 + c1 3 + 4;
etc., where we insert h = 0 for h > p. A brief derivation of the coe¢ cients are given
in Appendix §4.A.
The inverse polynomial in (4.30) shows that the AR(p) model can be written as
an in…nite moving average model, MA(1), i.e.
1
yt = (L) ( + t )
= 1 + c1 L + c2 L2 + c3 L3 + c4 L4 + ::: ( + t )
= (1 + c1 + c2 + c3 + c4 + :::) + t + c1 t 1 + c2 t 2 + c3 t 3 + :::; (4.31)
where we again have used that L = . Based on the MA(1) representation we

note that the AR(p) model is stationary if the moving average coe¢ cients converge
to zero. We can recap the stationarity condition for an AR(p) model as follows:
Theorem 4.1 (stationarity of the ar(p)): The AR(p) model in (4.13) is sta-
tionary and weakly dependent if and only if the autoregressive polynomial,
2 p
(z) = 1 1z 2z ::: pz ; z 2 C;
is invertible, i.e. that the p roots of the autoregressive polynomial are larger than one
in absolute value, kzj k > 1, or that the inverse roots are smaller than one, j < 1,
j = 1; 2; :::; p. For the AR(1) this holds if 1 < < 1. The model has a unit root if
p
X
(1) = 1 i = 0:
i=1
For the AR(1) this holds if = 1.
The moving average coe¢ cients measure the dynamic impact of a shock to the
process,
@yt @yt @yt
= 1; = c1 ; = c2 ; :::;
@ t @ t 1 @ t 2
and the sequence of MA-coe¢ cients, c1 ; c2 ; c3 ; :::; is also known as the impulse-responses
of the process. Note that the stationarity condition implies that the impulse response
function dies out eventually, and the process is weakly dependent.
Again we can …nd that properties of the process from the MA-representation. As
an example, the constant mean is given by
= E(yt ) = (1 + c1 + c2 + c3 + c4 + :::) ! = :
(1) 1 1 2 ::: p
For this to be de…ned we require that z = 1 is not a root of the autoregressive

polynomial, but that is ensured by the stationarity condition. The variance is given
by
2 2 2 2 2
0 = V (yt ) = 1 + c1 + c2 + c3 + c4 + ::: :
4.4.4 Autocorrelations and the Yule-Walker Equations

The presented approach for calculation of autocovariances and autocorrelations is
totally general, but it is sometimes di¢ cult to apply by hand because it requires that
the MA-representation is derived. An alternative way to calculate autocorrelations is
based on the so-called Yule-Walker equations, which are obtained directly from the
model equation. To illustrate, consider the AR(2) model given by
yt = + 1 yt 1 + 2 yt 2 + t: (4.32)
First we …nd the mean by taking expectations,
E(yt ) = + 1 E(yt 1 ) + 2 E(yt 2 ) + E( t ):
Assuming stationarity, E(yt ) = E(yt 1 ), it follows that
= E(yt ) = :
1 1 2
Next we de…ne a new process as the deviation from the mean, y~t = yt , so that
y~t = 1y
~t 1 + 2y
~t 2 + t: (4.33)
Now remember that V (yt ) = E((yt )2 ) = E(~yt2 ) = V (~

yt ), and we do all
calculations for (4.33) rather than (4.32). If we multiply both sides of (4.33) with y~t
and take expectations, we …nd
E y~t2 = 1E (~
yt 1 y~t ) + 2E (~
yt 2 y~t ) + E ( t y~t )
2
0 = 1 1 + 2 2 + ; (4.34)
where we have used the de…nitions and that E ( t y~t ) = E ( t ( 1 y~t 1 + 2y

~t 2 + t )) =
2
. If we multiply instead with y~t 1 , y~t 2 , and y~t 3 , we obtain
E (~
yt y~t 1 ) = 1E (~
yt 1 y~t 1 ) + 2E (~
yt 2 y~t 1 ) + E ( t y~t 1 )
1 = 1 0 + 2 1 (4.35)
4.5 ARMA and ARIMA Models 101
and
E (~
yt y~t 2 ) = 1E (~
yt 1 y~t 2 ) + 2E (~
yt 2 y~t 2 ) + E ( t y~t 2 )
2 = 1 1 + 2 0 (4.36)
and, …nally,
E (~
yt y~t 3 ) = 1E (~
yt 1 y~t 3 ) + 2E (~
yt 2 y~t 3 ) + E ( t y~t 3 )
3 = 1 2 + 2 1: (4.37)
The set of equations (4.34)-(4.37) is known as the Yule-Walker equations. To …nd

the variance we can substitute 1 and 2 into (4.34) and solve. This is a bit tedious,
however, and will not be done here. To …nd the autocorrelations, h = h = 0 , just
divide the Yule-Walker equations with 0 to obtain
1 = 1 + 2 1
2 = 1 1 + 2
..
.
h = 1 h 1 + 2 h 2 for h 3:
By collecting terms, we get the autocorrelation function:
1
1 =
1 2
2
1
2 = + 2
1 2
..
.
h = 1 h 1 + 2 h 2 for h 3:
4.5 ARMA and ARIMA Models

At this point it is worth emphasizing the duality between AR and MA models. We
have seen that a stationary AR(p) model can always be represented by an MA(1)
model because the autoregressive polynomial (z) can be inverted. We may also write
the MA model using a lag polynomial,
yt = (L) t ;
where (z) = 1 + 1 z + ::: + q z q is a polynomial. If (z) is invertible (i.e. if all

the inverse roots of (z) = 0 are smaller than one), then the MA(q) model can also
be represented by an AR(1) model. As a consequence we may approximate an MA

model with an autoregression with many lags; or we may alternatively represent a
long autoregression with a shorter MA model.
The AR(p) and MA(q) models can also be combined into a so-called autoregressive
moving average, ARMA(p,q), model:
Definition 4.4 (arma model): The class of autoregressive moving average mod-
els, ARMA(p,q), is de…ned by
yt = 1 yt 1 + ::: + p yt p + t + 1 t 1 + ::: + q t q; (4.38)

2
for t = 1; 2; :::; T , with t being i.i.d.(0; ) and conditional on initial values.
The ARMA model is a very ‡exible class of models that is capable of representing
many di¤erent patterns of autocovariances. Again we can write the model in terms
of lag-polynomials as
(L)yt = (L) t ; (4.39)
and there may exist both an AR-presentation,
1
(L) (L)yt = t ;
and an MA-representation,
1
yt = (L) (L) t ;
both with in…nitely many terms. In this way we may think of the ARMA model as
a parsimonious representation of a given autocovariance structure, i.e. the represen-
tation that uses as few parameters as possible.
Remark 4.2 (explanatory variables in ARMA models): Logically, there are

two di¤erent ways to include deterministic terms and explanatory variables in the
ARMA model (4.39), here written in terms on variables xt and zt :
(L)(yt + xt ) = (L)( t + zt ): (4.40)
The variables in xt are called additive, because they are added directly to yt without a
dynamic propagation. The variables in zt , on the other hand, enter in the same way
as the model innovations, and are therefore propagated in the same way.
As an example, consider a dummy variable Dt = I(t = T0 ) taking the value one at
time T0 . If this is included as additive (xt ), the interpretation is that the observation
for yT0 is wrong and should be replaced by yT0 + . If it is included as an innovational
dummy (zt ), the interpretation is that there was a particularly large shock at time T0
and this shock, Dt , is propagated through the system in the same way as the normal
innovations, t , i.e. via the moving average structure.
4.5 ARMA and ARIMA Models 103
For yt to be stationary it is required that the inverse roots of the characteristic

equation are all smaller than one. If there is a root at unity, a so-called unit root
1 = 1, then the factorized polynomial can be written as
(z) = (1 z) (1 2 z) 1 pz : (4.41)
Noting that = 1 L is the …rst di¤erence operator, the model can be written in
terms of the …rst di¤erenced variable
(L) yt = (L) t ; (4.42)
where the polynomial (z) is de…ned as the last p 1 terms in (4.41). The general
model in (4.42) is referred to as an integrated ARMA model or an ARIMA(p,d,q),
with p stationary autoregressive roots, d …rst di¤erences, and q MA terms.
We return to the properties of unit root processes and the testing for unit roots
later in the course.
Example 4.1 (danish house prices): To illustrate the construction of ARIMA

models we consider the real Danish house prices, 1972:1-2004:2, de…ned as the log of
the house price index divided with the consumer price index. Estimating a second
order autoregressive model yields
pt = 0:0034 + 1:545 pt 1 0:565 pt 2 + t;
with the autoregressive polynomial given by
(z) = 1 1:545 z + 0:565 z 2 :
The p = 2 inverse roots of the polynomial are given by 1 = 0:953 and 2 = 0:592
and we can factorize the polynomial as
(z) = 1 1:545 z + 0:565 z 2 = (1 0:953 z) (1 0:592 z) :
We do not want to test for unit roots at this point and we assume without testing
that the …rst root is unity and use (1 0:953 L) (1 L) = . Estimating a
AR(1) model for pt , which is the same as an ARIMA(1,1,0) model for pt , we get
the following results
pt = 0:0008369 + 0:544 pt 1 ;
where the second root is basically unchanged.
4.6 Estimation and Model Selection

If we are willing to assume a speci…c distributional form for the error process, t ,
then it is natural to estimate the parameters using maximum likelihood. The most
popular assumption in macro-econometrics is the assumption of Gaussian errors, and
the AR(p) model may be written as
d 2
yt = + 1 yt 1 + 2 yt 2 + ::: + p yt p + t; t = N (0; ); t = 1; 2; :::; T:
In this case the conditional distribution of yt given the past is normally distributed,
and the log-likelihood function has the well-known form
T X
T 2
2 2 t
log L( ; 1 ; :::; p; )= log(2 ) 2
; (4.43)
2 t=1
2
where
t = yt 1 yt 1 2 yt 2 ::: p yt p ;
see also the discussion in Chapter 2. Observe that the autoregressive model is in fact
a linear regression model and the ML estimator coincides with the OLS estimator for
this case.
For the moving average model the idea is the same, but it is more complicated to
express the likelihood function in terms of observed data. Consider for illustration
the MA(1) model given by
yt = + t + t 1 :
To express the sequence of error terms, 1 ; :::; T , as a function of the observed data,
y1 ; :::; yT , we solve recursively for the error terms
1 = y1
2 = y2 1 = y2 y1 +
2 2
3 = y3 2 = y3 y2 + y1 +
2 3 2 3
4 = y4 3 = y4 y3 + y2 y1 + +
..
.
where we have assumed a zero initial value, 0 = 0. The resulting log-likelihood is a

complicated non-linear function of the parameters ( ; ; 2 )0 but it can be maximized
using numerical algorithms to produce the ML estimates.
Remark 4.3 (choice of distribution): Other distributional assumptions than the

normal can also be used, and in …nancial econometrics it is often preferred to rely
on error distributions with more probability mass in the tails of the distribution, i.e.
4.6 Estimation and Model Selection 105
with a higher probability of extreme observations. A popular choice is the Student

t(v) distribution, where v is the number of degrees of freedom. Low degrees of free-
dom give a heavy tail distribution and for v ! 1, the t(v) distribution approaches
the Gaussian distribution. In a likelihood analysis we often treat v as a parameter
and estimate the degrees of freedom in the t distribution.
4.6.1 Model Selection

In empirical applications it is necessary to choose the lag orders, p and q, for the
ARMA model. If we have a list of potential models, e.g.
ARMA(1,1) : yt yt 1 = + t + t 1
AR(1) : yt yt 1 = + t
MA(1) : yt = + t + t 1
then we should …nd a way to choose the most relevant one for a given data set. This
is known as model selection in the literature. There are two di¤erent approaches:
general-to-speci…c (GETS) testing and model selection based on information criteria.
If the models are nested, i.e. if one model is a special case of a larger model,
then we may use standard likelihood ratio (LR) testing to evaluate the reduction
from the large to the small model. In the above example, it holds that the AR(1)
model is nested within the ARMA(1,1), written as AR(1) ARMA(1,1), where the
reduction imposes the restriction = 0. This restriction can be tested by a LR test,
and we may reject or accept the reduction to an AR(1) model. Likewise it holds that
MA(1) ARMA(1,1) and we may test the restriction = 0 using a LR test to judge
the reduction from the ARMA(1,1) to the MA(1).
If both reductions are accepted, however, it is not clear how to choose between
the AR(1) model and the MA(1) model. The models are not nested (you cannot
impose a restriction on one model to get the other) and standard test theory will not
apply. In practice we can estimate both models and compare the …t of the models and
the outcome of misspeci…cation testing, but a formal approach based on hypothesis
testing is rather complicated.
Remark 4.4 (cancelling roots): In pure AR(p) models and in regression model,
the suggested starting point for the GETS search is typically a quite large model with
many lags, to ensure that the initial model includes the DGP. That is often not a
good idea for GETS in ARMA models due to the problem of cancelling roots.
To understand the problem, consider an i.i.d. time series given by
yt = t ; t = 1; 2; :::; T: (4.44)
Now imagine an autoregressive coe¢ cient, ; with j j < 1, and an MA coe¢ cient,
= , and the ARMA model
yt = y t 1 + t + t 1;
that we write as
(1 L)yt = (1 + L) t : (4.45)
As = the two terms in parentheses cancel and (4.45) is just an overly complicated
way of representing the i.i.d. process, and the parameters are not identi…ed. If we
begin with an ARMA(p,q) model, with p and q large, there may be a large chance of
pairs of cancelling roots. Instead it is suggested to start with a small or moderate
model, e.g. ARMA(2,2), ARMA(3,1) or ARMA(4,1), and use autocorrelation tests
to check if more lags seem to be needed. In any case, it is important to look at
likelihood values for di¤erent models and not only t-statistics, a the latter may not be
very reliable for models with poorly identi…ed coe¢ cients.
Instead of hypothesis testing as a way of …nding the best model, an alternative

approach is called model selection and it is valid also for non-nested models with the
same regressand. Recall, that the more parameters we allow in a model the smaller
is the residual variance and the higher is the likelihood. To obtain a parsimonious
model we therefore want to balance the model …t against the complexity of the model.
This balance can be measured by a so-called information criteria that takes the log-
likelihood and subtract a penalty for the number of parameters, i.e.
IC = log ^ 2 + penalty(T; #parameters):
A small value indicates a more favorable trade-o¤, and model selection could be
based on minimizing the information criteria. Di¤erent criteria have been proposed
based on di¤erent penalty functions. Three important examples are the Akaike, the
Hannan-Quinn, and Schwarz’Bayesian criteria, de…ned as, respectively,
2 k
AIC = log ^ 2 + (4.46)
T
2 k log(log(T ))
HQ = log ^ 2 + (4.47)
T
k log(T )
BIC = log ^ 2 + ; (4.48)
T
where k is the number of estimated parameters, e.g. k = p + q + 1. The idea of
the model selection is to choose the model with the smallest information criteria,
i.e. the best combination of …t and parsimony. Unlike hypothesis testing, which by
construction favours the null hypothesis in the sense that it is only rejected if there
4.6 Estimation and Model Selection 107
is enough evidence, information criteria treat all models equally–simply choosing the
model with the best tradeo¤ between …t and complexity.
It is worth emphasizing that di¤erent information criteria will not necessarily give
the same preferred model, and the model selection may not agree with the GETS
testing. In practice it is therefore often di¢ cult to make …rm choices, and with
several candidate models it is a sound principle to ensure that the conclusions you
draw from an analysis are robust to the chosen model.
A …nal and less formal approach to identi…cation of p and q in the ARMA(p,q)
model is based directly on the shape of the autocorrelation function. Recall that the
autocorrelation function is de…ned as
ACF(h)= corr(yt ; yt h ):
Similarly we may de…ne the partial autocorrelation function, PACF, as the autocor-
relation conditional on intermediate lags, i.e.
PACF(h) = corr(yt ; yt h j yt 1 ; yt 2 ; :::; yt h+1 ):
It follows directly from the de…nition that an AR(p) model will have p signi…cant
partial autocorrelations and PACF is zero for lags h > p; at the same time it holds
that the ACF exhibits an exponential decay. For the MA(q) model we observe the
reverse pattern: We know that the …rst q entries in the ACF are non-zero while ACF
is zero for h > q; and if we write the MA model as an AR(1) we expect the PACF
to decay exponentially. The AR and MA model are therefore mirror images and by
looking at the ACF and PACF we could get an idea on the appropriate values for p
and q. This methodology is known as the Box-Jenkins identi…cation procedure and it
was very popular in times when ARMA models were hard to estimate. With today’s
computers it is probably easier to test formally on the parameters than informally on
their implications in terms of autocorrelation patterns.
Example 4.2 (danish consumption-income ratio): To illustrate model selec-

tion we consider the log of the Danish quarterly consumption-to-income ratio, 1971:1-
2003:2. This is also the inverse savings rate. The time series is illustrated in Figure 4.3
(A), and the autocorrelation functions are given in (B). The PACF suggests that the
…rst autoregressive coe¢ cient is strongly signi…cant while the second is more bor-
derline. Due to the exponential decay of the ACF, implied by the autoregressive
coe¢ cient, it is hard so assess the presence of MA terms; this is often the case with
the Box-Jenkins identi…cation. We therefore estimate an ARMA(2,2)
yt 1 yt 1 2 yt 2 = + t + 1 t 1 + 2 t 2;
and all sub-models obtained by imposing restrictions on the parameters ( 1 ; 2 ; 1 ; 2 )0 .

Estimation results and summary statistics for the models are given in Table 4.1. All
(A) Consumption income ratio (B) Autocorrelation function for (A)

1
0.0
ACF
PACF
0
-0.1
1970 1980 1990 2000 0 5 10 15 20
(C) Forecast from an AR(2) (D) AR(2) and ARMA(1,1) forecasts

0.00
Forecasts
Actual
-0.05
-0.05
-0.10
-0.10
-0.15 AR(2)
-0.15 ARMA(1,1)
Actual
-0.20
1970 1980 1990 2000 2010 2000 2002 2004 2006 2008 2010
Figure 4.3: Example based on Danish consumption-income ratio.
three information criteria are minimized for the purely autoregressive AR(2) model,
but the value for the mixed ARMA(1,1) is by and large identical. The reduction from
the ARMA(2,2) to these two models are easily accepted by LR tests. Both models
seem to give a good description of the covariance structure of the data and based on
the output from Table 4.1 it is hard to make a …rm choice.
4.7 Univariate Forecasting

It is straightforward to forecast with univariate ARMA models, and the simple struc-
ture allows you to produce forecasts based solely on the past of the process. The
obtained forecasts are plain extrapolations from the systematic part of the process
and they will contain no economic insight. In particular it is very rarely possible to
predict turning points, i.e. business cycle changes, based on a single time series. The
forecast may nevertheless be helpful in analyzing the direction of future movement in
a time series, all other things equal.
ARMA(p,q) (2,2) (2,1) (2,0) (1,2) (1,1) (1,0) (0,2) (0,1) (0,0)
1 1:418 0:573 0:536 0:833 0:857 0:715 ... ... ...
(4:28) (1:70) (6:36) (12:2) (15:2) (11:7)
2 0:516 0:224 0:251 ... ... ... ... ... ...

( 1:82) (0:89) (2:95)
1 0:899 0:040 ... 0:304 0:301 ... 0:577 ... ...

( 2:80) ( 0:11) ( 2:78) ( 3:06) (6:84)
2 0:308 ... ... 0:085 ... ... 0:397 0:487 ...

(2:13) (0:95) (5:75) (8:36)
0:094 0:094 0:094 0:094 0:094 0:094 0:094 0:094 0:095
( 11:1) ( 9:77) ( 9:87) ( 9:87) ( 9:44) ( 12:6) ( 52:3) ( 24:9) ( 30:4)
log-likelihood 300:822 300:395 300:389 300:428 299:993 296:174 2 87:961 274:720 249:826
BIC 4:441 4:472 4:509 4:472 4:503 4:482 4:318 4:152 3:806
HQ 4:506 4:524 4:548 4:525 4:542 4:508 4:357 4:178 3:819
AIC 4:551 4:560 4:575 4:560 4:569 4:526 4:384 4:196 3:828
Normality [0:70] [0:83] [0:84] [0:81] [0:84] [0:80] [0:44] [0:12] [0:16]
No-autocor. [0:45] [0:64] [0:72] [0:63] [0:66] [0:20] [0:00] [0:00] [0:00]
Table 4.1: Estimation results for ARMA(2,2) and sub-models. Figures in parentheses are t-ratios. Figures in square
brackets are p-values for misspeci…cation tests. Estimation is done using the ARFIMA package in PcGive, which uses a
slightly more complicated treatment of initial values than that presented in the present text.
The object of interest is a prediction of yT +h given the information up to time T .

Formally we de…ne the information set available at time T as IT = fy 1 ; :::; yT 1 ; yT g,
and we de…ne the optimal predictor as the conditional expectation
yT +hjT = E(yT +h j IT ): (4.49)
To illustrate the idea we consider the case of an ARMA(1,1) model,
yt = + y t 1 + t + t 1;
for t = 1; 2; :::; T . To forecast the next observation, yT +1 , we write the equation
yT +1 = + yT + T +1 + T;
and the best prediction is the conditional expectation of the right-hand-side. We note
that yT and T = yT yT 1 T 1 are in the information set at time T , while
the best prediction of future shocks are zero, E( T +h j IT ) = 0 for h > 0. We …nd
the predictions
yT +1jT = E [ + yT + T +1 + T j IT ] = + yT + T
yT +2jT = E [ + yT +1 + T +2 + T +1 j IT ] = + E [yT +1 j IT ] = + yT +1jT
yT +3jT = E [ + yT +2 + T +3 + T +2 j IT ] = + E [yT +2 j IT ] = + yT +2jT
..
.
We note that the error term, T , a¤ects the …rst period forecast due to the MA(1)
structure, and after that the …rst order autoregressive process takes over and the
forecasts will converge exponentially towards the unconditional expectation, .
In practice we replace the true parameters with the estimators, ^, ^ , and ^, and
the true errors with the estimated residuals, ^1 ; :::; ^T , which produces the feasible
forecasts, y^T +hjT .
4.7.1 Forecast Errors

The forecasts above are point forecasts, i.e. the best point predictions given by the
model. Often it is of interest to assess the variances of these forecasts, and produce
con…dence bounds or distributions of the forecasts.
To analyze forecast errors, consider …rst an MA(q) model
yt = + t + 1 t 1 + 2 t 2 + ::: + q t q:
4.7 Univariate Forecasting 111
The sequence of forecasts are given by
yT +1jT = + 1 T + 2 T 1 + ::: + q T q+1
yT +2jT = + 2 T + ::: + q T q+2
yT +3jT = + 3 T + ::: + q T q+3

..
.
yT +qjT = + q T
yT +q+1jT = ;
where the information set improves the predictions for q period. The corresponding
forecast errors are given by the error terms not in the information set
yT +1 yT +1jT = T +1
yT +2 yT +2jT = T +2 + 1 T +1
yT +3 yT +3jT = T +3 + 1 T +2 + 2 T +1
..
.
yT +q yT +qjT = T +q + 1 T +q 1 + ::: + q 1 T +1
yT +q+1 yT +q+1jT = T +q+1 + 1 T +q + ::: + q T +1 :
We can …nd the variances of the forecast as the expected squared forecast errors, i.e.
2
FEV(1) = E T +1 j IT
2
=
2
FEV(2) = E ( T +2 + 1 T +1 ) j IT
2 2
= 1+ 1
2
FEV(3) = E ( T +3 + 1 T +2 + 2 T +1 ) j IT
2 2 2
= 1+ 1 + 2
..
.
2
FEV(q) = E ( T +q + 1 T +q 1 + ::: + q 1 T +1 ) j IT
2 2 2 2
= 1+ 1 + 2 + ::: + q 1
2
FEV(q + 1) = E ( T +q+1 + 1 T +q + ::: + q T +1 ) j IT
2 2 2 2
= 1+ 1 + 2 + ::: + q :
Note that the forecast error variance increases with the forecast horizon, and
that the forecast variances converge to the unconditional variance of yt , see equation
(4.10). This result just re‡ects that the information set, IT , is useless for predictions
in the remote future: The best prediction will be the unconditional mean, , and the
uncertainty is the unconditional variance, FEV(1) = 0 .
Assuming that the error term is normally distributed, we may produce 95% con-
…dence bounds for the forecasts as
p
yT +hjT 1:96 FEV(h);
where 1:96 is the quantile of the normal distribution. Alternatively we may give a full
distribution of the forecasts as N (yT +hjT ;FEV(h)), also known as the density forecast.
To derive the forecast error variance for AR and ARMA models we just write the
models in their MA(1) form and use the derivations above. For the case of a simple
AR(1) model the in…nite MA representation is given in (4.16), and the forecast error
variances are found to be
2 2 2 2 4 2
FEV(1) = ; FEV(2) = 1 + ; FEV(3) = 1 + + ; :::
where we again note that the forecast error variances converge to the unconditional
variance, 0 .
Example 4.3 (danish consumption-income ratio): For the Danish consumption-

income ratio, we saw that the AR(2) model and the ARMA(1,1) gave by and large
identical in-sample results. Figure 4.3 (C) shows the out-of-sample forecast for the
AR(2) model. We note that the forecasts are very smooth, just describing an expo-
nential convergence back to the unconditional mean–the attractor in the stationary
model. The shaded area is the distribution of the forecast, and the widest band corre-
sponds to 95% con…dence. Graph (D) compares the forecasts from the AR(2) and the
ARMA(1,1) models as well as their 95% con…dence bands. The two sets of forecasts
are very similar and for practical use it does not matter which one we choose; this is
reassuring as the choice between the models was very di¢ cult.
Remark 4.5 (measuring forecast accuracy): Consider a model and a pro-

duced dynamic forecast, yT +hjT for h = 1; 2; :::; H, where H is the maximum forecast
horizon. With actual realizations given by yT +h ; the forecast error is
h = yT +h yT +hjT ; h = 1; 2; :::; H:
In order to evaluate the forecast, it is normal to calculate the average forecast error
to see if the forecast is systematically biased,
1 X
H
BIAS = h: (4.50)
H h=1
4.7 Univariate Forecasting 113
To measure the overall forecast accuracy, we choose a loss function measuring the
severity of a forecast error, L( h ). A classical choice is the root mean squared error,
RMSE, as given by
v v
u u
u1 X H u1 X H
RMSE = t 2
= t L2 ( h ); (4.51)
H h=1 h H h=1
with the loss function de…ned as L2 ( h ) = 2h . The RMSE is directly comparable to the
in-sample standard deviation of the estimated residuals, ^ , and if the RMSE is larger
than ^ , it is a signal that the out-of sample behavior of the model is worse than the
in-sample …t, which is normally the case.
Because of the quadratic function in the L2 ( ) loss in RMSE, a few large forecast
errors are heavily punished, and an alternative measure, which gives a smaller weight
to large errors, is the mean absolute error, MAE, as de…ned by
1 X 1 X
H H
MAE = j j= L1 ( h ); (4.52)
H h=1 h H h=1
with loss function L1 ( h ) = j h j, where j j denotes the absolute value. Which of these
measures to prefer in applications, depends on the nature of the forecasts, and in
particular the utility function of the forecaster.
Sometimes, researchers also consider the relative error, e.g. the mean absolute
percentage error, MAPE, as de…ned by
100 X yT +h yT +hjT
H
MAPE = : (4.53)
H h=1 yT +h
The interpretation of MAPE is straightforward, but can be misleading if the actual

series, fyT +h gH
h=1 , has some observations close to zero.
Remark 4.6 (test of equal forecast accuracy): Now imagine that we have
two series of forecasts,
yTA+hjT and yTB+hjT for h = 1; 2; :::; H;
produced by two competing models and two corresponding forecasts errors,
h = yT +h yTA+hjT and h = yT +h yTB+hjT :
To formally compare the two forecasts, we choose a loss function, e.g. L2 ( ) or L1 ( )

above, and calculate the loss di¤erential for h = 1; 2; :::; H:
2 2
dh = L2 ( h ) L2 ( h) = h h
or
dh = L 1 ( h ) L1 ( h) = j hj j hj :
Now, the hypothesis of equal forecast performance corresponds to the hypothesis
H0 : E(dh ) = = 0; (4.54)
against the two-sided alternative, 6= 0, that one of the forecast series is more accu-
rate. Diebold and Mariano (1995) suggest a test based on the average loss di¤erential
1 X
H
d= dh ; (4.55)
H h=1
in particular the t type statistic

1
PH
d h=1 dh H
DM =p =q P ; (4.56)
V (d) 1 H
V ( H h=1 dh )
P d
where V ( H1 H h=1 dh ) is the variance of d. It holds that under H0 , DM ! N (0; 1).
The complication of the Diebold-Mariano statistic is that the terms d1 ; d2 ; :::; dH
are typically correlated and the variance of the sum in the denominator is not the
sum of the variances. As a consequence, the variance has to take the covariances
into account and is calculated using the so-called long-run variance of d, which we
formally de…ne in §13.4.
In practice the test can be calculated using a linear regression
dh = + h; h = 1; 2; :::; H; (4.57)
where H0 is given by = 0 and DM is simply the t statistic for = 0, calculated

with standard errors that are robust to autocorrelation–the so-called heteroskedasticity
and autocorrelation consistent, HAC, standard errors.
4.8 Further Readings

The analysis of univariate time series is treated in most textbooks. Enders (2004)
gives an excellent treatment of the univariate models based on the theory for di¤erence
equations. Other treatments include Lütkepohl and Krätzig (2004).
4.A Moving Average Solution for the AR(p) 115
Appendix:
4.A Moving Average Solution for the AR(p)

To derive the moving average solution
yt = (1 + c1 + c2 + c3 + c4 + :::) + t + c1 t 1 + c2 t 2 + c3 t 3 + :::;
for an AR(p) model it is convenient to rewrite the model as an equivalent AR(1) model
for a vector of variables, a VAR(1), and use recursive substitution in the VAR(1).
Here we consider an AR(3),
yt = + 1 yt 1 + 2 yt 2 + 3 yt 3 + t;
but the solution method is totally general.

First de…ne the vector of variables, Zt = (yt ; yt 1 ; yt 2 )0 , and consider the following
VAR(1) model
0 1 0 1 0 10 1 0 1
yt 1 2 3 yt 1 t
@ yt 1 A = @ 0 A + @ 1 0 0 A @ yt 2 A + @ 0 A : (4.58)
yt 2 0 0 1 0 yt 3 0
The …rst equation is identical to the AR(3) while the two lower equations are just
identities, stating that yt 1 = yt 1 and yt 2 = yt 2 . We have now written the AR(3)
model as the …rst equation of the VAR(1) model
Zt = + Zt 1 + ut ; (4.59)
with 0 1 0 1 0 1
1 2 3 t
= @ 0 A; =@ 1 0 0 A; and ut = @ 0 A :
0 0 1 0 0
To …nd the MA solution, we use recursive substitution in (4.59). The …rst step
yields
Zt = + Zt 1 + ut
= + ( + Zt 2 + ut 1 ) + ut
2
= (I3 + ) + ut + ut 1 + Zt 2 ;
2
where the square = is now the product of 3 3 matrices. Doing one more
substitution gives
2
Zt = (I3 + ) + ut + ut 1 + Zt 2
2
= (I3 + ) + ut + ut 1 + ( + Zt 3 + ut 2 )
2 2 3
= I3 + + + ut + ut 1 + ut 2 + Zt 3 ;
and likewise,
2 3 2 3 4
Zt = I3 + + + + ut + ut 1 + ut 2 + ut 3 + Zt 4 : (4.60)
Because yt is given by the …rst entry in Zt and because ut = ( t ; 0; 0)0 , we can

directly …nd the moving average coe¢ cients fc1 ; c2 ; c3 ; :::g from (4.60) as the upper
left elements of the sequence f ; 2 ; 3 ; :::g. We …nd from that
c1 = 1: (4.61)
Next,
0 10 1 0 2 1
1 2 3 1 2 3 1 + 2 3 + 1 2 1 3
2
=@ 1 0 0 A@ 1 0 0 A=@ 1 2 3
A:
0 1 0 0 1 0 1 0 0
and
2
c2 = 1 + 2 = c1 1 + 2
Likewise,
0 2 2 2 1
3 + 1 2 + 1 1 + 2 1 3 + 2 1 + 2 3 1 + 2
3
=@ 2
1+ 2 3 + 1 2 1 3
A;
1 2 3
and
2
c3 = 1 1 + 2 + 1 2 + 3
= c2 1 + c1 2 + 3:
Chapter 5
The Autoregressive
Distributed Lag Model
for Stationary Time Series
T
his chapter introduces the class of autoregressive models with additional ex-
planatory variables, the so-called autoregressive distributed lag (ADL) mod-
els. Due to the dynamics of the model, the interpretation of regression co-
e¢ cients is somewhat complicated, and for interpretation we derive the dynamic
multipliers and the steady-state solution. In this chapter, we focus on the case of sta-
tionary and weakly dependent time series, but the tools are used again in Chapter 8
for unit-root time series in the presence of co-integration.
5.1 The Model

The models considered in Chapter 4 were all univariate in the sense that they focus
on a single variable, yt . In most situations, however, we are interested in the interre-
lationship between variables. As an example we could be interested in the dynamic
e¤ects on a particular variable of interest, yt , as the response to an intervention in
another variables, xt , i.e. in the dynamic multipliers
@yt @yt+1 @yt+2
; ; ; :::;
@xt @xt @xt
and a univariate model will not su¢ ce for this purpose. Often the object of interest is
the conditional mean of yt given yt 1 ; yt 2 ; :::; xt ; xt 1 ; xt 2 ; ::: and assuming linearity
118 The Autoregressive Distributed Lag Model
of the conditional mean, we get the so-called autoregressive distributed lag (ADL)
model:
y t = + 1 y t 1 + 0 xt + 1 x t 1 + t ; (5.1)
for t = 1; 2; :::; T , with E( t j yt 1 ; yt 2 ; :::; xt ; xt 1 ; xt 2 :::) = 0.
Sometimes, we make the stronger assumption that (i) t is independently and
identically distributed, i.i.d.(0; 2 ), ruling out also heteroskedasticity, or, for the like-
lihood analysis, that (ii) t given the past is homoskedastic Gaussian. These addi-
tional assumptions, (i)-(ii), may be relevant for likelihood-based estimation and for
the derivation of some of the properties of yt , e.g. second order moments. In most
cases, however, they are not paramount. Heteroskedasticity and non-normality do
not change the economic interpretation of the model, and the only change in the
statistical analysis is that the analysis is interpreted as a quasi-likelihood analysis,
with the implied use of the sandwich variance formula.
By referring to the results for an AR(1) process, we immediately see that the
process yt is stationary if j 1 j < 1 and xt is a stationary process. The …rst condition
excludes unit roots in the equation (5.1), while the second condition states that the
forcing variable, xt , is also stationary. In this case, the standard results for estimation
and inference holds for this model.
Before we go into details with the dynamic properties of the ADL model, we want
to emphasize that the model class is quite general and contains other relevant models
as special cases. The AR(1) model analyzed above prevails if 0 = 1 = 0. A static
regression with i.i.d. errors is obtained if 1 = 1 = 0. A static regression with AR(1)
autocorrelated errors is obtained if 1 = 1 0 . Finally, a model in …rst di¤erences
is obtained if 1 = 1 and 0 = 1 . Whether these special cases are relevant can be
analyzed with Wald or LR testing on the coe¢ cients.
To simplify notation, the model in (5.1) includes only one explanatory variable,
xt , and only one lag in yt and xt . The model can easily be extended to more general
cases, however, and with two explanatory variables, xt and zt , and with ky ; kx , and
kz lags, respectively, the model would read
ky
X X
kx X
kz
yt = + i yt i + i xt i + i zt i + t: (5.2)
i=1 i=0 i=0
The model can also be extended to include a more elaborate deterministic speci…-
cation, e.g. a linear trend, seasonal dummies, or intervention dummies for outlying
observations. In the analysis below, we mainly focus on the simple case in (5.1).
5.2 Dynamic- and Long-Run Multipliers 119
5.2 Dynamic- and Long-Run Multipliers

To derive the dynamic multiplier for the model in (5.1), we write the equations for
observations yt , yt+1 , yt+2 , yt+3 , etc.,
yt = + 1 yt 1 + 0 xt + 1 xt 1 + t
yt+1 = + 1 yt + 0 xt+1 + 1 xt + t+1
yt+2 = + 1 yt+1 + 0 xt+2 + 1 xt+1 + t+2
yt+3 = + 1 yt+2 + 0 xt+3 + 1 xt+2 + t+3 :
The dynamic multipliers, recalling that only xt changes, are calculated as the deriva-
tives wrt. xt :
@yt
= 0
@xt
@yt+1 @yt
= 1 + 1 = 1 0 + 1
@xt @xt
@yt+2 @yt+1
= 1 = 1 ( 1 0 + 1)
@xt @xt
@yt+3 @yt+2 2
= 1 = 1 ( 1 0 + 1)
@xt @xt
..
.
@yt+k k 1
= 1 ( 1 0 + 1) :
@xt
Under the stationarity condition, j 1 j < 1, shocks have only transitory e¤ects @y@xt+k
t
!
0 as k ! 1. We think of the sequence of multipliers as the impulse-reponses to a
temporary change in xt and for stationary variables it is natural that the long-run
e¤ect is zero. To illustrate the ‡exibility of the ADL model with one lag, Figure 5.1
(A) reports example of dynamic multipliers. In all cases the contemporaneous impact
@yt
is @x t
= 0:8, but the dynamic pro…le can be fundamentally di¤erent depending on
the parameters.
Now consider a permanent shift in xt , i.e. where E(xt ) changes. To …nd the
long-run multiplier we take expectations in (5.1) to obtain
yt = + 1 yt 1 + 0 xt + 1 xt 1 + t
E(yt ) = + 1 E(yt 1 ) + 0 E(xt ) + 1 E(xt 1 ):

If the processes are stationary, it holds that E(yt ) = E(yt 1 ) and E(xt ) = E(xt 1 ),
and collecting terms we obtain
E(yt ) (1 1) = +( 0 + 1 ) E(xt )
0 + 1
E(yt ) = + E(xt ): (5.3)
1 1 1 1
Based on (5.3) we …nd the long-run multiplier as the derivative
@E(yt ) +
= 0 1
: (5.4)
@E(xt ) 1 1
It also holds that the long-run multiplier is the sum of the short-run e¤ects, i.e.
@yt @yt+1 @yt+2 2
+ + + ::: = 0 +( 1 0 + 1) + 1 ( 1 0 + 1) + 1 ( 1 0 + 1) + :::
@xt @xt @xt
2 2
= 0 1+ 1+ 1 + ::: + 1 1+ 1 + 1 + :::
0+ 1
= :
1 1
We may also de…ne a steady state as yt = yt 1 and xt = xt 1 . Inserting that in

the ADL model we get the so-called long-run solution:
0 + 1
yt = + xt = + xt : (5.5)
1 1 1 1
We see that the steady state derivative is exactly the long-run multiplier.
Figure 5.1 (B) reports examples of cumulated dynamic multipliers, which picture
of the convergence towards the long-run solutions. The …rst period impact is in all
@yt
cases @x t
= 0:8, while the long-run impact is which depends on the parameters of
the model.
Example 5.1 (danish income and consumption): To illustrate the impact of

income changes on changes in consumption, we consider time series for the growth
from quarter to quarter in Danish private aggregate consumption, ct = log(CONSt )
say, and private disposable income, yt = log(INCt ), for 1971 : 1 2003 : 2, see
Figure 5.1 (C). To analyze the interdependencies between the variables we estimate
and ADL model with one lag and obtain the equation
c^t = 0:003 0:312 ct 1 + 0:244 yt + 0:055 yt 1 ; (5.6)

(2:01) ( 3:59) (3:88) (0:84)
where the numbers in parentheses are t ratios. Apart from a number of outliers, the
model appears to be well speci…ed and the misspeci…cation tests for no-autocorrelation
5.3 Error-Correction Model 121
cannot be rejected. The residuals are reported in graph (D). The long-run solution
is given by (5.5) with coe¢ cients given by
0:003 0 + 1 0:244 + 0:055

= = = 0:0024 and = = = 0:227:
1 1 1 + 0:312 1 1 1 + 0:312
The standard errors of and are complicated functions of the original variances
and covariances, but OxMetrics supply them automatically and we …nd that t =0 =
2:02 and t =0 = 3:00 so both long-run coe¢ cients are signi…cantly di¤erent from
zero at a 5% level. The impulse responses, @ct =@yt ,@ct+1 =@yt ; :::, are reported in
Figure 5.1 (E) together with 95% con…dence bands derived from the variance of the
parameter estimates. We note that the contemporaneous impact is 0:244 from (5.6).
The cumulated response is also presented in graph (E); it converges to the long-run
multiplier, = 0:227, which in this case is very close to the …rst period impact.
5.3 Error-Correction Model

There exists a convenient formulation of the model that incorporates the long-run
solution directly. In particular, we can rewrite the model as
yt = + 1 yt 1 + 0 xt + 1 xt 1 + t
yt yt 1 = +( 1 1)yt 1 + 0 xt + 1 xt 1 + t
yt yt 1 = +( 1 1)yt 1 + 0 (xt xt 1 ) + ( 0 + 1 )xt 1 + t
yt = +( 1 1)yt 1 + 0 xt + ( 0 + 1 )xt 1 + t: (5.7)
The idea is that the levels appear only once, and the model can be written on the
form
0+ 1
yt = 0 xt (1 1 ) yt 1 xt 1 + t ; (5.8)
1 1 1 1
where
0 + 1
= and =
1 1 1 1
refer to (5.5). The representation in (5.8) is known as the error-correction model

(ECM) and it has a very natural interpretation in terms of the long-run steady state
solution and the dynamic adjustment towards equilibrium.
First note that
yt 1 yt 1 = yt 1 xt 1
(A) Dynamic multipliers in an ADL(1,1) (B) Cumulated dynamic multipliers from (A)
y t =0.5⋅yt−1 +0.8⋅x t +0.2⋅xt−1 +ε t
1.0 y t =0.9⋅yt−1 +0.8⋅x t +0.2⋅xt−1 +ε t
y t =0.5⋅yt−1 +0.8⋅x t +0.8⋅xt−1 +ε t 4
y t =0.5⋅yt−1 +0.8⋅x t −0.6⋅xt−1 +ε t
0.5
2
0.0
0
0 5 10 15 20 25 0 5 10 15 20 25
(C) Growth in Danish consumption and income (D) Standardized residuals from ADL model
0.05
Consumption 2.5
0.00
0.0
-0.05 Income 0.05
0.00
-2.5
-0.05
1975 1980 1985 1990 1995 2000 1975 1980 1985 1990 1995 2000 2005
(E) Dynamic multipliers (F) Forecasts from ADL model

0.4
Forecasts
Actual
Cumulated dynamic multipliers 0.05
0.2
0.00
0.0
Dynamic multiplier (lag-weights) -0.05
0 5 10 1975 1980 1985 1990 1995 2000 2005
Figure 5.1: (A)-(B): Examples of dynamic multipliers for an ADL model with one
lag. (C)-(D): Empirical example based on Danish income and consumption.
is the deviation from steady state the previous period. For the steady state to be
sustained the variables have to eliminate the deviations and move towards the long-
run solution. This tendency is captured by the coe¢ cient (1 1 ) < 0, and if yt 1
is above the steady state value then yt will be a¤ected negatively, and there is a
tendency for yt to move back towards the steady state. We say that yt error-corrects
or equilibrium-corrects and yt = + xt is the equilibrium value or the attractor for
yt .
We can estimate the ECM model and obtain the alternative representation di-
5.4 General Case 123
rectly. We note that the models (5.1), (5.7), and (5.8) are equivalent, and we can
always calculate from one representation to the other. It is important to note, how-
ever, that the versions in (5.1) and (5.7) are linear regression models that can be
estimated using OLS while the solved ECM in (5.8) is non-linear and requires a dif-
ferent estimation procedure. The choice of which model to estimate depends on the
purpose of the analysis and on the used software. The analysis in OxMetrics focus on
the standard ADL model and the long-run solution and the dynamic multipliers are
automatically supplied. In other software packages it is sometimes more convenient
to estimate the ECM form directly.
Example 5.2 (danish income and consumption): If we estimate the linear error-
correction form for the Danish income and consumption data series we obtain the
equation,
c^t = 0:244 yt 1:312 ct 1 + 0:298 yt 1 + 0:0031:
(3:88) ( 15:1) (2:87) (2:01)
Note that the results are equivalent to the results from the ADL model in (5.6). We
0:298
can …nd the long-run multiplier as = 1:312 = 0:227. Using maximum likelihood
(with a normal error distribution) we can also estimate the solved ECM form to
obtain
c^t = 0:244 yt 1:312 ct 1 0:0024 0:227 yt 1 ;
(3:88) ( 15:1) (1:02) (3:00)
where we again recognize the long-run solution.
5.4 General Case

To state the results for the general case of more lags, it is again convenient to use
lag-polynomials. The model with ky and kx lags
ky
X X
kx
yt = + i yt i + i xt i + t: (5.9)
i=1 i=0
is therefore written as
(L)yt = + (L)xt + t ;
where the polynomials are de…ned as
2 ky 2 kx
(L) = 1 1L 2L ::: ky L and (L) = 0 + 1L + 2L + ::: + kx L :
Under stationarity we can write the model as

1 1 1
yt = (L) (L)xt + (L) + (L) t ; (5.10)
in which there are in…nitely many terms in both xt and t . Writing the combined
polynomial as
c(L) = 1 (L) (L) = c0 + c1 L + c2 L2 + :::; (5.11)
the dynamic multipliers, also called lag-weights, are given by the sequence c0 ; c1 ; c2 ; :::
This is parallel to the result for the AR(1) model stated above.
Remark 5.1 (deriving lag-weights): To understand the result in (5.11), con-

sider the case of a single lag,
(L) = 1 L and (L) = 0 + 1 L:
Then it holds that

1 1 2 3 4
(L) = =1+ L+ L2 + L3 + L4 + :::;
1 L
see e.g. (4.29). This gives
1
c(L) = (L) (L)
2 3 4
= 1+ L+ L2 + L3 + L4 + ::: ( 0+ 1 L)
2
= 0 +( 1 + 0 )L + ( 1 + 0 ) L2 + ( 1 + 0 ) L3 + :::
= c0 + c1 L + c2 L2 + c3 L3 + :::;
compare the result in §5.2.
Since the lag-weights are functions of the parameters, the uncertainty on the esti-
mated lag-weights,
c^0 = ^ 0 ; c^1 = ^ 1 + ^ 0 ^; c^2 = ^( ^ 1 + ^ 0 ^); etc.
can be derived from the covariance matrix of the estimated parameters. This allows
construction of con…dence bands, such that their signi…cance can be judged.
If we take expectations in (5.10) we get the equivalent to (5.3),

1 1
E[yt ] = (L) + (L) (L)E[xt ]:
Now recall that E[yt ] = E[yt 1 ] = LE[yt ] and E[xt ] = E[xt 1 ] = LE[xt ] are constant
due to stationarity, and it holds that
1 1
(L) (L)E[xt ] = (1) (1)E[xt ]:
The long-run solution can be found as the derivative
@E(yt ) (1) + + + ::: +
= = 0 1 2 kx
:
@E(xt ) (1) 1 1 2 ::: ky
This is also the sum of the dynamic multipliers, c0 + c1 + c2 + :::

5.5 Conditional Forecasts 125
5.4.1 Summary
To conclude, observe that with several lags the interpretation of the individual co-
e¢ cients, ( 1 ; :::; ky ; 0 ; :::; kx ), is somewhat complicated. We know that 0 is the
contemporaneous impact,
@yt
0 = ;
@xt
but there is no similar regression interpretation for 1 , because it is not natural to
consider the derivative of yt with respect to xt 1 leaving xt unchanged. We therefore
interpret the results via the parameters in the long-run solution
0 + 1 + 2 + ::: + kx
yt = + xt = + xt : (5.12)
1 1 2 ::: ky 1 1 2 ::: ky
We can also write the error-correction form

ky 1
X kX
x 1
yt = 0 xt (yt 1 xt 1 ) + i yt i + i xt i + t; (5.13)
i=1 i=1
where
=1 1 2 ::: ky
and 1 ; :::; ky 1 ; 1 ; ::: kx 1 are functions of the original parameters. In the error
correction form we can interpret 0 as the contemporaneous impact, and as
de…ning the long-run solution and as the speed of adjustment, i.e. the part of the
disequilibrium removed in the next period.
The remaining parameters add extra dynamics needed to …t the properties of the
data, and they can be interpreted by looking at the pro…le of the dynamic multipliers,
@yt =@xt ,@yt+1 =@xt ; :::, similar to Figure 5.1 (E).
5.5 Conditional Forecasts

From the conditional ADL model we can also produce forecasts of yT +k . This is
parallel to the univariate forecasts presented above and we will not go into details
here. Note, however, that to forecast yT +k we need to know xT +k . For the simple
model with one lag,
yt = 1 yt 1 + 0 xt + 1 xt 1 + t ; (5.14)
the information set for the forecast is given by
IT = fyT ; xT ; yT 1 ; xT 1 ; :::g; (5.15)

and the optimal one-period forecast will be
yT +1jT = E(yT +1 j IT )
= E( 1 yT + 0 xT +1 + 1 xT + T +1 j IT )
= 1 yT + 0 E(xT +1 j IT ) + 1 xT ; (5.16)
where we have used that yT and xT are included in the information set at time T and
that E( T +1 j IT ) = 0.
In (5.16), xT +1 in not included in the information set IT and to calculate the
forecast we need a forecast of the conditioning variable, xT +1 , given IT . The forecast
for xT +1 could be obtained using a univariate time series model for fxt gTt=1 , e.g. an
AR(1), and the forecast
xT +1jT = E(xT +1 j IT ); (5.17)
could then be inserted in (5.16), such that
yT +1jT = 1 yT + 0 xT +1jT + 1 xT : (5.18)
Alternatively, we could specify likely scenarios for xT +h , h = 1; 2; :::; and calculate

the most likely outcome for yT +h given these scenarios.
Example 5.3 (danish income and consumption): To illustrate dynamic fore-

casting with the ADL model we reestimate the ADL model for the Danish consump-
tion conditional income. We now estimate for the sample 1971 : 2 1997 : 2 and
retain the most recent 24 observations (6 years of quarterly data) for post sample
analysis. We forecast the observations (conditional on the observed observations for
yt ) and compare the forecasts with the actual observations for ct . For the reduced
sample we obtain the results
c^t = 0:003 0:298 ct 1 + 0:300 yt + 0:055 yt 1 ;

(1:78) ( 3:09) (4:24) (0:72)
which are very similar to the full sample results. Figure 5.1 (F) reports the forecasts
and the actual observations. The forecast do not seem to be very informative in this
case, which just re‡ect that the noise in the time series is very large compared to the
systematic variation. That makes forecasting very di¢ cult. For more persistent time
series it is often easier to predict the direction of future movements.
Remark 5.2 (comparing conditional forecasts): To compare conditional fore-

casts from di¤erent competing models, the tools in §4.7.1 again apply.
5.6 Further Readings 127

A very detailed analysis of the ADL is model is given in Hendry (1995), who also goes
through many special cases of the ADL model. There is also detailed coverage for
both stationary and non-stationary variables in Lütkepohl and Krätzig (2004) and
Lütkepohl (2005) but the technical level is higher than this course.
Chapter 6
Analysis of Vector
Autoregressive Models
V
ector autoregressive models are generalizations of univariate autoregressive
models to the case with several variables, Zt = (z1t ; z2t ; :::; zpt )0 2 Rp , and
they are natural starting points for empirical analyses when the causal re-
lationships between variables are unknown. This chapter introduces some classical
tools for the analysis of stationary vector autoregressions. Later in the course, we
revisit the vector autoregression for the case of unit-root non-stationary variables and
consider the implications of co-integration.
6.1 Introduction
So far in the course, the considered time series models were either univariate, as the
…rst-order autoregressive, AR(1), model for the variable yt 2 R,
yt = + y t 1 + t; for t = 1; 2; :::; T; (6.1)
with t serially uncorrelated and y0 given, or single-equation conditional models, as

the autoregressive distributed lag (ADL) model for yt 2 R conditional on xt 2 Rm ,
e.g. for m = 1,
y t = + y t 1 + 0 xt + 1 xt 1 + t ; (6.2)
with y0 and x0 given. Recall that the approach in (6.2) requires that the contem-
poraneous causal direction is known a priori, i.e. that we know that it is xt that
determines yt and not the other way around. This assumption may sometimes be
130 Analysis of Vector Autoregressive Models
appropriate, but in many cases reverse causality–or a simultaneous determination of

yt and xt –cannot be ruled out.
In this chapter we generalize the framework to consider models for a vector of vari-
ables, Zt = (z1t ; z2t ; :::; zpt )0 2 Rp , given their joint past, collected in the information
set
It 1 = fZt 1 ; Zt 2 ; Zt 3 ; :::g = fz1t 1 ; :::; zpt 1 ; z1t 2 ; :::; zpt 2 ; :::g:
For this class of models, we do not impose a speci…c causal structure; we simply
formulate a model for the dynamic properties of the vector Zt . To simplify notation,
most of the presentation below will focus on the case Zt = (yt ; xt )0 2 R2 , but the
theory holds for arbitrary dimensions, Zt 2 Rp .
6.2 The VAR Model

Let Zt = (z1t ; :::; zpt )0 2 Rp be a p dimensional vector. The vector autoregressive
model of order k, denoted VAR(k), is de…ned as the p equations:
Zt = + 1 Zt 1 + 2 Zt 2 + ::: + k Zt k + t; t = 1; 2; :::; T , (6.3)
with t serially uncorrelated and conditional on the k initial values Z0 ; Z 1 ; :::; Z (k 1) .

For the likelihood analysis, we assume that
d
t j Zt 1 ; Zt 2 ; :::; Zt k = N (0; );
where is symmetric and positive de…nite. Observe that is a p 1 vector while

the autoregressive coe¢ cients, 1 ; :::; k , are all p p matrices. In total, the model
has p + kp2 parameters in the conditional mean plus p(p + 1)=2 parameters in the
symmetric covariance matrix, .
Example 6.1 (a bivariate var(1)): Consider the case p = 2 and Zt = (yt ; xt )0 ,

and assume a model with k = 1 lag. The VAR(1) model may be written as
yt 1 11 12 yt 1 1t
= + + ; (6.4)
xt 2 21 22 xt 1 2t
d
with t j It 1 = N (0; ). Of course, we may also write the model as two separate
equations,
yt = 1 + 11 yt 1 + 12 xt 1 + 1t (6.5)
xt = 2 + 21 yt 1 + 22 xt 1 + 2t ; (6.6)
6.2 The VAR Model 131
as long as we remember that the error terms, 1t and 2t , may be correlated,
1t Et 1( 1t 1t ) Et 1( 1t 2t ) 11 12
Et 1 ( 1t ; 2t ) = = = ;
2t Et 1( 2t 1t ) Et 1( 2t 2t ) 21 22
where Et 1 ( ) = E( j It 1 ) denotes the expectation conditional on the past. By

symmetry it holds that 21 = 12 , and the parameters of the model are given by
= f 1; 2; 11 ; 12 ; 21 ; 22 ; 11 ; 21 ; 22 g;
i.e. 9 parameters.
Observe that the VAR is a reduced form, where all regressors are predetermined–in
the sense that they are dated at time t 1 or earlier (and therefore measurable with
respect to It 1 ). It also holds that all included variables in Zt are endogenous with
no prior assumptions on the causal contemporaneous direction between the variables.
The equations in (6.4) model the dynamic e¤ects and summarize the autocovari-
ances of the data. The simultaneous e¤ects, i.e. all e¤ects happening within one
time period, are not directly parametrized in (6.4). The simultaneous e¤ect between
yt and xt is still a part of the model, however, but it is described by the error co-
variance, 12 : To understand the relationship between the error covariance and the
contemporaneous e¤ects, imagine a situation where xt causally a¤ects yt within one
time period, which we here denote
xt ; yt : (6.7)
Now consider an increase of 2t that we will interpret as an unexpected shock to xt .

Then xt increases and because of the true causal structure in (6.7), yt also changes.
As none of the (predetermined) variables on the right hand side of equation (6.5)
changes, the change in yt in translated into a change in the residual, 1t . We therefore
observe that the two errors, 2t and 1t , move together and will therefore be correlated.
By the simple correlation alone, however, we cannot identify the causal relationship
in (6.7), because the opposite causal link would have given an identical observed
correlation.
Example 6.2 (empirical var(1) for consumption): As an empirical example

of a VAR(1) model, we consider Danish quarterly observations covering 1974(2)-
2017(3) for the log of real aggregate private consumption, ct , the log of private real
disposable income, yt , and the log of real wealth including housing equity, wt . To
impose stationarity, we model di¤erences from the year before (yearly growth rates),
i.e.
4 ct = ct ct 4 ; 4 yt = yt yt 4 and 4 wt = wt wt 4 ;
0
and let Zt = ( 4 ct ; 4 yt ; 4 wt ) . Estimates from a VAR(1) are given by
0 1 0 1
0 1 0:00369 0:483 0:125 0:0550 0 1 0 1
4 ct B (2:06) C B (7:58) (2:59) (4:64)
C 4 ct 1 ^1t
@ A=B
B 0:00760
C B
C+B 0:0971 0:400 0:00584 C @ 4 yt 1 A+@ ^2t A ;
4 yt
@ (2:80) A @ (1:01) (5:45) (0:325) C
A
4 wt 0:00841 0:780 0:277 0:953 4 wt 1 ^3t
(1:44) ( 3:75) (1:75) (24:7)
where numbers in parentheses are t ratios for the hypotheses of zero coe¢ cients.
Based on the t ratios (that under regularity conditions have standard normal dis-
tributions, as discussed below) we observe that consumption depends on lagged con-
sumption, income and wealth–both with the expected positive coe¢ cients. Income
is largely autoregressive, and e¤ects from lagged consumption and wealth are in-
signi…cant. Wealth is also highly autoregressive, but depends signi…cantly negative
on lagged consumption. This is most likely a savings e¤ect: large consumption last
quarter implies that wealth in this quarter deteriorates.
The contemporaneous e¤ects, i.e. e¤ects within one quarter, are given in terms
of the covariances. Here we report the correlation matrix of the estimated residuals,
^t , given by
0 1
1 0:228 0:180
@ 0:228 1 0:132 A :
0:180 0:132 1
The strongest correlation is between consumption and income. The correlation is
positive, suggesting a positive co-movement within one quarter. Theory would sug-
gest that higher disposable income triggers higher consumption, but the estimated
correlation is not directly informative on the direction of causality. To check whether
the correlation is signi…cant, we use the fact that if the true correlation is zero, 0 = 0,
then the asymptotic distribution of the estimated correlation, ^, is given by
d 1
^ ! N (0; T );
p
and we use 1:96= T as the critical value for the estimated correlation. In the present
example we have T = 174 e¤ective observations and 1:96 174 1=2 = 0:149, such that
an estimated correlation of 0:228 is signi…cantly di¤erent from zero.
Remark 6.1 (deterministic terms): The VAR(k) model in (6.3) includes a con-
stant term to allow the time series to have nonzero levels. The model can be easily
extended to include more elaborate deterministic speci…cations. As en example, a
VAR(1) model with a linear deterministic trend, and quarterly dummies to account
for seasonal di¤erences between the four quarters would be written as
Zt = 0 + 1t + 1 Zt 1 + 1 D1t + 2 D2t + 3 D3t + t; t = 1; 2; :::; T ,

6.3 MA Solution and Stationarity Condition 133
where the parameters 0 , 1 , 1 , 2 , and 3 are all p 1 vectors, Dit is a dummy

variable taking the value one in quarter i, i = 1; 2; 3, while t is the trend regressor,
(1; 2; 3; :::; t; :::; T ):
Remark 6.2 (var-x models): The VAR model for the endogenous variables in
Zt 2 Rp can also be extended to include exogenous stochastic variables, Xt 2 Rn ,
assuming a priori that the causality runs from Xt to Zt . This is called a VAR-X
model and can be written as
Zt = 0 + 1 Zt 1 + 0 Xt + 1 Xt 1 + t; t = 1; 2; :::; T;
where the equation conditions on both contemporaneous and lagged variables, Xt and
Xt 1 . Here we have that the parameters 0 and 1 are both p n matrices.
A common situation where a VAR-X model is useful is the case of a small open
economy like Denmark, where we have an idea that foreign variables should enter, but
that Denmark is too small to a¤ect the foreign variables. In this case we may include
foreign variables as exogenous, such that they a¤ect the Danish variables but are not
a¤ected themselves.
This model is a straightforward generalization of the ADL and for p = n = 1 the
VAR-X is just the ADL model.
6.3 MA Solution and Stationarity Condition

To characterize the dynamic properties of time series generated from the VAR model
and to discuss conditions under which the time series are stationary, we derive the
moving average solution for the VAR process. This is similar to the analysis of the
univariate autoregression, but adapted to matrix notation.
6.3.1 The VAR(1) Model

To simplify notation we consider …rst the VAR(1) model,
Zt = + 1 Zt 1 + t; with Z0 given. (6.8)
Using recursive substitution, similar to the univariate AR(1), we …nd
Zt = + 1 ( + 1 Zt 2 + t 1) + t
2
= (Ip + 1) + t + 1 t 1 + 1 Zt 2
2
= (Ip + 1) + t + 1 t 1 + 1 ( + 1 Zt 3 + t 2)
2 2 3
= Ip + 1 + 1 + t+ 1 t 1+ 1 t 2+ 1 Zt 3 ;
where Ip is the p dimensional identity matrix, 21 = 1 1 is the square of the p p

matrix, and 31 = 1 1 1 . Continuing the recursive substitution, we end with the
MA solution,
2 t 1 2 t 1 t
Zt = Ip + 1 + 1 + ::: + 1 + t+ 1 t 1+ 1 t 2 + ::: + 1 1+ 1 Z0 ; (6.9)
which consists of a deterministic part, a moving average of the error terms, t ; t 1 ; :::; 1 ,
as well as a contribution from the initial values.
From (6.9) we can …nd the expected value,
2 t 1 t
E (Zt j Z0 ) = Ip + 1 + 1 + ::: + 1 + 1 Z0 ; (6.10)
where we have used that E( t j Z0 ) = 0. If it holds that the matrix powers
2 3
1; 1 = 1 1; 1 = 1 1 1; etc. (6.11)
t
converge to zero, then the e¤ect of the initial value vanish, 1 Z0 ! 0 as t ! 1. In
this case, the …rst part is also convergent, such that
2 t 1 t 1
E (Zt j Z0 ) = Ip + 1 + 1 + ::: + 1 + 1 Z0 ! (Ip 1) = E (Zt ) ;
(6.12)
similar to the result for univariate AR(1) models.
The question is how we can verify conditions that ensure that the matrix powers
in (6.11) converge. Recall that for the univariate case, p = 1, the stability condition
is that the autoregressive coe¢ cient is smaller than one in absolute value, and we
want to …nd the corresponding condition for the case of a VAR model, p > 1.
To illustrate the di¢ culty, consider …rst an example:
Example 6.3 (convergence of matrix powers): Consider the case p = 2 and

the VAR coe¢ cient matrix given by
a b
1 = ;
c d
for some coe¢ cients fa; b; c; dg. We …nd the matrix powers to be
2 a2 + bc ab + bd
1 =
ac + cd d2 + bc
3 a3 + 2bca + bcd b (a2 + ad + d2 + bc)
1 =
c (a2 + ad + d2 + bc) d3 + 2bcd + abc
4 a4 + 3a2 bc + 2abcd + b2 c2 + bcd2 b (a + d) (a2 + d2 + 2bc)
1 = ;
c (a + d) (a2 + d2 + 2bc) a2 bc + 2abcd + b2 c2 + 3bcd2 + d4
and it is not easy to …nd conditions on fa; b; c; dg such that these matrices converge
towards zero.
6.3 MA Solution and Stationarity Condition 135
Instead we use that a p p matrix 1 can be written using the spectral decom-
position
1
1 =V V ; (6.13)
with 0 1
1 0 0
B .. C
V = (v1 ; :::; vi ; :::; vp ) and =@ 0 . 0 A:
0 0 p
Here 1 ; 2 ; :::; p are the p eigenvalues of 1, i.e. the p solutions to the eigenvalue
problem
j 1 i Ip j = 0; (6.14)
and vi , i = 1; 2; :::; p, are the corresponding eigenvectors. The eigenvalue problem
in (6.14) is essentially a p dimensional polynomial equation and the eigenvalues can
therefore be complex numbers. A brief introduction to eigenvalues and eigenvectors
are given in Appendix §6.A at the end of this chapter.
Based on the decomposition in (6.13) we observe that
2 1 1 2 1
1 = 1 1 =V V V V =V V :
Similarly
3 1 1 1 3 1
1 = 1 1 1 =V V V V V V =V V ;
and in general
k k 1
1 =V V :
Powers of diagonal matrices are simple,
0 k
1
1 0 0
k B ... C
=@ 0 0 A;
k
0 0 p
k k
and it is easy to see that 1 is convergent, 1 ! 0 exponentially fast, if
k i k < 1 for i = 1; 2; :::; p: (6.15)
We conclude that the VAR(1) model, seen as a di¤erence equation, is stable if the
eigenvalues are smaller than one in absolute value. Because the eigenvalues are com-
plex in general we say that the eigenvalues have to be inside the complex unit circle6 .
6
Formally, the derivation above is not totally general, because the spectral decomposition only
holds if all eigenvalues are distinct. If some eigenvalues, i and j , coincide, however, there is a
slightly more general decomposition–known as the Jordan decomposition–that will give a similar
result.
Stability implies that the mean, variance, and higher order moments are constant,
and the process Zt is stationary. The condition on the eigenvalues also implies that
the autocorrelation function decays to zero exponentially fast, such that Zt is also
weakly dependent. We state the result as follows:
Theorem 6.1: The VAR(1) model in (6.8) is stable if the eigenvalues of 1 are
inside the unit circle. Then Zt is stationary and weakly dependent.
Example 6.4 (stable var): To illustrate the result, consider the VAR(1) model
z1t 0:7 0:1 z1t 1 1t
= + :
z2t 0:3 0:8 z2t 1 2t
The eigenvalue problem is then

0:7 0:1 1 0
j 1 i Ip j = i
0:3 0:8 0 1
0:7 i 0:1
=
0:3 0:8 i
= (0:7 i ) (0:8 i) 0:3 0:1
2
= i 1:5 i + 0:56:
The solutions are
0:930 with v = (0:398; 0:907)0
i =
0:570 with v = (0:609; 0:793)0
and we conclude that this VAR is stable, such that Zt is a stationary process.
Example 6.5 (unstable var): Next consider the following, slightly modi…ed, VAR(1),
z1t 0:7 0:1 z1t 1 1t
= + :
z2t 0:3 0:9 z2t 1 2t
The eigenvalue problem in this case is

2
j 1 i Ip j = (0:7 i ) (0:9 i) 0:3 0:1 = i 1:6 i + 0:6;
and the solutions are
1:0 with v = (0:316; 0:949)0
i =
0:6 with v = (0:707; 0:707)0 :
We conclude that this VAR(1) is not stable, because there is a root at unity. In
this case Zt is not stationary (in fact it is a unit root process and will behave like a
bivariate random walk).
6.4 Conditioning and Single-Equations 137
6.3.2 The VAR(k) Model

We want to generalize the stationarity condition for the VAR(1) to also cover the
case of VAR(k) models, k > 1. As an illustration, consider the case of a VAR(3),
Zt = + 1 Zt 1 + 2 Zt 2 + 3 Zt 3 + t. (6.16)
The results above did not depend on t nor on . We therefore write the VAR(3)
as a VAR(1) and apply the result. In particular, we set Wt = (Zt ; Zt 1 ; Zt 2 )0 2 R3p
and look at the companion form
0 1 0 1 0 10 1 0 1
Zt 1 2 3 Zt 1 t
@ Zt 1 A = @ 0 A + @ Ip 0 0 A@ Zt 2 A + @ 0 A: (6.17)
Zt 2 0 0 Ip 0 Zt 3 0
| {z } | {z } | {z }| {z } | {z }
Wt ~ ~1 Wt 1 t
Here the …rst equation replicates the VAR(3), while the next two equations are identi-
ties, Zt 1 = Zt 1 and Zt 2 = Zt 2 . The new equation for Wt is a VAR(1) of dimension
3p, which is stable if the eigenvalues of the companion matrix, ~ 1 , are inside the unit
circle. We state the result as follows:
Theorem 6.2 (stability of a var(k)): The VAR(k) model in (6.3) is stable if

the eigenvalues of the companion matrix
0 1
1 2 3 k
B Ip 0 0 0 C
B C
B 0 Ip 0 0 C
B C
B .. ... .. C
@ . . A
0 0 Ip 0
are inside the unit circle. Then Zt is stationary and weakly dependent. In this case,
the moving-average representation is given by
Zt = t + C1 t 1 + C2 t 2 + :::Ct 1 1 + C0 ; (6.18)
where C0 is a function of the constant term and the initial values and the sequence
of matrices fC1 ; C2 ; C3 ; :::g converges to zero exponentially fast.
6.4 Conditioning and Single-Equations

Consider again the bivariate case Zt = (yt ; xt )0 . The VAR model for Zt represents
the (Gaussian) density
f (yt ; xt j It 1 ; );
where It 1 = fyt 1 ; xt 1 ; :::; yt k ; xt k g denotes the information set available at time

t 1. Remember that we can always factorize the joint density (omitting the reference
to parameters for simplicity)
f (yt ; xt j It 1 ) = f (yt j xt ; It 1 ) f (xt j It 1 ); (6.19)
corresponding to a conditional density, f (yt j xt ; It 1 ), and a marginal density, f (xt j

It 1 ). To understand the form of the conditional and marginal model, we state the
following results for conditioning in a multivariate Gaussian distribution:
Theorem 6.3 (conditioning in a gaussian distribution): Consider the mul-

tivariate Gaussian distribution as given by
x1 d 1 11 12
=N( ; ) with = and = ;
x2 2 21 22
and 21 = 12 from symmetry. It holds that the conditional process is Gaussian

d
x1 j x2 = N ( 1:2 ; 1:2 ) ;
where the conditional mean is a linear function of the conditioning set,

1 1 1
1:2 = 1 + 12 22 (x2 2) = 1 12 22 2 + 12 22 x2 ;
1
and ! = 12 22 is the OLS correction. In addition,
1
1:2 = 11 12 22 21 :
We can write the model as

x1 !x2 1 ! 2 1 ! 2
= + ;
x2 2 2
or
1 ! x1 1 ! 2 u1
= + ;
0 1 x2 2 u2
where
1
cov(u1 ; u2 ) = E(u1 u2 ) = E(( 1 ! 2) 2) = 12 ! 22 = 12 12 22 22 = 0;
by construction.
For the Gaussian VAR model, observe that the conditional density corresponds
to an ADL model and that the marginal model is just a VAR equation. This decom-
position allows us to reinterpret the covariance 12 as a causal regression e¤ect,
@E(yt j xt ; It 1 ) 1
=!= 12 22 : (6.20)
@xt
6.4 Conditioning and Single-Equations 139
The causal direction in (6.19), xt ; yt , is postulated, however, and this structural

form is controversial. We could equally well have postulated the opposite decom-
position implying the reverse causal interpretation, yt ; xt , and in this case the
regression e¤ect would be 21 111 .
Example 6.6 (conditional modelling): To illustrate consider consumption and

income from Example 6.2, i.e. Zt = ( 4 ct ; 4 yt )0 . The unrestricted reduced form of
a VAR(2) gives the output:
The estimated covariance matrix is given by

!
^ 11 ^ 12 4:7485 1:3810
^= = 10 4 :
^ 21 ^ 22 1:3810 9:5685
Now we assume that the causal direction runs from income to consumption,
4 yt ; 4 ct , and we include income, 4 yt , in the consumption equation, 4 ct .
The estimation output from applying OLS is
Firstly, observe that the likelihood of the new model is the same as the likelihood
of the reduced form, it is just the result of a di¤erent factorization. Secondly, the
equation for income in unchanged; this is just the marginal equation from the reduced
form. Thirdly, the equation for consumption is now conditional on income in the same
period, and is therefore an ADL model. The coe¢ cient to income is a function of the
covariance in the reduced form
^ 12 1:3810
!
^= = = 0:14433;
^ 22 9:5685
and we would now interpret the correlation as a causal regression e¤ect. Finally, the
error terms in the two equations are now uncorrelated by construction,
^= 4:5493 0
10 4 ;
0 9:5685
cf. the results for conditioning in a Gaussian distribution.
The relationship between the VAR model and the conditional ADL model illustrates
the conditions under which the conditional ADL is a useful framework for empirical
modelling. In particular we need three main assumptions
(1) We need to know the contemporaneous causal direction, e.g. xt ; yt .

(2) We are only interested in the coe¢ cients in the conditional model, f (yt j
xt ; It 1 ; '), and not in the coe¢ cient in the marginal model, f (xt j It 1 ; ).
(3) The variable of interest is univariate, yt 2 R.
If these are not ful…lled, we should keep the VAR model to avoid loosing information.
If the …rst two conditions are satis…ed, but the variable of interest is a vector, yt 2 Rp ,
we could estimate a VAR-X model as in Remark 6.2.
6.5 Estimation and Inference

To discuss estimation of the VAR model, consider Zt = (z1t ; :::; zpt )0 2 Rp and the
VAR(k) with k = 1:
Zt = + 1 Zt 1 + t; t = 1; 2; :::; T: (6.21)
To formulate the likelihood function, we assume a Gaussian distribution

d
t j Zt 1 = N (0; );
such that the full set of parameters are given by =f ; 1; g:

6.5 Estimation and Inference 141
Similar to the univariate case, we use sequential factorization of the joint density,
Q
f (Z1 ; :::; ZT j Z0 ; ) = Tt=1 f (Zt j Zt 1 ; ); such that the log-likelihood function is
given by
X
T XT
log L( ) = log `t ( ) = log f (Zt j Zt 1 ; ):
t=1 t=1
The multivariate Gaussian density is given by

p 1 1 0 1
f (Zt j Zt 1 ; ) = (2 ) 2 j j 2 exp (Zt 1 Zt 1 ) (Zt 1 Zt 1 ) ;
2
such that the log-likelihood function is
X
T
Tp T 1 0 1
log L( ) = 2
log(2 ) 2
log j j 2
(Zt 1 Zt 1 ) (Zt 1 Zt 1 ) :
t=1
Solving the …rst order condition for the maximum of the likelihood function, we
…nd that ! 1
X
T X
T
(^ ; ^ 1 ) = Zt Z~t
0
Z~t Z~t
0
;
t=1 t=1
with Z~t = (1; Zt0 1 )0 . As in the AR(1) case we recognize this as the OLS estimator.
Similarly, the estimator of the residual covariance matrix is given by
XT
^= 1
0
Zt ^ ^ 1 Zt 1 Zt ^ ^ 1 Zt 1 :
T t=1
The VAR model is just a multi-equation regression model and it should not be sur-
prising that the Gaussian maximum likelihood estimator coincides with OLS.
6.5.1 Inference
If the VAR model is stable, such that Zt is a stationary and weakly dependent process,
the maximum likelihood estimators are normally distributed, and test statistics will
have standard normal and 2 distributions. In particular, the following result holds:
Theorem 6.4 (distribution of the MLE): Consider the VAR(k) model in (6.3).
If all eigenvalues of the companion matrix are inside the unit circle, and if t is in-
dependently and identically distributed, i.i.d.(0; ), then
p d
T vec((^ ; ^ 1 ; :::; ^ k ) ( ; 1 ; :::; k )) ! N (0; W );
where W = E(Z~t Z~t0 ) 1

with Z~t = (1; Zt0 1 ; :::; Zt0 k )0 .
The notation is slightly involved. In particular, it uses the vectorization function,

vec(A), which stacks the columns of the matrix A into a long vector. This vector has
a Gaussian distribution with a variance formulated in terms of the Kronecker product
, with the property that for an m n matrix A and and a p q matrix B,
0 1 0 1
a11 a1n a11 B a1n B
B .. C B = B .. .. C :
A B = @ ... ..
. . A @ . ..
. . A
am1 amn am1 B amn B
We do not cover all details here, we just note that the result implies that all the
usual t ratios are distributed as N (0; 1) and all Wald tests or likelihood ratio tests
for hypotheses on the parameters have limiting 2 distributions.
Example 6.7 (lag-length determination): Consider a case with Zt 2 R3 which

is assumed to be stationary, and the VAR(3) model
Zt = + 1 Zt 1 + 2 Zt 2 + 3 Zt 3 + t;
for t = 1; 2; :::; T . To test whether a VAR(2) is su¢ cient for the data, we could
estimate the VAR(2) model for the same e¤ective sample
Zt = + 1 Zt 1 + 2 Zt 2 + t;
and calculate the likelihood ratio statistics LR(k = 2 j k = 3) as twice the di¤erence
in log-likelihoods. Because ^ 3 is asymptotically Gaussian, we have that if the null
hypothesis is true,
d
LR(k = 2 j k = 3) ! 2 (9);
where the degree of freedom is the number of restrictions imposed, i.e. the number
of parameters in the 3 3 matrix 3 . We could also estimate a VAR(1) model
Zt = + 1 Zt 1 + t:
Now it would hold that

d 2
LR(k = 1 j k = 2) ! (9);
and, likewise,
d 2
LR(k = 1 j k = 3) ! (18):
This is totally parallel to the univariate case.
6.6 Impulse-Responses and Structural VARs 143
Example 6.8 (lag-length determination): For the bivariate consumption and

income model in Example 6.6, we estimated a VAR(2) and got a likelihood value of
816:989. Estimating also a VAR(1) model yields the output
The likelihood ratio statistic is therefore given by
LR(k = 1 j k = 2) = 2 (816:989 815:924) = 2:13:
This is not signi…cant in a 2 (4) and we conclude that a VAR(1) is enough to describe
the dynamics of the bivariate system.
Remark 6.3 (robust inference): As in the univariate case, inference in the Wald
tests in the VAR model can be made robust to heteroskedasiticity and non-normality
by using the robust variance of the coe¢ cient estimates, i.e. the sandwich formula.
Robust standard errors are routinely supplied by most software packages.
Remark 6.4 (misspecification testing): The misspeci…cation tests known from

the univariate autoregressions can also be generalized to the case of VAR models and
they should routinely be considered to ensure that the model assumptions are satis…ed.
6.6 Impulse-Responses and Structural VARs

Consider the reduced form VAR model,
d
Zt = + 1 Zt 1 + 2 Zt 2 + ::: + k Zt k + t; t j It 1 = N (0; );
where the contemporaneous e¤ects, i.e. the e¤ects within one period, are captured
in the covariance matrix, .
The dynamic interpretation of the conditional mean parameters, f ; 1 ; :::; k g,
are sometimes di¢ cult, and a simple idea is to illustrate the dynamic properties of
the system using an impulse-response analysis. We make a shock to t , and look at
the dynamic propagation to the variables over time. Recalling the moving average-
representation in (6.18),
Zt = t + C1 t 1 + C2 t 2 + :::Ct 1 1 + C0 ; (6.22)
it is straightforward to see that the impulse-responses are just the moving average
coe¢ cients,
@Zt @Zt+1 @Zt+2
0
= Ip ; 0
= C1 ; = C2 ; ::: (6.23)
@ t @ t @ 0t
This is typically presented as p p graphs called impulse-response functions.
The reduced form impulse-response function is useful in order to illustrate the
speed of transition in the system, but the economic interpretation is a little more
complicated. The problem is that if is not diagonal, then a shock to Z1t alone, i.e.
e1 = (1; 0; :::; 0)0 is not necessarily reasonable, because this shock would historically be
accompanied also by a shock to the other variables. In our example with consumption
and income, Zt = ( 4 ct ; 4 yt )0 we may imagine a unexpected shock to income, (0; 1)0 ,
and look at the propagation. Such a shock, however, has historically been associated
with a contemporaneous change also in consumption, which we do not allow for in
our analysis.
Most impulse-response analyses therefore prefer to …nd a parametrization of the
model, where the covariance is diagonal, such that the simple shocks make sense
historically. Recall from §6.4 that this can be accomplished using conditioning, cor-
responding to the so-called structural form
A0 Zt = a + A1 Zt 1 + A2 Zt 2 + ::: + Ak Zt k + ut (6.24)
where a = A0 , A1 = A0 1 , A2 = A0 2 ; :::; Ak = A0 k , and the covariance of ut

is diagonal, see Example 6.6. If the system has more than two variables, p > 2, we
use sequential conditioning, which corresponds to A0 being lower triangular. This
is called a causal chain and is identical to a certain causal order of the variables.
For the p = 3 dimensional system in Example 6.2 we could choose the ordering of
the variables given by Zt = ( 4 yt ; 4 wt ; 4 ct )0 , and the causal chain would give a
sequential VAR(1) model
0 10 1 0 10 1 0 1
1 0 0 4 yt A11 A12 A13 4 yt 1 u1t
@ ! 21 1 0 A@ 4 wt
A = @ A21 A22 A23 A @ 4 wt 1
A + @ u2t A ;
! 31 ! 32 1 4 ct A31 A32 A33 4 ct 1 u3t
| {z } | {z }
A0 A1
such that we allow 4 yt ; 4 wt and ( 4 yt ; 4 wt ) ; 4 ct but no contemporaneous

e¤ect from consumption to income and wealth. To interpret, we may write the
equations as
4 yt = (A11 ; A12 ; A13 )Zt 1 + u1t

4 wt = ! 21 4 yt + (A21 ; A22 ; A23 )Zt 1 + u2t
4 ct = ! 31 4 yt + ! 32 4 wt + (A31 ; A32 ; A33 )Zt 1 + u3t
The model is now called a structural VAR (SVAR) and most impulse-response
analyses are based on SVARs. Because the shocks to the SVAR are orthogonal by
construction, cov(uit ; ujt ) = 0 for i 6= j, the impulse-responses from the SVAR are
often called the orthogonalized impulse responses or the structural impulse responses.
Technically, we can estimate the SVAR coe¢ cients by OLS in the conditional
equations or we can choose A0 = D 1 , where D is the Choleski decomposition of the
covariance matrix, i.e. the lower triangular matrix such that = DD0 .
The causal chain is easy to implement and has the property that the impulse-
responses are consistent with the observed historical covariance structure. On the
other hand, the ordering of the variables may be controversial, because it correspond
to an assumed causal structure. For p = 3 variables there are 3! = 6 di¤erent causal
orderings. Some of these may be ruled out by economic reasoning, but there could
be more than one relevant candidate left, and their implied impulse-responses would
di¤er.
Example 6.9 (danish consumption): Consider again the p = 3 variables in Ex-

ample 6.2, with the ordering given by
0
Zt = ( 4 yt ; 4 wt ; 4 ct ) :
The reduced form impulse-responses are given in red in Figure 6.1, where the shock is
scaled to have a magnitude of one standard deviation. This is quite normal, because
one standard-deviation shock can be interpreted as a shock of a normal magnitude.
We observe that the impact in the …rst period of the reduced-form impulse-
reponses is the unit matrix, I3 , (in the graph scaled by the standard deviations of the
shocks) assuming no contemporaneous impacts within the …rst quarter.
Next, in blue in Figure 6.1, are given the orthogonalized impulse-responses based
on the assumed causal chain. Here we see that the shock to income have contempora-
neous e¤ects of wealth and consumption, while the shock to wealth has contempora-
neous e¤ect on consumption, corresponding to the triangular system. The magnitude
of the shock is again one standard deviation. The orthogonalized impulse responses
corresponds to simple impulse responses in a structural VAR given by
where the covariance matrix is diagonal. We note that the contemporaneous e¤ect
from income to consumption is positive. In this case, the di¤erences between the
impulse-reponse functions are minor.
Remark 6.5 (inference on impulse responses): Impulse responses are func-

tions of the parameters, and the uncertainty on the impulse responses can be derived
from the covariance matrix of the estimated parameters. This allows construction of
con…dence bands on the impulse responses, such that their signi…cance can be judged.
Remark 6.6 (interpretation of structural shocks): Structural impulses can

be seen as the response to a shock to a single error, 1t say, allowing for contempo-
raneous e¤ects in Z1t as well as in the other variables, Z2t ; :::; Zpt , see the Danish
consumption example above.
Alternative, we can think of the structural impulse responses as the e¤ect on the
variables in the system of a shock to a particular combined error term–known as a
structural shock. As an example, consider the reduced form VAR(1)
Zt = + 1 Zt 1 + t; (6.25)
with impulse responses given by the coe¢ cients in the moving average representation,
Zt = t + C1 t 1 + C2 t 2 + :::Ct 1 1 + C0 ; (6.26)
0.03 0.015
∆ 4 y t (∆ 4 y t eqn), reduced form ∆ 4 w t (∆ 4 y t eqn), reduced form 0.006 ∆ 4 c t (∆ 4 y t eqn), reduced form
∆ 4 y t (∆ 4 y t eqn), structural form ∆ 4 w t (∆ 4 y t eqn), structural form ∆ 4 c t (∆ 4 y t eqn), structural form
0.02 0.010
0.004
0.01 0.005 0.002
0.00 0.000 0.000

0 20 40 0 20 40 0 20 40
0.006
∆ 4 y t (∆ 4 w t eqn), reduced form ∆ 4 w t (∆ 4 w t eqn), reduced form ∆ 4 c t (∆ 4 w t eqn), reduced form
∆ 4 y t (∆ 4 w t eqn), structural form ∆ 4 w t (∆ 4 w t eqn), structural form ∆ 4 c t (∆ 4 w t eqn), structural form
0.06
0.001 0.004
0.04
0.002
0.02
0.000 0.00 0.000

0 20 40 0 20 40 0 20 40
0.002 0.00
∆ 4 y t (∆ 4 c t eqn), reduced form ∆ 4 w t (∆ 4 c t eqn), reduced form 0.02 ∆ 4 c t (∆ 4 c t eqn), reduced form
∆ 4 y t (∆ 4 c t eqn), structural form ∆ 4 w t (∆ 4 c t eqn), structural form ∆ 4 c t (∆ 4 c t eqn), structural form
0.001 -0.01 0.01
0.000
-0.02 0.00
0 20 40 0 20 40 0 20 40
Figure 6.1: Reduced form (red) and orthogonalized (blue) impulse-responses. Ef-
fects on the p=3 variables from unexpected shocks to the p=3 equations.
i.e. fIp ; C1 ; C2 ; :::g with Ci = i1 ; i = 1; 2; :::

The corresponding structural form is given by
A0 Zt = A0 + A0 1 Zt 1 + A0 t = a + A1 Zt 1 + ut ; (6.27)
where the structural shocks, ut , has the property that it is contemporaneously uncor-
related, E(ut u0t ) = Ip . It holds that
ut = A0 t or t = A0 1 ut ; (6.28)
such that orthogonal impulse responses are given by the coe¢ cients in
Zt = A0 1 ut + C1 A0 1 ut 1 + C2 A0 1 ut 2 + ::: + Ct 1 A0 1 u1 + C0
= B0 + B1 ut 1 + B2 ut 2 + ::: + Bt 1 u1 + C0 (6.29)
i.e. the sequence fB0 ; B1 ; B2 ; :::g with Bi = Ci A0 1 for i = 0; 1; 2; :::
The structural shocks are linear combinations of the reduced form residuals, ut =
A0 t , and from their construction and their impact on the variables in the system,
researchers are sometimes able to relate them to shocks suggested from economic the-
ory, e.g. demand shocks, supply shocks, productivity shocks, monetary policy shocks,
etc.
6.7 Forecasting
Because the VAR model is a reduced form, it is straightforward to use it for forecasting
future values of the time series, ZT +h for h = 1; 2; :::, given the information up
to time T . Formally we de…ne the information set available at time T as IT =
fZT ; ZT 1 ; ZT 2 ; :::g, and we want to …nd the prediction in terms of the conditional
expectation
yT +hjT = E(yT +h j IT ):
This is similar to forecasts based on univariate autoregressive models.
As an illustration, consider the VAR(1) model,
Zt = + 1 Zt 1 + t; t = 1; 2; :::; T:
To forecast observation T + 1, we use
ZT +1jT = E(ZT +1 j IT ) = E( + 1 ZT + T +1 j IT ) = + 1 ZT :
Likewise, we get
ZT +2jT = E(ZT +2 j IT ) = E( + 1 ZT +1 + T +2 j IT ) = + 1 ZT +1jT
ZT +3jT = E(ZT +3 j IT ) = E( + 1 ZT +2 + T +3 j IT ) = + 1 ZT +2jT
where we have used E(ZT +1 j IT ) = ZT +1jT and E(ZT +2 j IT ) = ZT +2jT to get the
forecast recursion. For the stationary model, the forecasts will converge exponentially
1
towards the unconditional expectation, (Ip 1) .
To implement forecasting in practice, we replace the true parameters with the
estimates, ^ and ^ 1 .
We can also …nd the forecast errors and the corresponding forecast error variance.
For the one step ahead forecast, we get the forecast error
1 = ZT +1 ZT +1jT = T +1 ;
and the forecast error variance,

2 2
FEV(1) = E 1 j IT = E T +1 j IT = :
For longer forecast horizons, the forecast error can be found from the moving average
representation as in the univariate case in §4.7.1, and for horizon h we …nd
h = ZT +h ZT +hjT = T +h + C1 T +h 1 + C2 T +h 2 + ::: + Ch 1 T +1 ;
6.7 Forecasting 149
such that the variances of the forecasts are given by

FEV(1) =
FEV(2) = + C1 C10
..
.
FEV(h) = + C1 C10 + C2 C20 + ::: + Ch 1 Ch0 1 :
Under the assumption that the error term is normally distributed, we may pro-
duce 95% con…dence bounds for the forecasts as the point forecast, ZT +hjT ; 1:96
times the standard deviation of the forecasts, i.e. the square root of the diagonal
elements of FEV(h). Alternatively we may give a full distribution of the forecasts as
N (ZT +hjT ;FEV(h)).
Example 6.10 (forecasts): Forecasts for 24 quarters and corresponding con…dence-

fans for the bivariate consumption and income model in Example 6.8 are reported in
Figure 6.2.
Remark 6.7 (forecast error variance decomposition): For the structural VAR
model, formulated in terms of ut = A0 t with covariance Ip instead of t with covari-
ance , the forecast error can be written as
h = ZT +h ZT +hjT = B0 uT +h + B1 uT +h 1 + B2 uT +h 2 + ::: + Bh 1 uT +1 ; (6.30)

and the forecast error variance for horizon h is given by
FEV(h) = B0 B00 + B1 B10 + B2 B20 + ::: + Bh 1 Bh0 1 : (6.31)
For the variable Zj , j = 1; 2; :::; p, this is equal to
X
h 1
2 2 2
FEVj (h) = Bi;j1 + Bi;j2 + ::: + Bi;jp ;
i=0
2
where Bi;jk is element (j; k) in the matrix Bi2 . We may also calculate the contribution
from the structural shock, uk , k = 1; 2; :::; p, as
X
h 1
FEVkj (h) = 2
Bi;jk ; (6.32)
i=0
and we have the decomposition

FEV1j (h) FEV2j (h) FEVpj (h)
+ + ::: + = 1: (6.33)
FEVj (h) FEVj (h) FEVj (h)
This so-called forecast error variance decomposition can be used to assess the impor-
tance of di¤erent structural shocks to the forecast error variance.
(A) Private consumption (B) Income

Forecasts Forecasts
∆ 4c t ∆ 4y t
0.1 0.1
0.0
0.0
1980 1990 2000 2010 2020 1980 1990 2000 2010 2020
Figure 6.2: Dynamic forecasts for the yearly growh rates in consumption, 4 ct ,
and income, 4 yt .
6.8 Granger Causality

The discussion of causality within the framework of the structural VAR was related to
identi…cation of simultaneous e¤ects and interpreting symmetric correlation as causal
regression e¤ects. A di¤erent notion of causality was suggested by Granger (1969)
with reference to forecasts and forecast errors. We give the de…nition for the bivariate
case:
Definition 6.1: Let h be the forecast error obtained by forecasting yT +h based on

IT = fyT ; xT ; yT 1 ; xT 1 ; ::g and let ~h be the forecast error obtained by forecasting
yT +h based on only I~T = fyT ; yT 1 ; ::g. The series x is said to be Granger causal for
y if for some forecasting horizon h it holds that
2
E h j IT < E ~2h j IT :
This de…nition of causality is based on the assertion that the cause comes prior
to the e¤ect and that the causal variable has to be helpful in forecasting. For the
bivariate VAR model like
yt 1 A11 A12 yt 1 B11 B12 yt 2 1t
= + + + :
xt 2 A21 A22 xt 1 B21 B22 xt 2 2t
Granger causality holds if and only if lags of xt enters the right hand side of the
equation for yt , and we can test the hypothesis of no-Granger causality, by a restriction
x 9 y : A12 = B12 = 0:
Likewise we can test that y does not Granger cause x by the hypothesis
y 9 x : A21 = B21 = 0:
For the model with k = 2 lags, both hypotheses involve 2 restrictions that can be
tested using either likelihood ratio tests or Wald tests. Both types of statistics will
be 2 (2):
Example 6.11 (granger causality): For the bivariate consumption and income
model in Example 6.8, the unrestricted reduced form is given by
0 1 0 1
0:00338 0:628 0:144
4 ct (1:94) A + @ (11:1) (2:88) A 4 ct 1
=@ + 1t
:
y
4 t 0:00803 0:113 0:401 y
4 t 1 2t
(2:96) (1:39) (5:62)
To test 4 y 9 4 c we only need to look at upper right element in ^ 1 , and we

can test the hypothesis using a t test. The t statistic for the hypothesis is 2:88,
which is clearly signi…cant compared to a N (0; 1) distribution. We therefore reject
the hypothesis and conclude that income Granger causes consumption. The same
conclusion in obtained from a LR test, where the statistic is 8.245, which is clearly
signi…cant in an 2 (1) distribution.
To test 4 c 9 4 y we look at lower left element in ^ 1 and the t test for that
hypothesis is 1:39. Here we cannot reject the hypotheses and we conclude that con-
sumption does not Granger cause income. Again the LR test give similar results.
Remark 6.8 (granger causality for p > 2): The extension of Granger non-
causality to systems of higher dimensions, p > 2, is conceptually simple, but harder
to implement in practice. The reason is that in a system with more variables, e.g.
Zt = (yt ; xt ; zt )0 , it holds that x may help in the forecast of y even if it does not enter
the equation. This happens because x may be helpful predicting z, that may enter
the equation for y. As a results, the necessary restrictions are more complicated for
p > 2.

Most time series textbooks include sections on VAR models, including Verbeek (2017),
Hendry (1995), and Enders (2004). There is a detailed coverage of VAR models for
both stationary and non-stationary variables in Lütkepohl and Krätzig (2004) and
Lütkepohl (2005) but the technical level is higher than this chapter. Structural VAR
models are covered in all details in Lütkepohl and Kilian (2017).
Appendix:
6.A Eigenvalues and Eigenvectors

Let A be a square (p p) matrix, A 2 Rp p . We can think of A as de…ning a mapping
Rp 7! Rp : y = Ax:
An example could be the VAR(1) without error term,
xt = Axt 1 ;
called the skeleton of the VAR(1) model. To see if the mapping is stable, i.e. contracts
to a single point, or explosive, it pays to look at a vector, vi , that comes out parallel
to vi itself, i.e.
Avi = i vi ;
where i is a scalar. This implies
(A i Ip ) vi = 0: (6.34)
The vector vi is called an eigenvector and i the corresponding eigenvalue.

The eigenvalue problem,
(A i Ip ) vi = 0;
states that A i Ip is singular, because there is a linear combination of the columns

(de…ned by vi ) which is zero. It therefore holds that
jA i Ip j = 0: (6.35)
The equation (6.35) is a polynomial of order p and has p solutions, such that
Avi = i vi for i = 1; 2; :::; p:
Now collect the solutions into p p matrices:

0 1
1 0 0
B ... C
V = (v1 ; :::; vi ; :::; vp ) and =@ 0 0 A:
0 0 p
6.A Eigenvalues and Eigenvectors 153
It now holds, that
AV = V
(Av1 ; :::; Avi ; :::; Avp ) = (v1 1 ; :::; vi i ; :::; vp p ) :
For the eigenvalues it furthermore hold that

p
Y
jAj = 1 i ::: p = i
i=1
p
X
trace(A) = 1 + i + ::: + p = i:
i=1
Suppose the p eigenvalues are di¤erent, then V 1 exists and from AV = V we can
diagonalize the matrix
V 1 AV = ; (6.36)
or we can …nd the spectral decomposition
1
A=V V : (6.37)
1
If A is further symmetric, then i is real and V = V 0 such that V 0 V = Ip .
Eigenvalues and Characteristic Roots. Eigenvalues are identical to the inverses

of the characteristic roots. To illustrate this, consider a VAR(2),
Zt = + 1 Zt 1 + 2 Zt 2 + t;
Using the lag operator L, such that LZt = Zt 1 , we may write the model as
2
(Ip 1L 2 L )Zt = + t;
which de…nes the characteristic polynomial

2
Ip 1z 2z = 0: (6.38)
The eigenvalue-problem for the companion form gives
1 2 Ip 0 1 Ip 2
= =0
Ip 0 0 Ip Ip Ip
or
2
Ip 1 2 = 0: (6.39)
Observe that the characteristic roots, i.e. the solutions to (6.38), are inverse
eigenvalues, i.e. the solutions to (6.39), and it holds that zi = i 1 . If the eigenvalues
are inside the unit circle, the characteristic roots are outside the unit circle.
Chapter 7
Non-Stationary Time Series

and Unit Root Testing
T
his chapter discusses some central issues in the analysis of non-stationary time
series. We begin by showing examples of di¤erent types of non-stationarity,
involving trends, level shifts, variance changes, and unit roots. The three
…rst deviations from stationarity may complicate the econometric analysis, but the
tools developed for stationary variables may be adapted to the new situation, e.g. by
including explanatory variables accounting for the non-stationarity. The presence of
unit roots, however, changes the asymptotic behavior of estimators and test statistics
fundamentally, and a di¤erent set of tools for unit root processes has to be applied.
We continue to illustrate the properties of a unit root time series, and discuss the
issue of unit root testing. In practical applications, testing for unit roots is particularly
important, because the conclusion determines what kind of tool-kit that is appropriate
for a given problem: For stationary time series we can use the standard tools from
linear regression; for unit root time series we have to think on how to combine unit
root time series. The latter is called co-integration and is discussed in Chapter 8.
7.1 Stationary and Non-Stationary Time Series

First, recall from Chapter 1 and Chapter 4 that a time series,
fy1 ; y2 ; :::; yT g or fyt gTt=1 ;
is weakly stationary if the mean and variance are the same for all t = 1; 2; :::; T , and
if the autocovariance function, de…ned as
s = cov(yt ; yt s ); s = 0; 1; 2; :::
156 Non-Stationary Time Series and Unit Root Testing
depends on s but not on t. Also recall that if yt and xt are stationary and weakly
dependent time series, then the linear regression model
yt = 0 + 1 xt + t; t = 1; 2; :::; T; (7.1)
can be analyzed using standard tools, and most of the results for regression of inde-
pendent and identical (i.i.d.) observations hold. The technical argument is that there
exists a standard law of large numbers (LLN) and a central limit theorem (CLT) for
stationary and weakly dependent time series resulting in estimators that are consis-
tent and asymptotically normally distributed. This is also re‡ected in the conditions
for consistency and asymptotic normality outlined for the analysis based on the like-
lihood function in Assumption 3.1 and Theorem 3.1.
The same thing does not hold for non-stationary time series in general; and econo-
metric analysis of non-stationary time series should always be performed with care. In
macroeconomics this is particularly important, because many relevant observed time
series in economics and …nance do not seem to be well characterized as stationary
processes.
This chapter gives an intuitive account for non-stationarity in economic time se-
ries. In §7.2 we argue that time series can be non-stationary in many di¤erent ways,
and we present some typical deviations from the stationarity assumption, namely:
deterministic trends in §7.2.1, level shifts in §7.2.2, and changing variances in §7.2.3.
For each case we brie‡y discuss how the non-stationary time series could be treated
in practical applications, and it turns out that the required modi…cations to the usual
regression tools are minor. We proceed by introducing the concept of unit roots in
§7.2.4. The presence of a unit root more fundamentally changes the properties of the
time series, and the usual tools no longer apply.
In §7.3 we review the properties of a stationary autoregressive process and discuss
the implications of a unit root. This is extended in §7.4 to models that includes
deterministic components, such as a constant and a linear trend term.
We then proceed in §7.5 to discuss the issue of unit root testing, i.e. how a unit
root process can be distinguished from a stationary process. This issue is particularly
important in applications, because it determines the kind of tools that we should
apply to the data: For stationary time series we can apply the usual tools from
regression and the interpretation is straightforward. For unit root processes the tools
should be modi…ed and the econometric model should be interpreted in terms of
co-integration; we return to this in Chapter 8.
7.2 Non-Stationarity in Economic Time Series 157
(A) Stationary and trend-stationary process (B) Process with a level shift
10
xt
5
5
~
xt
0
0
0 50 100 150 200 0 50 100 150 200
(C) Process with a change in the variance (D) Unit root process
5 10
0 5
0
-5
0 50 100 150 200 0 50 100 150 200
Figure 7.1: Simulated examples of non-stationary time series.
7.2 Non-Stationarity in Economic Time Series

This section discusses some examples of non-stationarity typically observed in eco-
nomic data.
7.2.1 Deterministic Trends and Trend-Stationarity

Macro-economic variables are often trending by nature, i.e. they have a tendency
to systematically increase or decrease over time. As examples you could think of
productivity, GDP, consumption, prices etc. The trending behavior means that the
unconditional expectation changes over time, which is not in accordance with the
assumption of stationarity.
In some cases the trend is very systematic such that the deviations from the trend
is a stationary variable. In this case we can analyze the deviation from trend, the
so-called de-trended variable, instead of the original one, and since the de-trended
process is stationary, the usual results apply.
Definition 7.1 (trend stationarity): A time series
yt = 0 + 1t + y~t ; (7.2)
is called trend-stationary if the de-trended process, y~t = yt 0 1 t, is stationary.

The process yt ‡uctuates in a stationary manner around a deterministic linear trend.
Example 7.1 (a trend-stationary ar(1)): As an example consider the station-

ary AR(1) process with a zero mean,
x~t = x~t 1 + t; t = 1; 2; :::; T; (7.3)
with j j < 1 and x~0 = 0, and a new process, xt , de…ned as the stationary process plus
a linear trend term and a constant,
xt = x~t + 0 + 1 t: (7.4)
Since x~t is a stationary process, xt is stationary around the trending mean,
E(xt ) = E(~
xt ) + 0 + 1t = 0 + 1 t;
i.e. xt is trend-stationary. A single realization of T = 200 observations of the processes

x~t and xt (with = 0:5) is illustrated in Figure 7.1 (A).
The main point of a trend-stationary process is that the stochastic part is still
stationary, and the non-stationarity is deterministic. In an empirical analysis using
the regression (7.1) we could therefore de-trend the variables by running the two OLS
regressions:
yt = '0 + '1 t + residual and xt = 0 + 1t + residual. (7.5)
We then de…ne the de-trended variables as the residuals,
y~t = yt '
^0 ^ 1 t and x~t = xt
' ^ ^t; (7.6)
0
and consider the (stationary) linear regression
y~t = 0 + 1x
~t + t: (7.7)
Alternatively, we could consider a regression augmented with a linear trend term, i.e.
yt = 0 + 1 xt + 2t + t: (7.8)
The two-step approach and the approach of augmenting the regression with the trend
regressor, give identical results due to the celebrated Frisch-Waugh-Lovell theorem.
In most cases, it is therefore more convenient to include the trend, t, as a regressor.

This approach of including a trend regressor in regression models to account for a
deterministic drift in the variables has been suggested also in earlier chapters.
A leading example of linear trends in economic variables is that productivity
increases over time, implying growth in GDP, consumption etc. To illustrate the idea
of trend-stationarity, we consider two examples below.
Example 7.2 (linear trend in productivity): Let LPRODt denote the log of
Danish hourly productivity, 1971 : 1 2005 : 2, compiled as the log of real output
per hour worked. The time series is depicted in Figure 7.2 (A). The trend looks very
stable and the deviation from the trend looks like a stationary process. To estimate
an AR(1) model for the time series we augment the model with a linear trend,
LPRODt = LPRODt 1 + + t + t;
and obtain the results reported in Table 7.1. The autoregressive parameter is =
0:56, and according to the misspeci…cation tests, the model seems to be a reasonable
description of the data.
Example 7.3 (linear trend in consumption): Next, let LCONSt be the log of
private aggregate consumption in Denmark, 1971 : 1 2005 : 2, see Figure 7.2 (B).
Again there is a positive trend, but it is less stable, and consumption shows large and
persistent deviations from the trend. From a visual inspection it is not clear that the
Coe¢ cient Std.Error t value

LPRODt 1 0.561273 0.07056 7.95
Constant 0.090890 0.01382 6.58
t 0.002412 0.00039 6.15
b 0.016907 log-likelihood 366.09
R2 0.994 T 137
Statistic [p val] Distribution
2
No autocorrelation of order 1-2 3.80 [0.15] (2)
2
Normality 2.52 [0.28] (2)
2
No heteroskedasticity 6.53 [0.16] (4)
Correct functional form (RESET) 2.66 [0.11] F(1,133)
Table 7.1: Modelling LPRODt by OLS for t = 1971 : 2 2005 : 2.

deviations are stationary. If we estimate a second order autoregressive model allowing

for a linear trend, we obtain the results in Table 7.2, where the estimates of the two
inverse characteristic roots are given by
^ = 0:896 and ^ = 0:233:
1 2
In §7.5 we will formally test whether the deviations are stationary, i.e. whether
LCONSt is a trend-stationary process.

LCONSt 1 0:662660 0:08548 7:75
LCONSt 2 0:208583 0:08578 2:43
Constant 0:763562 0:29720 2:57
t 0:000436 0:00017 2:58
b 0:017502 log-likelihood 359:23
R2 0:983 T 136
2
No autocorrelation of order 1-2 3:57 [0:17] (2)
2
Normality 52:44 [0:00] (2)
2
No heteroskedasticity 11:80 [0:07] (6)
Correct functional form (RESET) 2:19 [0:14] F(1,131)
Table 7.2: Modelling LCONSt by OLS for t = 1971 : 3 2005 : 2.
7.2.2 Level Shifts and Structural Breaks

Another type of non-stationarity in a time series is if there is a change is the uncondi-
tional mean at a given point in time. As an example the mean of a time series could be
1 for the …rst half of the sample and 2 for the second half. Such a case is illustrated
in Figure 7.1 (B). The level shift may be associated with a change in the economic
structures, e.g. institutional changes, changes in the de…nition or compilation of the
variables, of a switch from one regime to another. As an classical example you could
think of the German reuni…cation in 1990 where the German Democratic Republic
(GDR) became part of the Federal Republic of Germany (FRG) to form the reunited
nation of Germany. Here, we we might expect the time series to behave di¤erently
for uni…ed Germany as compared to GDR amd FRG. In particular we expect a level
shift–simply because the number of inhabitants in the country changed.
(A) Log of Danish productivity (B) Log of Danish private consumption

1.00
6.4
0.75
6.2
0.50
6.0
0.25
1970 1980 1990 2000 1970 1980 1990 2000

^ for θ=0.5
(C) Distribution of θ ^ for θ=1
(D) Distribution of θ
^
Distribution of θ ^
10.0 Distribution of θ
N(s=0.0389) N(s=0.00588)
100
7.5
5.0
50
2.5
0.0
0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.97 0.98 0.99 1.00 1.01
Figure 7.2: (A)-(B): Examples of non-stationary time series. (C)-(D): Distribution

of the OLS estimator, ^, in an AR(1) model when the true parameter is = 0:5 and
= 1. Simulated with T = 500 observations and 20000 replications.
From a modelling point of view we may consider the change in the mean as
deterministic (as the time of the change is known) and include a dummy variable in
the regression model. De…ning a dummy variable,
Dt = I(t T0 ); (7.9)
to be zero before observation T0 and unity after, we can augment the regression model
(7.1) to take account of a level shift in yt or xt :
yt = 0 + 1 xt + 2 Dt + t: (7.10)
If we think that the structural break is more fundamental, e.g. changing all
parameters in the model, we may want to model the two regimes separately. That
requires su¢ cient observations in both sub-samples.
7.2.3 Changing Variances

A third type of non-stationarity is related to changes in the variance. Figure 7.1 (C)
illustrates an example:
yt = 0:5 yt 1 + t ;
where t is distributed as N (0; 1) for t = 1; 2; :::; 100, and as N (0; 5) for t = 101; 102; :::;
200. Again the interpretation is that the time series covers di¤erent regimes, where
one regime appears to be more volatile than the other.
If the sub-samples are long enough, a natural solution is again to model the
regimes separately. An alternative solution is to try to model the changes in the vari-
ance, which is covered later in the course using the class of autoregressive conditional
heteroskedastic (ARCH) models.
7.2.4 Unit Roots

The …nal type of non-stationarity presented here is generated by unit roots in autore-
gressive models.
Example 7.4 (unit root): Figure 7.1 (D) illustrates a so-called random walk,
yt = yt 1 + t; with t being i.i.d. N (0; 1);
that has a unit root in the characteristic polynomial. Note from the graph that the
random walk has no attractor, and wanders arbitrarily far up and down. Unit root
processes seem to be a good description of the behavior of actual time series in many
cases, and it is the main focus in the rest of this chapter.
The most important complication from the introduction of unit root processes
is that standard versions of LLN and CLT do not apply. Consequently, we have to
develop new statistical tools for the analysis of this case.
As an illustration consider a small Monte Carlo simulation with data generating
process (DGP) given by
d
yt = y t 1 + t; t = N (0; 1);
for t = 1; 2; :::; T , and y0 = 0. For each simulated time series, m = 1; 2; :::; M , we

estimate an AR(1) model and collect the OLS estimates, ^m . In the simulation we
take T = 500 observations to illustrate the behavior in large samples, and M = 20000
replications.
Figure 7.2 (C) depicts the distribution of ^, when the true parameter is = 0:5.
This is the standard stationary case, and the distribution is close to normal with a
7.3 Stationary and Unit Root Autoregressions 163
standard deviation of MCSD(^) 0:04. Figure 7.2 (D) depicts the distribution of ^
when the true parameter is = 1, i.e. in the presence of a unit root. Note that the
distribution is highly skewed compared to the normal distribution, with a long left
tail. This re‡ect that the asymptotic distribution of ^ is non-normal in the unit root
case. Also note that the distributions is much more condensed (compare the scales
of graph (C) and (D)) with MCSD(^) 0:006. This re‡ects that the estimator is
^
consistent, i.e. plim( ) = , and the convergence to the true value is much faster
for = 1 than for = 0:5. This phenomenon of fast convergence is referred to as
super-consistency.
7.3 Stationary and Unit Root Autoregressions

In this section we discuss the properties of an autoregressive model under the sta-
tionarity condition and under the assumption of a unit root. To make the derivations
as simple as possible we focus on the …rst order autoregression, but parallel results
could have been derived for more general models. We …rst analyze the case with no
deterministic terms, and then discuss the interpretation of deterministic terms in the
model.
7.3.1 Stationary Autoregression

Consider the …rst order autoregressive, AR(1), model given by
yt = y t 1 + t; (7.11)
for t = 1; 2; :::; T , where t is an i.i.d.(0; 2 ) error term, and the initial value, y0 , is
given. The characteristic polynomial is given by (z) = 1 z, and the characteristic
1 1
root is z1 = with inverse root 1 = z1 = . Recall that the stationarity condition
is that the inverse root is located inside the unit circle, and it follows that the process
in (7.11) is stationary if j j < 1.
The solution to (7.11) in terms of the initial value and the error terms can be
found from recursive substitution, i.e.
yt = yt 1 + t
= ( yt 2 + t 1) + t
2
= t + t 1 + yt 2
2
= t + t 1 + ( yt 3 + t 2)
2 3
= t + t 1 + t 2 + yt 3
..
.
2 t 1 t
= t + t 1 + t 2 + ::: + 1 + y0 : (7.12)
An important characteristic of this process is that a shock to t has only transitory

e¤ects because s goes to zero for s increasing. We say that the process has an at-
tractor, and if we could set all future shock to zero, the process yt would converge
towards the attractor. In the present case the attractor is the unconditional mean,
E(yt ) = 0. Figure 7.3 (A) illustrates this idea by showing one realization of a sta-
tionary AR(1) process. We note that the process ‡uctuates around a constant mean.
An extraordinary large shock (at time t = 50) increases the process temporarily, but
the series will return and ‡uctuate around the attractor after some periods.
From the solution in (7.12) we can …nd the properties of yt directly. The mean is
t
E(yt j y0 ) = y0 ! 0 = E(yt ) as t ! 1:
We note that the initial value a¤ects the expectation in small samples, but the e¤ect
vanishes for increasing t, such that the expectation is zero in the limit. Likewise, the
variance is found to be
2
2 2 2 4 2 2(t 1) 2
V (yt j y0 ) = + + + ::: + ! 2 = V (yt );
1
and remember that the autocorrelation function is given by
s
s = corr(yt ; yt s ) = ,
which goes to zero for increasing s, cf. Figure 7.3 (C).
7.3.2 Autoregression with a Unit Root

Now consider the case where the autoregressive parameter in (7.11) is unity, = 1,
i.e.
yt = yt 1 + t : (7.13)
Note that unity is now a root in the characteristic polynomial,
(z) = 1 z;
such that (1) = 0, hence the name unit root process. The solution is given by (7.12)
with = 1, i.e.
Xt
yt = y0 + 1 + 2 + ::: + t = y0 + i: (7.14)
i=1
7.3 Stationary and Unit Root Autoregressions 165
Note the striking di¤erences between (7.12) and (7.14). First, the e¤ect of the initial
value, y0 , stays in the unit root process and does not disappear for increasing t. This
means that
E(yt j y0 ) = y0 ;
and the initial plays the role of a constant term. Secondly, the shocks to the process,
P
t , are accumulated to a random walk component, i . This is called a stochastic
trend, and it implies that shocks to the process have permanent e¤ects. This is
illustrated in Figure 7.3 (B), where the large shock at time t = 50 increases the level
of the series permanently. More generally we note that the process is moved around
by the shocks with no attractor.
Thirdly, the variance now increases with t,
Xt
2
V (yt j y0 ) = V i j y0 = t ;
i=1
and the process is clearly non-stationary. The …rst-di¤erenced process, yt = t , is

stationary, however, and the process yt is often referred to as integrated of …rst order,
I(1), meaning that it is a stationary process that has been integrated once. More
generally, a time series is integrated of order d, I(d), if it contains d unit roots.
We also note that the covariance between yt and yt s is given by
cov(yt ; yt s j y0 ) = E((yt y0 )(yt s y0 ) j y0 )

= E(( 1 + 2 + ::: + t )( 1 + 2 + ::: + t s) j y0 )
2
= (t s) ;
and the autocorrelation is given by

cov(yt ; yt s j y0 )
corr(yt ; yt s j y0 ) = p
V (yt j y0 ) V (yt s j y0 )
(t s) 2
= p
t 2 (t s) 2
t s
= p
t(t s)
r
t s
= ;
t
which dies out very slowly with s. The autocorrelation function is illustrated for the
unit root case in graph (D).
For autoregressive processes of higher order, AR(p) with p > 1, the solution (7.14)
is generally more complicated, but an I(1) process can always be written as
X
t
yt = c i + c0 t + c1 t 1 + c2 t 2 + ::: + A; (7.15)
i=1
(A) Shock to a stationary process, θ= 0.8 (B) Shock to a unit root process, θ= 1
10
10
5
0
0
0 20 40 60 80 100 0 20 40 60 80 100
(C) ACF for stationary process, θ=0.8 (D) ACF for unit root process, θ=1
1.0 1.0
0.5 0.5
0.0 0.0
0 5 10 15 20 25 0 5 10 15 20 25
Figure 7.3: Di¤erences between stationary and non-stationary time series. (A) and
(B) show one realization of a stationary and non-stationary time series respectively
and illustrates the temporary and permanent impact of the shocks. (C) and (D) show
the estimated autocorrelation function.
where A is the contribution from the initial values and the sequence c0 ; c1 ; c2 ; ::: con-
verges to zero exponentially fast. This is a bit technical to show, but the calculations
for the AR(2) case are given Appendix §7.A at the end of this chapter.
7.4 Deterministic Terms

The statistical model in (7.11) is only valid if the time series under analysis has a
zero mean. This is rarely the case, and in practice it is always necessary to include a
constant term, and sometimes it is also necessary to allow for a deterministic linear
trend. Note from (7.14), however, that a unit root implies accumulation of the terms
in the model, and the interpretation of the deterministic terms changes in the presence
of a unit root.
Consider as an example, a model with a constant term,
yt = + y t 1 + t:
7.5 Testing for a Unit Root 167
If j j < 1, the solution can be derived as
X
t 1
t i 2 t 1
yt = y0 + t i + (1 + + + ::: + ) ; (7.16)
i=0
where the mean is given by
t 2 t 1
E(yt j y0 ) = y0 + (1 + + + ::: + ) ! = E(yt ):
1
In the case of unit root, = 1, we …nd the solution
X
t X
t
yt = y0 + ( + i ) = y0 + t + i; (7.17)
i=1 i=1
where the constant term is accumulated to a deterministic linear trend, t, while the
initial value, y0 , plays the role of a constant term. The process in (7.17) is referred
to as a random walk with drift. Note that if the constant term is zero, = 0, then
the solution is the random walk,
X
t
yt = y0 + i: (7.18)
i=1
The joint hypothesis, = 1 and = 0, plays an important role in unit root testing.
It holds in general that the deterministic terms in the model will accumulate, and
a linear trend term in an autoregressive equation with a unit root corresponds to a
quadratic trend in yt .
7.5 Testing for a Unit Root

To test for a unit root in a time series yt , the idea is to estimate a statistical model,
and then to test whether z = 1 is a root in the autoregressive polynomial, (z), i.e.
whether
(1) = 1 1 2 ::: p = 0:
The only thing which makes unit root testing di¤erent from hypothesis testing in
stationary models is that the asymptotic distributions of the test statistics are not
N (0; 1) or 2 (1) in general. We say that the test statistics follow non-standard dis-
tributions.
Some textbooks and computer programs present unit root tests as a misspeci…ca-
tion test that should be routinely applied to time series. This ’automatic’approach
has the danger that the user forgets the properties of the models involved and the
interpretation of the unit root test itself. It is therefore recommended to take a more
standard approach, and to consider the unit root hypothesis as any other hypothe-
sis in econometrics–bearing in mind that the asymptotic critical values for the test
should not be taken from the standard N (0; 1) or 2 (1) distributions.
The …rst step is to set up a statistical model for the data. To do so we have to
determine which deterministic components we want to include and we should ensure
that the statistical model is an adequate representation of the structure in the time
series, e.g. by applying the usual misspeci…cation tests to the estimated model. Based
on the statistical model we can then test for a unit root by comparing two hypothesis,
H0 and HA say, bearing in mind the properties of the model under the null (H0 ) and
under the alternative (HA ).
7.5.1 Dickey-Fuller Test in an AR(1)

First consider an AR(1) model given by
yt = yt 1 + t; (7.19)
for t = 1; 2; :::; T . A unit root implies that = 1. The null hypothesis of a unit root
is tested against a stationary alternative by comparing the hypotheses
H0 : = 1 against HA : 1< < 1:
Note that the alternative is explicitly a stationary model, i.e. a one-sided hypothesis.
A test could also be devised against an explosive alternative, which may be relevant
if the purpose of the analysis is to search for the presence of bubbles, but that will
not be discussed here.
An alternative but equivalent formulation is obtained by subtracting yt 1 on both
hand sides,
yt = yt 1 + t ; (7.20)
where = 1= (1) is the characteristic polynomial evaluated in z = 1. The
hypothesis (1) = 0 translates into
H0 : = 0 against HA : 2< < 0:
The Dickey-Fuller (DF) test statistic is simply the t ratio of H0 in (7.19) or

(7.20), i.e.
^ 1 ^
DF = = ;
se(^) se(^ )
Test level: Left-tail probability

Distribution 1% 2:5% 5% 10%
N (0; 1) 2:33 1:96 1:64 1:28
DF 2:56 2:23 1:94 1:62
DFc 3:43 3:12 2:86 2:57
DFl 3:96 3:66 3:41 3:13
Table 7.3: Asymptotic critical values for the Dickey-Fuller unit root test. This is
the one-sided test for = 0. Reproduced from Davidson and MacKinnon (1993). The
left-tail probability, , corresponds to the precentile of the asymptotic distribution.
where se( ) denotes the estimated standard error of the coe¢ cient. The asymptotic
distribution of the test under the null hypothesis of a unit root is not standard normal,
cf. also the non-standard distribution of ^ for = 1 illustrated in Figure 7.2 (D). It
follows a so-called Dickey-Fuller distribution, DF, which is tabulated in Table 7.3 and
compared to the standard normal distribution in Figure 7.4 (A). The 5% asymptotic
critical value in the DF-distribution is 1:94, which is smaller than the corresponding
1:64 for a one-sided test in a standard normal distribution.
It is worth noting that the DF distribution is derived under the assumption that
t is i.i.d. If that is not the case, e.g. if there is autocorrelation, the statistical model
could be augmented with more lags.
7.5.2 Asymptotic Analysis

To outline the asymptotic analysis for the unit root case, assume that the variance is
…xed, 20 = 1, and focus on . In the unit root case we have that yt is a random walk,
X
t
yt = i;
i=1
where y0 = 0 for simplicity.

Assuming Gaussian errors, recall that the likelihood function for the AR(1) model
in (7.19) is given by
X T
log LT ( ) = `t ( );
t=1
with
1 (yt yt 1 )2 1 1
`t ( ) = log 2 2
0 2
= log (2 ) (yt yt 1 )2 ;
2 2 0 2 2
2
where the last simplication follows from assuming 0 = 1. The score contribution is
therefore the …rst derivative
@ log `t ( )
st ( ) = = yt 1 (yt yt 1 ) ;
@
and the MLE is given as the solution to the …rst order condition,
X
T
ST (^) = yt 1 (yt ^yt 1 ) = 0;
t=1
which produces the OLS estimator,

PT
^ = Pt=1 yt 1 yt : (7.21)
T 2
t=1 yt 1
The second derivative of the log-likelihood contribution is given by
@ 2 log `t ( ) @
= fyt 1 (yt yt 1 )g = yt2 1 :
@ @ @
The results for asymptotic normality of ^ considered so far require that the Hessian
converges in probability. In the unit root case, however, the average
1X 2
T
y (7.22)
T t=1 t 1
does not converge to a constant, because the variance V (yt j y0 ) grows with t; and a
LNN does not apply. Instead it hold that
Z
1 X 2
T 1
d
y ! B(u)2 du;
T 2 t=1 t 1
0
where B(u) is a so-called Brownian motion de…ned on u 2 [0; 1], which is the contin-
uous limit of a random walk. This result is fundamentally di¤erent from the usual
law of large numbers. First, the denominator is T 2 rather than T implying that the
information on the parameter grows much faster. Secondly, the right hand side
limit is stochastic and not a constant.
It can be shown, that the score converges to a so-called stochastic integral,
R1
0
B(u)dB, and jointly the estimator has the complicated representation
R1
d B(u)dB
T (^ 1) ! R 01 : (7.23)
0
B(u)2 du
p
We note that the speed of convergence is T instead of the usual T , and we say that
the estimator is super-consistent. Likewise the DF distribution can be written as
R1
d 0
B(u)dB
DF ! qR : (7.24)
1 2 du
0
B(u)
To fully understand these results, some training in probability theory is needed, but
we will not pursue this here. An introduction is given in Patterson (2010).
7.5.3 Dickey-Fuller Test in an AR(p)

The DF test is easily extended to an autoregressive model of order p. Here we consider
the case of p = 3 lags:
yt = 1 yt 1 + 2 yt 2 + 3 yt 3 + t:
2 3
We note again that a unit root in (z) = 1 1z 2z 3z corresponds to
(1) = 1 1 2 3 = 0:
This hypothesis is straightforward to test using

P3 ^
i=1 i 1
DF = P3 ^ ; (7.25)
se( i=1 i )
P
provided that the applied software reports the sum 3i=1 î and the corresponding
standard error.
To avoid testing a restriction on 1 1 2 3 , which involves all p = 3 parameters,
the model can be rewritten as
yt yt 1 = ( 1 1)yt 1 + 2 yt 2 + 3 yt 3 + t
yt yt 1 = ( 1 1)yt 1 +( 2 + 3 )yt 2 + 3 (yt 3 yt 2 ) + t
yt yt 1 = ( 1 + 2 + 3 1)yt 1 +( 2 + 3 )(yt 2 yt 1 ) + 3 (yt 3 yt 2 ) + t
yt = yt 1 + c1 yt 1 + c2 y t 2 + t; (7.26)
where
= 1 + 2 + 3 1= (1)
c1 = ( 2 + 3)
c2 = 3:
In equation (7.26) the hypothesis (1) = 0 corresponds to
H0 : = 0 against HA : 2< < 0: (7.27)

The test statistic is again the t ratio for H0 and it is denoted the augmented Dickey-
Fuller (ADF) test. The asymptotic distribution is the same as for the DF test in an
AR(1).
We note that it is only the test for = 0 that follows the DF distribution, while
tests related to c1 and c2 have standard asymptotics. The reason for this di¤erence is
that the hypothesis, c1 = 0, does not introduce any unit roots, see §8.4.1 for a more
detailed argument.
For practical purposes, a unit root test is therefore just performed as a test for
= 0 in the regression (7.26), where we include su¢ cient lags to ensure that the errors
are i.i.d. To determine the number of lags, p, we can use the standard procedures. One
approach is to use general-to-speci…c testing. One starts with a maximum lag length,
pmax , and insigni…cant lags are then deleted. Some researchers prefer to remove the
longest lags …rst and to avoid holes in the lag structure, but that is not necessary.
Another possibility is to use information criteria to select the best model. In any
case it is important to ensure that the model is well-speci…ed before the unit root
test is applied.
Remark 7.1 (summary tables): Some authors and software packages suggest to
calculate the DF test for all values of p, and to look at the whole range. This is pre-
sented as a robustness check, but the interpretation is not as simple as it sounds. If
the regression model includes fewer lags than the true model, then there is autocorrela-
tion by construction and the DF distribution is no longer valid. If, on the other hand,
the regression model includes too many lags compared to the data generated process,
the parameters are typically imprecisely estimated, which also deteriorates the test.
In most situations in practice it is therefore recommended that the researcher care-
fully models the process, and performs the Dickey-Fuller test in the preferred model.
Remark 7.2 (what is a unit root?): A unit root is de…ned as a root, zi , or in-
verse root, i = zi 1 , with k i k = 1, i.e. on the boundary of the complex unit circle.
In this course, we focus on the real unit roots, i = 1, and do not cover the analysis
of other types of unit roots. Here we consider some possible examples of unit roots:
(1) Consider the positive real unit root, = 1; the leading case considered in the
course. In this case the characteristic polynomial is
(z) = 1 z=1 z;
and the process is given by
(L)yt = t
yt = yt 1 + t;
such that the process is a random walk, which seems to be similar to many
economic time series.
(2) Consider the negative real unit root, = 1: In this case the characteristic
polynomial is (z) = 1 z = 1 + z, and the process is given by
yt = yt 1 + t:
In this case the process jumps from negative and positive values and does not
seem to have a behavior relevant to economic time series.
(3) Now consider the case of two unit roots, 1 = 1 and 2 = 1. In this case the
characteristic polynomial is
(z) = (1 1 z)(1 2 z)
= (1 z)(1 + z)
= 1 z2;

yt = yt 2 + t:
This process could be relevant for bi-annual data, where the summer observation
is similar to the summer observation the year before and the winter observation
is similar to the winter observation the year before. This is known as a (bi-
annual) seasonal unit root.
(4) Finally, consider the case of four roots on the complex unit circle, 1 = 1,
p
2 = 1; 3 = i, and 4 = i, with i = 1 the imaginary number. In this
case the characteristic polynomial is
(z) = (1 1 z)(1 2 z)(1 3 z)(1 4 z)
= (1 z)(1 + z)(1 iz)(1 + iz)

4
= 1 z ;

yt = yt 4 + t:
This process could be relevant for quarterly data , where each quarterly observa-
tion is similar to the same quarter the year before. This is known as a quarterly
unit root.
7.5.4 Dickey-Fuller Test with a Constant Term

In practice we always include deterministic variables in the model, and the unit root
test has to be adapted to this situation. The DF regression with a constant term is
(A) Dickey-Fuller distributions (B) US unemployment rate

0.6 8
DFl DFc
7
DF
0.4 N(0,1)
6
5
0.2
4
0.0 3
-4 -2 0 2 4 1985 1990 1995 2000 2005
Figure 7.4: (A) Dickey-Fuller distributions. (B) US unemployment rate.
given by
yt = + yt 1 + c1 yt 1 + c2 yt 2 + t: (7.28)
The hypothesis of a unit root is unchanged H0 : = 0, and as a test statistic we can
use the t ratio
^
DF;c = :
se(^ )
There are two important things to note. First, the presence of the constant term in
the regression changes the asymptotic distribution. The asymptotic distribution that
allows for a constant, DFc , is illustrated in Figure 7.4 (A), and the critical values are
reported in Table 7.3. We note that the constant term shifts the distribution to the
left and the 5% critical values is 2:86.
Secondly, under the null hypothesis, = 0, the constant term accumulates to a
linear trend, and the DF t test actually compares the model in (7.16) and (7.17), i.e.
a stationary model with a non-zero level with a random walk with drift. This is not
a natural comparison, and it is an assumption of the unit root test that = 0 under
the null hypothesis, = 0, such that the drift disappears. This restriction, however,
is not imposed in estimation or testing.
A more satisfactory solution is to impose the joint hypothesis
H0 : = = 0;
i.e. to compare (7.28) with the model
y t = c1 y t 1 + c2 y t 2 + t: (7.29)
The joint hypothesis can be tested by a LR test,
LR( = = 0) = 2 (log L0 log LA ) ;

Test level: Right-tail probability

Distribution 1% 2:5% 5% 10%
2
(2) 9:21 7:38 5:99 4:61
DF2c 12:73 10:73 9:14 7:50
DF2l 16:39 14:13 12:45 10:56
Table 7.4: Asymptotic critical values for the likelihood ratio unit root tests for
= = 0 and = = 0, respectively. The right-tail probability, , corresponds to
the 100 percentile of the asymptotic distribution.
where log L0 and log LA denote the log-likelihood values from the models in (7.29) and
(7.28), respectively. Due to the presence of a unit root under the null hypothesis, the
LR test statistic has a non-standard distribution under the null hypothesis; referred
to as DF2c . Critical values for the LR test is given in Table 7.4. Note that the test of
the joint hypothesis is two-sided and rejects for large values of LR.
Example 7.5 (unit root in unemployment): To illustrate the use of the DF

test consider in Figure 7.4 (B) the US unemployment rate, calculated as the number
of unemployed in percentage of the labour force, 1985 : 6 2005 : 6. We denote the
variable UNRt . From an economic point of view many economists would object to a
linear trend in the model, and we consider a regression with a constant term,
p 1
X
UNRt = + UNRt 1 + ci UNRt i + t:
i=1
To satisfactorily model the time series we use an AR(5), which means that we allow for
four lagged …rst di¤erences in the Dickey-Fuller regression. The results are reported
in Table 7.5.
The augmented Dickey-Fuller test is just the t test t =0 = 1:94. The asymptotic
distribution is DFc , with a 5% critical values of 2:86. Based on the DF t-test we
therefore cannot reject the null hypothesis of a unit root, and we conclude that
the time series for US unemployment is likely to be generated as a unit-root non-
stationary process.
The likelihood ratio test for the joint hypothesis LR( = = 0) can be obtained
by performing the regression under the null,
UNRt = 0:105 UNRt 1 + 0:168 UNRt 2 + 0:225 UNRt 3 + 0:114 UNRt 4 + t;

( 1:60) (2:60) (3:49) (1:79)

UNRt 1 0:016803 0:00868 1:94
UNRt 1 0:106844 0:06537 1:63
UNRt 2 0:165137 0:06422 2:57
UNRt 3 0:228119 0:06451 3:54
UNRt 4 0:123870 0:06372 1:94
Constant 0:087189 0:04945 1:76
b 0:123707 log-likelihood 161:37
R2 0:102 T 236
2
No autocorrelation of order 1-2 5:40 [0:07] (2)
2
Normality 0:79 [0:67] (2)
2
No heteroskedasticity 10:77 [0:38] (4)
Correct functional form (RESET) 1:36 [0:24] F(1,229)
Table 7.5: Modelling UNRt by OLS for t = 1985 : 11 2005 : 6.
where the numbers in parentheses are t values. The log-likelihood value from this
regression is log L0 = 159:06, and the likelihood ratio test is given by
LR( = = 0) = 2 (log L0 log LA ) = 2 (159:06 161:37) = 4:62;
which is much smaller than the critical value of 9:13.

The conclusion that unemployment has a unit root has important consequences
from an economic point of view. It implies that shocks to the labour market have
permanent e¤ects. An explanation could be that if people get unemployed, then they
gradually loose their ability to work, and it is very di¢ cult to be reemployed. This
hypothesis is known as hysteresis in the labour market.
7.6 Dickey-Fuller Test with a Trend Term

If the variable in the analysis is trending, the relevant alternative to a unit root
process is in many cases trend-stationarity. The test is in this case based on the
regression model augmented with a trend,
yt = + t + yt 1 + c1 y t 1 + c2 y t 2 + t: (7.30)
7.6 Dickey-Fuller Test with a Trend Term 177
The hypothesis of a unit root is still H0 : = 0, and the DF t test is again just the
t ratio
^
DF;l = :
se(^ )
The presence of a trend shifts the asymptotic distribution, DFl , further to the left as
illustrated in Figure 7.4 (A). By looking at the distribution we note that even if the
true value of is zero, = 0, the estimate, ^ , is almost always negative. This re‡ect
the large bias when the autoregressive parameter is close to one. From Table 7.3 we
see that the 5% critical values is now 3:41.
It holds again that if = 0 then t is accumulated to produce a quadratic trend in
the model for yt . To avoid this we may consider the joint hypothesis, H0 : = = 0,
i.e. to compare (7.30) with the model under the null:
yt = + c1 yt 1 + c2 y t 2 + t: (7.31)
In the model under the null it still holds that is accumulated to a linear trend, which
exactly match the deterministic speci…cation under the alternative (7.30). The test is
a comparison between a trend-stationarity model under the alternative and a random
walk with drift under the null. It seems reasonable to allow the same deterministic
components under the null and under the alternative, and the joint hypothesis, H0 , is
in most cases preferable in empirical applications. The joint hypothesis can be tested
by a LR test
LR( = = 0) = 2 (log L0 log LA ) ;
where log L0 and log LA again denote the log-likelihood values from the two relevant
models. The asymptotic distribution of LR( = = 0) is DF2l , with critical values
reported in Table 7.4.
Example 7.6 (unit root in productivity): To formally test whether the Dan-
ish productivity in Example 7.2 is trend-stationary, we want to test for unit root in
the model in Table 7.1. One possibility is to use the t test
^ 1 0:561273 1
DF;l = = = 6:22;
se(^) 0:07056
from Table 7.1. This is much smaller than the critical value, and we clearly reject the
null hypothesis of unit root, thus concluding that productivity appears to be a trend-
stationary process. We note that exactly the same result would have been obtained
if we considered the transformed regression
LPRODt = 0:091 + 0:0024t 0:439 LPRODt 1 + t;

(6:58) (6:15) ( 6:22)
where we recognize the t ratio in parenthesis.

To test the joint hypothesis, H0 : = = 0, we may run the regression under
the null
LPRODt = 0:0057 + t ;
(3:48)
which gives a log-likelihood value of 348.63. The likelihood ratio test is given by
LR( = = 0) = 2 (log L0 log LA ) = 2 (348:63 366:09) = 34:92;
which strongly rejects compared to the critical value of 12.39.
Example 7.7 (unit root in consumption): We also test the hypothesis that pri-
vate consumption in Example 7.3 is trend-stationary. Since the regression in that case
in an AR(2), there is no way to derive the test statistic from the output of Table 7.2
alone. Instead we run the equivalent regression in …rst di¤erences
LCONSt = 0:764 + 0:0004t 0:129 LCONSt 1 0:209 LCONSt 1 + t;

(2:57) (2:58) ( 2:56) ( 2:43)
which produces the same likelihood as in Table 7.2, log LA = 359:23. The Dickey-
Fuller t test is given by DF;l = 2:56, which is not signi…cant in the DFl distribution
(but close to the critical value). We conclude that private consumption seems to
behave as a unit-root non-stationary process.
To test the joint hypothesis, H0 : = = 0, we use the regression under the null,
LCONSt = 0:0046 0:274 LCONSt 1 + t;

(2:97) ( 3:29)
with a log-likelihood value of 355.869. For the consumption series, the likelihood ratio
test for a unit root is given by
LR( = = 0) = 2 (log L0 log LA ) = 2 (355:869 359:23) = 6:722;
and we conclude in favour of a unit root.
7.7 Further Issues in Unit Root Testing

This section contains some concluding remarks on unit root testing.
7.7.1 The Problem of Low Power

The decision on the presence of unit roots or not is important. From an economic
point of view it is important to know whether shocks have permanent e¤ects or not,
7.7 Further Issues in Unit Root Testing 179
and from a statistical point of view it is important to choose the appropriate statistical
tools. It should be noted, however, that the decision is di¢ cult in real life situations.
We often say that the unit root test has low power to distinguish a unit root from a
large (but stationary) autoregressive root.
To illustrate the problem, consider two time series, generated as
yt = 0:2 yt 1 + 0:05 t + t (7.32)

xt = 0:25 + t : (7.33)
Process yt is trend-stationary, while xt is a random walk with drift. Figure 7.5 (A)
depicts T = 100 observations from one realization of the processes. We note that the
two series are very alike, and from a visual inspection it is impossible to distinguish
the unit root from the trend-stationary process. This illustrates that in small samples,
a trend-stationary process can be approximated by a random walk with drift, and
vice versa. That makes unit root testing extremely di¢ cult! Figure 7.5 (B) shows
the same series, but now extended to T = 500 observations. For the long sample the
di¤erence is clear: yt has an attractor, xt does not.
Using the equation (7.32) as a Monte Carlo DGP, we can illustrate the power of
the unit root hypothesis, H0 : = 0, in the model
yt = + t + yt 1 + t;
i.e. how often = 0 is rejected given that the true value is = 0:2. Similarly we
can use equation (7.33) as a DGP to illustrate the size of the test, i.e. how often
= 0 is rejected if it is true in the DGP. Figure 7.5 (C) depict the size and power
of H0 as a function of the number of observations. All tests are performed at a 10%
level, so we expect the size to converge to 10% and the power to converge to 100%
as T diverges. We note that the actual size is too large in small sample, such that
we reject a true hypothesis too often. As the number of observations increases, the
actual size converges to 10%. The power is increasing relatively slowly in the number
of observations. To reject the false unit root hypothesis 50% of the times, we need
close to 100 observations. In small samples, e.g. T = 50, it is extremely di¢ cult to
tell the two processes, (7.32) and (7.33), apart.
There is a large literature that tries to construct unit root tests with a higher power
than the Dickey Fuller type test presented above, see Haldrup and Jansson (2006)
for a survey. The most famous test is developed by Elliott, Rothenberg, and Stock
(1996) (ERS), and this test is often used in applications. The idea is that the role of
the initial value is fundamentally di¤erent for stationary and non-stationary variables
and (correct) assumptions on the distribution of the initial value may improve the
power to distinguish stationary and unit root processes. In practice the ERS test is
based on an initial de-trending of the variable, x~t = xt ~ 0 ~ 1 t, where ~ 0 and ~ 1 are
(A) Trend-stationary and unit root process (B) Trend-stationary and unit root process
30
∆y t = −0.2⋅yt−1 + 0.05⋅t + ε t ∆y t = −0.2⋅yt−1 + 0.05⋅t + ε t
∆x t = 0.25 + ε t ∆x t = 0.25 + ε t
100
20
50
10
0 0
0 20 40 60 80 100 0 100 200 300 400 500
(C) Rejection frequency of DF test

1.00
0.75
Power, θ=0.8
0.50
0.25
Size, θ=1
0.00
0 50 100 150 200
Figure 7.5: Low power of unit root tests.
not OLS estimates, but more complicated estimates that take the initial value into
account. The null hypothesis of a unit root is then tested on the de-trended variable,
x~t , using a Dickey Fuller type test. The implicit assumption on the initial value is
not easy to test, however, and if it is not in accordance with the data, then the test
may have lower power than the simple DF test conditional on the initial values, see
Nielsen (2008).
7.7.2 Importance of Special Events

The core of a unit root test is to assess whether shocks have transitory or permanent
e¤ects, and that conclusion is very sensitive to a few large shocks.
As discussed above, a stationary time series with a level shift is no-longer station-
ary. A unit root test applied to a time series like the one in Figure 7.1 (B) is therefore
likely not to reject the unit root. From an intuitive point of view the process in graph
(B) can be approximated by a random walk, while a stationary autoregression will
never be able to change its level to track the observed time series. That will bias the
conclusion towards the …nding of unit roots. If an observed time series has a level
shift, the only solution is to model it using dummy variables, and to test for a unit
root in a model like (7.10). This is complicated, however, because the distribution
of the unit root test depends on the presence of the dummy in the regression model,
and new critical values have to be used.
If the time series have many large isolated outliers, the e¤ects on the unit root
test will be the opposite. Large outliers make the time series look more stable than
it actually is and that will bias the test towards stationarity. One solution may again
be to model the outliers with dummy variables.

The literature on unit root testing is huge, and most references are far more technical
than the present course. Textbooks on time series econometrics with sections on unit
root testing include Patterson (2000), Enders (2004), Banerjee, Dolado, Gailbraith,
and Hendry (1993), and Hamilton (1994), where the latter is rather technical. The
review of the unit root literature in Maddala and Kim (1998) is a good starting point
for further reading.
Appendix:
7.A Solution for an I(1) Process

To illustrate the structure of the solution for an I(1) process of order higher than one,
consider the AR(2) model,
xt = 1 xt 1 + 2 xt 2 + t
2
1 1L 2L xt = t:
Now factorize the characteristic polynomial

2
1 1z 2z = (1 1 z)(1 2 z);
where 1 and 2 are the inverse roots. If the process xt is an I(1) process, it has
exactly one unit root, and it holds that 1 = 1 and j 2 j < 1, i.e.
(1 L)(1 2 L)xt = t:
Because j 2 j < 1; the second factor is invertible

1 2 2 3 3
=1+ 2z + 2z + 2z + :::;
1 2z
and using that (1 L) = we get

1 2 3
xt = t = t + 2 t 1 + 2 t 2 + 2 t 3 + :::
1 2L
where the sequence of coe¢ cients 2 ; 22 ; 32 ; ::: converges to zero when j 2 j < 1.
Now …rst truncate the expression after t 3 , and rewrite the expression by repeat-
edly adding zero terms:
2 3
xt = t + 2 t 1 +
2 t 2+ 2 t 3
2 3 3
= t + 2 t 1 + ( 2 + 2) t 2 + 2( t 3 t 2)
2 3 2 3 3
= t + ( 2 + 2 + 2 ) t 1 + ( 2 + 2 )( t 2 t 1) + 2( t 3 t 2)
= (1 + 2 + 22 + 32 ) t + ( 2 + 22 + 32 )( t 1 t)
+( 22 + 32 )( t 2 3
t 1) + 2( t 3 t 2 );
7.A Solution for an I(1) Process 183
or
x t = c t + c0 t + c1 t 1 + c2 t 2;
where we are left with one term in levels, c t , and the rest in …rst di¤erences, ci t i .
This way of rewriting the process also holds if we do not truncate, and we can write
the process as
2 3
xt = t + 2 t 1 + 2 t 2 + 2 t 3 + :::
= c t + c0 t + c1 t 1 + c2 t 2 + :::
= c t + c (L) t;
where c = (1 + 2 + 22 + 32 + :::) and c (z) = c0 + c1 z + c2 z 2 + ::: is a complicated

in…nite polynomial where the coe¢ cients c1 ; c2 ; c3 ; ::: converge to zero.
But this means that the level of the process can be found as
X
t
xt = xi + x0
i=1
X
t
= (c i + c (L) i) + x0
i=1
X t X
t
= c i+ (c (L) i c (L) i 1) + x0
i=1 i=1
X
t X
t
= c i + (Si S i 1 ) + x0 ;
i=1 i=1
where St = c (L) t is a stationary process. Most of the terms in the second sum
cancel, and the expression simpli…es to
X
t
xt = c i + St S0 + x0 ;
i=1
P
which is a random walk component, c ti=1 i , a stationary process, St = c0 t +c1 t 1+
c2 t 2 + :::, and contributions from the initial values, x0 S0 .
Chapter 8
Analysis of Non-Stationary and

Co-integrated Time Series
I
n this chapter we discuss some important issues in regression models for non-
stationary time series. It is illustrated how linear combinations of non-stationary
time series are non-stationary in general, and co-integration is de…ned as the
special case where a linear combination is stationary. We emphasize that relations
between non-stationary variables can only be interpreted as de…ning an equilibrium if
the variables co-integrate, and we discuss error-correction as the force that sustain the
equilibrium relation. We then present some single-equation tools for co-integration
analysis, namely the so-called Engle-Granger two-step procedure and co-integration
analysis based on ADL models. We show how to estimate the co-integrating param-
eters and how to test the hypothesis of no co-integration.
8.1 Introduction and Main Statistical Tools

If two time series, x1t and x2t for t = 1; 2; :::; T , are both unit root non-stationary,
then a linear combination, x1t 2 x2t , will also be non-stationary in general, and
to model the relationship between them, we have so far considered …rst di¤erences,
x1t = x1t x1t 1 and x2t = x2t x2t 1 . The …rst-di¤erence transformation
eliminates the stochastic trend and allows applications of the law or large number and
the central limit theorem, such that standard results for autoregressive distributed
lag (ADL) models and vector autoregressive (VAR) models apply. The cost of the
transformation is that much information in the levels of the processes is lost, and we
can characterize the short-term relationship only.
186 Analysis of Non-Stationary and Co-integrated Time Series
In some cases, however, the linear combination x1t 2 x2t also eliminates the
stochastic trends and becomes stationary. We then say that x1t and x2t are co-
integrated and the linear combination x1t 2 x2t can be thought of as de…ning a stable
equilibrium between non-stationary and trending time series. This concept is not new
or surprising, and examples are well-known in economics; a short-maturity interest
rate, st , and a long-maturity interest rate, lt , may both look non-stationary, while the
interest rate spread, lt st , behave in a more stable and stationary manner. In this
example, the two interest rates co-integrate and the interest rate spread constitutes an
equilibrium relationship. For the interest rate spread, the coe¢ cient of co-integration,
2 = 1, is known, and it is straightforward to build models (e.g. an ADL or VAR)
including the co-integrating interest-rate spread.
Even if the coe¢ cient 2 is unknown, we can still build statistical models for the
non-stationary variables that allows us to estimate 2 and embed the co-integrating
relation, x1t 2 x2t , as an equilibrium relationship, such that the non-stationary
variables, x1t and x2t , adjust to sustain the equilibrium. This can be formulated as
the error-correction form of an ADL model or in a VAR model.
Below we …rst discuss the mathematical structure of non-stationary variables and
formally de…ne co-integration. We then present an important distributional result
that shows that many hypotheses can still be tested using standard asymptotics,
even if the variables in the model are non-stationary. Next, we show how to perform
a co-integration analysis of non-stationary variables based on an ADL model. This
involves estimation of the co-integration parameter 2 and a formal test of whether
the non-stationary variables co-integrate or not. In the next chapter, we consider the
implications of co-integration in a vector autoregression.
8.2 Mathematical Structure of Co-integration

In this section we look at linear combinations of unit root non-stationary time series
and de…ne the concept of co-integration. To simplify the notation we consider the
case of p = 2 variables in most of the presentation below, but the discussion is easily
extended to more variables.
Consider an autoregressive time series
yt = 1 yt 1 + 2 yt 2 + ::: + k yt k + t:
Recall from Chapter 7 and, in particular, the derivation in Appendix §7.A, that if yt
has a unit root and is integrated of order one, I(1), then the moving average solution
can always be written as a random walk component, a stationary process, and a
8.2 Mathematical Structure of Co-integration 187
contribution from the initial values, i.e.
yt = + c1
t t 1 + c2 t 2 + ::: + ct 1 1 +A
X
t
= c i + c0 t + c1 t 1 + c2 t 2 + ::: + ct 1 1 +A
i=1
= t + St + A; (8.1)
P
where t = c ti=1 i is a random walk, often called the stochastic trend of yt , St =
c0 t + c1 t 1 + c2 t 2 + ::: + ct 1 1 is a stationary process and A depends on the initial
values.
Now let x1t and x2t be two time series that are integrated of …rst order. We can
write the two processes on the form
x1t = 1t + S1t + A1 (8.2)

x2t = 2t + S2t + A2 : (8.3)
Next de…ne the linear combination, zt = 0 Xt , where Xt is a vector of variables, and

is a vector of weights in the linear combination, i.e.
x1t 1
Xt = and = :
x2t 2
Inserting (8.2) and (8.3), we can write the linear combination as

0
zt = Xt
x1t
= 1 2
x2t
= x1t 2 x2t
= ( 1t 2 2t ) + (S1t 2 S2t ) + (A1 2 A2 ): (8.4)
We note that zt contains the random walk component, 1t 2 2t , and in most cases
zt will also be I(1).
An exception from this result is if there exist a vector, , such that zt de…ned in
(8.4) is a stationary process. This property is denoted co-integration.
Definition 8.1 (co-integration): Let Xt = (x1t ; x2t ; :::; xpt )0 2 Rp be an I(1)

vector of time series, t = 1; 2; :::; T . The variables in Xt are said to be co-integrated if
there exists a co-integration vector, ; such that 0 Xt is stationary. For p > 2, there
may be more than one co-integration vectors ( 1 ; 2 ; :::; r ), r p,
Note, that for co-integration to exist we need the stochastic trends, it , to be common,
P
i.e. generated by the same underlying random walk, t = ti=1 i , i.e.
X
t X
t
1t = c1 t = c1 i and 2t = c2 t = c2 i:
i=1 i=1
If we choose 2 = c1 =c2 we have from (8.4) that

0
zt = X t = c1 (c1 =c2 ) c2 t + S1t S + A1 A:
| t
{z } | {z 2 2t} | {z 2 }2
=0 stationary process initial values
The common stochastic trend, t , cancels and zt is a stationary process.

To interpret the relationship between x1t and x2t , i.e. 0 Xt , we may solve for x1t
and consider a regression type formulation,
x1t = 2 x2t + zt ; (8.5)
where zt is stationary by co-integration. The interpretation is now that if x2t is

increased with one unit, x1t is increased with 2 units in equilibrium, and we may
talk about 2 as the long-run e¤ect on x1t of a change in x2t .
Because (8.5) describes the long-run relationship between the variables, it is nat-
ural to think of the co-integrating relation as de…ning an economic equilibrium: The
variables themselves, x1t and x2t , wander arbitrarily far up and down due to the
stochastic trends, but they never deviate too much from equilibrium. When the
variables co-integrate, we can de…ne x1t = 2 x2t , and we will refer to x1t as the
equilibrium value of x1t , and
x1t x1t = x1t 2 x2t
is the stationary deviation from equilibrium. The equilibrium value can be interpreted
as the value at which there is no inherent tendency for x1t to move away, but it is
important to realize that because the economy is continuously hit by shocks, the
system will never settle down at x1t , and x1t will not converge to x1t in any sense.
Remark 8.1 (deterministic terms): In the de…nition of co-integration above, we

have left out any deterministic terms, such that zt has expectation zero. We could
easily extend the formulation such that E(zt ) = , considering
x1t = + 2 x2t + ut = x1t + ut ; (8.6)
where x1t = + 2 x2t is the equilibrium value and ut is the deviation from equilibrium.
We can also extend to other speci…cations of deterministic variables. As an ex-
ample we might believe that zt is stationary around a deterministic linear trend. This
would be the case if x1t and x2t contain both deterministic and stochastic trends, and
that the linear combination, 0 Xt , cancels the stochastic trends but not the determin-
istic trends. To model this case we can extend (8.5) with a deterministic trend term,
e.g.
x1t = 0 + 1 t + 2 x2t + ut = x1t + ut ; (8.7)
where x1t = 0 + 1 t + 2 x2t is the equilibrium value. The interpretation is that
x1t 2 x2t is trend-stationary, i.e. stationary around the linear trend, 0 + 1 t. The
deviation, ut , is a mean zero stationary process. Similarly, linear combinations could
be stationary around other deterministic components, e.g. level shifts.
Remark 8.2 (normalization): In the discussion above, we have imposed a so-

called normalization on the …rst coe¢ cient in the co-integration vector,
1
= :
2
This normalization is natural if we have a relation of the form (8.5) in mind. It is

important to realize, however, that if 2 6= 0, then we could equally well have chosen
the opposite normalization,
~
~= 1
;
1
that would correspond to an equation with x2t on the left hand side,
x2t = ~ 1 x1t + z~t ;
and z~t = x2t ~ 1 x1t is a stationary process.
8.2.1 Example of a Data Generating Process

As an example of a data generating process (DGP) that generates co-integrated vari-
ables, consider Xt = (ct ; yt )0 and the following system:
ct = + 2 yt 1 + c;t (8.8)
yt = y;t ; (8.9)
where ct denotes consumption, yt denotes income and c;t and y;t are i.i.d. and
uncorrelated error processes. Observe that ct is a very simpli…ed ADL model with a
constant term.
To understand the dynamic structure, we …rst solve (8.9) to …nd the MA solution
for income,
Xt
yt = y;i + y0 ; (8.10)
i=1
which is a random walk, and note that the lagged value is given by
X
t 1 X
t
yt 1 = y;i + y0 = y;i y;t + y0 :
i=1 i=1
We then solve (8.8) to …nd the MA solution for consumption,

ct = + 2 yt 1 + 1t
!
X t
= + 2 y;i + y0 y;t + c;t
i=1
X
t
= + 2 y;i + c;t 2 y;t + 2 y0 (8.11)
i=1
Here 2 y0 is the initial value, c;t 2 y;t is a stationary process, and

X
t
t = y;i ;
i=1
is a random walk component. In this case the random walk components are the same
in (8.10) and (8.11) and ct and yt are both integrated of order 1, I(1), but co-integrate
0
with co-integration vector = (1; 2 ) . In particular we …nd
0
zt = X t = ct 2 yt = + c;t 2 y;t ;
which is stationary with E(zt ) = . Note that in this example, the shock to income,
y;t , has permanent e¤ects on both income and consumption, while the shock to
consumption, c;t , has only transitory e¤ects.
We can also write the moving average solution on matrix form,
Pt
ct 0 2 1 0 2 c0
= Pti=1 c;i + 2 c;t
+ + ;
yt 0 1 i=1 y;i 0 0 y;t 0 0 1 y0
or simply
X
t
Xt = C i + C0 t + + A; (8.12)
i=1
0
with t = ( c;t ; y;t ) , which has the same form as the univariate case in (8.1). Now ob-
serve that the 2 2 matrix loading the random walk, C, has reduced rank, rank(C) =
0
1, and the co-integration vector, = (1; 2 ) , has the property that it eliminates
the random walk,
0 0 2
C= 1 2 = 0 0 ;
0 1
such that 0 Xt is a stationary process.
To make the situation more realistic, the dynamics of both equations could of
course be made more complicated.
8.2.2 Some Economic Examples

To illustrate the idea of co-integration and equilibrium relationships, we consider
some economic examples.
Example 8.1 (purchasing power parity): Let x1t = log(Et ) denote the log of
the bilateral exchange rate between Dollar and Euro (denominated as Dollar per
Euro), and let x2t = log(PtU S ) log(PtEU ) denote the corresponding di¤erence between
the logs of the consumer prices. Then
Et PtEU
zt = x1t x2t = log(Et ) log(PtU S ) log(PtEU ) = log
PtU S
is the relative deviation from purchasing power parity (PPP) between the US and
the Euro area. For most countries consumer prices and exchange rates appear non-
stationary, and if the deviation from PPP is stationary we can think of PPP as a
valid equilibrium relation for the parity between US and the Euro area. In this case
= (1; 1)0 would be a co-integrating vector for Xt = (x1t ; x2t )0 . If, on the other
hand, the deviation, zt , is non-stationary, it means that the price di¤erential can
wander arbitrarily far from the PPP value and there is no equilibrium interpretation
of the PPP.
Below we develop a strategy for testing for the presence of co-integration. This
co-integration test can be used to analyze if the theory underlying PPP is empirically
valid.
Example 8.2 (prices on the orange market): As an empirical example, Fig-

ure 8.1 (A) illustrates the price of organic and regular oranges, porg
t and preg
t , in pence
org reg
per lb., while graph (B) illustrates the price di¤erential, pt pt . The individ-
ual prices in graph (A) are obviously non-stationary and a possible interpretation
is that the non-stationarity is driven by stochastic trends. The prices show strong
co-movements, however, and the price di¤erential looks much more stable and could
be a sample path from a stationary process. This suggests that the relation
porg
t = + preg
t + ut ;
de…nes an equilibrium for the orange market, where is the additional price of organic
oranges in equilibrium, and ut is the deviation from equilibrium in period t. Note
again, that porg
t preg
t will not equal in any speci…c period and porg
t preg
t will not
approach as T ! 1. The equilibrium concept refers to the fact that ‡uctuations
of porg
t preg
t around will be stationary as suggested by graph (B).
Example 8.3 (private consumption): Similarly, Figure 8.1 (C) illustrates the
log of real private consumption in Denmark, ct , the log of real disposable income, yt ,
and the log of real private wealth including the value of owner occupied housing, wt
(we have subtracted 2 from wt in the graph to make the levels comparable). All three
time series are clearly trending. The series for consumption and income have many
similarities and co-move in some periods. Deviations from this pattern seem to occur
primarily when there are large ‡uctuations in private wealth. People familiar with
the Danish business cycle will recognize the peak in private wealth in 1986 as the
result of a boom in the housing market, which apparently drove up the consumption-
to-income ratio. The time series behavior, as well as simple economic theory, suggest
that consumption depends on both income and wealth, and graph (D) depicts the
deviation, ut = ct ct , from a simple consumption function
ct = 0:404 + 0:364 yt + 0:516 wt + ut :
We note that the deviation, ut , looks much more stable than the variables themselves,
suggesting that = (1; 0:364; 0:516)0 may be a co-integrating vector for Xt =
(ct ; yt ; wt )0 . Whether the deviation, ut , actually corresponds to a stationary process
is a testable hypothesis to which we return in §8.6.3.
Example 8.4 (money demand): To estimate a long-run money demand relation

we may consider the variables Xt = (mt ; yt ; rt ; bt )0 , where mt is the real money stock
(in logs), yt is real income (in logs), rt is the short interest rate as a measure of the
yield of holding money, while bt is the bond rate measuring the yield on holdings
alternative to money. Some theories suggest that in the long run the demand for
money is given by
mt = yt ! (bt rt ) ;
such that money demand increases with the amount of transactions, measured by yt ,
and decreases with the opportunity cost of holding money, bt rt . This suggests that
= (1; 1; !; !)0 could be a co-integrating vector for the variables in Xt .
Alternatively, theories for the determination of interest rates would suggest that
the spread between two interest rates with di¤erent maturities should be stable, and
also the velocity, yt mt , may be stationary. That suggests a di¤erent scenario with
two co-integration relations:
0 1 0 1
mt mt
B C B C
0 B yt C 0 B yt C
1 Xt = (0; 0; 1; 1) B C = bt rt and 2 Xt = (1; 1; 0; 0) B C = mt yt :
@ rt A @ rt A
bt bt
It is an empirical question, which of the scenarios (if any) that characterizes a data
set.
8.3 How is the Equilibrium Sustained? 193
(A) Price of oranges (pence per lb.) (B) Price differential

Organic oranges 30
250 Regular oranges
200 25
150 20
100 15
50 10
0 50 100 150 0 50 100 150
(C) Real consumption and income, logs (D) Deviation from consumption function
6.50 Income
Consumption 0.05
Wealth (subtracted 2)
6.25
0.00
6.00
-0.05
1970 1980 1990 2000 1970 1980 1990 2000
Figure 8.1: Examples of some possibly co-integrated series. (A): Price of organic
oranges, porg reg
t , and regular oranges, pt , measured in pence per lb. (B): The orange
price di¤erential, porg
t preg
t . (C): Real aggregate consumption, ct , disposable income,
yt , and private wealth, wt , in logs. (D): A linear combination of consumption, income
and wealth given by ut = ct 0:364 yt 0:516 wt + 0:404.
8.3 How is the Equilibrium Sustained?

In the previous section we de…ned co-integration of a vector a variables, Xt 2 Rp ,
e.g. Xt = (x1t ; x2t )0 2 R2 , as the existence of a vector such that the combination,
zt = 0 Xt , is a stationary process, and we emphasized the relationship between the
statistical concept of co-integration and the economic concept of equilibrium.
Logically, an equilibrium requires the existence of some forces in the DGP, which
ensures that the non-stationary variables, x1t and x2t , do not move too far away from
equilibrium. In this section, we present error-correction as a way of describing these
forces, and we discuss how co-integration and error-correction are two complementary
ways of characterizing the same phenomenon.
As an introductory example, consider …rst the simple case in (8.8), where ct is
consumption and yt is income. The simple ADL model in (8.8) can be written also
as an error-correction model using the usual manipulations:
ct = + 2 yt 1 + c;t
ct ct 1 = ct 1 + + 2 yt 1 + c;t
ct = (ct 1 2 yt 1 ) + c;t : (8.13)
The expression in bracket is the deviation from equilibrium at time t 1, as given by
ct 1 = + 2 yt 1 :
The equation states that if consumption is larger than its equilibrium value, ct 1 >
ct 1 , there will be a downward pressure on consumption. This is known as error-
correction or equilibrium correction.
This relationship between co-integration and error-correction is an example of a
famous representation theorem, due to Engle and Granger (1987), here formulated
for p = 2.
Theorem 8.1 (granger representation theorem): Two unit-root non-station-

ary variables, x1t and x2t , are co-integrated if and only if there exist an error-correction
model for either x1t , x2t or both.
In most cases, the structure of error-correction is more complicated that the simple
example in (8.13). As an example, the error correction models for x1t and x2t could
look like
x1t = 1 (x1t 1 2 x2t 1 ) + 11 x1t 1 + 12 x2t 1 + 1t
x2t = 2 (x1t 1 2 x2t 1 ) + 21 x1t 1 + 22 x2t 1 + 2t ;
where the contemporaneous e¤ect is captured by corr( 1t ; 2t ). It is important to

observe that x1t and x2t error correct to the same equilibrium,
0
Xt 1 = x1t 1 2 x2t 1 :
Also observe that x1t error corrects if 1 < 0. To see this, imagine that x1t 1 is
above equilibrium such that x1t 1 2 x2t 1 is positive. For x1t to move towards
the equilibrium we need x1t < 0, which requires 1 < 0. If x1t error corrects,
the magnitude of 1 measures the proportion of the deviation that is corrected each
period, and 1 is sometimes referred to as the speed of adjustment. As an example,
a value of 1 = 0:5 would indicate that half of the of a deviation from equilibrium
is removed each period. If 2 > 0, the same line of arguments implies that 2 > 0 is
consistent with error correction of x2t . With 2 < 0, error correction of x2t requires
that 2 < 0.
8.3 How is the Equilibrium Sustained? 195
To intuitively understand the link between co-integration and error correction,

notice that under the maintained assumptions, x1t , x1t 1 , x2t , x2t 1 , 1t , and
2t are all stationary terms. Since x1t and x2t are assumed to be I(1), the two error-
correction equations are only balanced in terms of the order of integration if the
combination x1t 1 2 x2t 1 is stationary, i.e. if the variables co-integrate. If the
variables do not co-integrate, then it must hold that x1t 1 2 x2t 1 is non-stationary,
and the only way to balance the equations would be to have 1 = 2 = 0, removing
error-correction.
We may also write the two error-correction equations as the so-called vector error
correction model,
x1t 1 11 12 x1t 1 1t
= (x1t 1 2 x2t 1 ) + + ;
x2t 2 21 22 x2t 1 2t
or simply
0
Xt = Xt 1 + 1 Xt 1 + t; (8.14)
where we have used the de…nitions
1 1 11 12
= ; = ; and 1 = :
2 2 21 22
We note that is the co-integration vector and that the lagged deviation from the
co-integrating relation,
0
Xt 1 = x1t 1 2 x2t 1 ; (8.15)
appears as an explanatory variable in both equations.
Example 8.5 (simulated co-integrated series): To illustrate the graphical im-

plications of co-integration and error correction we consider a simple model for two
co-integrated variables,
x1t 0:2 1t
= (x1t 1 x2t 1 ) + ; (8.16)
x2t 0:1 2t
where 1t and 2t are independent standard normals, N (0; 1). Here = (1; 1)0 is
a co-integrating vector and both variables error correct, with speeds of adjustment
given by = ( 0:2; 0:1)0 . One realization of x1t and x2t (t = 1; 2; :::; 100) generated
from the DGP in (8.16) is illustrated in Figure 8.2 (A). Notice the strong co-movement
between the variables, which re‡ects that they have the same stochastic trend. Graph
(B) depicts the deviation from the long-run relation,
0
zt = Xt = x1t x2t :
(A) Two cointegrated variables (B) Deviation from equilibrium

x1t
0 x2t
β'x t =x1t −x2t
2.5
-5
0.0
-10
-2.5
0 20 40 60 80 100 0 20 40 60 80 100
(C) Speed of adjustment (D) Cross-plot

12.5
x1t × x2t
10.0 0
7.5
5.0 -5
β'x t
2.5
0.0 -10 x
100
-2.5
0 20 40 60 80 100 -12.5 -10.0 -7.5 -5.0 -2.5 0.0
Figure 8.2: Simulated series to illustrate co-integration and error-correction.
The series zt is relatively persistent and is often above or below equilibrium for longer
periods of time. This illustrates the moderately slow error-correction in (8.16).
In graph (C) we illustrate the speed of adjustment. We consider a large deviation
zt = 0 Xt = 10 in a particular period and show the adjustment towards equilibrium
in a situation where no shocks hit the system. In the present case the deviation from
0
Xt is visible for approximately 10 periods and the convergence is exponential. It is
the equilibrating force in graph (C) that ensures that the levels in graph (A) do not
move to far apart.
Next, graph (D) depicts a cross plot of x1t on x2t . The variables are non-stationary
and will wander arbitrarily on the real axis. Co-integration (i.e. the force implied
by error-correction) implies that the observations will never move to far from the
equilibrium de…ned by the straight line.
Finally, observe that the most recent observation is far from equilibrium, 0 X100 >
0. If we were to make an out-of-sample forecast of the series, X101 ; X102 ; :::; then we
would conjecture that Xt would be drawn towards equilibrium, i.e. that either x1t
would decrease or that x2t would increase to close the gap.
8.4 Introduction to Estimation and Inference 197
Example 8.6 (prices on the orange market, continued): For the case of the
organic and regular oranges, assume co-integration such that
porg
0
Xt 1 = (1; 1) t 1
= porg preg
preg
t 1
t 1 t 1
is stationary. An estimation of the two error-correction equations gives
porg
t = 22:500 0:900 porg
t 1 preg org
t 1 + ^t
(1:665) (0:081)
preg
t = 1:136 0:008 porg
t 1 preg reg
t 1 + ^t ;
(0:634) (0:031)
where the numbers in parentheses are standard errors of the estimated coe¢ cients.
We can write the system as a vector error correction model,
porg 22:500 0:900 ôrg

t
= porg preg
t 1 +
t
:
preg
t 1:136 0:008 t 1
^reg
t
Here we have ^ = ( 0:900; 0:008)0 , which characterizes the speed of adjustment

towards equilibrium.
The organic orange price seems to error correct very strongly, removing most of
the disequilibrium each month. The regular orange price, on the other hand, does
not seem to error correct. The coe¢ cient is negative, indicating a movement away
from equilibrium, but it is very small and not signi…cantly di¤erent from zero. A
simple interpretation of this result is that the orange price is essentially determined
on the large market for regular oranges. The price of organic oranges has to follow
the price of regular oranges, with an additional premium of approximately 23 pence
per lb. Note that changes in the price of regular oranges (i.e. a shock to reg t ceteris
paribus) will be fully transmitted to the price of organic oranges after one month,
while changes to the price of organic oranges (i.e. a shock to org
t ceteris paribus) will
not be transmitted to the market for regular oranges.
8.4 Introduction to Estimation and Inference

Above, the concepts of co-integration and error correction was introduced. In this sec-
0
tion we discuss how the parameters in the co-integrating vector, = (1; 2 ; :::; p) ,
and the parameters characterizing the error-correction can be estimated. As a tool,
we …rst present some distributional results.
8.4.1 Main Distributional Results

Let Xt = (x1t ; :::; xpt )0 2 Rp be a p dimensional vector of variables we want to
model and let Wt = (w1t ; :::; wmt )0 2 Rm be a m dimensional vector of exogenous
explanatory variables. Consider a dynamic regression model of the form
Xt = + 1 Xt 1 + ::: + k Xt k + 0 Wt + 1 Wt 1 + ::: + k Wt k + t; (8.17)
where t is an i.i.d. error term with covariance matrix . This model contains
n = p + k p2 + k p m parameters
=f ; 1 ; :::; k; 0; 1 ; :::; k g;
plus the covariance matrix . Note that (8.17) is the VAR-X model in Remark 6.2 and
by choosing p and m and restrict some of the parameter matrices to zero, the model
includes most of the models we have considered so far, e.g. a univariate autoregressive
model, an ADL model or a VAR model, as special cases.
Stationary Case. The main distributional result used so far is based on the law
of large numbers and the central limit theorem for stationary and weakly dependent
processes. The main result is that if (Xt0 ; Wt0 )0 is a stationary and weakly dependent
process, and the model is correctly speci…ed, then all maximum likelihood estimators
based on the model in (8.17) are consistent and asymptotically normal, such that for
any parameter, j , j = 1; 2; :::; n, it holds that
p d
T (^j j) ! N (0; j ); (8.18)
where j is the asymptotic variance, see Theorem 3.1 and Theorem 3.2. This result
holds jointly for all parameters, which means that test statistics for hypotheses on
individual parameters and several parameters jointly have standard distributions,
such as N (0; 1) and 2 . We have used this result several times, for e.g.
(1) Lag-length determination in a univariate autoregression.

(2) Lag-length determination in a vector autoregression.
(3) Test for Granger non-causality in a vector autoregression.
Non-Stationary Case. Importantly, however, the imposed assumptions are su¢ -

cient for the result in (8.18) but not necessary. In particular, there could be other
cases, where the time series are not stationary but estimators and tests statistics may
nevertheless have standard distributions. For the case of unit-root non-stationary
processes, this topic has been addressed in Sims, Stock, and Watson (1990). They
give the following general result:
8.4 Introduction to Estimation and Inference 199
Theorem 8.2 (sims, stock and watson): Consider the model (8.17) under the
assumption of no serial correlation of f t g and the regulaties regarding su¢ cient …-
nite moments. In the case where some of the time series in Xt = (x1t ; :::; xpt )0 or
Wt = (w1t ; :::; wmt )0 are unit root non-stationary it holds that if the model can be re-
formulated in such a way that j is a parameter to a stationary mean zero regressor
then it holds that p d
T (^j j ) ! N (0; j ); (8.19)
and tests on j will have standard N (0; 1) and 2 (1) distributions.
If two or more parameters, j and h say, are parameters to stationary mean zero
regressors in the same reformulation of the model, then they are jointly Gaussian and
a joint test on j and h will have a standard 2 distribution.
The result may sound complicated, but we have used the result already. Consider an
augmented Dickey-Fuller test based on a univariate AR(3). Here the starting point
is
yt = 1 yt 1 + 2 yt 2 + 3 yt 3 + t : (8.20)
If yt has a unit root, 1 + 2 + 3 = 1, then it is not clear that we can perform tests on
2 and 3 to determine the relevant lag-length, because the process is not stationary.
Recall, however, that we can rewrite the model as
yt = 1 yt 1 +( 2 + 3 )yt 2 + 3 (yt 3 yt 2 ) + t : (8.21)
Now 3 is a coe¢ cient to yt 3 yt 2 = yt 2 , which is a stationary process and
by Theorem 8.2 it holds that ^3 is asymptotically Gaussian, and the t test statistic
for the lag-length, t 3 =0 , is asymptotically N (0; 1) under the null hypothesis. Like-
wise, the likelihood ratio statistic, LR( 2 = 0), is asymptotically 2 (1) by a similar
reparametrization.
Observe that we do not have to estimate the reformulated version in (8.21) to
perform the test. It is enough that the version in (8.21) exists; then we know we can
make the test on 3 (or 2 ) in (8.20).
We can also write the model as e.g.
yt = 1 (yt 1 yt 2 ) + ( 1 + 2 + 3 )yt 2 + 3 (yt 3 yt 2 ) + t : (8.22)
Now 1 and 3 are regressors to stationary terms in the same reformulation, which
shows that a joint test on ( 1 ; 3 ) will be 2 (2). The same result holds for a joint test
on ( 2 ; 3 ). We can never write the sum 1 + 2 + 3 as a coe¢ cient to a stationary
regressor, however, which shows that the unit-root test statistic has a non-standard
distribution.
We will use Theorem 8.2 below to discuss inference on parameters in di¤erent
models.
8.4.2 Co-integration Analysis in Practice

Below we consider three approaches for performing co-integration analyses.
The …rst approach sets out from a static regression, and is based solely on the
de…nition of co-integration. It’s main advantage is the simplicity, and the fact that
it does not require the regression to be a good approximation of the DGP. There are
also severe disadvantages. First, it is not possible to perform hypotheses testing on
the estimates of the long-run e¤ects, and secondly, the use of a static regression runs
the danger of obtaining inconsistent estimators if variables do not co-integrate.
Next, §8.6 discusses the co-integration analysis based on a single-equation dynamic
regression model while Chapter 9 introduces the co-integration analysis based on the
vector autoregressive model. These three approaches are the workhorses for applied
co-integration analyses.
8.5 Estimation Based on a Static Regression

Recall, that if a set of variables, x1t and x2t , co-integrate then there exists coe¢ cients,
and 2 , such that
x1t = + 2 x2t + ut (8.23)
de…nes an equilibrium. It is natural to try to estimate 2 in the static regression
(8.23), which is also the approach suggested in the seminal paper of Engle and Granger
(1987). They show that if x1t and x2t are I(1) and co-integrated then the OLS
estimator from (8.23), ^ 2 , is consistent for the true parameter, 2 . It is not postulated
that the model in (8.23) is the DGP that generated the data, and it turns out that
consistency of ^ 2 holds even if the estimation model is misspeci…ed relative to the
DGP.
Consistency of the estimator tells you that ^ 2 converges to 2 as T diverges. It
turns out that the non-stationarity of the variables in Xt a¤ects the so-called rate of
convergence, i.e. the speed at which the variance of ^ 2 go to zero. If x1t and x2t are
stationary variables, we know that under usual conditions we have the result
p d
T (^2 2 ) ! N 0; 2
;
where 2
is the asymptotic variance of ^ 2 . The interpretation is that
V ( ^ 2) = 2
;
T
1
such that the variance of the estimator approaches zero at the rate of T . For
co-integrated I(1) series, it holds that T ( ^ 2 2 ) converges, such that
V ( ^ 2) = 2
;
T2
8.5 Estimation Based on a Static Regression 201
which approaches zero at a faster rate of T 2 . This property is known as super

consistency of ^ 2 , see §7.5.2.
Unfortunately, the limiting distribution of T ( ^ 2 2 ) is unknown in general, which
means that although we can obtain consistent estimators, we cannot test hypotheses
on the long-run relationship using standard approaches.
Remark 8.3 (modifications of static ols): The econometric literature has sug-
gested adjustments of the static regression such that the distribution of ^ 2 becomes
asymptitically normal, e.g. the fully modi…ed OLS (FMOLS) estimator or the dy-
namic OLS estimator (DOLS), but this set of notes does not cover these suggestions.
In a co-integration analysis, the static regression (8.23) is sometimes referred to as

the …rst step of an Engle-Granger two-step procedure; where the second step is a
description of the dynamic adjustment towards equilibrium. Given the estimated
co-integration parameters, we may de…ne the so-called error correction term as the
deviation from equilibrium,
u^t = x1t ^ ^ x2t : (8.24)
2
The second step of the Engle-Granger procedure is therefore to estimate an error-

correction model given u^t 1 , e.g.
x1t = + 1 x1t 1 + 0 x2t + 1 x2t 1 + u^t 1 + t; (8.25)
where we have assumed one lag in the …rst di¤erences. All terms in the error correction
models are now stationary and standard inference procedures apply to all parame-
ters, in the sense that t ratios will follow standard normal distributions, N (0; 1),
asymptotically.
Remark 8.4 (asymptotic inference): Under co-integration, u^t is a stationary

process, and since the estimator, ^ 2 , converges to the true value, 2 , at the speed of
T , estimation uncertainty related to ^ 2 can be disregarded in the
p assumptotic analysis
of the short-run parameters (that converges with the rate of T ), and inference in
the model can be done as if u^t 1 was a …xed regressor in a dynamic model.
Note that (8.25) is a model for x1t conditional on x2t and the past, thereby
imposing a contemporaneous causal structure,
x2t ; x1t :
Alternatively, we could have estimated the reduced form for x1t and x2t given
the past, leaving the contemporaneous relationship in the residual covariance, e.g.
corr( 1t ; 2t ).
8.5.1 What if Variables do not Co-integrate?

Recall that co-integration is the special case where the stochastic trends in the indi-
vidual variables cancel. From a logical point of view this is an exception, and it is
interesting to ask for the properties of regression models with I(1) variables that do
not co-integrate.
To discuss this case assume that x1t and x2t are two unrelated I(1) variables. Both
variables contain stochastic trends, but they are unrelated and do not co-integrate.
Ideally we would like the static regression
x1t = + 2 x2t + ut ; (8.26)
to reveal that 2 = 0, at least asymptotically. This turns out not to be the case,
however, and the estimator ^ 2 is not consistent. Moreover, as T ! 1 the t ratio,
t 2 =0 , will indicate a signi…cant relation between x1t and x2t . This is known as the
spurious regression result. The problem is that when the variables do not co-integrate,
ut is an I(1) process and standard results do not hold.
Example 8.7 (spurious regression): As an example of a spurious regression,

consider two presumably unrelated I(1) variables, namely yearly data covering 1980
2000 for the log of real private consumption in Denmark, const , and the log of the
number of breeding cormorants in Denmark, birdt . We estimate a static regression:
const = 12:145 + 0:095 birdt + u^t :

(0:150) (0:015)
The t ratio for the hypothesis that there is no relation, 2 = 0, is given by

0:095
t 2 =0
= = 6:30;
0:015
which seems highly signi…cant in a N (0; 1) distribution, apparently suggesting a clear
positive relation between the number of birds and aggregate consumption! Further-
more, R2 in the equation is 0:69, indicating that the number of breeding birds can
account for large proportion of the variation in consumption. These results are of
cause totally spurious–a simple consequence of the variables being I(1).
8.5.2 Testing for no Co-integration

Following the de…nition, a set of variables Xt = (x1 ; x2 ; :::; xp )0 co-integrate with
0
co-integration vector = 1; 2; 3 ; :::; p , if the linear combination
0
zt = Xt = x1t 2 x2t 3 x3t ::: p xpt
is stationary. It follows that the null hypothesis of no-integration can be translated

into the hypothesis of a unit root in zt . This hypothesis can be tested using a con-
ventional augmented Dickey-Fuller (ADF) test. Allowing zt to have a mean di¤erent
from zero but no deterministic linear trend, the hypothesis of no co-integration can
be tested as the hypothesis H0 : = 0 in the ADF regression with a constant term,
X
k 1
zt = + ci zt i + zt 1 + t; (8.27)
i=1
where t is an i.i.d. error term. The alternative to a unit root is stationarity, HA :

2 < < 0, and under the null of a unit root the ADF t test statistic,
^
DF;c = ;
se(^ )
follows a DF distribution. Critical values for the DF distribution are reproduced in

part (A) of Table 8.1 in the row with zero estimated parameters.
If the relevant alternative to a unit root is trend-stationarity, the ADF regression
(8.27) may be augmented with a linear trend term, and the test for no co-integration
is the ADF test with a linear trend, DF;l .
Example 8.8 (prices on the orange market, continued): As an example of

a unit root test where the potential co-integration vector is known, reconsider the
prices from the orange market. The potential stationary variable is the price di¤er-
ential
zt = porg
t preg
t ;
implying a co-integration vector = (1; 1)0 . To test the hypothesis of no co-

integration, we test for a unit root in zt . Setting up an ADF regression with a
constant term and 5 lags in zt and deleting insigni…cant lags, lead to the simple DF
regression
zt = 21:718 1:082 zt 1 + bt :
(1:534) (0:0750)
The Dickey-Fuller test is given by the t ratio,

1:082
DF;c = = 14:43:
0:075
The 5% critical value for the case of a constant is 2:86, so we can easily reject the
null of no co-integration. Also recall from Figure 8.1 (B) that the price di¤erential
looks extremely stable.
If the co-integration vector has been estimated, the test procedure has to be
modi…ed to use the estimated residuals in (8.24). In particular, we test for no co-
integration by testing whether the estimated residual, u^t , contains a unit root. This
test is translated into the hypothesis H0 : = 0 in the ADF regression
X
k 1
u^t = u^t 1+ ci u^t i + t: (8.28)
i=1
We note that since the estimated residual, u^t , has a mean of zero, there is no constant
term in the ADF regression (8.28). Nonetheless, the critical values for the ADF test
depend on the deterministic speci…cation of the static regression, e.g. whether (8.23)
contains a constant or a linear trend.
The fact that the co-integrating vector ^ is estimated also changes the critical
values for the ADF test, and the estimation uncertainty has to be taken into account.
The intuition is that OLS applied to the static regression (8.23) will minimize the
variance of u^t , and graphically the estimated residuals will look as ‘stationary as
possible’. And the more explanatory variables we include in (8.23), i.e. the more
parameters we estimate to I(1) variables, the smaller is the variance of u^t , and the
more stationary it will look. In the test procedure we will have to account for that,
and the critical values depend on the number of estimated parameters to I(1) variables
in the regression (8.23). The asymptotic distributions of tests for no co-integration
are illustrated in Figure 8.3 (A). As the number of regressors in the static regression
increases, the distribution of the ADF test statistic moves to the left. This re‡ects
that the OLS procedure makes the variance of the estimated residual smaller and
smaller. The critical values of the residual based test are reproduced in Table 8.1
(A).
8.5.3 Summary
We summarize the Engle and Granger co-integration analysis as follows:
(1) Make sure the variables in Xt are I(1) using Dickey-Fuller unit root tests.
(2) Estimate the static regression in (8.23) including a constant or trend as it is
suitable for the application. Recall that the estimators are super consistent,
but the distribution is unknown.
(3) Test for no co-integration of the estimated residuals in (8.24) using the Dickey-
Fuller unit root tests in (8.28). Make sure to use the critical values in Table 8.1
(A).
(4) Characterize the error correction using the dynamic regression in (8.25), poten-
tially for all variables in Xt .
(A) Residual-based (ADF) test for no co-integration

Number of est.
par. to I(1) var. Constant in (8.27) Constant and trend in (8.27)
in (8.27) 1% 5% 10% 1% 5% 10%
0 3:43 2:86 2:57 3:96 3:41 3:13
1 3:90 3:34 3:04 4:32 3:78 3:50
2 4:29 3:74 3:45 4:66 4:12 3:84
3 4:64 4:10 3:81 4:97 4:43 4:15
4 4:96 4:42 4:13 5:25 4:72 4:43
5 5:25 4:71 4:43 5:52 4:98 4:70
6 5:51 4:98 4:70 5:77 5:23 4:95
7 5:76 5:23 4:95 6:00 5:47 5:20
8 6:00 5:47 5:19 6:22 5:69 5:42
(B) PcGive test for no co-integration

Number of est.
par. to I(1) var. Constant in (8.39) Constant and trend in (8.39)
in long-run sol. 1% 5% 10% 1% 5% 10%
1 3:79 3:21 2:91 4:25 3:69 3:39
2 4:09 3:51 3:19 4:50 3:93 3:62
3 4:36 3:76 3:44 4:72 4:14 3:83
4 4:59 3:99 3:66 4:93 4:34 4:03
5 4:80 4:19 3:87 5:11 4:52 4:21
6 4:99 4:38 4:06 5:29 4:70 4:38
7 5:17 4:56 4:23 5:46 4:86 4:53
8 5:34 4:73 4:40 5:61 5:01 4:69
Table 8.1: Asymptotic critical values for tests of no co-integration. The distribu-
tion depends on the number of estimated long-run parameters to I(1) variables. The
critical values can be found in Davidson and MacKinnon (1993) and Ericsson and
MacKinnon (2002).
0.6
Number of estimated parameters
in the static regression.
7 6 5 4
3 2 1
0.5
DF with a constant, τ c
0.4
N(0,1)
0.3
0.2
0.1
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Figure 8.3: Asymptotic distributions of the residual-based test for no co-integration.
Example 8.9 (private consumption, continued): To illustrate estimation and

inference on co-integrating coe¢ cients, consider again the Danish quarterly consump-
tion data: Xt = (ct ; yt ; wt )0 . Applying OLS to a static regression model for the 122
observations, 1973 : 1 2003 : 2, yields
ct = 0:404 + 0:364 yt + 0:516 wt + u^t ; (8.29)

(0:129) (0:049) (0:044)
where the numbers in parentheses are standard errors. The estimates seem consis-
tent with a simple consumption function in which consumption depends positively
on income and wealth. We may note that a one per cent increase in income and
wealth give less than a one per cent increase in consumption as 0:364 + 0:516 = 0:88.
Consequently, the consumption-income ratio will not be constant in a steady state,
which may be regarded as unsatisfactory from an economic point of view. Note that
t ratios constructed from the reported standard errors in (8.29) do not follow a stan-
dard normal distribution and they are not suitable for testing. For example we cannot
test if the sum of coe¢ cients, 0:88, is signi…cantly di¤erent from zero.
To test whether the static regression of a consumption function in (8.29) corre-

sponds to a co-integrating relation, we construct the estimated residual
u^t = ct 0:404 0:364 yt 0:516 wt ;
which is depicted in Figure 8.1 (D). To test for no co-integration we use an ADF
regression without deterministic terms. In the present case one lag is needed,
u^t = 0:223 u^t 1 0:221u^t 1 + ^t ;

(0:089) (0:068)
and the test statistic is given by

0:221
DF;c = = 3:27:
0:068
The 5% and 10% critical values for the case of a constant term and two estimated
parameters in the static regression are given by 3:74 and is 3:45, respectively, so we
cannot reject the hypothesis of no co-integration. This is re‡ected in Figure 8.1 (D),
where the deviations from the relation are relatively persistent. The deviations seem
to be related to the business cycle, suggesting that the consumption-income ratio is
pro-cyclical besides the wealth e¤ects. To obtain stronger evidence of co-integration
one possible solution is to augment the model with a measure of the business cycle,
e.g. a variable measuring the e¤ects of unemployment.
Since the decision regarding co-integration was borderline, we continue to look at
the error correction properties. In principle there may exist error correction models
of ct , yt , and wt , and starting with a model with two lags in the …rst di¤erences
and deleting insigni…cant lags, produces the three equations:
c^t = 0:001 0:195 ct 1 + 0:229 yt + 0:426 wt 0:250 u^t 1

(0:002) (0:077) (0:057) (0:117) (0:064)
y^t = 0:002 + 0:433 ct + 0:387 ct 1 0:353 yt 1 + 0:066 u^t 1
(0:002) (0:118) (0:115) (0:087) (0:099)
w^t = 0:003 + 0:232 ct 0:030 u^t 1 :
(0:001) (0:060) (0:050)
Note that only consumption corrects deviations form the long-run relation, with a
speed of adjustment of 0:25, while yt and wt do not adjust signi…cantly when
the variables are out of equilibrium. The equations above are formulated conditional
on contemporaneous values of the other variables.
An alternative approach would be to estimate the short-run adjustment in a VAR
model, either a reduced form or a formulation subject to a theoretically suggested
causal chain.
8.6 Dynamic Regression Models

An alternative to the co-integration analysis based on OLS estimation in the static
regression (8.23) is to construct a dynamic time-series model and use this model as a
basis for co-integration inference. The dynamic model is often a better approximation
of the DGP, and may therefore imply estimators of the co-integrating coe¢ cients that
are superior to the results derived from the static rgression. The overall idea is to
construct the best possible description of the auto-covariance structure of the data by
estimating an appropriate autoregressive distributed lag model, and derive estimators
of the co-integrating parameters from the long-run solution.
In particular we could estimate the unrestricted ADL model by OLS, where the
lag-lengths are set to eliminate residual autocorrelation, e.g. an ADL(2,2) model,
x1t = + 1 x1t 1 + 2 x1t 2 + 0 x2t + 1 x2t 1 + 2 x2t 2 + t: (8.30)
Recall that the unrestricted ADL model can be written as an error-correction model.
In particular we can use the reformulations
x1t 1 x1t 1 2 x1t 2 = x1t + 2 x1t 1 ( 1 + 2 1) x1t 1
0 x2t + 1 x2t 1 + 2 x2t 2 = 0 x2t 2 x2t 1 +( 0 + 1 + 2 ) x2t 1 ;
to obtain the ECM form
x1t = + 1 x1t 1 + 0 x2t + 1 x2t 1 + x1t 1 + x2t 1 + t; (8.31)
where 1 = 2, 0 = 0, 1 = 2, = ( 1 + 2 1), and = ( 0 + 1 + 2 ). For

both (8.30) and (8.31) the estimator of the co-integrating coe¢ cient is given by the
long-run solution,
x1t = + 2 x2t (8.32)
with
^ +^ +^ ^ ^ ^
^2 = 0 1 2
= and ^= = (8.33)
1 ^1
^2 ^ 1 ^1 ^2 ^
The model in (8.31) is often referred to as the linear ECM form. Recall, that we may
also write the model with the long-run solution explicit as
x1t = 1 x1t 1 + 0 x2t + 1 x2t 1 + (x1t 1 2 x2t 1 ) + t: (8.34)
These formulations are equivalent but (8.31) can be estimated with OLS while (8.34)
is non-linear in the parameters and requires a more elaborate estimation procedure
(e.g. maximum likelihood).
8.6 Dynamic Regression Models 209
8.6.1 Inference
To discuss inference in the dynamic regression models (8.30), (8.31) or (8.34), we use
the general results in Theorem 8.2. As an example, consider the ADL(2,2) model,
x1t = + 1 x1t 1 + 2 x1t 2 + 0 x2t + 1 x2t 1 + 2 x2t 2 + t: (8.35)
This can be rewritten such that the parameters, 1 ; 2 ; 0 ; 1 ; 2 , are coe¢ cient to
a stationary regressor, one at a time, and as a consequence, t test will have stan-
dard N (0; 1) distributions. This suggests that we can determine the lag-length using
standard asymptotics.
For the ECM form,
x1t = 1 x1t 1 + 0 x2t + 1 x2t 1 + fx1t 1 2 x2t 1 g + t; (B2-2)
we note that x1t 1 , x2t , and x2t 1 are stationary variables with mean zero,
so estimators of the corresponding parameters: 1 , 0 , and 1 will follow a normal
distribution. Given co-integration, the term x1t 1 2 x2t 1 is also stationary with
mean zero, so also the estimator of will follow a normal distribution.
Unfortunately there is no way to rewrite the model so that 2 is the coe¢ cient
to a stationary mean zero term, so ^ 2 is not Gaussian distributed. If x1t is the only
variable that error corrects, however, then all information on 2 is present in the
ADL equation and the single equation OLS estimator is identical to the maximum
likelihood estimator in the vector error-correction model. It follows that ^ 2 is asymp-
totically e¢ cient and the distribution of ^ 2 is mixed normal, which is a Gaussian
distribution with a random variance. The variance can be estimated, however, and
test statistics calculated for hypotheses on the co-integrating coe¢ cient, e.g. the
t ratio
^ b
t 2 =b = 2 ; (8.36)
se( ^ ) 2
follows a standard normal distribution asymptotically.
8.6.2 What if Variables do not Co-integrate?

A serious problem with the static regression was the possibility of spurious regression
results. This problem is not present in a dynamic regression. As an example, consider
the ADL model,
x1t = + 1 x1t 1 + 0 x2t + 1 x2t 1 + x1t 1 + x2t 1 + t; (8.37)
If the variables do not co-integrate, we would simply get

p p
^ ! 0 and ^ ! 0; (8.38)
and the remaining parameter estimates, ^ 1 , ^ 0 , and ^ 1 would characterize the poten-
tial short-run relationship between stationary …rst di¤erenced variables.
8.6.3 Testing for no Co-integration

To test the hypothesis of no co-integration in the dynamic regression, we use The-
orem 8.1 to imply that the null hypothesis of no co-integration corresponds to the
null of no-error-correction. This observation has been used to construct several tests
for whether variables co-integrate. The most convenient is based on the linear error-
correction model, e.g.
x1t = + 1 x1t 1 + 0 x2t + 1 x2t 1 + x1t 1 + x2t 1 + t: (8.39)
Here we can test the hypothesis that x1t do not error correct, i.e. H0 : = 0 against
the co-integrating alternative, HA : < 0. The test statistic is just the conventional
t ratio, given by
^
t =0 = : (8.40)
se(^ )
As for the residual based test, the distribution of t =0 depends on the deterministic
terms in the regression (8.39) as well as the number of estimated parameters to I(1)
variables in the long-run solution. The asymptotic critical values are reproduced in
part (B) of Table 8.1. This test appeared very early in the PcGive software package
and is often referred to as the PcGive test for no co-integration.
Comparing the residual-based test for no co-integration with the test for no-error-
correction in the dynamic model three things are worth noting. First, the test for
no-error-correction is based on the assumption that x1t is the only variable which
error corrects to the potential co-integrating relation. This implies that we should
test for no co-integration in the ‘correct’error-correction model; in the present case
that is the model for x1t and not the model for x2t . In most cases, prior knowledge
from economic theory suggests which equation to consider.
Secondly, the test for no-error-correction of x1t is parallel to a test for no co-
integration for a relation involving x1t . Even if we cannot reject the hypothesis of
no-error-correction of x1t , the other right hand side variables in levels, x2t ; :::; xpt ,
may still co-integrate in a relation not involving x1t .
8.6.4 Summary
We summarize the co-integration analysis based on a single-equation ADL model as
follows:
(1) Make sure the variables in Xt are I(1) using Dickey-Fuller unit root tests.
8.6 Dynamic Regression Models 211
(2) Estimate the ADL model in (8.30) including a constant or trend as it is suitable
for the application. Use t test and standard asymptotics to simplify the model
by deleting insigni…cant terms.
(3) Test for no co-integration using the PcGive test in (8.40). Make sure to use the
critical values in Table 8.1 (B).
(4) Derive the long-run solution in (8.32) to characterize the equilibrium.
(5) Characterize the error correction based on the remaining parameters.
The main limitation of the single-equation approach, is that we that we assume that
only one variable error corrects (x1t ). If this is true, estimates from the ADL model
will be asymptotically e¢ cient. If other variables in Xt also error correct, we loose
information by only using one equation to estimate the parameters.
We also assume that there is only one co-integration relationship between the
variables. This is …ne for p = 2 variables, but for more than two variables, there could
exist more than one stationary linear combination, which makes the interpretation of
the ADL results complicated.
Example 8.10 (private consumption, continued): Here, we perform the co-

integration analysis for consumption based on an ADL model. Assuming at most
three lags and deleting insigni…cant terms lead to the preferred ADL model
ct = 0:080 + 0:544 ct 1 + 0:204 ct 2 + 0:240 yt 0:125 yt 1

(0:093) (0:092) (0:079) (0:060) (0:065)
+0:401 wt 0:291 wt 1 + ^t : (8.41)

(0:124) (0:129)
According to misspeci…cation tests, the model seems relatively well-behaved. No-

autocorrelation of order 1 to 5 is not rejected with a p value of 0:64; and no-ARCH
of order 1 to 4 is not rejected with a p value of 0:21.
Notice that the results obtained in the estimation of (8.41) can also be obtained
by estimating the equivalent linear error-correction model, i.e.
ct = 0:080 0:204 ct 1 + 0:240 yt + 0:401 wt

(0:093) (0:079) (0:060) (0:124)
0:251 ct 1 + 0:115 yt 1 + 0:110 wt 1 + ^t : (8.42)

(0:065) (0:044) (0:046)
To test whether the linear error correction model in (8.42) suggests co-integration
we test for no-error-correction using the t ratio,
0:251
t =0 = = 3:86:
0:065
The 5% critical value is given in part (B) of Table 8.1 as 3:51, so we can borderline
reject no co-integration. The di¤erent conclusions from the residual-based test and
(A) Impulse response function for income (B) Impulse response function for wealth
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0 10 20 30 40 0 10 20 30 40
Figure 8.4: Impulse-response functions for a permanent change in income and

wealth, i.e. the accumulated values of @ct+i =@yt and @ct+i =@wt (i = 0; 1; :::; 40).
the PcGive test for no co-integration could be related to the fact that the dynamic
regression model is a better approximation of the DGP.
Solving equation (8.41) for the static long-run solution yields
ct = 0:320 + 0:458 yt + 0:436 wt + u^t ; (8.43)

(0:357) (0:146) (0:130)
where the long-run coe¢ cients are derived from (8.41) as

0:240 0:125 0:401 0:291
= 0:458 and = 0:436; (8.44)
1 0:544 0:204 1 0:544 0:204
and where the standard errors to the co-integrating coe¢ cients are complicated func-
tions of the covariance matrix of the estimated parameters. Compared to the static
regression, the estimated coe¢ cient to income is somewhat higher, whereas the co-
e¢ cient to private wealth is lower. We also note that the standard errors in (8.43),
which can be used for testing hypotheses on the co-integrating coe¢ cients, are much
larger than the standard errors in (8.29).
The co-integrating coe¢ cients in (8.44) can also be found as the long-run solution
from the error correction model in (8.42): 0:115=0:251 = 0:458 and 0:110=0:251 =
0:436.
Based on the dynamic model, the sum of the coe¢ cients is still below unity,
0:458 + 0:436 = 0:894, but now we can test the hypothesis that it is actual unity. A
Wald test for this hypothesis gives a test statistic of 5.26, corresponding to a p value
of 0.022 in a 2 (1) distribution. We therefore reject the hypothesis and conclude that
the sum of the coe¢ cients seems to be signi…cantly smaller than unity.
To illustrate the dynamic properties of the estimated co-integration model, Fig-
ure 8.4 shows the impulse-response functions for a permanent change in income and
wealth, i.e. the cumulated values of @ct+i =@yt and @ct+i =@wt (i = 0; 1; :::; 40). For
8.7 Concluding Remarks and Further Readings 213
disposable income the contemporaneous impact is 0:240, and there is a smooth con-
vergence to the long-run impact of 0:458. A permanent change in the private wealth
have a contemporaneous e¤ect on consumption of 0:401, which is not far from the
long-run impact of 0:436. The convergence is not monotone, however, and the large
contemporaneous impact is followed by a decrease in the next period and then a
gradual convergence.
8.7 Concluding Remarks and Further Readings

This chapter has discussed the statistical analysis of non-stationary and co-integrated
time series. We have presented a number of single-equation tools for co-integration.
The conceptually simplest approach is the Engle-Granger two-step estimation, but
for practical purposes the co-integration analysis based on unrestricted ADL or ECM
models are in most cases preferable. This also …ts within the general-to-speci…c
framework, in which we …rst …nd an appropriate statistical description of the data
(the unrestricted ADL model), and afterwards test hypotheses to link the statical
model to economic theory (testing for co-integration and interpreting the long-run
relationship).
The literature on co-integration analysis is huge, and most references are far more
technical that the present chapter. An accessible introduction is Hendry and Juselius
(2001). Alternative presentations of time series econometrics, including sections on
single equation co-integration analysis, are given in Patterson (2000) and Enders
(2004). A classic reference on co-integration analysis based on the ADL model is
the book by Banerjee, Dolado, Gailbraith, and Hendry (1993). Maddala and Kim
(1998) give a review of the literature on unit roots and co-integration. A speci…c
reference for the test for no-error-correction (with references to the earlier literature)
is Ericsson and MacKinnon (2002). The classic reference for time series analysis in
general, which includes rather technical sections on co-integration models is Hamilton
(1994).
Chapter 9
The Co-integrated
Vector Autoregression
I
n this chapter we discuss some issues in the co-integration analysis based on the
vector autoregressive (VAR) model. We begin by introducing the vector error-
correction form of the VAR model and discuss the implication of unit roots on
the parameters in the model. We then discuss how to test for no co-integration and
how to test hypotheses on the co-integration vectors.
9.1 The Vector Error Correction Model

The starting point for performing a co-integration analysis using the vector error-
correction model, as in the Example 8.6, is an unrestricted vector autoregression. We
let Xt = (x1t ; :::; xpt )0 2 Rp be a p dimensional vector of variables, and consider for
simplicity a VAR(2):
Xt = 1 Xt 1 + 2 Xt 2 + + t; t = 1; 2; :::; T , (9.1)
conditional on the initial values, X0 and X 1 , and with
d
t j Xt 1 ; Xt 2 = N (0; ):
Recall that this model has p+2p2 parameters in the conditional mean. We can rewrite
the VAR equation using the lag operator, L, such that LXt = Xt 1 , to get
2
Xt 1 LXt 2 L Xt = + t
2
(Ip 1L 2L )Xt = + t
(L)Xt = + t:
216 The Co-integrated Vector Autoregression
This de…nes (z), which is a p p matrix of polynomials. For the VAR model, the
characteristic polynomial is de…ned using the determinant, j (z)j = 0. We note that
(1) = Ip 1 2;
such that the VAR model has a unit root if

j (1)j = jIp 1 2j = 0: (9.2)
Similar to the univariate case, we can rewrite the VAR model as a vector error-
correction form,
Xt = 1 Xt 1 + 2 Xt 2 + + t
Xt Xt 1 = ( 1 + 2 Ip )Xt 1 + 2 (Xt 2 Xt 1 ) + + t
Xt = Xt 1 + 1 Xt 1 + + t; (9.3)
where 1 = 2 and where the matrix loading the levels is given by
= 1 + 2 Ip = (1): (9.4)
This shows that a unit root in the model for Xt implies that j j = 0, such that the
coe¢ cient matrix has reduced rank. This is parallel to the univariate case, p = 1,
where is 1 1 and a unit root requires that has reduced rank, = 0.
A p p matrix with reduced rank, r = rank( ) < p, has only r linearly
independent columns (and rows), while the remaining columns (and rows) are linear
combinations of the r independent ones. Therefore we can write
0
= ;
where is a p r matrix containing the independent columns while is a p r
matrix containing the independent rows. As an example, consider the decomposition
of the following reduced-rank matrix:
Example 9.1 (matrix decomposition): A reduced rank matrix, , can be de-

composed. Consider here the case of 3 3 matrix with rank 1:
0 1 0 1
0:2 0:2 0:2 0:2
@ 0:3 0:3 0:3 A = @ 0:3 A 1 1 1 = 0
;
0:4 0:4 0:4 0:4
where is the column space and is the row space.
For the unit root case, we can therefore write the vector error correction equation as
0
Xt = Xt 1 + 1 Xt 1 + + t; (9.5)
where the coe¢ cient to the levels has reduced rank. Three di¤erent cases are of
interest:
9.1 The Vector Error Correction Model 217
Stationarity. If Xt is a stationary process, then there are no unit roots, and
= (1)
is a full rank matrix. The vector error-correction equation in (9.3) is a balanced

equation, because the left and right hand sides are both stationary by de…nition.
Unit-Roots and Co-integration. If Xt is unit-root non-stationary, then the equa-

tion in (9.3) looks unbalanced because the the left hand side, Xt , is stationary while
the right hand side contains non-stationary variables. We know that = (1) has
reduced rank, however, and if the variables in Xt co-integrate, such that 0 Xt is
stationary, the equation in (9.5) is still balanced, because only the stationary combi-
nations of Xt enter the right-hand side. We see that the deviation from equilibrium
the period before, 0 Xt 1 , enters and that the deviation is corrected by the …rst dif-
ferences. The strength of the error correction is given by the speed of adjustment,
.
Unit-Roots and no Co-integration. If Xt is a unit-root non-stationary process

and the series do not co-integrate, then there are no stationary combinations of the
levels to include in the model, and the equation in (9.3) is only balanced if = 0.
This corresponds to the case of no error-correction, = 0, in the equation in (9.5),
which re‡ects the representation result in Theorem 8.1. In this case, the VAR model
we started with simplify to a VAR model in …rst di¤erences.
Xt = 1 Xt 1 + + t:
The core of the co-integration analysis based on the VAR model is to determine the
rank of and characterize the equilibrium relations in and the structure of the
error correction in :
Example 9.2 (prices on the orange market, continued): Consider the prices
of organic and regular oranges from Example 8.6, Xt = (porg reg 0
t ; pt ) . For p = 2 and
one lag, we write the model as
porg
t 11 12 porg
t 1 1
org
t
= + + ;
preg
t 21 22 preg
t 1 2
reg
t
and the information on the long-run properties is contained in the matrix
11 12
= :
21 22
(1) If = 0, then the equation is in …rst di¤erences only. Variables are unit-root
non-stationary and do not co-integrate.
0
(2) If the rank of is one, we use = and write the vector error correction
model as
porg
t 1 porg
t 1 1
org
t
= 1 + + ;
preg
t 2
2
preg
t 1 2
reg
t
where we have normalized the co-integration vector on the …rst entry, porg
t .
Estimating this model produces
porg
t 0:900 porg
t 1 22:500 ôrg
t
= 1 1:03 + + :
preg
t 0:008 preg
t 1 1:136 ^reg
t
The interpretation is that
porg
0
Xt = 1 1:03 t
= porg 1:03preg
preg
t
t t
is a stationary, such that the equilibrium relationship is given by
porg
t = 1:03 preg
t + ut ;
where ut is a stationary deviation from equilibrium. We see that if the price of

regular oranges increases one unit, the price of organic oranges increases 1.03
in equilibrium. If prices are out of equilibrium, we note that it is the price of
organic oranges that adjusts, removing 90 percent of the deviation in each time
period.
(3) If the rank of is two, it implies that both prices are stationary and the
polynomial (z) has no root at unity.
9.2 Inference
Assume now that Xt is a unit-root non-stationary process. For the VAR model in
(9.1) and the reformulation in (9.3) we can use the result
p in Theorem 8.2 to show that
estimators for all parameters are still individually T Gaussian, such that we can
use standard 2 inference to determine the lag-length of the VAR process. To show
this, we should rewrite the equations such that the parameter we are interested in is a
parameter to a stationary (…rst di¤erence) term. For the vector error-correction model
in (9.5) is also holds that 0 Xt 1 is stationary, and given co-integration, the estimator
for the speed of adjustment, ^ , is also distributed as a Gaussian distribution.
9.3 Test for the Co-integration Rank 219
Remark 9.1 (granger causality): Importantly, we cannot rewrite the equation

such that both parameter matrices in (9.1), 1 and 2 , are parameters to stationary
terms at the same time. This means that joint test involving all parameter matrices
are dangerous. This includes the test for Granger non-causality, where the statistic
is not 2 in the presence of unit roots.
For the situation is complicated. We can never rewrite the model such that is
a parameter to a stationary term, and ^ is not asymptotically Gaussian. It holds,
however, that it is super-consistent, and T ( ^ ) converges to a distribution with
mean zero. This distribution is known as a mixed Gaussian distribution, which is
a Gaussian distribution where the variance is a random variable. For the case =
0
(1; 2 ) , you could image a result like:
d
T (^2 2) ! N (0; 2
); where 2
is a random variable.
p
We can consistently estimate 2
from the data, however, ^ 2
! 2
, and the t test
statistic
^2 b ^ b
t 2 =b
= =q2
se( ^ 2 ) T 1^
2
will have a standard N (0; 1) distribution. Likewise, Wald test statistics and likelihood
ratio statistics involving the parameters in have asymptotic 2 distributions under
the null.
9.3 Test for the Co-integration Rank

Consider the case of p = 2 variables, such that is a 2 2 matrix with 4 parameters. If
= 0, such that the time series are unit-root non-stationary without co-integration,
the number of parameters in is zero. If the rank of is rank( ) = r = 1; then we
can decompose the matrix as
1
= 1 2 ;
2
that (because of the normalization) has three parameters to be estimated. This gives
a sequence of three nested model
H0 : Xt = 1 Xt 1+ + t
0
H1 : Xt = Xt 1+ 1 Xt 1+ + t
H2 : Xt = Xt 1+ 1 Xt 1+ + t
95 Percent Quantile
Distribution p r=5 p r=4 p r=3 p r=2 p r=1
DF2 69:61 47:71 29:80 15:41 3:84
DF2c 76:81 53:94 35:07 20:16 9:14
DF2l 88:55 63:66 42:77 25:73 12:45
Table 9.1: Asymptotic critical values (at a …ve percent level) for the likelihood ratio
test for the co-integration rank. p r indicates the number of unit roots under the
null hypothesis.
and we can test for the co-integration rank using likelihood ratio tests. In particular
we can look at
LR(H0 j H2 ) = 2 (log L(H0 ) log L(H2 )) (9.6)

LR(H1 j H2 ) = 2 (log L(H1 ) log L(H2 )) ; (9.7)
where log L(Hi ) denotes the maximized log-likelihood value, i = 0; 1; 2. These test
statistics are known as trace test statistics in the co-integration literature, and because
they involve unit-root behavior under the null, they will follow Dickey-Fuller type
distributions as T ! 1. The distribution depends on the number of unit roots
tested, i.e. the dimension of the system minus the co-integration rank under the null
hypothesis, p r. The distribution also depends on the included deterministic terms
and the …ve percent critical values for the case with a constant are reported as DF2
in Table 9.1.
In practice we …rst test for no co-integration using (9.6). If we cannot reject this
hypothesis we know that = 0 and we keep the model in …rst di¤erences. If we
reject the model H0 , we know that the variables have some stationary components.
We then test for one co-integrating relationship using (9.7). If we cannot reject this,
we keep the model and continue to interpret and . If we reject H1 we end in the
model H2 and conclude that has full rank and that the variables are stationary.
An empirical example is given in Example 9.4 below.
Remark 9.2 (unit root testing): If the model is univariate, p = 1, the rank test
is similar in spirit to the Dickey Fuller unit root test and is identical to the (two-sided)
likelihood ratio test for a unit root. If p > 1 we can use the test sequence to determine
how many stationary combinations there exist between the variables. Because the
rank test procedure is a generalization of the unit root test, there is no reason to test
9.4 The Moving Average Solution 221
for the presence of unit roots in univariate models before the co-integration analysis.
Presence of unit roots will automatically be detected by the test procedure.
Remark 9.3 (including stationary variables): If we use a data vector Xt =

(x1t ; x2t )0 where one variables has a unit root, x1t is I(1), while the other one, x2t ,
is stationary, the co-integration analysis is still valid. We should then …nd that
rank( ) = 1, and
0 0 x1t
= ; such that Xt = 0 1 = x2t ;
1 x2t
which is stationary. Our model would then be a model including the (stationary) level
of x2t and only …rst di¤erences of x1t .
9.4 The Moving Average Solution

Given the co-integrated VAR model, e.g.
0
Hr : Xt = Xt 1 + 1 Xt 1 + t;
it is possible to …nd the moving average solution for Xt . This is often referred to as
the Granger-representation, and it has the form
X
t
Xt = C i + C0 t + C1 t 1 + ::: + Ct 1 1 + A; (9.8)
i=1
see also (8.12), where C is a p p matrix and C0 ; C1 ; C2 ; ::: is a sequence of p p

matrix coe¢ cients that converges to zero.
Because 0 C = 0 it must hold that C has reduced rank, and we can decompose
C into its p r independent columns and rows:
C = BD0 ;
where 0 B = 0. To be precise, it holds that C = ? ( ? 0? ) 1 0? , such that B =

0 1
?( ? ?) and D = ? , where ? is notation for a p (p r) matrix that is
orthogonal to the p r matrix , and = Ip 1 . Inserting this expression, we get
X
t
Xt = B D 0 i + C0 t + C1 t 1 + ::: + Ct 1 1 + A: (9.9)
i=1
This expression allows us to …nd the shocks that have permanent e¤ects on the data
in Xt , namely the linear combinations, D0 t , that are accumulated into the common
stochastic trends
X t X t
0 0
D i= ? i:
i=1 i=1
The remaining shocks have only transitory e¤ects.

We can also see how the random walks a¤ect the variables by considering the
coe¢ cients in B = ? ( ? 0? ) 1 .
Example 9.3 (prices on the orange market, continued): For the estimated
co-integrated VAR model for the orange price in Example 9.2 we have p = 2 variables,
a co-integration rank of r = 1 and therefore p r = 1 stochastic trend.
In this case, the coe¢ cients of the Granger representation is given by
1:06 0:00888
B= and D = :
1:03 1
This means that the shock to the orange market with permanent e¤ect is
org
0 t reg org
D t = ( 0:00888; 1) reg = t 0:00888 t ;
t
which is almost just the shock to the price of regular oranges. We also see that the
stochastic trend loads into the variables with almost identical coe¢ cients in B. This
behavior closely corresponds to the parallel movements in Figure 8.1 (A).
9.5 Summary
We summarize the co-integration analysis based on the vector autoregression as fol-
lows:
(1) Estimate a well speci…ed VAR model for the variables in Xt allowing for the
relevant deterministic variables.
Test-statistics for lag-length determination have standard 2 distributions.
(2) Determine the co-integration rank by looking at the sequence of nested models,
H0 H1 ::: Hr ::: Hp :
Calculate the likelihood ratio (trace) test statistics
LR(H0 j Hp ) LR(H1 j Hp ) ::: LR(Hp 1 j Hp );

9.5 Summary 223
and stop at the smallest model that is not rejected. The distributions of the
trace test statistics are non-standard and depend on the deterministic speci…-
cation.
(3) For the preferred model, Hr say, characterize the equilibrium relationships, ,
and the speed of adjustment, . Likelihood ratio statistics for testing hypotheses
on and have standard 2 distributions.
Example 9.4 (private consumption, continued): Consider again the analysis

for consumption, Xt = (ct ; yt ; wt )0 . To determine the lag-length of the VAR(k) model,
we use LR tests and critical values from a 2 distribution. We end with a VAR(2)
model and standard misspeci…cation tests suggest that the model is reasonably well-
speci…ed, with no signs of autocorrelation. There are a few outliers that we could
remove by inserting dummy variables, but we do not do so here.
The estimated model in levels is given by
0 1
0 1 0:541 0:0246 0:151 0 1
c^t B (5:3) (0:35) (1:0)
C ct 1
@ y^t A = B B 0:269 0:600 0:0433 C C @ yt 1 A
@ (2:0) (6:5) ( 0:23) A
w^t 0:181 0:0693 0:997 wt 1
( 2:7) (1:5) (10)
0 1 0 1
0:153 0:135 0:0342 0 1 0:0759
B (1:5) (1:9) ( 0:2) C ct 2 B (0:72) C
B C B C
B 0:348
+ B ( 2:6) 0:356 0:147 C @ yt 2 A + B (0:0996 C;
(3:7) (0:75) C @ 0:71) A
@ A w
0:0906 0:0515 0:0215 t 2 0:0117
(1:4) (1:1) ( 0:22) (0:17)
with t ratios in parentheses. To start the co-integration analysis, we look at =

1+ 2 I; which is estimated as
0 1
0:306 0:160 0:117
B ( 4:1) (3:3) (2:3) C
B C
^ = B 0:0783 0:044 0:104 C :
B ( 0:8) ( 0:7) (1:5) C
@ A
0:0904 0:121 0:0248
( 1:8) (3:8) ( 0:7)
It is not easy to see the rank directly, but the second row does not look very signi…cant,
indicating that the matrix has a zero row and therefore reduced rank.
The likelihood ratio tests for reduced rank are given in the following table:
Rank, r Log-likelihood LR(Hr j Hp ) 5% critical value p value

0 978:259 30:08 29:80 0:046
1 988:938 8:72 15:41 0:399
2 993:239 0:12 3:84 0:727
3 993:300
We …rst test r = 0 against the unrestricted VAR with r = 3; and obtain the LR
statistic LR(H0 j Hp ) = 30:08. This hypothesis involves p r = 3 unit roots, and the
critical value from Table 9.1 is 29:80. We conclude that the hypothesis is borderline
rejected with a p value of 0.046. We now know that r 1. We next test r 1
against r = 3. The statistic is LR(H1 j Hp ) = 8:72 and produces a p value of 0.399.
We therefore do not reject and keep the model with r = 1 as the preferred.
The estimates based on this model are given by
0 1
0 1 0:222 0 1
c^t B ( 3:8) C ct 1
@ y^t A = B B 0:0234
C 1
C 0:812 0:141 @ yt 1 A + :::
@ (0:30) A ( 5:9) ( 1:1)
w^t 0:135 wt 1
( 3:5)
where we have left out the constant and the additional lag of …rst di¤erences. The
estimate of the long-run relation suggest that in equilibrium
ct = 0:812yt + 0:141wt ;
( 5:9) ( 1:1)
such that a higher income and wealth increases consumption in the long run. The co-
e¢ cient to wealth is not very signi…cant, with a t ratio of 1.1, which is asymptotically
standard normal. If we are away from equilibrium, we see from ^ that consumption
error corrects, removing 22% of the disequilibrium each quarter, while income does
not error-correct. There is also correction in wealth, but wealth is moving away from
equilibrium, indicating that the wealth component is destabilizing the economy.
The sum of long-run coe¢ cients is 0:812 + 0:141 = 0:953, which is quite close to
one. We therefore test the hypothesis that the sum is one, producing the new set of
estimates
0 1
0 1 0:172 0 1
c^t B ( 3:4) C ct 1
@ y^t A = B B 0:0358
C 1
C 0:965 0:035 @ yt 1 A + :::
@ (0:50) A ( 6:7) ( 0:20)
w^t 0:123 wt 1
( 3:7)
where the importance of wealth in private consumption has almost vanished. The test
statistic for the hypothesis is 0.535 which is not signi…cant in a 2 (1)-distribution.
Because the error-correction in income is weak, we impose a further restriction on
and obtain
0 1
0 1 0:188 0 1
c^t B ( 3:9) C ct 1
@ y^t A = B B 0
C 1
C 0:936 0:064 @ yt 1 A + :::
@ A ( 6:6) ( 0:20)
w^t 0:126 wt 1
( 3:7)
9.6 Further Issues 225
The joint hypothesis that the long-run coe¢ cients sum to one and a zero in gives
a test statistics of 0.797, which is not signi…cant in a 2 (2).
9.6 Further Issues

We conclude this chapter by discussing some further issues in the analysis of the
co-integrated VAR model.
9.6.1 Normalizations and Identification

The parameters and only enter the model and the likelihood function in terms
of their product,
0
= :
This means that (and therefore the model likelihood) is unchanged if we multiply
with a constant ! and divide with the same constant, i.e.
~ = ~ ~ 0 = ! 1! 0
= 0
= :
For the case with r = 1 we solve this by imposing a normalization, and we …x one
coe¢ cient in to unity, e.g. for p = 4;
0 1
1
B C
B 2 C
=B C:
@ 3 A
4
When r 2, we can still scale the columns of but we can also swap the order of
the columns without changing the likelihood (if we change the order of the columns
in similarly). In addition, we may even replace a column i with some linear-
combination of the columns in as long as we do not change the rank of and we
do the necessary adjustment to : We say that we can perform column operations on
.
The column operations do not change the likelihood, but may enhance the inter-
pretation. If the columns of the new version of impose di¤erent restrictions–such
that each column has a unique structure–we say that is identi…ed, and we can po-
tentially attach names and economic interpretations to the individual columns. To
illustrate, consider a simple example.
Example 9.5 (identification): Consider a market for …sh, where we observe the
quantity of …sh, Qt , the price of …sh, Pt , the weather conditions for …shermen, Wt ,
and the price of meat, which is a close substitute for …sh, Pt , i.e. the p = 4 variables,
Xt = (Qt ; Pt ; Wt ; Pt )0 :
Imagine that we have performed a co-integration analysis and …nd r = 2 equilibrium

relationships, such that is 4 2, i.e.
0 1
B C
B C
=B C;
@ A
where indicates a coe¢ cient.

As economists, we may want to …nd a demand curve for …sh and a supply curve
for …sh. So far, however, we do not know if the …rst column in is demand or supply
or a linear combination of the two. From theory, however, we may have an idea that
the weather conditions a¤ect the supply and not the demand, and reversely, that the
price of substitute meat a¤ects demand and not supply. This allows us to consider
instead 0 1
1 1
B C
~=B B
C
C:
@ 0 A
0
The new ~ can be obtained from the original by column operations and re-scaling
and if we make the necessary adjustments to , the two versions correspond to models
with the same and the same likelihood. The system with ~ ; however, is easier to
interpret, because we know that the …rst column is the supply curve for …sh (because
it excludes the price of substitute meat). The second columns is the demand curve
(because it excludes the weather variable).
We would say in this case that is identi…ed.
9.6.2 Deterministic Terms

So far, we have considered co-integrated VAR models with very simple deterministic
speci…cations, e.g. a constant term in the vector autoregression. In general, the
interpretation of deterministic terms are complicated in dynamics models with unit
roots.
The Granger representation in §9.4 was given without deterministic terms and
shows how the shock, t , accumulates in the model. If we include a constant term, ,
9.6 Further Issues 227
it will appear in the same way as t, and we get the moving average solution
X
t
xt = C ( + i ) + C0 ( + t ) + C1 ( + t 1) + ::: + Ct 1 ( + 1) + A:
i=1
The full e¤ect of the constant is

X
t
C + (C0 + C1 + ::: + Ct 1 ) ; (9.10)
i=1
where the …rst part is a linear trend, C t, and the second part is a constant. We
conclude that inserting a constant term allows for a linear trend in the data, Xt .
Because 0 C = 0, the linear trend cancels is the co-integration relationships and 0 Xt
has no trend. We sometimes say that the model is not balanced in terms of the
deterministic speci…cation.
As a solution to get a balanced behavior of Xt and 0 Xt ; a common practice is
to insert deterministic terms with restricted coe¢ cients. For the constant term we
would use
= 00 ;
such that the accumulation vanishes
X
t X
t
0
C =C 0 =0
i=1 i=1
because C = 0, see §9.4. In this case the data and co-integrating relations all have
constant terms but no trends. We could write this model as
0 0
Xt = Xt 1 + 1 Xt 1 + 0 + t
0
Xt 1
= + 1 Xt 1 + t; (9.11)
0 1
and we sometimes say that we have restricted the constant to the co-integration space.
The di¤erent treatment of the constant term changes the asymptotic distributions of
the test statistics for the co-integration rank. For the model in (9.11), the critical
values are reported as DF2c in Table 9.1. Note that for p r = 1, this is identical to
the distribution of the likelihood ratio statistic for the joint unit root hypothesis, H0 ,
discussed in Chapter 7.
To get a trend also in the co-integration relationships, 0 Xt , e.g. to account
for some homogenous growth not accounted for by the included variables, we could
augment the model with a constant and a trend term,
0
Xt = Xt 1 + 1 Xt 1 + + t + t: (9.12)
That would allow for a quadratic trend in the data, however, because the accumula-
tion of a trend produces a quadratic trend. To avoid the quadratic trend, we could
use a restricted trend term
+ t = + 01 t;
and write the model as
0 0
Xt = Xt 1 + 1 Xt 1 + + 1t + t
0
Xt 1
= + 1 Xt 1 + + t: (9.13)
1 t
The constant term would then accumulate to a trend in the data, while 01 t would
give a trend in 0 Xt without accumulating to a quadratic trend. For the model in
(9.13), the critical values are reported as DF2l in Table 9.1.
Similar challenges prevail for including dummy variables for level shifts and out-
liers, as they will accumulate in the data. To allow a level shift in the data and in
the co-integrating relations, the solution is to include a step dummy,
0 t < T0
Dt = I(t T0 ) =
1 t T0
with a restricted coe¢ cient and to include the …rst di¤erences of the dummy with
unrestricted coe¢ cients, e.g.
0 10 0 1
Xt 1
X= @ 1
A @ t A+ 1 Xt 1 + + 0 Dt + 1 Dt 1 + t: (9.14)
2 Dt 1
The asymptotic distribution depends on the presence of the dummy variable, and
where in the sample the break occurs, i.e. T0 =T , and critical values have to be found
on a case-by-case basis. A detailed treatment of this topic is beyond the scope of this
chapter.
9.7 Concluding Remarks and Further Readings

This chapter has discussed the statistical analysis co-integration within the framework
of a vector autoregression. This approach is conceptually more elegant, and avoids
the assumption that only one variable error corrects. An introduction to vector error
correction models and the analysis of co-integration in a VAR model is given in
Juselius (2007), while the (very technical) theory is given in Johansen (1996).
Chapter 10
Modelling Volatility
in Financial Data: Introduction
to ARCH and GARCH
S
o far in this course, we have mainly discussed models for the conditional mean
and we have focussed particularly on the structure of time dependence. In
many applications within …nance, however, there is a major interest also in
the conditional variance of a time series, vaguely interpreted as the risk of holding
certain assets. In this chapter we introduce a particular class of so-called autoregres-
sive conditional heteroskedasticity (ARCH) models for the conditional variance. In
2003, Robert Engle was awarded the Nobel prize for the ARCH model, and since its
introduction in Engle (1982) many interesting re…nements have been proposed. In
this chapter we focus on the main ideas, and, although interesting from a theoretical
and empirical point of view, we only brie‡y mention some of the possible extensions.
To give you a feeling for the behavior of …nancial data and the ARCH model, we
emphasize the properties using an empirical example.
10.1 Changing Volatility in Time Series

Applications in …nancial economics are often interested in both the mean and the
variance of investment returns–and sometimes in the entire return distribution. An
intuitive reason is that an investor typically faces a trade o¤ between the return of
the investment and the risk that it implies. Two classical examples are the following:

230 Modelling Volatility in Financial Data: Introduction to ARCH
Example 10.1 (portfolio optimization): One interesting application where the

mean-variance trade-o¤ is obvious is within the area of portfolio management, i.e. how
to combine bonds, stocks, etc. in a portfolio. A simple setup used in some theoretical
and applied work is the so-called mean-variance utility function, where the expected
utility of a stochastic portfolio return, y, is de…ned as being
E[u(y)] = E(y) V (y); (10.1)
where u( ) is the utility function, and is the degree of risk aversion, measuring the
explicit trade-o¤ between the expected return, E(y), and the associated variance,
V (y).
Example 10.2 (value at risk): An application where the whole distribution of

returns is needed is for calculation of the so-called Value-at-Risk, VaR. The VaR is
used in stress tests of banks and other …nancial institutions, and measures the maxi-
mum loss of a given portfolio with stochastic return y, calculated at some prespeci…ed
con…dence level. Technically, the 5% VaR of a portfolio is a return value, VaR0:05 (y),
such that the probability of loss larger than VaR0:05 (y) is exactly 5%, i.e.
prob(R < VaR0:05 (y)) = 0:05; (10.2)
and, by de…nition, it corresponds to the 5% quantile of the distribution of the re-

turn.
A stylized fact of the behavior of many …nancial time series is a tendency for the
variance, or volatility, to be non-constant. In particular, as noted by Mandelbrot
(1963):
“...large changes tend to be followed by large changes, of either sign, and

small changes tend to be followed by small changes.”
In terms of economics (or the psychology of investors) we may think of a large shock
to the return at time t 1, yt 1 . Then there is a high probability of another large
shock at time t, yt . The interpretation could be that the large shock at time t 1
has upset the market and a period of larger uncertainty on the future direction of
yt follows. This idea, often referred to as volatility clustering, is just a re‡ection of
high and low market uncertainty. The non-constant (conditional) variance is known
as heteroskedasticity, and this particular form of time varying heteroskedasticity is
known as autoregressive conditional heteroskedasticity or ARCH. One main insight of
the ARCH model is the distinction between conditional and unconditional variance.
10.1 Changing Volatility in Time Series 231
This is parallel to the distinction in autoregressive models between the conditional

and unconditional mean.
In many applications in …nance, the autocorrelation of returns is quite low. This
can be seen a weak indication of the validity of the e¢ cient market hypothesis, be-
cause strong autocorrelations would make future returns predictable. Note, however,
that the ARCH e¤ects will make the squared returns (and also absolute returns) au-
tocorrelated, and an informal way to examine a data set for potential ARCH e¤ects
is therefore to look at the squared or absolute returns and to calculate the autocor-
relation function (ACF).
Example 10.3 (volatility of sp500 stock returns): As an example, let yt

denote the daily log-return for the Standard & Poor’s stock market index, calcu-
lated as
yt = 100 log(SP500t );
where SP500t is the daily closing price for the period January 2, 1997, to February
27, 2018. The time series of returns is reported in Figure 10.1 (A). We note that the
variance is high in some periods, e.g. around the …nancial crisis in 2008, while move-
ments are much smaller in other periods. Notice that the periods of large variation
entail both positive and negative returns, so whereas the mean seems by-and-large
constant, the variance is non-constant. We say that there are clear ARCH e¤ects.
Figure 10.1 (B) depicts the absolute returns for the SP500 index, jyt j. In this
graph the time dependence is clearly visible, re‡ecting the clusters of high variance.
A similar picture is visible from squared returns, but time series graphs for squared
returns have a tendency to be dominated by a few large returns, and the pattern is
often more clear for absolute returns.
Next, Figure 10.1 (C) shows the ACF for the returns, yt , and the absolute returns,
jyt j. Whereas the autocorrelations are generally small for the returns, they are much
larger for absolute returns: Again indicating ARCH e¤ects.
To measure the volatility of returns, we can estimate the variance of returns for
the whole sample using the usual formula for the unconditional empirical variance,
1X 2
T
2
s^ = ^; (10.3)
T t=1 t
P
where ^t = yt ^ is the deviation of yt from the estimated mean of ^ = Tt=1 yt .
To illustrate the expected variation in t we could draw approximate 95% con…dence
bands as 1:96 s^. These constant con…dence bands are reported in Figure 10.1 (D)
as the unconditional volatility. Notice that due to the non-constant variance, the
constant bands are very poor measures of the uncertainty at a given point in time,
and the observations outside the band seem to cluster. If we used s^ as a measure of
the risk of having a position in the SP500 index, we would in some periods markedly
underestimate the risk–e.g. in 2008–while in other periods we would overestimate the
risk–e.g. in 2017.
As a measure of the uncertainty at a given point in time, t, we could instead
calculate the variance using a small window of observations, e.g. 25 observations
(approximately the number of trading days in one month):
1 X 2
25
^ : (10.4)
25 i=1 t i
This measure of the conditional variance is also depicted in graph (D). According
to the graph, the time varying measure seems to be a better approximation of the
variance at a given point in time.
The descriptive measure in (10.4) could be re…ned by changing the weights in
the moving average, and the class of ARCH models we will introduce below is a
way of formally embedding this kind of changing con…dence bands into the statistical
model.
Remark 10.1 (returns and log-returns): Econometric analyses of …nancial

data typically use log-returns,
yt = 100 log(SP500t ) = 100 (log(SP500t ) log(SP500t 1 )); (10.5)
sometimes also referred to as continuously compounded returns. The scaling with 100,
also applied in Example 10.3 above, is used in order to make the maximization of the
likelihood function more stable and make the log-returns comparable to percentage
returns, de…ned as
SP500t
y~t = 100 1 :
SP500t 1
For small returns, the di¤erence is only minor, but log-returns have the advantage of
being additive over time, such that the total return from time t to time t + h is simply
log(SP500t+h ) log(SP500t ) = log(SP500t+1 ) + log(SP500t+2 ) + ::: + log(SP500t+h )

= yt+1 + yt+2 + ::: + yt+h :
To accumulate percentage returns, we have to use

SP500t+h SP500t+1 SP500t+2 SP500t+h
1 = ::: 1
SP500t SP500t SP500t+1 SP500t+h 1
= (1 + y~t+1 ) (1 + y~t+2 ) ::: (1 + y~t+h ) 1;
where the formula is multiplicative rather than additive.

10.2 The ARCH Model De…ned 233
(A) Daily returns on SP500 (B) Absolute returns

10 10
0 5
-5
-10
0
2000 2005 2010 2015 2000 2005 2010 2015
(C) Autocorrelation function (D) Measures of volatility
Returns 10 Return
0.3 Absolute returns Unconditional volatility
Conditional volatility
5
0.2
0
0.1
-5
0.0
-10
0 10 20 30 40 50 2000 2005 2010 2015
Figure 10.1: Returns and absolute returns for the SP500 stock market index from
January 2, 1997, to February 27, 2018. See details in Example 10.3.
10.2 The ARCH Model Defined

First, let yt denote the variable of interest and we think of yt as a (stationary) time
series for e.g. an asset return or an interest rate. We consider the following linear
model for the conditional mean,
E(yt j xt ) = x0t ; t = 1; 2; :::; T;
which corresponds to the linear regression model,
yt = x0t + t ; (10.6)
where xt is a vector of predetermined explanatory variables (possibly including lags of

yt ), is a vector of parameters, and E( t j xt ) = 0. In the literature of …nance, there
is a long history of seeing returns as approximately uncorrelated and unpredictable
over time, and it is di¢ cult to model the conditional mean of an asset return. This
fact is often seen as a weak form of market e¢ ciency, because predictability would
typically allow easy speculative gains. It implies that the equation in (10.6) is often
di¢ cult to specify, and x0t only accounts for a small proportion of the variation in
yt . In many cases, the model for the conditional mean only includes a constant, such
that
yt = + t : (10.7)
In the presence of conditional heteroskedasticity in t , OLS estimation of the
equation for the conditional mean in (10.6) is consistent but ine¢ cient. The su¢ -
cient condition for consistency of the OLS estimator in (10.6) is that E(xt t ) = 0,
and conditional heteroskedasticity will not in general violate that. OLS is no-longer
e¢ cient, however, and there exists a non-linear model that takes the ARCH e¤ects
into account, and the estimator in this model has a smaller variance.
To introduce a model for conditional heteroskedasticity, we specify an equation
also for the conditional variance. We therefore …rst de…ne the conditional variance:
2 2
t = E( t j It 1 ); (10.8)
where It 1 = fyt 1 ; yt 2 ; :::; xt ; xt 1 ; xt 2 ; :::g is the information set available at the

beginning of period t (i.e. including the predetermined variable xt ). The class of
ARCH models consists of an equation for the conditional mean (10.6) augmented
with an equation for the conditional variance, 2t .
Engle (1982) suggests a statistical model for 2t which follows directly for the
descriptive measure in (10.4). In particular he suggests the ARCH(p) model for the
conditional variance:
2 2 2 2
t =$+ 1 t 1 + 2 t 2 + ::: + p t p; (10.9)
or
2
t = $ + (L) 2t ;
where (L) = 1 L + 2 L2 + ::: + p Lp is a lag polynomial. Notice that the variance,
2
t , is a linear function of p lagged squared residuals. To ensure a consistent model
that generates a positive variance, we need to constrain the parameters, $ > 0 and
i 0, i = 1; 2; :::; p: The economic interpretation is straightforward: If there is a
large shock at time t 1, i.e. if 2t 1 is large, then the variance of the following shocks
will also be large. Graphically, the width of the con…dence bands depends on the
squared magnitudes of the past shocks.
A common way to write the full model is the following
yt = x0t + t
t = t zt
2 2 2 2
t = $+ 1 t 1 + 2 t 2 + ::: + p t p:
where zt is an i.i.d. error term with mean zero and unit variance. At each point
in time there is a new independent shock to the system, zt , which is scaled by t
so that the ARCH error has a conditional variance of E( 2t j It 1 ) = 2t , and the

conditionally heteroskedastic shock, t , drives the observed return process yt together
with potential regressors, xt .
10.2.1 Interpretation
2
Another way to understand the model is to decompose t into the conditional expec-
tation and the surprise in the squared innovations:
2 2 2
t = E( t j It 1 ) + vt = t + vt ;
where it follows that E(vt j It 1 ) = 0, such that vt is an uncorrelated (but not nec-
essarily homoskedastic) sequence. Inserting 2t = 2t vt into the ARCH(p) equation
in (10.9) yields the equation
2 2 2 2
t =$+ 1 t 1 + 2 t 2 + ::: + p t p + vt ; (10.10)
which shows that the squared innovation, 2t , follows an AR(p) process. Note that
(10.10) is exactly the implication we use in the auxiliary regression (10.18) in the
misspeci…cation test.
Observe that $ is not the unconditional variance. We can illustrate by taking
unconditional expectations to obtain
E( 2t ) = $ + 2
1 E( t 1 ) + ::: + 2
p E( t p ):
Assuming stationarity, this de…nes a constant unconditional variance of

2 $
= ;
1 1 ::: p
P
provided that the sum of the coe¢ cients is less than one, (1) = pi=1 i < 1. Note
that whereas the ARCH model exhibit conditional heteroskedasticity, the ARCH
process is unconditionally homoskedastic.
Remark 10.2 (fourth-order moment): The probability mass in the tails of a

distribution is measured by the fourth-order moment. For the ARCH(1) case and
d
zt j It 1 = N (0; 1), the unconditional fourth moment of t is given by
3$2 (1 2
)
E( 4t ) = ;
(1 )2 (1 3 2 )
2
p
which is …nite for 1 3 > 0; or < 1=3 0:577. The Kurtosis is
E( 4t ) (1 2
)
K= 2 2
= 3 2
:
E( t ) (1 3 )
Observe that for > 0 the distribution of t has K > 3 and fatter tails than the
Gaussian distribution.
10.2.2 Maximum Likelihood Estimation

Before we continue let us brie‡y discuss the estimation of ARCH models. To make
the notation simple we consider an ARCH(1) model with
yt = x0t + t
2 2
t = $+ t 1:
To perform a likelihood analysis we have to specify a distributional shape for t . First

consider conditional normality:
d
t = t zt ; zt j It 1 = N (0; 1);
d 2
or, alternatively, that t j It 1 = N (0; t ). We can write the likelihood contribution
as a function of observed data as
`( ; $; jyt ; xt ; yt 1 ; xt 1 ; :::)
1 1 2t
= p exp
2 2t 2 2t
1 1 (yt x0t )2
= p exp ;
2 ($ + (yt 1 x0t 1 )2 ) 2 $ + (yt 1 x0t 1 )2
and maximize the sum of the log-likelihood contributions with respect to the param-
eters , $ and . The analytical analysis of the likelihood function is somewhat
complicated and we cannot solve the likelihood equations analytically. Instead we
use numerical optimization to …nd the ML estimators.
Remark 10.3 (properties of the gaussian qmle): Consider a general class of

conditional volatility models (including the ARCH models and the GARCH models
considered below) de…ned by
yt = t + t; t = t zt ; (10.11)
where t and 2t are functions of a vector of parameters, . Assume that zt is an i.i.d.
process with E(zt ) = 0 and E(zt2 ) = 1.
Suppose that we estimate the model parameters, , using the log-likelihood function
based on the Gaussian distribution,
" #
XT XT
1 (yt 2
t ( ))
log LT ( ) = log `t ( ) = log p exp : (10.12)
t=1 t=1 2 2t ( ) 2 2t ( )
Note that if zt is non-Gaussian, this is the quasi-log-likelihood function, and the
estimator is given by
XT
^QM L = arg max log `t ( ). (10.13)
t=1
Under suitable conditions, see e.g. the general regularity conditions for QML
estimation, we have consistency
p
^QM L ! 0 (10.14)
and asymptotic normality
p d
T (^QM L 0) ! N (0; I 1
JI 1
); (10.15)
where the QMLE has the usual sandwich asymptotic covariance matrix with
@ 2 log `t ( 0 ) @ log `t ( 0 ) @ log `t ( 0 )

I= E and J = E :
@ @ 0 @ @ 0
If zt is Gaussian we have ^QM L is the MLE. In this case I = J , such that the
asymptotic covariance matrix is the usual inverse information,
p d
T (^QM L 0 ) ! N (0; I
1
): (10.16)
In place of the normal distribution, other forms of distributions can be used.

Many applications use a more fat-tailed distribution to allow a larger proportion of
extreme observations. A particularly convenient solution is to use a Student’s t(v)
distribution where the degree of freedom, v, determines how fat the tails should be.
Here v can be treated as a parameter in the likelihood function and estimated jointly
with the remaining parameters. Recall that as ! 1, the Student’s t( ) distribution
converges to the Gaussian distribution.
Remark 10.4 (scaled t-distribution): For the interpretation of 2t as the con-

ditional variance of the ARCH and GARCH model, it is important that zt has unit
d
variance. A Student’s t( ) distributed variable, = t( ) say, has variance given by
V ( ) = =( 2) > 1. In practice, it is therefore assumed that > 2, such that the
variance of zt is …nite, V (zt ) < 1, and the applied t( )-density is rescaled with the
factor [( 2)=v]1=2 to ensure that V (zt ) = 1.
The formula for the scaled density, t ( ), for t with conditional mean zero and
conditional variance 2t , is given by
v+1
+1 2
2 ( 2
) 2
1
t
2
f( t j t; )= ( 2) t
2
1+ 2
; (10.17)
(2) t (v 2)
where ( ) is the so-called Gamma-function.

Example 10.4 (arch(p) model, sp500 stock returns): To illustrate the esti-
mation of AR(2)-ARCH(p) models we consider a model given by the equations
yt = + 1 yt 1 + 2 yt 2 + t
t = t zt
2 2 2 2
t = $+ 1 t 1 + 2 t 2 + ::: + p t p;
d
where we assume that the innovation is Gaussian, zt j It 1 = N (0; 1). The estimation
results for di¤erent lag lengths, p = 1; 2; 3; 5; 7; 9, are reported in Table 10.1, and some
diagnostic graphs for the ARCH(9) model are presented in Figure 10.2.
The results in Table 10.1 suggest that the ARCH e¤ects are clearly signi…cant.
Unfortunately, it is quite hard to determine the appropriate number of lags in the
conditional variance–and many lags seem to be needed. This is also clear from the
sum of ARCH coe¢ cients,
p
X
(1) = i;
i=1
that increases with p. In addition, many coe¢ cients have similar magnitudes and
comparable signi…cance. This is an observed weakness of the ARCH model that it
is hard to precisely pin down the shape of the memory structure, and the estimated
coe¢ cients are often relatively unstable between models.
Regarding the conditional mean, the …tted value in Figure 10.2 (A) suggests that
only a small fraction of the variation in returns is explained. The estimates in Ta-
ble 10.1 suggest that lags are borderline signi…cant, but the test for no residual
autocorrelation is only borderline rejected. This may be related to the fact that the
test is very broad–measuring up to lag 72. The individual ACF’s in Figure 10.2 (D)
do not suggests that some particular lag is needed in the conditional mean.
Finally, Figure 10.2 (C) shows the estimated conditional standard deviation, ^ t ,
indicating a strong time variations. The standardized residuals in Figure 10.2 (B),
calculated as
^t
z^t = ;
^t
seem much more homoskedastic than returns, suggesting that the estimate of the
conditional variance is reasonable.
ARCH(1) ARCH(2) ARCH(3) ARCH(5) ARCH(7) ARCH(9)

y_1 (X) 0:1812 0:04276 0:03715 0:04224 0:04521 0:04287
(0:0590) (0:0189) (0:0174) (0:0145) (0:0142) (0:0144)
y_2 (X) 0:04481 0:01871 0:03900 0:02462 0:02164 0:01609
(0:0255) (0:0258) (0:0200) (0:0169) (0:0158) (0:0161)
Constant (X) 0:05007 0:06671 0:06489 0:06526 0:06952 0:06277
(0:0161) (0:0144) (0:0136) (0:0122) (0:0119) (0:0117)
$ (H) 0:9748 0:6645 0:4920 0:3186 0:2573 0:2146

(0:0423) (0:0329) (0:0280) (0:0262) (0:0273) (0:0241)
1 (H) 0:3803 0:2039 0:1479 0:09256 0:07376 0:06675
(0:0482) (0:0270) (0:0250) (0:0248) (0:0268) (0:0258)
2 (H) . 0:3615 0:3411 0:2192 0:1865 0:1652
(0:0405) (0:0414) (0:0267) (0:0274) (0:0285)
3 (H) . . 0:2181 0:1654 0:1254 0:1014
(0:0257) (0:0230) (0:0204) (0:0195)
4 (H) . . . 0:1869 0:1452 0:1142
(0:0231) (0:0206) (0:0203)
5 (H) . . . 0:1502 0:1163 0:0915
(0:0231) (0:0185) (0:0178)
6 (H) . . . . 0:07984 0:06692
(0:0165) (0:0159)
7 (H) . . . . 0:1235 0:09775
(0:0248) (0:0225)
8 (H) . . . . . 0:1064
(0:0198)
9 (H) . . . . . 0:06972
(0:0175)
(1) 0.380 0.565 0.707 0.814 0.850 0.880

Log-lik. -8274.271 -7927.966 -7781.708 -7567.413 -7500.538 -7460.967
AIC 3.111 2.982 2.927 2.847 2.823 2.809
HQ 3.113 2.984 2.930 2.851 2.828 2.814
SC/BIC 3.118 2.989 2.936 2.858 2.836 2.825
Portmanteau, 1-72 [0.00] [0.00] [0.00] [0.05] [0.05] [0.04]
No ARCH(1) [0.00] [0.06] [0.20] [0.95] [0.69] [0.63]
Normality [0.00] [0.00] [0.00] [0.00] [0.00] [0.00]
Table 10.1: Estimation of ARCH(p) models for SP500 stock market returns. Num-
bers in parentheses are robust standard errors. Numbers in square brackets are p-
values for misspeci…cation tests. All estimations are based on T = 5322 daily ob-
servations from 1997-01-06 to 2018-02-27. The label (X) indicates that a regressor
enters the equation for the conditional mean, while (H) denotes the equation for the
variance.
(A) Actual and fitted value (B) Standardized residual

4
Dlog(sp500)
10 Fitted
2
5
0
0 -2
-4
-5
-6
-10
2000 2005 2010 2015 2000 2005 2010 2015
(C) Conditional standard deviation (D) Residual autocorrelation, ACF
0.2
6
0.1
4
0.0
2
-0.1
0 -0.2
2000 2005 2010 2015 0 10 20 30 40 50 60 70
Figure 10.2: Diagnostic graphs for the ARCH(9) model for the SP500 returns.
10.3 A Test for No-ARCH Effects and

Misspecification Testing
A test for no ARCH e¤ects, e.g. H0 : 1 = 0 in the simple ARCH(1) model is
conceptually simple, but is statistically complicated by the fact that 1 = 0 is an
hypothesis on the boundary of the parameter space for 1 0, which is a violation of
the regularity conditions for hypothesis testing. As as consequence, standard Wald
and LR statistics do not follow 2 distributions under H0 .
To test the hypothesis of no ARCH-e¤ects, it is therefore customary to use the
LM test principle. In particular, we can use the standard Breusch-Pagan test for
no-heteroskedasticity applied to this particular form of autoregressive conditional
heteroskedasticity.
In the setting of a linear regression,
yt = x0t + t ;
the null hypothesis of no ARCH-e¤ects implies zero correlation of squared errors, see
(10.10), and the hypothesis of no ARCH e¤ects up to order p can be tested using the
10.3 A Test for No-ARCH E¤ects and Misspeci…cation Testing 241
auxiliary regression model
^2t = 0 + 2
1^t 1 + 2
2^t 2 + ::: + 2
p^t p + error; (10.18)
where ^t is the estimated residual from the regression. The null hypothesis of no
ARCH is given by
H0 : 1 = 2 = ::: = p = 0; (10.19)
such that the expected value of ^2t (conditional on the past) is a constant, 0 , for all
t. The alternative is that at least one i is nonzero, i = 1; 2; :::; p. We can use the
familiar LM statistic,
2
ARCH = T R ; (10.20)
where R2 is the coe¢ cient of determination from the auxiliary regression in (10.18).
The statistic, ARCH , is asymptotically distributed as a 2 (p) if the null is true.
It is important to note that the ARCH test has also power against residual au-
tocorrelation. This is because autocorrelation in t will imply autocorrelation in 2t
(while the opposite is not true in general). Before the ARCH test is applied it is
therefore important always to test for no-autocorrelation …rst. If the residual are not
autocorrelated, but the squared residuals are, that is interpreted as an indication of
ARCH e¤ects.
Example 10.5 (test for no-arch, sp500 stock return): Consider an AR(2)
model for the conditional mean
yt = 0:0275 0:0715yt 1 0:0528yt 2 + ^t ;

(0:0168) (0:0234 (0:0294)
estimated with OLS and with heteroskedasticity robust standard errors in paren-
theses. The negative autoregressive coe¢ cients indicate negative autocorrelation of
returns and therefore some degree of predictability. The coe¢ cients are quite small,
however, and the coe¢ cient of determination is low, with R2 = 0:00739, meaning that
the statistical model explains under one percent of the variation in returns. There
are no signs of autocorrelation in the residuals, and a test for no-autocorrelation of
order 1-2 is accepted with a p value of 0:687.
To test for the presence of ARCH e¤ects, we consider the auxiliary regression of
squared residuals on p = 5 lagged squared residuals:
^2t = 0:462 + 0:0513^2t 1 + 0:268 ^2t 2 + 0:0144^2t 3 + 0:138 ^2t 4 + 0:212 ^2t 5 + error.
(0:0613) (0:0134) (0:0133) (0:0138) (0:0133) (0:0134)
We note that many of the lags are statistically signi…cant. The coe¢ cient of deter-
mination is R2 = 0:219 and the LM test statistic for no-ARCH is given by
ARCH = T R2 = 5317 0:219 = 1165:5;

which is highly signi…cant in the asymptotic 2 (5) distribution. This is a formal indi-
cation of ARCH e¤ects, as also illustrated in the estimated ARCH models above.
To test for remaining ARCH e¤ects in a estimated ARCH model, i.e. as a misspec-
i…cation test for the lag-length, p, and the functional form of the proposed ARCH
model, we may use the same approach, here applied to the standardized residuals,
i.e.
z^t2 = 0 + 1 z^t2 1 + 2 z^t2 2 + ::: + p z^t2 p + error;
where z^t denotes the standardized estimated residual from the ARCH model, z^t =
^t =^ t .
Example 10.6 (misspecification test, sp500 stock return): Statistics for no

additional ARCH e¤ects of order one are reported for the models in Table 10.1. The
results suggest that there are no additional ARCH e¤ects in z^t as long as p = 3 lags
of 2t are included in the model.
10.4 Generalized ARCH (GARCH) Models

To resolve the problem with many lags in ARCH models, Bollerslev (1986) and Taylor
(1986) suggested a generalized version of the ARCH model which economize more
with the number of parameters. The simplest case is the very popular GARCH(1,1)
model de…ned by the equation
2 2 2
t =$+ t 1 + t 1;
where the lagged variance is included along with the squared innovation. A simple
interpretation is that the lagged dependent variable allows for a more smooth devel-
opment in the variance and a longer memory without including many parameters.
To understand the relationship between ARCH and GARCH models we can again
use the de…nition 2t = 2t vt , and obtain that
2 2 2
t = $+ t 1 + t 1
2 2 2
t vt = $ + t 1 + t 1 vt 1
2 2
t = $+( + ) t 1 + vt vt 1 :
This suggests that the GARCH(1,1) implies an ARMA(1,1) structure for the squared
innovation. We recall that an ARMA model can be seen as a restricted and parsi-
monious representation of an in…nite AR model, and we can think of the GARCH
10.4 Generalized ARCH (GARCH) Models 243
model as a restricted in…nite ARCH model. By repeated substitution we can write

the GARCH model as
2 2 2
t = $+ t 1 + t 1
2 2 2
= $+ t 1 + ($ + t 2+ t 2)
2 2 2 2
= $(1 + ) + t 1 + t 2+ t 2
..
.
2 3 2 2 2 2
= $(1 + + + + :::) + t 1 + t 2 + t 2 + :::
$ X
1
j 1 2
= + t j;
1 j=1
where we assume that the process started in the in…nite past.

Also for the GARCH model we need 2t to be non-negative and constrain the
coe¢ cients to be non-negative. By looking at the ARMA representation we can also
read of the condition under which 2t is covariance stationary, namely that + < 1.
In this case the unconditional variance is given by
2 $
= E( 2t ) = :
1
The GARCH(1,1) model is extremely popular in applied work, but it can of cause
be generalized with more lags as the GARCH(p,q) model:
p q
X X
2 2 2
t =$+ j t j + j t j:
j=1 j=1
Pp Pq
The condition for covariance stationarity is j=1 j + j=1 j < 1.
10.4.1 Explanatory Variables in the Variance

Sometimes we have ideas for an exogenous variable, dt , that may a¤ect the conditional
variance, and it is straightforward to extend the ARCH and GARCH model with
explanatory variables. We would then write the model as
yt = x0t + t
t = t zt
2 2 2
t = $ + dt + t 1 + t 1;
where we require that dt > 0. The interpretation is that the conditional variance
changes with dt . As and example, dt could be dummy variable taking the value one
the last day before a closing day and zero otherwise
1 the last trading day before a closing day
dt =
0 otherwise.
In this case, the unconditional variance is $=(1 ) on most trading days, but
2 $+
= ;
1
the last trading day before a closing day.
Example 10.7 (garch models, sp500 stock return): To illustrate, we estimate

the AR(2)-GARCH(1,1) model, i.e.
yt = + 1 yt 1 + 2 yt 2 + t
t = t zt
2 2 2
t = $+ t 1 + t 1;
and the results are reported in Table 10.2.

The results for GARCH model (1) are obtained by assuming normal innovations,
d
zt j It 1 = N (0; 1). We note that the coe¢ cient to the lagged variance is large and
signi…cant, and + = 0:989 is quite close to one. The fact that the estimated
GARCH models have roots close to unity is a common observation which is discussed
in more details below. In addition, Figure 10.3 (A) reports the estimated conditional
standard deviation.
To illustrate the conditional standard deviation as a measure of market volatility,
we compare in Figure (B) with the so-called implied volatility index, VIX. The VIX
is calculated from observed option prices and is often used as a measure of market
volatility7 . Although the VIX and ^ t measure slightly di¤erent things (^ t measures
current volatility conditional on the past while the VIX index measures expected
volatility in the near future), we observe a very close correspondence between the
estimated volatility, ^ t , and the implied volatility (VIX) index.
To improve the model, column (2) and column (3) in Table 10.2 insert dummy
variables for the …rst trading day after a closing day, and the last trading day before a
closing day, respectively. The dummy variables are included in the conditional mean
and in the conditional variance, but the estimation results suggest that the e¤ects
are small and not signi…cantly di¤erent from zero.
In all models, we note that the null of normality of standardized residuals, z^t =
^t =^ t , is strongly rejected (p value of 0.00), which is mainly due to a large proportion
of extreme observations. One re…nement is to estimate the model with the assumption
of a more general error distribution, and column (4) reports the results for a scaled
d
Student’s t-distribution, zt j It 1 = t ( ). The estimated degrees of freedom is v^ =
7
The VIX is sometimes called the index of fear. It is calculated as the market volatility in
the near future that is in line with observed option prices, if the standard option pricing theory
(Black–Scholes) holds.
10.4 Generalized ARCH (GARCH) Models 245
(1) (2) (3) (4)

y_1 (Y) 0:04659 0:04661 0:04654 0:05360
(0:0144) (0:0143) (0:0143) (0:0137)
y_2 (Y) 0:01829 0:01805 0:01829 0:03719
(0:0161) (0:016) (0:0159) (0:014)
Constant (X) 0:06133 0:05785 0:06128 0:07747
(0:0117) (0:0136) (0:0132) (0:0106)
FirstTrade (X) . 0:01616 . .
(0:0291)
LastTrade (X) . . 0:0002076 .
(0:0298)
$ (H) 0:01753 0:01717 0:0164 0:008663

(0:00448) (0:00955) (0:00952) (0:00248)
(H) 0:1041 0:1041 0:1041 0:09453
(0:0124) (0:015) (0:0149) (0:0101)
(H) 0:8850 0:8849 0:8849 0:905
(0:0117) (0:0152) (0:0151) (0:00961)
FirstTrade (H) . 0:001635 . .
(0:0403)
LastTrade (H) . . 0:005317 .
(0:0428)
student-t df . . . 6:288
(0:553)
+ 0.989 0.989 0.989 1.000

Log-lik. -7441.460 -7441.289 -7441.444 -7322.939
AIC 2.799 2.799 2.799 2.755
HQ 2.801 2.803 2.803 2.758
SC/BIC 2.806 2.809 2.809 2.763
Portmanteau, 1-72 [0.05] [0.05] [0.05] [0.04]
No ARCH(1) [0.32] [0.33] [0.32] [0.80]
Normality [0.00] [0.00] [0.00] .
Table 10.2: Estimation of variants of GARCH(1,1) models for the SP500 stock
market returns. Numbers in parentheses are robust standard errors. Numbers in
square brackets are p-values for misspeci…cation tests. All estimations are based on
T = 5322 daily observations from 1997-01-06 to 2018-02-27.
6:3, which gives a much more fat-tailed distribution than the Gaussian. It holds that
kurtosis for the normal distribution is K = 3 while it is K = 3 + 6 4 = 3 + 6:36 4 = 5:6
for the t(6:3) distribution. The latter is also close to the kurtosis of the standardized
residuals from the basic GARCH(1,1) model. There are also small di¤erences in the
estimated parameters, but the estimated conditional standard deviation is almost
identical (not shown).
(A) Conditional standard deviation (B) Conditional standard deviation and VIX
5 5 Conditional standard deviation
VIX index (scaled)
4 4
3 3
2
2
1
1
0
2000 2005 2010 2015 2000 2005 2010 2015
(C) Conditional variance forecast (D) News impact curves
σ 2t
Basic GARCH
4 Threshold model
Asymmetric model
1.0
3
2 0.5
1
0.0
0
2015 2016 2017 2018 2019 -2 -1 0 1 2 εt−1
Figure 10.3: Results for the analysis of the SP500 stock returns.
10.5 Volatility Forecasts

One important application of ARCH and GARCH models is prediction of future
volatility. We use
2 2
T +hjT = E( T +h j IT )
to denote the forecast of volatility at time T + h given the information set at time T .
To construct the forecast from an ARCH(1) we …rst use the fact $ = 2 (1 ) to
rewrite the equation in deviations from the mean
2 2
t = $+ t 1
2 2 2
t = (1 )+ t 1
2 2 2 2
t = t 1 :
To forecast volatility for T + 1 we …nd the best prediction

2 2 2 2 2 2 2 2
T +1jT = E( T +1 j IT ) = E + T j IT = + T ;
10.5 Volatility Forecasts 247
2
where we have used that T is in the information set at time T , such that E ( T j IT ) =
2
T . Forecasts for the next periods are constructed as the recursion
2 2
T +2jT = E( T +2 j IT )
2 2 2
= E + T +1 j IT
2 2 2
= + E T +1 j IT
2 2 2
= + T +1jT ;
2 2
T +3jT = E( T +3 j IT )
2 2 2
= E + T +2 j IT
2 2 2
= + E T +2 j IT
2 2 2
= + T +2jT ;
etc. Note that the forecast will produce an exponential convergence towards the
unconditional variance, 2 .
For the GARCH(1,1) model we use a similar recursion. First write the model in
deviations from mean
2 2 2
t = $+ t 1 + t 1
2 2 2 2
t = (1 )+ t 1+ t 1
2 2 2 2 2 2
t = t 1 + t 1 :
The …rst period forecast is given by
2 2
T +1jT = E( T +1 j IT )
2 2 2 2 2
= E + T + T j IT
2 2 2 2 2
= + T + T ;
2
because T is in the information set at time T and T can be calculated from the
information set. The next period forecast is
2 2
T +2jT = E( T +2 j IT )
2 2 2 2 2
= E + T +1 + T +1 j IT
2 2 2 2 2
= + E T +1 j IT + E( T +1 j IT )
2 2 2 2 2
= + T +1jT + T +1jT
2 2 2
= +( + ) T +1jT ;
where we have used that E 2T +1 j IT = 2
T +1jT and E 2
T +1 j IT = 2
T +1jT . For
longer horizons we …nd similarly that
2 2 2 2
T +hjT = +( + ) T +h 1jT ;
which is an exponential convergence with speed + .
Example 10.8 (garch model forecasts, sp500 stock return): Figure 10.3
(C) reports one year of daily forecasts for the conditional variance of the GARCH(1,1)
model in Table 10.2 (1). In principle, this is an exponential convergence to the
conditional variance (horizontal line), but in the present case, + is close to unity
and the convergence is quite slow.
Remark 10.5 (igarch phenomenon): It is a stylized feature of estimated GARCH

models that the sum of the coe¢ cients is close to one,
p q
X X
^ (1) + ^ (1) = î + ^
i 1;
i=1 i=1
known as an integrated GARCH or IGARCH model. In this case the squared resid-
uals follow a unit root ARMA model, and the unit root has strange implications for
the behavior of the model. In particular it implies that the unconditional variance is
not de…ned (or in…nite), and a forecast for the conditional variance will replicate the
forecast of a random walk (with drift), i.e. a linear trend!
The IGARCH model may be a valid characterization of the data, but a trending
conditional variance is hard to maintain as a behavioral model for investors. As a
consequence, the IGARCH phenomenon is often discussed as a sign of misspeci…cation
of the GARCH model. One problem could be that there are structural shifts in the
unconditional variance, i.e. in the constant term of the GARCH equation. If this is
not modelled it will bias the roots of the ARMA model towards unity, as already seen
for the analysis of unit roots in the conditional mean.
The asymptotic analysis of the IGARCH model is given in Nelson (1990). He
demonstrates that the IGARCH model is actually strictly stationary, but the variance
is in…nite (and is not weakly stationary!). A result is that although the analysis looks
strange, a test for a unit root in the variance, i.e. a likelihood ratio test for the
IGARCH model against the GARCH follows a standard 2 distribution.
Remark 10.6 (variance targeting): An additional complication that prevails in

the near-integrated GARCH case is that the unconditional variance is poorly esti-
mated. A simple estimator of the unconditional variance is s^2 from (10.3). The un-
conditional variance in the GARCH model is 2 = $=(1 ) and for + 1 the
denominator is close to zero and the estimate of the constant term, $, also becomes
very small. As a consequence the estimator ^ 2 = $=(1
^ ^ ^ ) can be very poor, and
in the limiting case, + ! 1, it does not exist. In practice, the estimate ^ 2 can
be far from the estimate s^2 . Since the forecasts for the conditional variance describe
a convergence back to the unconditional variance, a poor estimate of 2 may render
the forecasts useless. A simple solution is to insert $ = 2 (1 ) and write the
10.6 Extensions to the Basic Model 249
GARCH equation as
2 2 2 2
t = (1 )+ t 1 + t 1:
Instead of estimating ( 2 ; ; ) (which is equivalent to estimating ($; ; )), we can

…x 2 to some robustly estimated value (e.g. s^2 ) and only estimate the parameters
( ; ) from the likelihood function of the GARCH model. This is known as variance
targeting.
10.6 Extensions to the Basic Model

There are many possible extensions of these basic model, and Bollerslev (2010) lists
more than 100 acronyms related to di¤erent ARCH and GARCH speci…cations. Here
we only consider a few di¤erent speci…cations.
10.6.1 Asymmetric ARCH and the News Impact Curve

The basic models have the feature that the sign of a shock does not matter and
positive and negative shocks have the same e¤ects on the conditional variance. The
e¤ect of a shock t 1 on the conditional variance, 2t , is known as the news impact
curve; and for the basic model the news impact curve is obviously symmetric.
In some cases it is reasonable to believe that negative shocks have a di¤erent
impact from positive shocks, and the models can easily be adapted to this situation.
In a famous paper, Glosten, Jagannathan, and Runkle (1993) suggest to model the
asymmetric e¤ects in the following simple way
2 2 2 2
t =$+ t 1 + t 1 I( t 1 < 0) + t 1;
where
1 if x < 0
I(x < 0) =
0 if x 0
is the indicator function. We follow PcGive and call this a threshold asymmetric
GARCH model–or TGARCH–but it is also referred to as the GJR model. We may
write the model as two equations
2 2
2 $+( + ) t 1+ t 1 if t 1 <0
t =
$ + 2t 1 + 2
t 1 if t 1 0
and we note that the news impact curve is now asymmetric. The impact of a squared
residual is for positive shocks and + for negative values.
1
If the distribution of zt is symmetric, such that there is a probability 2
of a positive
shock, the unconditional variance of the threshold GARCH model is
2 $
= ;
1 =2
provided that + =2 + < 1.

A simple economic interpretation of the asymmetry is the so-called leverage e¤ect:
A negative return implies a lower value of the …rm, and therefore a larger ratio of
debt to market value, i.e. a higher leverage. And a higher leverage is often thought
to be associated with a higher volatility. Alternatively, the asymmetry may simply
re‡ect that negative shocks typically upset markets more than positive shocks, which
would typically be the case if a majority of investers hold long positions, i.e. positive
positions in the markets.
Example 10.9 (threshold model, sp500 stock returns): For the SP500 in-
dex, the estimates from the threshold model are reported in column (5) of Table 10.3
and compared with the symmetric model in column (4). We see that the e¤ect on
the variance of a positive shock is ^ = 0, while the e¤ect of a negative shock is
^ + ^ = 0:16–indicating that only negative stock market shocks a¤ect the conditional
variance8 .
The asymmetric news impact curve is compared to the basic symmetric news
impact curve in Figure 10.3 (D). Using likelihood ratio tests and information criteria,
the threshold e¤ect is clearly signi…cant.
An alternative form of asymmetry, suggested by Engle and Ng (1993), is

2
t =$+ ( t 1 )2 + 2
t 1;
where the news impact curve has the same slope for small and large values, but zero
is no-longer the neutral shock. We follow PcGive and refer to this as the asymmetric
GARCH model–or AGARCH.
Example 10.10 (asymmetric model, sp500 stock returns): To illustrate the

asymmetric model, column (6) of Table 10.3 shows the estimation results for the
SP500 index. We see that the asymmetric e¤ect is clearly signi…cant, and the model
is preferred over the basic model in column (4). The neutral shock is now a return of
0:5 pct. Comparing, using information criteria, the threshold e¤ect is preferred to the
asymmetric model. Again, the news impact curve is shown in Figure 10.3 (D).
8
In this example, the model is estimated under the maintained assumption that 0; and the
MLE is obtained at the boundary of the parameter space.
(4) (5) (6) (7)

y_1 (X) 0:05360 0:04685 0:05290 0:05426
(0:0137) (0:0129) (0:0128) (0:0137)
y_2 (X) 0:03719 0:02498 0:03223 0:03800
(0:0140) (0:0142) (0:0140) (0:0140)
Constant (X) 0:07747 0:05310 0:04607 0:1023
(0:0106) (0:0109) (0:0103) (0:0146)
2
log( t) (X) . . . 0:03890
(0:0165)
$ (H) 0:008663 0:01263 0 0:009157

(0:00248) (0:0031) (:::) (0:00254)
(H) 0:09453 0 0:1034 0:09679
(0:0101) (:::) (0:0108) (0:0102)
(H) 0:9050 0:9073 0:8763 0:9026
(0:00961) (0:0114) (0:0113) (0:00969)
; threshold (H) . 0:1631 . .
(0:0214)
; asymmetry (H) . . 0:4974 .
(0:000674)
student-t df 6:288 7:217 7:420 6:274

(0:553) (0:745) (0:766) (0:552)
+ + =2 1.000 0.989 0.980 0.999

Log-lik. -7322.939 -7241.557 -7253.190 -7319.836
AIC 2.755 2.724 2.729 2.754
HQ 2.758 2.728 2.732 2.757
SC/BIC 2.763 2.734 2.739 2.764
Portmanteau, 1-72 [0.04] [0.05] [0.03] [0.05]
No ARCH(1) [0.80] [0.07] [0.01] [0.70]
Table 10.3: Estimation of extended GARCH(1,1) models for SP500 stock market
returns. Numbers in parentheses are robust standard errors. Numbers in square
brackets are p-values for misspeci…cation tests. All estimations are based on T = 5322
daily observations from 1997-01-06 to 2018-02-27.
The two models can easily be combined into a more general speci…cation and a fully
asymmetric GARCH(p,q) model could have the form
p q
X X
2 2 2 2
t =$+ i( t i ) + ( t i ) I( t 1 < 0) + i t i;
i=1 i=1
which allows for a very elaborate speci…cation of the news impact curve.
10.6.2 Exponential GARCH (EGARCH)

A di¤erent approach is taken in Nelson (1991), who speci…es a model for the logarithm
of the variance. In particular, he suggests the speci…cation
2 2
log t = $ + zt 1 + (jzt 1 j E(jzt 1 j)) + log t 1; (10.21)
where (jzt 1 j E(jzt 1 j)) is the magnitude e¤ect, while zt 1 is the additional sign
e¤ect. Observe, that E(jzt j) depends on the assumed distribution of zt , and for the
d
Gaussian pdistribution, zt jIt 1 = N (0; 1), it holds that the expectation is given by
E(jzt j) = 2= 0:798.
There are several key di¤erences between this, so-called EGARCH model, and the
standard GARCH. First, the model is formulated for the log-variance, which implies
that the variance is always positive by construction,
2 2
t = exp $ + zt 1 + (jzt 1 j E(jzt 1 j)) + log t 1 > 0; (10.22)
and (unlike the GARCH case) there are no restrictions on the parameters, $, ; ;
and . This is sometimes an advantage if we want to include explanatory variables,
such as 0 xt , because we do not have to restrict the e¤ect to be positive, compare
§10.4.1.
Secondly, the variance in (10.21) is driven by the scaled residuals, zt 1 = t 1 = t 1 ,
and not the residuals t 1 . The interpretation is that a large shock, t , increases the
variance more if it comes at a time where the variance is small, because zt = t = t is
then larger. We may say that unexpectedly large shocks increase the variance most.
This also implies that the news impact curve is not constant, but rather changes over
time as a function of 2t 1 . On average, the news impact curve is typically steeper in
the EGARCH compared to the normal GARCH, because the exponential function is
more extreme than the quadratic function.
Thirdly, the basic formulation includes zt 1 in levels and not in squares. The
innovation, zt , may be both positive and negative, but because of the log-speci…cation
the variance is always positive.
Fourthly, the EGARCH model includes zt 1 and jzt 1 j and is therefore asymmetric
by construction, cf. the magnitude and sign e¤ect. In particular, a positive innovation
with magnitude zt 1 = a a¤ects the variance with the coe¢ cient ( + )a, while a
negative innovation a¤ects the variance with ( )a. For stock market data, it is
typically found that > 0 while < 0 such that negative innovations has a larger
e¤ect on the variance.
To estimate the EGARCH model, we may assume that zt is i.i.d. Gaussian.
Alternatively, we may again choose a distribution with more probability mass in
the tails, and for the EGARCH model, Nelson (1991) suggests to use the so-called
generalized error distribution (GED), with density given by
1 ! 12
exp 2 t
t
2
2
(1)
2
f( t j t; )= with = ; (10.23)
2(1+1= ) (1) t (3)
where 2t is the (conditional) variance and the shape parameter > 0 determines
the probability mass in the tails. If = 2 the GED coincides with the Gaussian
distribution. The distribution has fat tails compared to the Gaussian distribution for
0 < < 2, while it has thinner tails if > 2. As ! 1, the GED converges towards
the uniform distribution9 .
10.6.3 ARCH in Mean

Most theoretical models in …nance suggest a trade-o¤ between risk and return; and
within a given time period investors require a larger expected return from an invest-
ment which is riskier, often referred to as a risk premium. Similar e¤ects may also
work over time, such that investors require a higher return in time periods where the
volatility is asserted to be higher. There may also be e¤ects in the opposite direction,
see Glosten, Jagannathan, and Runkle (1993) for a discussion.
The argument above suggests that there should be a relationship between the
conditional variance and the conditional mean. One way to test the hypothesis is
to let the conditional mean depend on the variance, 2t . It is not obvious how the
functional form of the relationship should be, but some suggested examples replaces
the simple equation for the conditional mean (10.6) with
yt = x0t + 2
t + t
yt = x0t + t + t
yt = x0t + log( 2
t) + t:
A positive estimate for corresponds to a positive risk premium.
Example 10.11 (garch-in-mean, sp500 stock returns): To illustrate, consider

the model for the SP500 return augmented with a GARCH-in-mean e¤ect. Results
for the most promising speci…cation are reported in Table 10.3 model (7). We see that
log( 2t ) is borderline signi…cant with a t-ratio of 0:03890=0:0165 = 2:36, suggesting a
positive risk premium.
9
The unconditional variance of an i.i.d. GED distributed variable is …nite for > 1. One
argument to prefer the GED over the Student’s t-distribution is that the unconditional variance of
the EGARCH is in…nite if zt has a Student’s t-distribution.
10.7 Multivariate ARCH Models

So far, the considered ARCH and GARCH models have been univariate, describing
the conditional mean and variance of a single return series yt 2 R, possibly conditional
on some regressor, xt .
To analyze the relationship between several …nancial time series, GARCH models
can be extended to multivariate cases. As an example, we consider the bivariate case
y1;t
Yt = 2 R2 :
y2;t
Here we could be interested in the conditional mean and variance of y1;t and y2;t , as
well as their covariance.
One situation, where the covariance is of interest is within the area of portfolio-
choice. Imagine that we want to construct a portfolio consisting of y1;t and y2;t , with
weights given by ! and (1 !), respectively. The portfolio return is then given by
!y1;t + (1 !)y2;t ;
and the conditional variance of the portfolio, measuring the risk, is given by
Pt = V (!y1;t + (1 !)y2;t j It 1 )
= ! 2 V (y1;t j It 1 ) + (1 !)2 V (y2;t j It 1 ) + 2!(1 !) cov(y1t ; y2;t j It 1 );
involving the conditional variances as well as the covariance.
10.7.1 The Constant Conditional Correlation Model

To formulate a multivariate ARCH model for Yt , the starting point could be
y1;t 1 1t
= +
y2;t 2 2t
2 1=2
1;t 1;t 12;t z1;t
= 2 ;
2;t 21;t 2;t z2;t
where Zt = (z1;t ; z2;t )0 is N (0; I2 ) are two independent innovations and

2
1;t 12;t V (y1;t j It 1 ) cov(y1t ; y2;t j It 1 )
t = 2 =
21;t 2;t cov(y2t ; y1;t j It 1 ) V (y2;t j It 1 )
denotes the conditional covariance matrix.

10.7 Multivariate ARCH Models 255
Using matrix notation, we could write the setup as
Yt = + t
1=2 d
t = t Zt ; Zt j It 1 = N (0; I2 );
1=2
where t describes the time-varying conditional variances and covariance, and t
1=2 1=2
denotes the symmetric matrix square root, such that t t = t . In the literature,
many di¤erent models have been suggested for the conditional covariance matrix t .
To be a valid model, the requirement is that t is a valid covariance matrix, i.e. that
it is symmetric and positive de…nite, such that
2 2 2 2
12;t = 21;t ; 1;t > 0; 2;t > 0; 1;t 2;t 21;t 12;t > 0:
A simple example is the so-called constant conditional correlation (CCC) model

proposed in Bollerslev (1990). This model is based on the decomposition of the
covariance into standard deviations and correlations:
2
1;t 0 1 1;t 0 1;t 1;t 2;t
t = = 2 :
0 2;t 1 0 2;t 1;t 2;t 2;t
The CCC model is de…ned such that the conditional correlation is a constant, , while
the conditional variances can be any univariate GARCH model, e.g.
2 2 2
1;t = $1 + 1 1;t 1 + 1 1;t 1
2 2 2
2;t = $2 + 2 2;t 1 + 2 2;t 1 :
The implied conditional covariance is given by
12;t = 1;t 2;t ;
while the conditional correlation is constant. If the assumption of a constant corre-

lation is problematic, extensions exist that also make the correlation time-varying.
Example 10.12 (nasdaq and apple return): As an example, consider the daily
log-returns for the stock on the tech-…rm Apple, and the Nasdaq-100 index, where
Apple has a weight of around 10 percent, i.e.
y1;t 100 log(Applet )

Yt = = :
y2;t 100 log(Nasdaqt )
We consider daily data for the period 2006-2019. Figure 10.4 (A)-(B) shows the daily
returns, where we observe a higher variation in the Apple stock return.
We estimate the CCC model for Yt assuming a conditionally Gaussian distribu-
d
tion, Zt j It 1 = N (0; I2 ), and allow for a VAR(1) speci…cation of the conditional
(A) Daily return, Apple (B) Daily return, Nasdaq
10 10
0 0
-10 -10
2006 2008 2010 2012 2014 2016 2018 2020 2006 2008 2010 2012 2014 2016 2018 2020
(C) Conditional variance, Apple (D) Conditional variance, Nasdaq
40
20
20 10
2006 2008 2010 2012 2014 2016 2018 2020 2006 2008 2010 2012 2014 2016 2018 2020
(E) Conditional covariance (F) Minimal-variance weight to Apple
20 Weight
Moving average (250)
0.50
0.25
10
0.00
2006 2008 2010 2012 2014 2016 2018 2020 2005 2010 2015 2020
Figure 10.4: Multivariate GARCH model for daily returns for Apple and Nasdaq,
2006-2019.
mean. The estimation results are reported in Figure 10.5. We observe that the au-
toregressive component in the conditional mean is not very signi…cant, and could
have been removed. The constant term in the conditional mean of Apple is larger
than in Nasdaq, suggesting a higher average return in the Apple stock.
All coe¢ cients in the conditional variances are clearly signi…cant, and we note
that the variance of the Apple stock is much higher than the variance of the Nasdaq
index. The constant correlation is estimated to ^ = 0:70. Figure 10.4 (C)-(E) show
graphs of the time-varying conditional variances and covariance.
10.7 Multivariate ARCH Models 257
Figure 10.5: Output from a Gaussian maximum likelihood estimation of a bivariate

CCC model for Apple and Nasdaq daily returns, 2006-2019.
Example 10.13 (optimal portfolio choice): To apply the results from the bi-
variate ARCH model for daily returns on Nasdaq and Apple, we reconsider the prob-
lem of optimal portfolio choice. We consider the portfolio
! t Applet + (1 ! t )Nasdaqt ;
where ! t is the portfolio weight to Applet in period t, determined at time t 1, i.e.
measureable with regrads to the information set It 1 .
The variance of the portfolio, given the information It 1 , is
Pt = ! 2t 2
1;t + (1 ! t )2 2
2;t + 2! t (1 !t) 12;t ;
with 21;t , 22;t , and 12;t denoting the conditional variances and conditional covariance,
respectively.
Now assume that the investor is risk-averse and wants to minimize the variance
of the portfolio. The …rst-order condition for the minimum-variance portfolio weight
is given by
@Pt 2 2
= 2! t 1;t 2(1 !t) 2;t +2 12;t 4! t 12;t = 0;
@! t
such that the (time-varying) weight to Applet in the minimum-variance portfolio is
given by
2
2;t 12;t
!t = 2 2
:
1;t + 2;t 2 12;t
Figure 10.4 (F) shows the calculated optimal weight based on the estimated vari-
ances and covariance, together with a 250 days moving average to smooth the results.
Apple has a weight of 10 percent in the Nasdaq index, and we see that in most of the
period, the optimal additional weight to Apple (in the minimum variance portfolio)
is actually negative.
10.8 Concluding Remarks

This chapter has introduced the main ideas of ARCH and GARCH models and some
extensions. There is an enormous literature in this …eld and the number of speci…c
ARCH-type models is exploding. Excellent reviews of the early ARCH literature are
given in Bollerslev, Chou, and Kroner (1992), Bollerslev, Engle, and Nelson (1994),
and Bera and Higgins (1995); and they also contain many references to speci…c models
and extensions.
Chapter 11
Introduction to
Regime-Switching Models
M
ost models for the conditional mean considered so far have been linear,
or close to linear, in the parameters. In many of these cases maximum
likelihood estimations have been equivalent to linear regressions. We have
considered primarily the mathematical structure of the models in order to understand
the dynamic properties of the considered data.
In some cases, however, the behavior of the data is genuinely non-linear, and in
this chapter we consider examples of some non-linear models that have been applied
within the …eld of economics. We focus on the so-called regime-switching models,
where the behavior of the time series of interest is di¤erent in di¤erent regimes. This
could be macroeconomic data, where the impact of x on y is di¤erent in expansions
and recessions, or …nancial data where the impact of x on y is di¤erent in bear and
bull markets, inter alia.
11.1 Introduction
To illustrate the idea, we consider single-equation models for the conditional mean,
but the regime-switching framework may be equally relevant for VAR models or
volatility models. We start out with the autoregressive distributed lag model, ADL,
and focus in some sections on the simple autoregression; but remember that the
framework is easily extended to more complicated settings.
260 Introduction to Regime-Switching Models
The regime-switching ADL model for two variables, fyt ; xt gTt=0 , is de…ned as:
yt = ( 0+ 0 yt 1 + 0 xt + 0 xt 1 + 0 zt ) (1 st )
(11.1)
+ ( 1+ 1 yt 1 + 1 xt + 1 xt 1 + 1 zt ) st ;
for t = 1; 2; :::; T . Here, zt is i.i.d.(0; 1) and we condition on initial values (y0 ; x0 ).

The dynamic properties are given by an ADL model with parameters given by
2 0
0 = ( 0 ; 0 ; 0 ; 0 ; 0 ) in the regime de…ned by st = 0 and by an ADL model with
parameters given by 1 = ( 1 ; 1 ; 1 ; 1 ; 21 )0 in the regime de…ned by st = 1.
Di¤erent mechanisms can be designed to determine the regime, st , i.e. the state
of the economy. Either st can be a deterministic function of the information set,
st = G(yt 1 ; yt 2 ; :::; xt ; xt 1 ; xt 2 ; :::);
for some known function G( ), in which case we refer to the model as having explained
regime switching. One examples is the threshold model, with st 2 f0; 1g, such that
the process switches abruptly between the two regimes, see e.g. the survey in Tong
(2011) or Balke and Fomby (1997). Another example is the smooth transition model,
with st a continuous variable, 0 st 1, such that the behavior is characterized by
a weighted average of the two regimes, see e.g. Teräsvirta (1994).
Alternatively, st is represented by a stochastic variable, typically exogenous and
independent of the information set; and below we consider the case where st is an
exogenous two-state Markov chain, with st 2 f0; 1g, leading to the so-called Markov-
switching model.
Remark 11.1 (dimensions): The regime-switching models can be easily extended

to include more than two variables or more lags for some of the variables. The
functional form could also be di¤erent. e.g. an error-correction model or some non-
linear speci…cation.
Remark 11.2 (regime-invariant parameters): It may hold that some param-

eters are constant across regimes. As an example, the variance may be the same
for the two regimes, 20 = 21 . If all parameters are regime-invariant, the regime-
switching model trivially reduces to a simple constant parameter ADL model, and the
state variable st is redundant.
Remark 11.3 (number of regimes): The illustrated model has two regimes. The
setup can be naturally extended to allow more than two regimes, e.g. with three macro
regimes: expansion, recession, and normal growth.
11.2 Threshold Model 261
11.2 Threshold Model

The threshold model is de…ned by st being a function with discrete support, st 2
f0; 1; 2; :::; rg where d = r + 1 is the number of regimes. For two regimes, st 2 f0; 1g;
a simple example is the threshold autoregression (TAR) with state variable given by
st = I(yt 1 c); (11.2)
i.e.
0+ 0 yt 1 + 0 zt if yt 1 c
yt = (11.3)
1+ 1 yt 1 + 1 zt if yt 1 >c
This can easily be extended with more lags or additional explanatory variables. The
interpretation is that the process dynamics is given by 0 = ( 0 ; 0 ; 0 )0 for low values
of yt 1 and 1 = ( 1 ; 1 ; 1 )0 for high values of yt 1 .
An alternative model could have
st = I( yt 1 c); (11.4)
such that behavior is di¤erent in cases of low growth, yt 1 c, and high growth,
yt 1 > c.
A …nal prominent class of models would use
st = I(jyt 1 j c): (11.5)
This model would de…ne an inner regime, c yt 1 c, and an outer regime,
jyt 1 j > c.
Note that the setting could be extended, such that the variable driving the regime-
switches is not yt but a di¤erent variable, e.g. st = I(xt 1 c). It could also be some
other lag, yt b or xt b with b 1, where b is the delay parameter.
Example 11.1 (trading costs): A leading example of st = I(jyt 1 j c) would

be the case of trading costs. Assume that yt is the deviation from an equilibrium
relationship, which is sustained by arbitrage. Arbitrage trading may be costly and
deviations from equilibrium smaller than the trading cost, c, are not corrected. But
if the deviation exceeds c; arbitrage trading begins and the process is driven towards
equilibrium. With an equilibrium mean of zero, this could lead to a model of the
form
yt 1 + 0 zt if jyt 1 j c
yt =
1 yt 1 + 1 zt if jyt 1 j > c
with j 1 j < 1. This implies a random walk behavior, and no equilibrium correction,
when jyt 1 j c, but equilibrium correction with j 1 j < 1 whenever the deviation is
larger than the trading cost, jyt 1 j > c.
(A) Unemployment and natural rate (B) Unemployment gap

Unemployment rate 2
10.0 Natural rate
1
7.5
0
5.0
-1
1960 1980 2000 1960 1980 2000

(C) Yearly change in umenployment rate (D) Likelihood grid
2 log-likelihood × γ
260
1
240
0
-1 220
1960 1980 2000 -1 -0.5 0 0.5 1 1.5 2
Figure 11.1: Unemployment data for the threshold autoregressive model.
11.2.1 Estimation of Threshold Models

Assuming a known distribution for the error term it is straightforward to write the
likelihood function for the threshold model. Because the discrete nature of st , how-
ever, the likelihood function is not di¤erentiable in the parameter c, and standard
algorithms for estimation do not work, and the standard results for asymptotic infer-
ence do not apply.
When the regime switching variable is measurable with respect to the information
set and c is known, however, the model can typically be written as a linear regression.
As an example, consider the model in (11.3) with regime-invariant variance, 20 =
2 2
1 = . This model can be written as
yt = ( 0 + 0 yt 1 )I(yt 1 c) + ( 1 + 1 yt 1 )I(yt 1 > c) + t
= 0 I(yt 1 c) + 0 yt 1 I(yt 1 c) + 1 I(yt 1 > c) + 1 yt 1 I(yt 1 > c) + t

0 10 0 1
I(yt 1 c) 0
B C B C
B y I(y c) C B C
= B t 1 t 1
C B
0
C+ t
@ I(yt 1 > c) A @ 1 A
yt 1 I(yt 1 > c) 1
11.2 Threshold Model 263
= X(c)0t (c) + t ;
with t = zt . This is a standard linear regression, and the parameters in the condi-
tional mean (given c), (c), can be estimated by Gaussian MLE or OLS
! 1 !
X
T X
T
^ (c) = X(c)t X(c)0t X(c)t yt : (11.6)
t=1 t=1
To estimate c; we may consider a grid over relevant values, c 2 fc1 ; ::::; cN g. Often
this is implemented by estimating the model for all values for yt 1 in the sample,
c 2 fy0 ; y1 ; y2 ; :::; yT 1 g, and choose c^ as the value that maximizes the likelihood. In
practice we need some minimal number of observations in each regime, and instead
of all values of yt 1 , we choose c^ such that each regime contains at least 10 or 15
percent of the observations.
Example 11.2 (tar model for the us unemployment): Let yt denote the US
unemployment rate in Figure 11.1 (A). We remove the persistent movements in un-
employment by considering the deviations from the natural rate of unemployment,
which is here estimated by an HP …lter, see Figure 11.1 (A) for the natural rate and
Figure (B) for the deviation from the natural rate. We consider a threshold AR(1)
model with regime variable given by
st = I( 4 yt 1 c); (11.7)
where 4 yt 1 is the yearly change in the unemployment rate, see Figure 11.1 (C). To
estimate the model, we initially assume that c = 0:4. That gives the regime classi-
…cation indicated by the shading in the graphs in Figure 11.1. The OLS estimation
results for the model are given by
0 1
0 10 0:0265
I( 4 yt 1 0:4) B (0:00680) C
B C B 0:934 C
B yt 1 I( 4 yt 1 0:4) C B C
C B
0^ (0:0106) C
yt = X(0:4)t (0:4) + ^t = B + ^t ; (11.8)
@ B
I( 4 yt 1 > 0:4) A B 0:160 C C
yt 1 I( 4 yt 1 > 0:4) @ (0:0224) A
0:934
(0:0230)
with standard errors in parentheses. The results indicate a much di¤erent mean
unemployment rate in the two regimes, low growth and high growth, respectively,
whereas the autoregressive coe¢ cients are almost identical.
To …nd the MLE for c; we consider a grid covering all realizations for 4 yt 1 . The
likelihood values are illustrated in Figure 11.1 (D). We observe a local maximum for
c around 0.4 and the global maximum is obtained where c = 0:742. This value only
leaves approximately 6% of the observation is the upper regime, yt 1 > 0:742, and
we therefore prefer the model with c = 0:4: The regression coe¢ cients obtained for
c = 0:742 are very similar to the results reported above and not shown.
11.3 Smooth Transition Model

The threshold model assumes that the time series behavior is always characterized by
one of the regimes, e.g. st 2 f0; 1g for the case of two regimes. The smooth transition
model is based on the same type of description of the regimes, but the time series is
at each t = 1; 2; :::; T characterized by a weighted average of the two regimes; we say
there is a smooth transition between the regimes rather than the abrupt change in
the threshold model.
For the simplest, homoskedastic AR(1) case, we write the smooth transition model
as
yt = ( 0 + 0 yt 1 )(1 G(yt 1 )) + ( 1 + 1 yt 1 )G(yt 1 ) + t ; (11.9)
where st = G( ) is a continuous function of a regime switching variable, in the example
above given by yt 1 . A classical choice of the transition is the logistic function
1
G(x) = with > 0; (11.10)
1 + exp ( (x c))
for some x in the information set, leading to the so-called logistic smooth transition
model (LSTAR). The parameter c is the center of the logistic function while deter-
mines the smoothness of the transitions. For ! 1; the LSTAR transition function
st = G(yt 1 ) converges to the TAR model with st = I(yt 1 c).
Note that as x ! 1 it holds that G(x) ! 0 and the model is characterized
by the process with parameters, 0 = ( 0 ; 0 )0 . As x ! 1, on the other hand, the
other extreme regime is relevant as G(x) ! 1 and the model is characterized by the
parameters, 1 = ( 1 ; 1 )0 . For other values of x the process is a weighted average.
Examples of the LSTAR transition function are given in Figure 11.2 (A)-(B).
An alternative is the so-called exponential smooth transition model (ESTAR)
de…ned by st = G(yt 1 ), with
G(x) = 1 exp (x c)2 with > 0; (11.11)
that allows for an inner and an outer regime–parallel to the TAR model given by
st = I(jyt 1 j c). Examples of the ESTAR transition function are given in Figure
11.2 (C).
Again the model can be extended to switch between any two speci…cations for the
conditional mean and variance.
11.3 Smooth Transition Model 265
(A) LSTAR function G(x) (B) LSTAR function G(x)

1.0 1.0
γ=1, c=0 × x γ=1, c=2 × x
γ=2, c=0 × x γ=2, c=2 × x
0.5 0.5
0.0 0.0
-4 -2 0 2 4 -4 -2 0 2 4
(C) ESTAR function G(x) (D) Regime classification for TAR and LSTAR
1.0 1.0
0.5 0.5
γ=0.5, c=0 × x
γ=4, c=0 × x
0.0 0.0
-4 -2 0 2 4 1960 1980 2000
Figure 11.2: (A)-(C): Examples of smooth transition functions. (D): Regime clas-
si…cations for the TAR and STAR models.
11.3.1 Estimation of Smooth Transition Models

Because the smooth transition functions are di¤erentiable in the parameters ( ; c)0 ,
the usual tools for numerical maximization the likelihood function apply to the
smooth transition models, and for stationary and weakly dependent data, inference
on the parameters are also standard, such that t statistics, Wald-statistics and like-
lihood ratio statistics have standard limiting distributions–given some regularity con-
ditions.
Example 11.3 (star model for the us unemployment): Reconsider the US

unemployment rate. The LSTAR model deliver estimation results similar to the TAR
model. The regime classi…cation, i.e. the weights to the two regimes, s^t , are reported
in Figure 11.2 (D). Whereas the classi…cation in the TAR case is a discrete decision,
s^t 2 f0; 1g, the classi…cation for the LSTAR case is represented by probabilities, i.e.
the weights of the regimes.
The two estimations agree on the timing of regime shifts, but for the LSTAR model
some periods are less certain, meaning that the actual coe¢ cients for these periods
are not given by the particular extreme regime, but rather a weighted average.
Both approaches, the TAR and the STAR, have advantages and drawbacks, and
at the end of the day, it is an empirical questions, which of the models that …ts the
data the best.
11.4 Markov Switching Model

A …nal class of regime switching models treats the regime variable st 2 f0; 1g as an
unobserved exogenous random variable that jumps between the two (or more) regimes
according to a set of …xed probabilities, see e.g. Hamilton (1989) or Hamilton (1990)
for early applications of the Markov switching (MS) model in economics.
D
The process st is called a Markov chain if (st jst 1 ; :::; s1 ) = (st jst 1 ), such that
the distribution of st given the past is a function only of st 1 . The dynamics of the
unobserved process st is governed by the transition matrix,
P (st = 0jst 1 = 0) P (st = 0jst 1 = 1)

P =
P (st = 1jst 1 = 0) P (st = 1jst 1 = 1)
p0j0 p0j1
=
p1j0 p1j1
p0j0 1 p1j1
= ;
1 p0j0 p1j1
such that pijj = p (st = ijst 1 = j) is the probability of being in state i at time t given
that the process was in state j at time t 1. The last equality holds as the probabilities
in each column of P sum to one, and the only parameters in the transition are p0j0
and p1j1 , with 0 < p0j0 < 1 and 0 < p1j1 < 1.
Remark 11.4 (exogenous regime switches): An important part of interpret-

ing the model is to notice that the Markov process st is exogenous. Whether the
market shifts from a bull-regime to a bear-regime, for example, is determined exoge-
nously. Within the class of MS models, the transition probabilities could be allowed
to be time varying, possibly as a function of exogenous or predetermined variables.
Nevertheless, the actual switching is exogenously determined.
11.4.1 Estimation of the Markov Switching Model

Because the MS model contains unobserved variables, the likelihood function is more
complicated than for the models discussed above. To illustrate, consider the simple
11.4 Markov Switching Model 267
case
0+ 0 zt if st = 0
yt jIt 1 = (11.12)
1+ 1 zt if st = 1
with zt being i.i.d.N(0; 1) and It 1 = fyt 1 ; yt 2 ; :::y1 g is the information set. Observe
this model has a switching mean and variance and the parameters are given by =
f 0 ; 1 ; 20 ; 21 ; p0j0 ; p1j1 g.
For given values of the parameters, , the conditional distribution of yt jIt 1 is
given by N ( 0 ; 20 ) if st = 0 and N ( 1 ; 21 ) if st = 1. We write the two conditional
densities as
!
2
1=2 (y t 0 )
f (yt jst = 0; It 1 ; ) = 2 20 exp (11.13)
2 20
!
2
1=2 (y t 1 )
f (yt jst = 1; It 1 ; ) = 2 21 exp ; (11.14)
2 21
and note that the density of yt jIt 1 is just the weighted mixture of the two possible
states,
f (yt jIt 1 ; ) = P (st = 0jIt 1 ; ) f (yt jst = 0; It 1 ; )
+P (st = 1jIt 1 ; ) f (yt jst = 1; It 1 ; );
where the weights are the predictive probabilities, P (st = i j It 1 ; ) i 2 f1; 2g, i.e.
the best guess of the probability of regime i at time t given observations up to time
t 1 and given . The predictive probabilities are not to be confused with the simple
transition probabilities, pijj = P (st = i j st 1 = j).
Inserting, the log-likelihood function for the observations y1 ; :::; yT is given by
X
T
log LT ( ) = log f (yt jIt 1 ; )
t=1
!
XT X
1
= log P (st = ijIt 1 ; ) f (yt jst = i; It 1 ; ) : (11.15)
t=1 i=0
Note that the conditional densities in (11.13)-(11.14) are trivial to evaluate given ,
and the challenge is how to evaluate the predictive probabilities, P (st = i j It 1 ; ).
One solution is a …ltering-algorithm that recursively evaluates the predictive proba-
bilities, and we outline that in Appendix 11.A.
To estimate the parameters, we simply maximize the likelihood function in (11.15),
^ = arg max log LT ( );
subject to the restrictions

2 2
0 < p0j0 < 1; 0 < p1j1 < 1; 1 > 0; and 2 > 0: (11.16)
11.4.2 Predicted–, Filtered–, and Smoothed

Probabilities
A by-product of the recursive algorithm that evaluates the likelihood function in
(11.15) is that it also produces the predicted probabilities,
tjt 1 = P (st = ijyt 1 ; yt 2 ; :::y1 ; ); (11.17)
i.e. the best guess of the probability of being in di¤erent regimes in the period to
come. In many applications these probabilities are interesting in their own right.
As an example, it could be of interest to predict–given the information set available
today–whether the market tomorrow will most likely be a bear or a bull market.
Likewise, the …ltered probabilities,
tjt = P (st = ijyt ; yt 1 ; yt 2 ; :::y1 ; ); (11.18)
give the best guess of the stance of the current market given that the information
in the most recent observation, yt , has also been taken into account. This could be
viewed as a real-time regime classi…cation.
Finally, it is sometimes interesting to do retrospective analysis, i.e. to characterize
the regime probabilities for a historical period given what we know today. In the
macro-econometric literature, for example, researchers have tried to characterize the
business cycle stance of an economy and have tried to label periods of recessions and
expansions. To do so, de…ne the so-called smoothed probabilities as
tjT = P (st = i j yT ; yT 1 ; :::; yt ; :::; y2 ; y1 ; ); (11.19)
i.e. the best in-sample prediction given the full set of observations, y1 ; y2 ; :::; yT .
Smoothed probabilities are calculated using a backward recursive algorithm, see e.g.
Hamilton (1994) or Doornik (2013).
Example 11.4 (ms model for the us unemployment): Reconsider the US un-
employment gap from the examples above de…ned as the di¤erence between the un-
employment rate and the natural rate in Figure 11.3 (A). A two-regime Markov
switching AR(1) model (with also switching variance) produces the results
8
< 0:203 + 0:954 yt 1 + 0:241 zt if st = 0
(0:0502) (0:0436) (0:0246)
yt =
: 0:0363 + 0:935 yt 1 + 0:147 zt if st = 1
(0:00681) (0:0103) (0:00496)
with standard errors in parentheses. Again there are markedly di¤erent means in
the two regimes, while the di¤erence is persistence is only minor. We also observe
11.4 Markov Switching Model 269
(A) Unemployment and natural rate (B) Enemployment gap and regimes
Unemployment rate 2
10.0 UR
Natural rate Regime 0
1
7.5
0
5.0
-1
1960 1980 2000 1960 1980 2000

(C) Smoothed probabilities of regime 0 (D) Predicted probability LSTAR and MS
1.0 1.0
MS
LSTAR
0.5 0.5
0.0
1960 1980 2000 1960 1980 2000
Figure 11.3: Results for the Markov switching model.
that the variance in the high unemployment regime is much higher than in the low
unemployment regime, with estimated standard deviations given by ^ 0 = 0:241 and
^ 1 = 0:147. The regime switching probabilities are given by
p0j0 = 0:874 and p1j1 = 0:981 ;
(0:04316) (0:007067)
indicating that the process stays much longer in regime st = 1 than it does in regime
st = 0: The probability of staying in regime st = 0 indicates that this is quite
transitory and the probability of staying in this regime in 5 consecutive periods is
0:8745 = 0:510, i.e. around 50 percent. The probability of staying in the regime
st = 1 in 5 periods is 0:9815 = 0:909, i.e. above 90 percent.
Figure 11.3 (B) shows the unemployment gap and the regime classi…cation of
the MS model, with classi…cation based on the smoothed probabilities, while Figure
(C) shows the smoothed probabilities of the regime st = 0 and the corresponding
classi…cation.
The MS model identi…es the regimes where unemployment is rapidly increasing–
similar to the results for the TAR and STAR, but in the case of the MS model, the
economic reasoning for regime switches is not imposed on the model a priori, but cho-
sen as the Markov chain, st , that maximizes the likelihood function for the observed
unemployment gap. Figure (D) shows the probabilities for the low growth regime,
st = 0, of the LSTAR model and the comparable predicted probabilities from the
MS model. Although the assumptions regarding the regime switching are markedly
di¤erent, explained switching versus switching based on an exogenous Markov chain,
the results are remarkably similar in this particular case.
11.5 More on Linearity Testing

One of the most important hypotheses to test, is how many regimes are actually
needed to characterize the data, e.g. if the two-regime model is a signi…cant im-
provement over a linear model. The test of this hypothesis is made complicated by
the fact that under the null hypothesis of linearity, then the parameters in regime 1,
e.g. f 1 ; 21 g in the MS example, are not identi…ed in the model. This violates the
regularity conditions for likelihood ratio testing, and the LR statistic will not have a
standard 2 limiting distribution. Several solutions to the testing problem have been
suggested, see e.g. Davies (1977), Andrews and Ploberger (1994) and Hansen (1996).
Other authors have suggested that information criteria typically works satisfactorily
for model selection, see e.g. Doornik (2013) and references therein.
11.A Filter Algorithm for MS Models 271
Appendix:
11.A Filter Algorithm for MS Models

To …nd the predicted probabilities
P (st = i j It 1 ; ); i 2 f0; 1g;
needed for the likelihood function, assume for a moment that we have already been
able to calculate the …ltered probabilities
P (st 1 = i j It 1 ; ); i 2 f0; 1g; (11.20)
i.e. the regime probabilities at time t 1 given the information set including yt 1 ,
and stack the probabilities in a vector,
P (st 1 = 0 j It 1 ; )
t 1jt 1 = ; (11.21)
P (st 1 = 1 j It 1 ; )
It is then easy to predict the probabilities for the next observations t, in particular,
P (st = 0 j It 1 ; ) = p0j0 P (st 1 = 0 j It 1 ; ) + p0j1 P (st 1 = 1 j It 1 ; )

P (st = 1 j It 1 ; ) = p1j0 P (st 1 = 0 j It 1 ; ) + p1j1 P (st 1 = 1 j It 1 ; );
or using vectorized notation, that
P (st = 0 j It 1 ; )
tjt 1 = =P t 1jt 1 : (11.22)
P (st = 1 j It 1 ; )
This is known as the prediction step of the algorithm. Next note that if we can update
the prediction tjt 1 to produce tjt , then we are back at (11.21) but one time period
ahead, and we have a recursive algorithm.
To achieve this, …rst note that by de…nition of a conditional probability,
f (st = 0; yt j It 1 ; )
P (st = 0 j It 1 ; yt ; ) = : (11.23)
f (yt j It 1 ; )
Here the numerator,
f (st = 0; yt j It 1 ; ) = f (yt j st = 0; It 1 ; )P (st = 0 j It 1 ; );

is the joint density of yt and st = 0, which factorize into the conditional density in
(11.13) multiplied with the predictive probability in tjt 1 . Likewise, the denominator
is just the likelihood contribution from (11.15). Inserting produces
f (st = 0; yt jIt 1 ; )
P (st = 0jIt 1 ; yt ; ) =
f (yt jIt 1 ; )
f (yt jst = 0; It 1 ; ) P (st = 0jIt 1 ; )
= P1 ; (11.24)
i=0 f (yt jst = i; It 1 ; ) P (st = ijIt 1 ; )
where all terms are straightforward to evaluate given . The calculation in (11.24) is
known as the …ltering step or the update step of the algorithm, and given an initial
value for 0j0 , the recursion in (11.22) and (11.24) produces the weights required to
evaluate the likelihood in (11.15) given . Further details can be found in Hamilton
(1994) or Doornik (2013).
Chapter 12
State-Space Models
and the Kalman Filter
M
ost models considered so far in this book has been formulated in terms of
observed variables, e.g. a regression of yt on a set of observed regressors, xt ;
possibly including lags of yt . An alternative approach, considered in this
chapter, is to formulate the model for yt as function of a number of unobserved or
latent factors, xt , and then postulate a dynamic mechanism for xt . I this chapter we
consider the so-called linear Gaussian state-space model, where yt is a linear function
of xt , while the unobserved factor, xt , follows a vector autoregression. Maximum
likelihood estimation of the state-space model is performed by applying a recursive
algorithm–including prediction and …ltering as in the case of the Markov switching
model in Chapter 11–known as the Kalman …lter, originally suggested in Kalman
(1960).
12.1 The Linear State-Space Model

The linear Gaussian state-space model is de…ned as follows:
Definition 12.1 (linear gaussian state-space model): Let yt 2 Rp be an ob-

served process and let xt 2 Rq be a latent process, t = 1; 2; :::; T . The linear Gaussian
state-space model is de…ned by the two equations:
yt = + Axt + wt (12.1)
xt = xt 1 + t; (12.2)
274 State-Space Models and the Kalman Filter
where
wt 0 G 0
is i.i.d N ; :
t 0 0 H
D
The initial values for the latent process are assumed to be given as x0 = N (a; P ),
and the parameters of the model are given by
= f ; A; ; G; Hg:
The equation in (12.1) is referred to as the measurement or observation equation

and postulates a linear relationship between yt and xt while wt is a measurement
error with variance-covariance matrix G. The VAR(1) equation in (12.2) is referred
to as the process or state equation with innovation variance H. Standard textbook
treatments of linear state-space models are given in, inter alia, Anderson and Moore
(1979), Shumway and Sto¤er (2000), Shumway and Sto¤er (2011) or Durbin and
Koopman (2001).
Note that xt is a Markov chain, see also page 262, and the model is an example
of a so-called hidden Markov model (HMM). The model postulates that the latent
state, xt , depends on the past only through xt 1 , while the observation, yt , depends
on the past only through xt ; it follows that yt , conditional on xt , is independent of
yt 1 . This is often referred to as the conditional independence structure of the model,
and can be illustrated graphically as,
(12:2) (12:2) (12:2) (12:2) (12:2)

- xt 2 - xt 1 - xt - xt+1 -
(12:1) (12:1) (12:1) (12:1)
? ? ? ?
yt 2 yt 1 yt yt+1
Remark 12.1 (generalizations): The conditional independence structure of the

model may be relaxed to build more elaborate state-space models. The model can also
be generalized to include more lags in the equation for xt , practically by using the VAR
companion form in (12.2). Furthermore, the model can also be extended to include
exogenous variables. The analysis of these models are typically more complicated than
the basic setup, however, and will not be considered further.
Remark 12.2 (arma equivalence): The linear state-space model in (12.1) and
(12.2) may look unfamiliar by explicitly introducing latent variables and by having two
sets of innovations fwt ; t gTt=1 for a single observed process fyt gTt=1 . This allows a very
general interpretation of the model, and the state-space class is a ‡exible framework
12.1 The Linear State-Space Model 275
for building models in terms of latent variables. Importantly, however, the linear
Gaussian state-space model is observationally equivalent to a vector ARMA model,
and if we prefer, we may rewrite the structure as a reduced form ARMA. Reversely,
we may use the state-space formulation (and the Kalman …lter outlined below) to
evaluate the highly non-linear log-likelihood function of the vector ARMA model.
To illustrate, consider the univariate case, p = q = 1, and set = 0 for simplicity.
Solving the measurement equation (12.1) for xt , we …nd (for A 6= 0)
xt = yt =A wt =A:
Inserting this in the process equation (12.2) yields
xt = xt 1 + t
(yt =A wt =A) = (yt 1 =A wt 1 =A) + t
yt yt 1 = wt wt 1 + A t;
which is an AR(2) model with a combined error term
ut = wt wt 1 + A t: (12.3)
As ut is Gaussian by assumption, it is fully characterized by the …rst two moments,

which are given by
E(ut ) = 0
2
V (ut ) = 1+ G + A2 H
cov(ut ; ut 1 ) = G:
A standard ARMA(1,1) model for yt would be given by

D 2
yt yt 1 = vt + avt 1 with vt = N 0; v ; (12.4)
and parameters given by = f ; a; 2v g. The corresponding moments of the right

hand side, ut = vt + avt 1 are given by
E(ut ) = 0
V (ut ) = 1 + a2 2
v
2
cov(ut ; ut 1 ) = a v:
2
These two structures are observationally equivalent if we choose a and v to solve:
2
1+ G + A2 H = 1 + a2 2
v and G=a 2
v:
This can be generalized to more complicated settings, see e.g. Casals, Garcia-Hiernaux,
and Jerez (2012) for a multivariate algorithm to go from a state-space formulation to
a vector ARMA model.
Remark 12.3 (identification issue): By comparing the parameters of the state-

space model, = f ; A; ; G; Hg, with the parameters of the equivalent ARMA rep-
resentation, = f ; ; a; 2v g, we immediately see that the state-space representation
has one more parameter than the ARMA model, and in practice we can only estimate
as many parameters as there are parameters in the ARMA model, i.e. four in this
case and not …ve.
The fact that the model is written with ’too many’parameters is referred to as an
identi…cation problem. The intuitive issue is that the scale of the latent process xt is
determined by the variance H. But as xt is unobserved, we only have information on
xt through the observation yt , and xt is scaled with A in the measurement equation–
meaning that there are two parameters that implicitly determines the scaling of yt . To
estimate the model, we therefore have to impose a restriction on the parameters. As
an example, we could choose to restrict A, e.g. A = 1, or to restrict H, e.g. H = 1.
12.2 The Kalman Filter

Under the assumption of normality of (wt0 ; 0t )0 and x0 in De…nition 12.1, we see that
the joint process (x1 ; y1 ; x2 ; y2 ; :::; xT ; yT ) is multivariate Gaussian, and any condi-
tional distribution is also Gaussian. This holds for the conditional distribution of
yt jyt 1 ; :::; y1 , which is needed for the likelihood contribution, and the distribution
of the latent process given past observations, xt jyt 1 ; :::; y1 . The Gaussian distribu-
tions are fully characterized by their expectation and variance, and the Kalman …lter,
suggested in Kalman (1960), is a recursive algorithm that for given values of the pa-
rameters, = f ; A; ; G; Hg, produces estimators of the expectation and variance
of the latent state variable and evaluates the likelihood.
Similar to Markov switching model in Chapter 11, the …ltering algorithm produces
the predicted state variable, here denoted
xtjt 1 = E(xt jy1 ; :::; yt 1 ); (12.5)
which is needed for the evaluation of the likelihood, as well as the …ltered state variable,
xtjt = E(xt jy1 ; :::; yt ); (12.6)
which gives the best prediction of the unobserved state variable given that the infor-
mation in the most recent observation, yt , has also been taken into account. Finally,
it is sometimes interesting to do retrospective analyses, i.e. to estimate the state
variables for a historical period given what is known today. This is achieved by the
smoothed state variable
xtjT = E(xt jy1 ; :::; yT ); ; (12.7)
12.2 The Kalman Filter 277
(A) Unobserved process and observed series (B) Unobserved process and filtered series
5 5
xt xt
yt Filtered series
0 0
-5 -5
0 100 200 300 400 500 0 100 200 300 400 500
Figure 12.1: Simulated and …ltered series from a linear state space model.
i.e. the best in-sample prediction given the full set of observations, y1 ; y2 ; :::; yT .
Smoothed state variables are calculated using a further backward recursive algorithm,
see, inter alia, Anderson and Moore (1979) and Shumway and Sto¤er (2000). The
derivation of the Kalman …lter is given in the Appendix 12.A.
Because the Kalman …lter can evaluate the log-likelihood, log LT ( ), for given
parameters, , we can also estimate the parameters by maximizing the likelihood,
and
^ = arg max log LT ( ); (12.8)
2
where each evaluation of the log-likelihood requires the application of the Kalman
…lter. This gives the MLE ^ if (wt0 ; 0t )0 are i.i.d. jointly Gaussian as stated, see
De…nition 12.1, while ^ is the QMLE if this assumption does not apply.
From Caines (1988, Chapter p 7-8) and Watson (1989), see also Ruiz (1994) and
Dunsmuir (1979), it holds that T (^ 0 ) is asymptotically Gaussian under the
assumption that yt is stationary and weakly dependent. Note that this requires that
has eigenvalues inside the complex unit circle.
Example 12.1 (kalman filter): To illustrate estimation with the Kalman …lter,
consider a single time series fyt gTt=1 , generated from the following linear model
yt = + Axt + wt (12.9)
xt = xt 1 + t; (12.10)
with parameters given by = 0; A = 1, = 0:8, G = 0:5, H = 1 and with T = 500

observations.
Figure 12.1 (A) shows the simulated (but assumed unobserved) process, xt , as
well as the observed variable, yt . To estimate the parameters we set A = 1 for
identi…cation and obtain the following estimates
^= 0:316, ^ = 0:758, G
^ = 0:447, and H
^ = 1:214;
not far from the true values. Figure (B) shows the unobserved state xt together with
the estimate of the mean of the …ltered process, xtjt . Intuitively, as the measurement
error, wt , increases, i.e. G larger, the precision of the estimate would deteriorate.
12.A Kalman Filter 279
Appendix:
12.A Kalman Filter

First, introduce the notation Y1:s = (y1 ; :::; ys ) for the history of y and use
xtjs = E[xt j Y1:s ];
to denote the estimated expectation of the states conditional on observed data up to

time s, Y1:s . We will refer to xtjt 1 as the predicted state while xtjt is called the …ltered
state, c.f. the prediction and …ltering in the Markov switching model. In addition,
we use the notation
x
Ptjs = V [xt j Y1:s ]
for the variance of the states, and we note that because the distributions are assumed
to be Gaussian, they are fully characterized by the …rst two moments. The Kalman
…lter derived below is a recursive calculation of state mean and variance. We consider
the …lter for the case = 0, but the formulas are easily extended to 6= 0 by replacing
yt with yt below.
Similarly to the notation for the states,
ytjt 1 = E[yt j Y1:t 1 ]

y
will denote the prediction for yt given observation up to t 1 and Ptjt 1 = V (yt j Y1:t 1 )
is the corresponding variance. The prediction error and its variance also allow eval-
uation of the likelihood function, which is given by
X
T
log LT ( ) = log `t ( )
t=1
XT
= log f (yt j Y1:t 1 ; )
t=1
1X
T
= (log jV (yt j Y1:t 1 )j
2 t=1
1
+ (yt E[yt j Y1:t 1 ])0 V (yt j Y1:t 1 ) (yt E[yt j Y1:t 1 ])
1X
T
y y
= log Ptjt 1 + (yt ytjt 1 )0 (Ptjt 1
1 ) (yt ytjt 1 ) (: 12.11)
2 t=1
Observe that the normalizing constant in the Gaussian distribution has been omit-
ted for simplicity, and that the parameters enter the likelihood function, log LT ( ),
y
through the model prediction, ytjt 1 , and the corresponding variance, Ptjt 1.
Prediction Step and Likelihood Evaluation: To build a recursive algorithm,

assume that we have already calculated the …ltered values, xt 1jt 1 and Ptx 1jt 1 , which
fully characterizes the (Gaussian) distribution of xt 1 jyt 1 ; yt 2 ; :::; y1 . The prediction
step of the Kalman …lter algorithm follows directly from the model equations. The
state prediction follows from (12.2) as
xtjt 1 = E (xt j Y1:t 1 ) = xt 1jt 1 (12.12)
and
x 0
Ptjt 1 =E xt xtjt 1 xt xtjt 1 j Y1:t 1 = Ptx 1jt 1
0
+ H: (12.13)
The measurement prediction follows from (12.1),
ytjt 1 = E (yt j Y1:t 1 ) = Axtjt 1 ; (12.14)
while
y 0
Ptjt 1 = E yt ytjt 1 yt ytjt 1 j Y1:t 1
0
= E A(xt xtjt 1 ) + wt A(xt xtjt 1 ) + wt j Y1:t 1
x 0
= APtjt 1 A + G: (12.15)
In addition, the covariance is given by,

xy 0
Ptjt 1 = E xt xtjt 1 yt ytjt 1 j Y1:t 1
0
= E xt xtjt 1 A(xt xtjt 1 ) + G1=2 wt j Y1:t 1
x 0
= Ptjt 1A : (12.16)
y
From the prediction, ytjt 1 , and the variance, Ptjt 1 , we can now evaluate one term
in the log-likelihood function, log f (yt j Y1:t 1 ; ) in (12.11), as
1 y y
log Lt ( ) = log Ptjt 1 + (yt ytjt 1 )0 (Ptjt 1
1 ) (yt ytjt 1 ) : (12.17)
2
12.A Kalman Filter 281
Update Step: It follows from the prediction step, that the joint distribution of xt
and yt conditional on Y1:t 1 has a Gaussian distribution,
!!
x xy
xt D x tjt 1 P tjt 1 P tjt 1
Y1:t 1 =N ; xy 0 y
yt ytjt 1 (Ptjt 1 ) P tjt 1
!!
x x 0
D x tjt 1 P tjt 1 P tjt 1 A
=N ; x x 0 :(12.18)
Axtjt 1 APtjt 1 AP tjt 1 A + G
The update step of the …lter, that produces the …ltered estimate of xt conditional
also on yt (i.e. xtjt ), is found from (12.18) by deriving the conditional distribution of
xt j(Y1:t 1 ; yt ). This conditional distribution is also Gaussian, with expectation given
by,
xtjt = E (xt j Y1:t )
xy y 1
= xtjt 1 + Ptjt 1 (Ptjt 1 ) yt ytjt 1
x 0 x 0
= xtjt 1 + (Ptjt 1 A )(APtjt 1 A + G) 1 (yt Axtjt 1 )
= xtjt 1 + Kt y t Axtjt 1 ; (12.19)
where the linear regression coe¢ cient,
xy y 1 x 0 x 0 1
Kt = Ptjt 1 (Ptjt 1 ) = (Ptjt 1 A )(APtjt 1 A + G) ; (12.20)
is known as the Kalman gain. The update formula for the variance is–similarly–given
by
x
Ptjt = V (xt j Y1:t )
x xy y 1 xy 0
= Ptjt 1 Ptjt 1 (Ptjt 1 ) (Ptjt 1 )
x x 0 x 0 1 x
= Ptjt 1 Ptjt 1 A (APtjt 1 A + G) APtjt 1
x x
= Ptjt 1 Kt APtjt 1: (12.21)
Now we have a recursive algorithm to go from time period t 1, i.e. xt 1jt 1 and
Ptx 1jt 1 , x
to the next time period t, i.e. xtjt and Ptjt , and each step produces the …ltered
state estimate and allows calculation of the likelihood contribution. To implement
the …lter, initial values
x
x0j0 = a and P0j0 =P
have to be chosen. If the process equation is stationary, one possible choice would
be to take a and P from the invariant distribution, in which case they would depend
explicitly on the parameters and could contribute to the likelihood. An alternative
choice, which is possible also for non-stationary processes is to use a so-called di¤use
or uninformative initialization, i.e. choosing a = 0 and P as a big number (or matrix).
This means that we assume to have almost no information on the initial value x0 and
in this case, the …rst observation y1 will totally dominate x1j1 .
Remark 12.4 (missing data): The Kalman …lter and the calculation of the likeli-
hood is straightforwardly extended to cover cases with missing data, e.g. where ob-
servations for some time periods are not recorded. The basic idea is that if yt is not
observed, then there is no new information on xtjt and the update of xtjt 1 is therefore
skipped, such that
x
xtjt = xt 1jt 1 and Ptjt = Ptx 1jt 1
0
+ H;
respectively. Shumway and Sto¤er (2011, Section 6.4) discuss this in details and give
suggestions for convenient implementations when only some of the p observations in
the vector yt are missing.
Remark 12.5 (forecasting): Forecasting is implemented in a similar way as the

case of missing data, using the process and observation equations with the Kalman
update gain set to zero, KT +h = 0 for h = 1; 2; :::; H.
Chapter 13
Instrumental Variables
and Generalized Method
of Moments Estimation
G
eneralized method of moments (GMM) estimation is an alternative to the
likelihood principle and it has been widely used since its introduction in
econometrics by Hansen (1982). This chapter introduces the principle of
GMM estimation and discusses some familiar estimators, OLS, IV, 2SLS and ML, as
special cases. We focus on the intuition for the procedure, but GMM estimation is
inherently technical and some details are discussed along the way. The chapter …rst
presents the general theory. It then considers the special case of linear instrumental
variables estimation and derives the well-known IV estimators as special cases. Fi-
nally, we present a GMM module for OxMetrics and discuss two empirical examples.
One is the estimation of monetary policy rules; the other is the estimation of Euler
equations for utility optimizing consumption.
13.1 Introduction
This chapter introduces the theory and implementation of the generalized method of
moments (GMM) estimation techniques and complements the coverage in Verbeek
(2017, Section 5.6). Many estimation techniques can be seen as special cases of GMM,
including the method of moments (MM), ordinary least squares (OLS), instrumental
variables (IV), two-stage least squares (2SLS), as well as maximum likelihood (ML)
284 Instrumental Variables and GMM Estimation
and quasi maximum likelihood (QML). GMM can therefore be used as a unifying
framework to derive and compare estimators.
From the …rst part of the course, we know that the maximum likelihood estimator
is asymptotically e¢ cient, i.e. it has the smallest possible variance in the class of
consistent and asymptotically normal estimators. To obtain e¢ ciency, however, we
need a full description of the data generating process (DGP) such that the likelihood
function is correctly speci…ed.
GMM is a convenient alternative based on fewer assumptions, formulated as mo-
ment conditions also known from MM estimation. The framework of GMM allows a
natural discussion of the minimal assumptions for consistency of an estimator, and
it provides a way to produce robust hypotheses testing. As examples, it turns out
that the heteroskedasticity consistent variance formula for the OLS estimator, see
e.g. Wooldridge (2006, Chapter 8), and the sandwich variance formula for the QML
estimator appear as the variance of the GMM estimator.
The moment conditions for GMM estimation may come from di¤erent sources. In
a regression model, e.g.
yt = x0t + t ; t = 1; 2; :::; T;
the moment condition for the MM estimator comes from the assumption of a zero
conditional mean or pre-determinedness, such that
E(xt t ) = 0: (13.1)
The moment condition may also be the results of a theoretical economic argument,
such as the behavior of an optimizing (rational) agent, e.g. in the form of Euler
equations; we will see examples below.
Compared to the maximum likelihood (ML) principle applied earlier in the course,
GMM estimation is based on fewer assumptions. To formulate a likelihood function
for the non-linear regression model, for example, we need to specify the distribution
for yt j xt , such that it account for all the main features of the data. GMM, on the
other hand, is based solely on a set of moment conditions, e.g. (13.1).
On the one hand, GMM may make the model formulation easier, because we do
not care about the entire distribution of yt j xt , but only the exogenenity (or pre-
determinedness) of the regressors. That means that we don’t need at full description
of the DGP, but only a partial description of some features of the model.
On the other hand, the model control for GMM estimators is more complicated,
because signs of misspeci…cation are less obvious. In addition, the GMM estimator
may be less e¢ cient (i.e. have a larger asymptotic variance) than the ML estimator,
because it is based on less a priori information.
13.2 Method of Moments Estimation 285
Remark 13.1 (assumptions for gmm): GMM estimation is often closely related
to economic theory–exploiting directly a moment condition implied by optimizing eco-
nomic agents. Consistency of the GMM estimator requires that the moment condi-
tions (and hence the economic theories) are true. So whereas the imposed statistical
assumptions are typically minimal, the GMM estimator is derived under very strict
economic assumptions: for example a representative agent, global optimization, ra-
tional expectations etc.
13.2 Method of Moments Estimation

To introduce the idea of GMM estimation, we begin with a small motivating example.
13.2.1 Estimation of Rational Expectations Models

Consider a central bank that sets the short interest rate, rt , to stabilize in‡ation, t .
Because it takes time for the interest rate to a¤ect in‡ation (the so-called outside lag)
it could be optimal for the central bank to be forward looking and act preemptively
on the expected in‡ation in the next period, producing a monetary policy rule of the
form
e
rt = + t+1jt ; (13.2)
where et+1jt is the expected in‡ation in period t + 1 given the information available at
time t. This is a simpli…ed and forward-looking version of the monetary rule suggested
in Taylor (1993).
Because the expectation, et+1jt , is unobserved, it is impossible to estimate the
forward-looking policy rule (13.2) directly. If we assume that the central bank is
rational, however, the optimal forecast of the central bank is the conditional expec-
tation
e
t+1jt = E( t+1 j It );
where It denotes the information set of the central bank at time t–before it sets the
interest rate, rt . Now we can decompose the actual in‡ation rate as the expected
in‡ation plus an expectation error
e
t+1 = t+1jt + vt = E( t+1 j It ) + vt ;
where E(vt j It ) = 0 by construction. Using that et+1jt = t+1 vt , we can rewrite

(13.2) as
rt = + t+1 + t ; (13.3)
where
t = vt = ( t+1 E( t+1 j It )) (13.4)
is a function of the expectation error.
At a …rst sight, the equation in (13.3) looks like a linear regression model, where
rt depends on leaded in‡ation, t+1 , and based on observed time series for rt and
t , an econometrician may try to estimate the parameters of interest, = ( ; )0 .
Unfortunately, it is clear from (13.4) that t is correlated with t+1 such that OLS is
inconsistent.
Under the assumption of rational expectations, however, the expectation of the
central bank is unbiased in the sense that they do not make systematic forecast errors,
and therefore
E( t j It ) = E( vt j It ) = E(vt j It ) = 0:
It therefore holds that for any vector of variables zt 2 It ; we have that
E( t j zt ) = 0;
which implies the (unconditional) moment condition
E(zt t ) = E[zt (rt t+1 )] = 0: (13.5)
The moment condition in (13.5) turns out to be enough to estimate and using
MM or GMM. The information set available would typically include
frt 1 ; rt 2 ; :::; t 1; t 2 ; :::g
and we could choose zt as a subset of these–and potentially other pre-determined

variables.
13.2.2 The Notation

In the rest of this chapter, we will refer to the observed variables entering the equation
of interest, e.g. (13.3), as model variables, wt = (rt ; t+1 )0 , and we will refer to zt as
the vector of instruments. The parameters of the models are given by = ( ; )0 ,
and we note that the condition in (13.5) is a function of the model variables, the
instruments, and the parameters, i.e.
f (wt ; zt ; ) = zt (rt t+1 ) ;
such that the moment condition can be formulated as
E [f (wt ; zt ; )] = 0:
Below we …rst outline this general notation for MM and GMM estimation, and then
we go through a number of examples of MM estimation.
13.2.3 Model Construction

Consider a model like the equation in (13.3) de…ned by the model variables, wt , and
the K parameters, . Assume in addition that we have a number of instrumental
variables, zt .
Definition 13.1 (moment condition): A moment condition is a statement in-

volving the data and the parameters,
g( 0 ) = E[f (wt ; zt ; 0 )] = 0; (13.6)
where is a K dimensional vector of parameters with true value 0, and f ( ) is an R
dimensional vector of potentially non-linear functions.
In most applications the distinction between model variables, wt , and instruments,

zt , is clear. If not, we can de…ne f (yt ; 0 ) where yt includes all the observed data.
The R equations in (13.6) simply state that the expectation of the function
f (wt ; zt ; ) is zero if evaluated at the true value 0 . For the rational expectation
example in (13.3), the moment condition in (13.5) would correspond to
0 1
z1;t
B z2;t C
B C
f (wt ; zt ; ) = zt (rt t+1 ) = B .. C (rt t+1 )
@ . A
zR;t
and the moment conditions are given by
g( 0 ) = E[f (wt ; zt ; 0 )] = E[zt (rt 0 0 t+1 )] = 0;
where 0 = ( 0 ; 0 )0 contains the true values of the parameters.
If we knew the mathematical expectations, E( ), then we could solve the equations
in (13.6) to …nd 0 , and for the system to be well-de…ned the solution should be unique.
The presence of a unique solution is called identi…cation:
Definition 13.2 (identification): The moment conditions in (13.6) are said to

identify the parameters in 0 if there is a unique solution, such that E[f (wt ; zt ; )] = 0
if and only if = 0 .
For a given set of observations, wt and zt , t = 1; 2; :::; T , we cannot calculate the

expectation, and it is natural to rely on sample averages. We de…ne the analogous
sample moments as
1X
T
gT ( ) = f (wt ; zt ; ); (13.7)
T t=1
which contains the information in the data. Observe that because of sample uncer-
tainty in the R sample moments, we have in general that
1X
T
gT ( 0 ) = f (wt ; zt ; 0) 6= E [f (wt ; zt ; 0 )] = 0;
T t=1
and we cannot …nd the true value of the parameter. Instead we …nd an estimator, ^,
based on the sample moments. Three cases emerge:
(1) If R = K we say that the system is exactly identi…ed, and we solve the R
equations with K = R unknown,
1X
T
^
gT ( ) = f (wt ; zt ; ^) = 0; (13.8)
T t=1
and the resulting estimator (if it exists) is referred to as the method of moments
(MM) estimator.
(2) If R > K we have more equations than unknown parameters, and there is no
solution to gT ( ) = 0 (in general). Instead we minimize a weighted sum of
squares to …nd the GMM estimator, see §13.3 below.
(3) If R < K we have fewer equations than unknowns and solutions are not unique.
In this case the parameter is not identi…ed, and R K is known as the order
condition for identi…cation.
To illustrate the idea of identi…cation and the derivation of the MM estimator §13.2.4
below considers a number of small examples.
Remark 13.2 (instrumental variables estimator): In many applications, the

function in the moment condition has the speci…c form,
f (wt ; zt ; ) = zt u(wt ; );
where an R 1 vector of instruments, zt , is multiplied by the 1 1 so-called disturbance

term, u(wt ; ). We could think of u(wt ; ) as being the GMM equivalent of an error
term, and the condition
g( 0 ) = E[zt u(wt ; 0 )] = 0; (13.9)
states that the instruments should be uncorrelated with the disturbance term of the
model. The class of estimators derived from (13.9) is referred to as instrumental
variables estimators.
13.2.4 MM Estimation by Examples

To illustrate the idea of moment conditions and the general notation, consider the
following examples of method of moments estimations; please read them carefully
and relate to the general notation.
Example 13.1 (mm estimator of the mean): Suppose that yt is a random vari-
able drawn from a population with expectation 0 , such that
E(yt ) = 0;
or, alternatively,
E(yt 0) = 0:
We use the notation
f (yt ; 0) = yt 0;
such that
g( 0 ) = E[f (yt ; 0 )] = E[yt 0] = 0:
Based on an observed sample, yt , t = 1; 2; :::; T , we can construct the corresponding
sample moment condition by replacing the expectation with the sample average:
1X
T
gT (^ ) = (yt ^ ) = 0: (13.10)
T t=1
This is one equation with one unknown, and the MM estimator of the mean 0 is the
solution to (13.10), i.e.
1X
T
^M M = yt :
T t=1
Note that the MM estimator is the sample average of yt .
Example 13.2 (ols as an mm estimator): Consider the linear regression model
yt = x0t 0 + t; t = 1; 2; :::; T; (13.11)
where yt is a scalar variable to be explained and xt is a K dimensional vector of

regressors. The parameter in the model is here and 0 denotes the true value of the
parameter . From earlier courses, we know that the OLS estimator was derived from
the assumption of a zero-conditional mean, E( t j xt ) = 0, which allowed a natural
interpretation of the parameter,
@E(yt j xt ) @ (x0t 0 )
0 = = :
@xt @xt
This assumption implies the (unconditional) moment condition,
E[xt t ] = E[xt (yt x0t 0 )] = 0: (13.12)
Again we introduce the notation
f (yt ; xt ; ) = xt (yt x0t );
and
g( 0) = E[f (yt ; xt ; 0 )] = E[xt (yt x0t 0 )] = 0:
De…ning the corresponding sample moment conditions,
1X 1X 1X
T T T
^
gT ( ) = xt y t x0t ^ = xt y t xt x0t ^ = 0;
T t=1 T t=1 T t=1
we have K equations with K unknowns, and the MM estimator can be derived as

the unique solution:
! 1 !
XT XT
^MM = 1 xt x0t
1
xt y t ; (13.13)
T t=1 T t=1
P
provided that T1 Tt=1 xt x0t is non-singular such that the inverse exists. We recog-
nize (13.13) as the OLS estimator, and recall the two conditions for identi…cation:
The moment conditions implied by predetermined regressors and the non-singularity
implied by no-perfect-collinearity.
Example 13.3 (mm estimation in non-linear models): Consider an example

where the model of interest is non-linear, e.g.
yt = h(xt ; 0) + t;
where h(xt ; 0) = E(yt j xt ) is some non-linear function, e.g.
h(xt ; 0) = exp(x0t 0 ):
Here contains the parameters of the model and 0 is the true value.
The zero conditional mean requirement would be the same as in (13.12), E( t j
xt ) = 0, and estimation could be based on the moment condition
g( 0 ) = E[xt t ]
= E[xt (yt h(xt ; 0 )]
= E[xt (yt exp(x0t 0 )] = 0: (13.14)

The sample counterpart would be
1X
T
^
gT ( ) = xt (yt exp(x0t ^ )) = 0;
T t=1
and we could solve the sample moment conditions (numerically) to obtain the MM
estimate ^ .
Example 13.4 (under-identification and non-consistency): Now we recon-

sider the estimation model in equation (13.11) but we assume that some of the vari-
ables in xt are endogenous in the sense that they are correlated with the error term.
In particular, we write the partitioned regression model:
yt = x01t 0 + x02t 0 + t;
where the K1 variables in x1t are predetermined, while the K2 = K K1 variables in

x2t are endogenous, i.e.
E(x1t t ) = 0 (K1 1) (13.15)

E(x2t t ) 6= 0 (K2 1): (13.16)
In this case OLS is known to be inconsistent.

As a MM estimator, the explanation is that we have K parameters, 0 = ( 00 ; 00 )0 ,
but only K1 < K moment conditions. The K1 equations with K unknowns have no
unique solution, so the parameters are not identi…ed by the model.
Example 13.5 (simple iv estimator): Consider the estimation problem in Ex-

ample 13.4, but now assume that there exist K2 new variables, z2t , that are correlated
with x2t but uncorrelated with the errors:
E(z2t t ) = 0: (13.17)
The K2 new moment conditions in (13.17) can replace (13.16). To simplify notation,
we de…ne
x1t x1t
xt = and zt = ;
(K 1) x2t (K 1) z2t
where xt are the model variables and zt are the instruments. We say that the prede-
termined variables are instrument for themselves, while the new instruments, z2t , are
instruments for x2t . Using (13.15) and (13.17) we have K moment conditions:
g( 0) = E[zt t ] = E[zt (yt x0t 0 )] = 0:

The corresponding sample moment conditions are given by
1X
T
^
gT ( ) = zt yt x0t ^ = 0;
T t=1
and the MM estimator is the unique solution:

! 1 !
1X 0 1X
T T
^ = zt xt zt yt ;
MM
T t=1 T t=1
P
provided that the K K matrix Tt=1 zt x0t can be inverted. This is the case if the new
instruments are correlated with the endogenous variables; we say the instruments are
relevant. Observe, that this MM estimator coincides with the simple IV estimator.
Example 13.6 (maximum likelihood estimation): Now consider a log-likelihood

function given by
X
T
log LT ( ) = log `( j yt );
t=1
where `( j yt ) is the likelihood contribution for observation t given the data and is
the parameter. The …rst order conditions for the ML estimator, ^, are given by the
likelihood equations,
X
T X
T
@ log `(^ j yt )
ST (^) = st (^) = = 0:
t=1 t=1
@
Now note, that these equations can be seen as a set of K sample moment conditions
1X ^ 1 X @ log `(^ j yt )
T T
^
gT ( ) = st ( ) = = 0; (13.18)
T t=1 T t=1 @
to which ^M L is the unique MM solution. The population moment conditions corre-

sponding to the sample moments in (13.55) are given by
g( 0 ) = E[st ( 0 )] = 0; (13.19)
where st ( ) = f (yt ; ) in the GMM notation. It follows that the ML estimator is the
MM estimator based on the score equations as the moment conditions. Recall from
Theorem 3.2 that (13.19) is exactly the minimal requirement for consistency of the
QMLE.
Example 13.7 (moment matching for an ma(2)): Consider the MA(2) model
as given by
yt = t + 0 t 1 + 0 t 2 ; t = 1; 2; :::; T; (13.20)
with t being a sequence of independent and identically distributed random variables,
d 2 2
t jIt 1 = N (0; 0 ); 0 > 0,
where It = fyt ; yt 1 ; yt 2 ; :::g denotes the information set. We note that E(yt ) = 0
and the parameters are given by = ( ; ; 2 )0 with true value 0 .
To develop an MM estimation strategy, we calculate the unconditional variance,
2
E yt2 = E(( t + 0 t 1 + 0 t 2) )
2 2 2 2 2
= E t +E 0 t 1 +E 0 t 2
2 2 2
= 1+ 0 + 0 0;
as well as the unconditional autocovariances
E(yt yt 1 ) = E (( t + 0 t 1 + 0 t 2) ( t 1 + 0 t 2 + 0 t 3 ))
2
= ( 0 + 0 0) 0
E(yt yt 2 ) = E (( t + 0 t 1 + 0 t 2) ( t 2 + 0 t 3 + 0 t 4 ))
2
= 0 0;
while E(yt yt i ) = 0 for i 3: Then the following population moment conditions hold:
0 2 1 10
E (yt2 ) 1+ 2
0 + 0
2
0 0
g( 0 ) = @ E(yt yt 1 ) ( 0+ 0
2
0) 0
A = @ 0 A; (13.21)
2
E(yt yt 2 ) 0 0 0
while the higher order autocovariances are not informative on the parameters, as
E(yt yt i ) = 0 for i 3:
Using observations fyt gTt= 1 , the empirical moment conditions are given by
0 P 1 0 1
1 T
(y 2
(1 + ^ 2
+ ^2 )^ 2 ) 0
t
^ B T1 Pt=1
T ^ 2 C @ A
gT ( ) = @ T t=1 (yt yt 1 (^ + ^ )^ ) A= 0 ; (13.22)
1
PT ^ 2
T t=1 (yt yt 2 ^ ) 0
which implicitly de…nes the parameter estimates, ^ = (^ ; ^; ^ 2 )0 .

In this case there is a close-form solution. Alternatively, estimates can be found
using numerical methods.
13.3 GMM Estimation

The case R > K is referred to as over-identi…cation and the estimator is denoted
the GMM estimator. In this case there are more equations than parameters and no
solution to
1X
T
gT ( ) = f (wt ; zt ; ) = 0
T t=1
in general.
Instead we …nd an estimator by minimizing the distance from the vector gT ( ) to
zero. One possibility is to choose to minimize the simple distance corresponding to
the sum of squares, gT ( )0 gT ( ). That has the disadvantage of being dependent on
the scaling of the moments (e.g. whether a price index is scaled so that 1980 = 100
or 1980 = 1). Instead, we minimize the weighted sum of squares, de…ned by the
quadratic form
QT ( ) = gT ( )0 WT gT ( ); (13.23)
where WT is an R R symmetric and positive de…nite weight matrix that attach
weights to the individual moments. We can think of the matrix WT as a weight
matrix re‡ecting the importance of the moments; alternatively we can think of WT
as de…ning the metric for measuring the distance from gT ( ) and zero. Note that the
GMM estimator depends on the chosen weight matrix:
^GM M (WT ) = arg min fgT ( )0 WT gT ( )g : (13.24)
Since (13.23) is a quadratic form it holds that QT ( ) 0. Equality holds for the
exactly identi…ed case, where the weight matrix is redundant and the estimator ^M M
unique.
To derive the estimator in (13.24) we take the …rst derivative and solve the K
equations
@QT ( )
= 0 ;
@ (K 1)
for the K unknown parameters in . In some cases these equations can be solved
analytically to produce the GMM estimator, ^GM M , and we will see one example
from a linear model below. If the function f (wt ; zt ; ) is non-linear, however, it is
in most cases not possible to …nd an analytical solution, and we have to rely on a
numerical procedure for minimizing QT ( ).
13.3 GMM Estimation 295
13.3.1 Properties of the GMM Estimator

To discuss the properties of the GMM estimator, we make the following assumptions
on the moment function f ( ) and the data, which are very similar to the requirements
on the derivatives of the likelihood function:
Assumption 13.1 (properties of the moment function): The data are such
that the moment function and its derivatives obey the following conditions:
(1) A law of large numbers applies to f (wt ; zt ; 0 ), i.e.
1X
T
p
gT ( 0 ) = f (wt ; zt ; 0) ! E[f (wt ; zt ; 0 )] = g( 0 )
T t=1
for T ! 1.
(2) A central limit theorem applies to f (wt ; zt ; 0 ), i.e. as T ! 1,
1 X
T
p d
T gT ( 0 ) = p f (wt ; zt ; 0) ! N (0; S); (13.25)
T t=1
where S is the asymptotic variance of f (wt ; zt ; 0 ).

(3) A law of large numbers applies to the derivative of the moment function,
1 X @f (wt ; zt ;
T
0) p @f (wt ; zt ; 0)
DT = !E = D:
T t=1 @ 0 @ 0
(4) The second derivative of f (wt ; zt ; ) is bounded by a constant in a small neigh-

borhood of 0 .
For simplicity the assumption is formulated directly on f (wt ; zt ; ), but it is a restric-

tion on the behavior of the data and the assumption can be translated into precise
requirements on the data on a case by case basis. For independent and identically dis-
tributed data the assumption is ful…lled, while for time series we require stationarity
and weak dependence known also from the likelihood analysis.
Theorem 13.1 (consistency): Let the data obey Assumption 13.1. If the moment
conditions are correct, g( 0 ) = 0, then (under some regularity conditions):
p
^GM M (WT ) ! 0 as T ! 1;
for all WT positive de…nite.

Di¤erent weight matrices produce di¤erent estimators, and Theorem 13.1 states that
although they may di¤er for a given data set they are all consistent! The intuition
is the following: If a law of large numbers applies to f (wt ; zt ; ), then the sample
moment, gT ( ), converges to the population moment, g( ). And since ^GM M (WT )
makes gT ( ) as close a possible to zero, it will be a consistent estimator of the solution
to g( 0 ) = 0. The requirement is that WT is positive de…nite, such that we put a
positive and non-zero weight on all moment conditions. Otherwise we may throw
important information away.
Theorem 13.2 (asymptotic distribution of gmm): Let the data obey Assump-
tion 13.1. For a positive de…nite weight matrix WT with probability limit W , the
asymptotic distribution of the GMM estimator is given by
p d
T ^GM M 0 ! N (0; V ): (13.26)
The asymptotic variance is given by

1 1
V = (D0 W D) D0 W SW D (D0 W D) ; (13.27)
where
@f (wt ; zt ; 0)
D=E
@ 0
is the expected value of the R K matrix of …rst derivatives of the moment function
f ( ).
A sketch of the derivation of the asymptotic properties of the GMM estimator is given
in §13.3.4 below.
The expression for the asymptotic variance in (13.27) is quite complicated. It
depends on the limit of the chosen weight matrix, W , and the expected derivative,
D. For the latter, you could think of the derivative of the sample moments
1 X @f (wt ; zt ; )
T
@gT ( )
DT = = ; (13.28)
@ 0 T t=1 @ 0
and D is the limit of DT for T ! 1. The variance also depends on the asymptotic
variance matrix of the moment functions, S, and many of the technicalities of GMM
are related to the estimation of S.
13.3.2 Efficient GMM Estimation

It follows from Theorem 13.2 that the variance of the estimator depends on the weight
matrix, WT ; some weight matrices may produce precise estimators while other weight
matrices produce poor estimators with large variances. We want to …nd a systematic
way of choosing the good estimators. In particular we want to select a weight matrix,
WTopt , that produces the estimator with the smallest possible asymptotic variance.
This estimator is denoted the e¢ cient–or optimal GMM estimator.
It seems intuitive that moments with a small variance are very informative on
the parameters and should have a large weight while moments with a high variance
should have a smaller weight. And it can be shown that the optimal weight matrix,
WTopt , has the property that
plim WTopt = S 1 :
p 1
With an optimal weight matrix, WT ! W = S , the asymptotic variance in (13.26)
simpli…es to
1 1 1
V = D0 S 1
D D0 S 1
SS 1
D D0 S 1
D = D0 S 1
D ; (13.29)
which is the smallest possible asymptotic variance.
Theorem 13.3 (asymptotic distribution of efficient gmm): The asymptotic

distribution of the e¢ cient GMM estimator is given in (13.26), with asymptotic vari-
ance (13.29).
To interpret the asymptotic variance in (13.29), we note that the best moment con-
ditions are those for which S is small and D is large (in a matrix sense). A small
S means that the sample variation of the moment (or the noise) is small. D is the
derivative of the moment, so a large D means that the moment condition is much vio-
lated if 6= 0 , and the moment is very informative on the true values, 0 . This is also
related to the curvature of the criteria function, QT ( ), similar to the interpretation
of the expression for the variance of the ML estimator.
Hypothesis testing on ^GM M can be based on the asymptotic distribution:
^GM M a 1^
N ( 0; T V ): (13.30)
An estimator of the asymptotic variance is given by

1
V^ = DT0 ST 1 DT ; (13.31)
where DT is the sample average of the …rst derivatives in (13.28) and ST is an estima-
tor of S = T V (gT ( )). If the observations are independent, a consistent estimator
is
1X
T
ST = f (wt ; zt ; ^)f (wt ; zt ; ^)0 ; (13.32)
T t=1
see the discussion of weight matrix estimation in §13.4.
Example 13.8 (interpretation of weight matrices): Consider the case of a

linear model
yt = x0t + t ; t = 1; 2; :::; T:
Assume the existence of R = 4 instruments zt = (z1t ; z2t ; z3t ; z4t )0 , which de…ne the
sample moment conditions
0 1 0 PT 1
1 0
g1 z1t (y t x t )
C B T PT
t=1
B C
B g2 C B T1 Pt=1 z2t (yt x0t ) C
gT ( ) = B C B= C;
@ g3 A @ T1 Tt=1 z3t (yt x0t ) A
1
PT
g4 T t=1 z4t (yt x0t )
where the dependence on T and is suppressed.

Consider an example, where the variances of the sample moments are given by
p
S = V ( T gT ( ))
0 1 0 1
V (g1 ) cov(g1 ; g2 ) cov(g1 ; g3 ) cov(g1 ; g4 ) 2 0:8 0 0
B C B C
B cov(g1 ; g2 ) V (g2 ) cov(g2 ; g3 ) cov(g2 ; g4 ) C B 0:8 1 0 0 C
= TB C=B C:
@ cov(g1 ; g3 ) cov(g2 ; g3 ) V (g3 ) cov(g3 ; g4 ) A @ 0 0 1 0 A
cov(g1 ; g4 ) cov(g2 ; g4 ) cov(g3 ; g4 ) V (g4 ) 0 0 0 3
The optimal weight is then given by

0 1
0:735 0:588 0 0
B C
B 0:588 1:471 0 0 C
W opt = S 1 = B C;
@ 0 0 1:0 0 A
0 0 0 0:333
and the criteria function is

0 10 1
0:735 0:588 0 0 g1
B CB C
B 0:588 1:471 0 0 CB g2 C
QT ( ) = g1 g2 g3 g4 B CB C
@ 0 0 1:0 0 A@ g3 A
0 0 0 0:333 g4
= 0:735 g12 + 1:471 g22 1:176 g1 g2 + g32 + 0:333g42 :
Now observe the following
(1) Higher variances give lower weights, compare weights to g3 and g4 .

(2) Weights adjust for the covariance of moments, here the inclusion of g1 g2 .
(3) Positively correlated moments are downweighted.
13.3.3 Computational Issues

To obtain the e¢ cient GMM estimator we need an optimal weight matrix. But
note from (13.32) that the weight matrix depends on the parameters in general, and
to estimate the optimal weight matrix we need a consistent estimator of 0 . This
dependence suggests di¤erent estimation strategies.
First-step GMM estimator. Initially, we could choose some weight matrix, e.g.
an identity W[1] = IR , and …nd the …rst-step GMM estimator
^[1] = arg min gT ( )0 W[1] gT ( ):
This estimator is consistent but ine¢ cient.
Two-step e¢ cient GMM estimator. To obtain an e¢ cient estimator we could

opt
estimate the optimal weight matrix, W[2] , based on the consistent …rst step estimator,
^[1] . And given the optimal weight matrix we can …nd the e¢ cient GMM estimator
^[2] = arg min gT ( )0 W opt gT ( ):

[2]
This procedure is denoted two-step e¢ cient GMM. This estimator is not unique as
it depends on the choice of the initial weight matrix W[1] .
Iterated GMM estimator. Looking at the two-step procedure, it is natural to

opt
make another iteration. That is to reestimate the optimal weight matrix, W[3] , based
on ^[2] , and then update the optimal estimator ^[3] . If we switch between estimating
W opt and ^[ ] until convergence (i.e. that the parameters do not change from one
[]
iteration to the next) we obtain the so-called iterated GMM estimator, which does
not depend on the initial weight matrix, W[1] .
The iterated GMM estimator is asymptotically equivalent to the two-step estima-
tor. The intuition is that the estimators of and W opt are consistent, so for T ! 1
the iterated GMM estimator will converge in two iterations. For a given data set,
however, there may be gains from the iterative procedure.
Continuously updated GMM estimator. A third approach is to recognize from

the outset that the weight matrix depends on the parameters, and to reformulate the
GMM criteria as
QT ( ) = gT ( )0 WT ( )gT ( );
and minimize this with respect to . This procedure, which is called the continuously
updated GMM estimator, see Hansen, Heaton, and Yaron (1996), is never possible to
solve analytically, but it can be implemented on a computer using numerical opti-

mization.
From the econometric literature, it is not clear if the iterated GMM estimator or
the continuously updated GMM estimator is preferable in practice, but the asymp-
totic behavior of the two estimators are identical.
13.3.4 Derivations
To characterize the asymptotic properties om the GMM estimator, consider the …rst-
order condition for the minimization of the GMM criteria function,
@QT (^)
= 0:
@
With QT ( ) = gT ( )0 WT gT ( ), it follows that
0
@QT ( ) @gT ( )
=2 WT gT ( ) = 2DT ( )0 WT gT ( ); (13.33)
@ @ 0
where
@gT ( )
DT ( ) = (13.34)
@ 0
is the R K matrix of …rst derivatives. Next, we use a …rst-order Taylor approxima-
tion10 of gT ( ) around the true value 0 :
gT ( ) gT ( 0 ) + DT ( 0 ) ( 0) : (13.35)
It therefore holds that

@QT ( )
= 2DT ( 0 )0 WT gT ( ) = 2DT ( 0 )0 WT gT ( 0 ) + 2DT ( 0 )0 WT DT ( 0 ) ( 0) :
@
The …rst-order condition for a minimum implies
DT ( 0 )0 WT gT ( 0 ) + DT ( 0 )0 WT DT ( 0 )(^ 0) = 0; (13.36)
and by rearranging terms, we get
^= 1
0 (DT ( 0 )0 WT DT ( 0 )) DT ( 0 )0 WT gT ( 0 ); (13.37)
which expresses the estimator as the true value plus an estimation error.
10
The remainder term of the …rst order Taylor approximation depends on the second derivative
of the moment function. If the second derivative is bounded, as implied by Assumption 13.1, it will
not a¤ect the asymptotic behavior.
13.4 Weight-Matrix Estimation 301
To discuss the asymptotic behavior we de…ne the limits

p @f (wt ; zt ; 0) p
DT ( 0 ) ! D = E and WT ! W:
@ 0
Consistency then follows from
p 1
^! 0 (D0 W D) D0 W g( 0 ) = 0;
p
where we have used that gT ( 0 ) ! g( 0 ) = 0.
p d
To derive the asymptotic distribution, recall that T gT ( 0 ) ! N (0; S). It
follows directly that the asymptotic distribution of the estimator is given by
p d
T (^ 0 ) ! N (0; ); (13.38)
where the asymptotic variance is

1 1
= (D0 W D) D0 W SW D (D0 W D) : (13.39)
13.4 Weight-Matrix Estimation

The optimal weight matrix is given by WTopt = ST 1 , where ST is a consistent estimator
of p
S = V [ T gT ( 0 )]:
P
Using that gT ( 0 ) = T1 Tt=1 f (wt ; zt ; 0 ); we may write S as11
! !
1X X
T T
1
S = T V [gT ( 0 )] = T V f (wt ; zt ; 0 ) = V f (wt ; zt ; 0 ) : (13.40)
T t=1 T t=1
How to construct this estimator depends on the properties of the data. If the data
are independent, then the variance of the sum is the sum of the variances, and we
get that
1X 1X
T T
0
S= V [f (wt ; zt ; 0 )] = E [f (wt ; zt ; 0 )f (wt ; zt ; 0 ) ]:
T t=1 T t=1
A natural estimator is
1X
T
ST = f (wt ; zt ; ^)f (wt ; zt ; ^)0 : (13.41)
T t=1
11
Here we write the result for …xed T to focus on the individual terms, although strictly speaking
S is the asymptotic variance, i.e. the limit for T ! 1.
This is robust to heteroskedasticity by construction and is often referred to as the

heteroskedasticity consistent (HC) variance estimator.
In the case of autocorrelation, f (wt ; zt ; 0 ) and f (ws ; zs ; 0 ) are correlated, and the
variance of the sum in (13.40) is not the sum of variances but includes contributions
from all the covariances
1 XX
T T
0
S= E [f (wt ; zt ; 0 )f (ws ; zs ; 0 ) ] :
T t=1 s=1
This is the so-called long-run variance of f ( ) and the estimators are referred to as the
class of heteroskedasticity and autocorrelation consistent (HAC) variance estimators.
To describe the HAC estimators, …rst de…ne the R R sample covariance matrix
between f (wt ; zt ; 0 ) and f (wt j ; zt j ; 0 ),
XT
^j = 1 f (wt ; zt ; ^)f (wt j ; zt j ; ^)0 :
T t=j+1
The natural estimator of S is then given by
X
T 1 X
T 1
ST = ^j = ^0 + ^j + ^0 ; (13.42)
j
j= T +1 j=1
where ^ 0 is the HC estimator in (13.41), and the last equality follows from the sym-
metry of the autocovariances, j = 0 j .
Example 13.9 (univariate weight matrix): To illustrate, consider the univari-

ate example, with ft = f (wt ; zt ; 0 ) 2 R. If ft and fs are correlated, we write the
variance as
1 XT
S = V ft
T t=1
= T 1 V (f1 + f2 + ::: + fT )
1
= T E [(f1 + f2 + ::: + fT ) (f1 + f2 + ::: + fT )]
1
= T f E(f12 ) + E(f1 f2 ) + ::: + E(f1 fT )
+ E(f2 f1 ) + E(f22 ) + ::: + E(f2 fT )
+ :::
+ E(fT f1 ) + E(fT f2 ) + ::: + E(fT2 ) g;
that includes all cross products. Now consider the autocovariances,

1 XT
j = E(ft ft j );
T t=j+1
13.4 Weight-Matrix Estimation 303
with natural estimates given by

1 XT
^j = ft ft j ;
T t=j+1
we can write ST as the long-run variance,

XT 1
ST = ^ 0 + 2 ^ 1 + 2 ^ 2 + 2 ^ 3 + :::: + 2 ^ T 1 = 0 + 2 ^j :
j=1
Observe that ^ j is estimated using T j terms of the form ft ft j , such that ^ 0 uses
all T terms, f12 ; f22 ; :::; fT2 , while ^ T 1 is based on a single term, f1 fT .
Observe, that we cannot consistently estimate as many covariances as we have obser-

vations, see also Example 13.9 where ^ T 1 is estimated from a single term, and the
simple estimator in (13.42) is inconsistent.
If, however, we believe that j = 0 for j q, then we can use the truncated
estimator
q 1
X
^
ST = 0 + ( ^ j + ^ 0j ): (13.43)
j=1
For q …xed (and if indeed j = 0 for j q) this estimator is consistent as T ! 1,

but in …nite samples it may not be positive de…nite.
An alternative is to put a weight wj on autocovariance j, and to let the weights
go to zero as j increases. This class of so-called kernel estimators can be written as
X
T 1
ST = ^ 0 + wj ( ^ j + ^ 0j ); (13.44)
j=1
where the weight, wj , is a function of j and q,
wj = k(j; q):
The function k( ) is called the kernel function and the constant q is referred to as the
bandwidth parameter. A simple–but often used–choice is the Bartlett kernel, where
(
1 jjjq
for jjj < q
wj = k(j; q) = : (13.45)
0 for jjj q
For this kernel, the weights decrease linearly with j and the weights are zero
for j q, see Figure 13.1. We can think of the bandwidth parameter q 1 as
the maximum order of autocorrelation taken into account by the estimator. This
estimator is also known as the Newey-West estimator. Other kernel functions exist
which let the weights go to zero following some smooth pattern.
Figure 13.1: Bartlett kernel weights with bandwidth parameter q = 6.
For a given kernel the bandwidth has to be chosen. If the maximum order of
autocorrelation is unknown, then the (asymptotically optimal) bandwidth can be
estimated from the data in an automated procedure; this is implemented in many
software programs.
Finally, note that the HAC covariance estimator can also be used for calculating the
standard errors for OLS estimates. Provided that the moments conditions are valid,
such that the OLS estimator is actually consistent, the HAC covariance estimator
makes hypotheses testing robust to autocorrelation.
13.5 Test of Overidentifying Conditions

Recall that K moment conditions were su¢ cient to obtain a MM estimator of the K
parameters in . If the estimation is based on R > K moment conditions, we can
test the validity of the R K overidentifying moment conditions. The intuition is
that by MM estimation we can set K moment conditions equal to zero, but if all R
moment conditions are valid then the remaining R K moments should also be close
to zero. If a sample moment condition is far from zero it indicates that it is violated
by the data.
It follows from (13.25) that
a 1
gT ( 0 ) N (0; T S):
p p
If we use the optimal weights, WTopt ! S 1
, then ^GM M ! 0, and
d
J = T gT (^GM M )0 WTopt gT (^GM M ) = T QT ( ) ! 2
(R K): (13.46)
13.6 Empirical Examples 305
This is the standard result that the square of a normal variable is 2 . The intuitive
reason for the R K degrees of freedom (and not R, which is the dimension of gT ( ))
is that we have used K parameters to minimize QT ( ). If we wanted we could put K
moment conditions equal to zero, and they would not contribute to the test statistic.
The test is known as the J-test or the Hansen test for overidentifying restrictions.
In linear models, the test is often referred to as the Sargan test. It is important to
note that J does not test the validity of model per se; and in particular it is not a test
of whether the underlying economic theory is correct. The test considers whether the
R K overidentifying conditions are correct, given identi…cation using K moments.
We cannot see from J directly, which moments that causes the rejection. If the
test rejects, however, we may try to remove some moment conditions, reestimate and
reconsider the statistic. This would give some indication of the problematic moment
conditions or problematic instruments.
It can be shown that if the second estimation is based on R1 < R moment con-
ditions, with R1 K for identi…cation, then it holds that the di¤erence in the
J statistic is distributed as a 2 (R R1 ) under the null of R valid moment condi-
tions. In practice it is important to base both estimations on the same estimate of
S, i.e. that the second estimation reuses R1 rows and columns of ST from the …rst
estimation. These incremental J statistics are sometimes referred to as C tests12 .
13.6 Empirical Examples

In this section we present a number of empirical illustrations.
13.6.1 Software Installation

To illustrate the GMM methodology, I have written as a small GMM module for
OxMetrics. Due to the generality and the non-linearity of GMM, estimation always
require some degree of programming; and the practitioner has to make decisions on
details in the implementation. The OxMetrics module has similarities with the PcGive
module for ML estimation although it is less advanced.
To install the program unpack the …les in GMM_3_x.ZIP to a folder in which you
are allowed to write, e.g.
C:nEconometricsnGMMn (13.47)
or similar.
12
The C test gives a way of implementing misspeci…cation tests. If we have E( t j xt ) = 0 for
identi…cation, homoskedasticity would imply E( 2t j xt ) being constant, which could be formulated
as a moment condition.
Figure 13.2: Installation of the GMM module via OxPack.
Inside OxMetrics choose the OxPack module and select the [Add/Remove Package...]
from the [Package] menu. Now choose [Add] and …nd the …le GMM.ox you have just
downloaded and add it. Press [Done] to close the window, see Figure 13.2
The GMM module should now be running and you should see the message GMM
3.x session started... in the results window.
13.6.2 Simple Estimators

To illustrate the module we …rst consider some simple examples. We use a data set
where DC denote the change in the log of Danish consumption and DY denote the
change in the log of disposable income, for the sample 1971 : 2 2003 : 2.
OLS Estimation. To implement the OLS regression
DCt = 1 + 2 DY t + t;
as a GMM estimator, we note that wt = (DCt ; 1; DYt )0 are model variables, zt =

(1; DYt )0 are instruments, and the moment conditions are given by
E( t zt ) = 0:
In the GMM module we …rst choose the variables as in PcGive and set the relevant
variables as model variables and instruments, see Figure 13.3.
Figure 13.3: Selecting variables for the GMM estimation.
We then get a window with options for the estimation, see Figure 13.4, and for
the current example we choose Iterated GMM. The choice of weight matrix is not
important for the exactly identi…ed estimators, but they will a¤ect the estimated
variances. To reproduce the OLS results we choose i.i.d. weight matrix. Because
the linear regression model is linear, make sure to tick that the model is linear; this
avoids the need for any programming.
In the next window, we choose the largest possible sample 1971(2)-2003(2).
If we did not tick that the model was linear, we are asked to code the moment
conditions manually, with a little programming. For the programming, the model
variables are renamed as the T 1 column vectors Y[0], Y[1], and Y[2]; the instruments
are renamed as the T 1 vectors Z[0] and Z[1]; while the parameters are denoted
fBeta_1g, fBeta_2g etc. Parameters may be renamed, as long as the names are
enclosed in f g. The default in the programming window is the OLS-type moment
conditions, see Figure 13.5. The …rst line de…nes the T 1 vector of residuals,
Uvec = Y[0] fBeta_1g Y[1] fBeta_2g Y[2];
where element t is just the residual t = DCt 1 2 DY t . The second line de…nes
the T 2 matrix of moments
Gmat = Uvec: (Z[0]~Z[1]);

Figure 13.4: Options for GMM estimation.
in which row t is given by ft0 = t (1; DYt ). The notation : is Ox code for element
by element multiplication rather than matrix multiplication, and ~ is concatenation
of column vectors. The results are reported in the …rst row of Table 13.1. We note
that the criteria function is zero because the model is just identi…ed.
To make the variances robust to heteroskedasticity, we just redo the analysis with
the HC weight matrix. We note that the estimators are the same but the t values
are slightly di¤erent. We can also make the inference robust to autocorrelation by
choosing the HAC weight matrix, which again changes the variances. Here we use a
Bartlett kernel with q = 12 lags, and the weights in (13.44) are given by w1 = 11=12 =
0:917; w2 = 10=12 = 0:833; ::::; w11 = 1=12 = 0:083, while wj = 0 for j 12. These
GMM-type corrections of the OLS variance are standard in econometric software
packages and they are also available in PcGive.
Two-Stage Least Squares. To implement instrumental variables estimation we

assume that DYt is endogenous and we instrument it with lags, DYt 1 ; DYt 2 ; DCt 1 ; DCt 2 .
In this case we change the list of instruments to Zt = (1; DYt 1 ; DYt 2 ; DCt 1 ; DCt 2 )0 ,
and the moment conditions to
Gmat = Uvec: (Z[0]~Z[1]~Z[2]~Z[3]~Z[4]);
The estimation is now over-identi…ed with R K = 3 over-identifying moment condi-

tions. The optimal GMM estimator in the case of i.i.d. moments is Two-Stage Least
Figure 13.5: Code for the GMM moment conditions.
Estimator Weight 0 1 T J DF p val

OLS Iterated GMM IID 0:0024 0:2100 129 0:000
(0:0016) (0:0611)
OLS Iterated GMM HC 0:0024 0:2100 129 0:000
(0:0015) (0:0921)
OLS Iterated GMM HAC 0:0024 0:2100 129 0:000
(0:0011) (0:1086)
IV Iterated GMM IID 0:0037 0:1340 127 9:081 3 0:028

(0:0019) (0:1957)
IV Iterated GMM HC 0:0049 0:1761 127 6:018 3 0:111
(0:0018) (0:1795)
IV Iterated GMM HAC 0:0039 0:1715 127 2:581 3 0:462
(0:0011) (0:1149)
Table 13.1: GMM estimation of simple models. Standard errors in parentheses.

’IID’denotes independent and identically distributed moments. ’HC’denotes the es-
timator allowing for heteroskedasticity of the moments. ’HAC’denotes the estimator
allowing for heteroskedasticity and autocorrelation. In the implementatioon of the
HAC estimator we allow for autocorrelation of order 12 using the Bartlett kernel.
’DF’ is the number of overidentifying moments for the Hansen test, J , and ’p-val’
is the corresponding p-value.
Squares, reported in row 4 of Table 13.1. The reason that the coe¢ cients change
so much is either that DYt is strongly endogenous and OLS is invalid, or that the
instruments are weak. Looking at the …rst step estimates from the two-stages least
squares estimation, we get13
DYt = 0:0047 0:3212DYt 1 0:0538DYt 2 + 0:2010DCt 1 0:1549DCt 2 :
(2:08) ( 3:40) ( 0:578) (1:58) ( 1:20)
The coe¢ cient of determination is R2 = 0:12, and the Wald F -test for all coe¢ cients
equal to zero is 4:345 and produces a p value of 0.003. This indicates that the
identi…cation is formally valid, but the instruments are probably relatively weak.
The weight matrix now play a role for the optimal GMM estimators, and changing
the weight matrix produces di¤erent estimators, see Table 13.1.
13.6.3 Optimal Monetary Policy

Now consider a more interesting application where IV and GMM are relevant. Many
authors have suggested that monetary policy can be described by a reaction function
in which the policy interest rate reacts on the deviation of expected future in‡ation
from a constant target value, and the output gap, i.e. the deviation of real activity
from potential. Let t denote the current in‡ation rate from the year before, and
let denote the constant in‡ation target of the central bank. Furthermore, let
y~t = yt yt denote a measure of the output gap. The reaction function for the policy
rate rt can then be written in a simple so-called Taylor-rule, see Taylor (1993):
rt = 0 + 1 E( t+12 j It ) + 2 yt j It );
E(~ (13.48)
where 0 is interpretable as the target value of rt in equilibrium. We have assumed
that the relevant forecast horizon of the central bank is 12 month, and E( t+12 j It )
is the best forecast of in‡ation one year ahead given the information set of the central
bank, It . The forecast horizon should re‡ect the lag of the monetary transmission.
The parameter 1 is central in characterizing the behavior of the central bank. If
1 > 1 then the central bank will increase the real interest rate to stabilize in‡ation,
while the a reaction 1 1 is formally inconsistent with in‡ation stabilization.
The relevant central bank forecasts cannot be observed, and inserting observed
values we obtain the model
rt = a0 + 1 t+12 + 2 yet + ut ; (13.49)
where the constant term 0 = ( 0 1 ) now includes the constant in‡ation target,
. Also note that the new error term contains the forecast errors:
ut = 1 [E( t+12 j It ) t+12 ] + 2 yt j It )
[E(~ y~t ] : (13.50)
13
The …rst step estimates are obtained using a separate linear regression.
The model in (13.49) is a linear model in (ex post) observed quantities, t+12 and
y~t , but we cannot apply simple linear regression because the error term ut is correlated
with the explanatory variables. If we assume that the forecasts are rational, however,
then all variables in the information set of the central bank at time t should be
uninformative on the forecasts errors, and
E(ut j It ) = 0:
This zero conditional expectation implies the unconditional moment conditions
E(ut zt ) = 0; (13.51)
for all variables zt 2 It included in the formation set, and we can estimate the
parameters in (13.48) by linear instrumental variables estimation. Using the model
formulation, the moment conditions have the form
E[ut zt ] = E [(rt 0 1 t+12 + 2 y~t ) zt ] = 0;
for instruments z1t ; :::; zRt . We need at least R = 3 instruments to estimate the
three parameters = ( 0 ; 1 ; 2 )0 . As instruments we should choose variables that
can explain the forecasts E( t+12 j It ) and E(~ yt j It ) while at the same time being
uncorrelated with the disturbance term, ut . Put di¤erently, we could choose variables
that the central bank use in their forecasts, but which they do not react directly upon.
As an example the long-term interest rate is a potential instrument if it is informative
on future in‡ation–but if the central bank reacts directly on the movements of the
bond rate, then an orthogonality condition in (13.51) is violated and the bond rate
should have been included in the reaction function. In a time series model lagged
variables are always possible instruments, but in many cases they are relatively weak
and they often have to be augmented with other variables.
To illustrate estimation, we consider a data set for US monetary policy under
Greenspan, with e¤ective sample 1988 : 1 2005 : 8. We use the (average e¤ective)
Federal funds rate, t , to measure the policy interest rate, rt , and the CPI in‡ation
rate year-over-year, inf t , to measure t . As a measure of the output gap, y~t = yt yt ,
we use the deviation of capacity utilization from the average, capgapt , such that large
values imply high activity; and we expect 2 > 0. The time series are illustrated
in Figure 13.7. For most of the period the Federal funds rate in (A) seems to be
positively related to the capacity utilization in (C). For some periods the e¤ect from
in‡ation is also visible–e.g. around the year 2000 where the temporary interest rate
increase seems to be explained by movements in in‡ation.
To estimate the parameters we choose a set of instruments consisting of a constant
term and lagged values of the interest rate, in‡ation, and capacity utilization. For
the presented results we use lag 1 6 plus lag 9 and 12 of all variables:
0
zt = (1; rt 1 ; :::; rt 6 ; rt 9 ; rt 12 ; t 1 ; :::; t 6; t 9; t 12 ; y
~t 1 ; :::; y~t 6 ; y~t 9 ; y~t 12 ) :
That gives a total of R = 25 moment conditions to estimate the 3 parameters. The

formulation window is given in Figure 13.6
If we assume that the moments are i.i.d., then we can estimate the optimal weight
matrix by (13.62) and the GMM estimator simpli…es again to the two-stage least
squares. The estimation results are presented in row (M1) in Table 13.2. We note
that 1 is signi…cantly larger than one, indicating in‡ation stabilization, and there
is a signi…cant e¤ect from the capacity utilization, 2 > 0. We have 22 overidenti-
fying moment conditions and the Hansen test for overidenti…cation of J = 105 is
distributed as a 2 (22) under correct speci…cation. The statistic is much larger than
the 5% critical value of 33:9 and we conclude that some of the moment conditions are
violated. The values of the Federal funds rate predicted by the reaction function are
illustrated in graph (D) together with the actual value Federal funds rate. We note
that the observed interest rate is much more persistent than the prediction.
Allowing for heteroskedasticity of the moments produce the (iterated GMM) es-
timates reported in row (M2). These results are by and large identical to the results
in row (M1).
The fact that ut includes a 12-month forecast will automatically produce auto-
correlation, and the optimal weight matrix should allow for autocorrelation up to lag
12. Using a HAC estimator of the weight matrix that allows autocorrelation of order
12 produces the results reported in row (M3). The parameter estimates are not too
far from the previous models, although the estimate to in‡ation is a bit smaller. It
is worth noting that the use of an autocorrelation consistent weight matrix makes
the test for overidenti…cation insigni…cant; and the 22 overidentifying conditions are
overall accepted for this speci…cation.
13.6.4 Interest Rate Smoothing

The estimated Taylor rules based on (13.49) are unable to capture the high persistence
of the actual Federal funds rate. In the literature many authors have suggested to
reinterpret the Taylor rule as a target value and to model the actual reaction function
as a partial adjustment process:
rt = 0 + 1 E( t+12 j It ) + 2 yt j It )
E(~
rt = (1 ) rt + rt 1 :
The two equations can be combined to produce
rt = (1 ) f 0 + 1 E( t+12 j It ) + 2 yt j It )g +
E(~ rt 1 ;
Figure 13.6: Moment conditions for the forward looking monetary policy analysis.
in which the actual interest rate depends on the lagged dependent variable. Replacing
again expectations with actual observations we obtain an empirical model
rt = (1 ) f 0 + 1 t+12 + 2 y~t g + rt 1 + ut ; (13.52)
where the error term is given by (13.50) with i replaced by i (1 ) for i = 1; 2. The
parameters in (13.52), = ( 0 ; 1 ; 2 ; )0 , can be estimated by linear GMM using the
conditions in (13.51) with
ut = rt (1 ) f 0 + 1 t+12 + 2 y~t g rt 1 :
We note that the lagged Federal funds rate, rt 1 , is included in the information set
at time t, so even if rt 1 is now a model variable it is a still included in the list of
instruments. We say that it is instrument for itself. To estimate using the GMM
module we reformulate the moment conditions as
Uvec = Y[0] (1 frhog) (falpha_0g Y[1]+falpha_1g Y[3]+falpha_2g Y[4]) frhog Y[2];
Rows (M4) (M6) in Table 13.2 report the estimation results for the partial ad-
justment model (13.52). Allowing for interest rate smoothing changes the estimated
0 1 2 T J DF p val
(M1) IID 0:5529 1:4408 0:3938 212 105:576 22 0:000
(0:4476) (0:1441) (0:0401)
(M2) HC 0:4483 1:5133 0:3747 212 54:110 22 0:000
(0:3652) (0:1124) (0:0302)
(M3) HAC 1:1959 1:3333 0:3551 212 9:883 22 0:987
(0:7506) (0:2143) (0:0620)
(M4) IID 0:6483 1:2905 0:7108 0:9213 212 42:168 21 0:004

(0:6469) (0:2093) (0:0732) (0:0102)
(M5) HC 1:0957 1:1881 0:7254 0:9240 212 36:933 21 0:017
(0:5923) (0:2052) (0:0561) (0:0094)
(M6) HAC 0:8355 1:7385 1:0714 0:9284 212 10:352 21 0:974
(0:7314) (0:2459) (0:15584) (0:0108)
Table 13.2: GMM estimation of monetary policy rules for the US. Standard errors
in parentheses. ’IID’denotes independent and identically distributed moments. ’HC’
denotes the estimator allowing for heteroskedasticity of the moments. ’HAC’denotes
the estimator allowing for heteroskedasticity and autocorrelation. In the implementa-
tioon of the HAC estimator we allow for autocorrelation of order 12 using the Bartlett
kernel. ’DF’ is the number of overidentifying moments for the Hansen test, J , and
’p-val’is the corresponding p-value.
parameters somewhat. We …rst note that the sensitivity to the business cycle, 2 , is
markedly increased to values in the range 34 to 1. The sensitivity to future in‡ation,
1 3
1 , depends more on the choice of weight matrix, ranging now from 1 4 to 1 4 . We
also note that the interest rate smoothing is very important. The coe¢ cient to rt 1
1
is very close to one and the coe¢ cient to the new information in rt is below 10 . The
predicted values are presented in graph (D), now capturing most of the persistence.
A coe¢ cient to the lagged interest rate close to unity could re‡ect that the time
series for rt is very close to behaving as a unit root process. If this is the case then
the tools presented here would not be valid, as Assumption 13.1 would be violated.
This case is beyond the scope of this section.
13.6.5 The C-CAPM Model

To illustrate a non-linear GMM estimation we consider the (consumption based) cap-
ital asset pricing (C-CAPM) model of Hansen and Singleton (1982). A representative
agent is assumed to choose an optimal consumption path, ct ; ct+1 ; :::, by maximizing
(A) Federal funds rate (B) Inflation

10.0
6
7.5
4
5.0
2
2.5
0.0 0
1990 1995 2000 2005 1990 1995 2000 2005
(C) Capacity utilization (D) Actual and predicted values

5.0
Actual Federal funds rate
10 Predicted value, model (M1)
2.5 Predicted value, model (M6)
0.0
5
-2.5
-5.0
-7.5
1990 1995 2000 2005 1990 1995 2000 2005
Figure 13.7: Estimating reaction functions for US monetary policy for the
Greenspan period.
the present discounted value of lifetime utility, i.e.

X
1
s
max E( u(ct+s ) j It ) ;
s=0
where u(ct+s ) is the utility of consumption, 0 1 is a discount factor, and It

is the information set at time t. The consumer can change the path of consumption
relative to income by investing in a …nancial asset. Let At denote the …nancial wealth
at the end of period t and let rt be the implicit interest rate of the …nancial position.
Then a feasible consumption path must obey the budget constraint
At+1 = (1 + rt+1 ) At + yt+1 ct+1 ;
where yt denotes labour income. The …rst order condition for this problem is given
by
u0 (ct ) = (E u0 (ct+1 ) Rt+1 j It );
where u0 ( ) is the derivative of the utility function, and Rt+1 = 1 + rt+1 is the return
factor.
To put more structure on the model, we assume a constant relative risk aversion
(CRRA) utility function
c1
u(ct ) = t ; < 1;
1
and the …rst derivative is given by u0 (ct ) = ct . This formulation gives the explicit
Euler equation:
ct E ct+1 Rt+1 j It = 0
or alternatively !
ct+1
E Rt+1 1 j It = 0: (13.53)
ct
The zero conditional expectation in (13.53) implies the unconditional moment
conditions
" ! #
ct+1
E [f (ct+1 ; ct ; Rt+1 ; zt ; ; )] = E Rt+1 1 zt = 0; (13.54)
ct
for all variables zt 2 It included in the formation set. The economic interpreta-
tion is that under rational expectations a variable in the information set must be
uncorrelated to the expectation error.
We recognize (13.54) as a set of moment conditions of a non-linear instrumental
variables model. Since we have two parameters to estimate, = ( ; )0 , we need at
least R = 2 instruments in zt to identify . Note that the speci…cation is fully theory
driven, it is nonlinear, and it is not in a regression format. Moreover, the parameters
we estimate are the “deep”parameters of the optimization problem.
To estimate the deep parameters, we have to choose a set of instruments zt .
Possible instruments could be variables from the joint history of the model variables,
and here we take the 3 1 vector:
0
ct
zt = 1; ; Rt :
ct 1
This choice would correspond to the three moment conditions

" ! #
ct+1
E Rt+1 1 = 0
ct
" ! #
ct+1 ct
E Rt+1 1 = 0
ct ct 1
" ! #
ct+1
E Rt+1 1 Rt = 0;
ct
Figure 13.8: Programming the GMM moment conditions.
for t = 1; 2; :::; T , but we could also extend with more lags.

To illustrate the procedures we use a data set similar to Hansen and Singleton
(1982) consisting of monthly data for real consumption growth, ct =ct 1 , and the real
return on stocks, Rt , for the US 1959 : 3 1978 : 12. We take wt = ( ctct 1 ; Rt )0 as
model variables and zt = (1; cctt 21 ; Rt 1 )0 as instruments and use a formulation with
Uvec = (fdeltag (Y[0]:^( fgammag)): Y[1] 1);

Gmat = Uvec: (Z[0]~Z[1]~Z[2]);
where :^ is the Ox function for element by element power.

Rows (N1) (N3) in Table 13.3 report the estimation results for the nonlinear
instrumental variable model where the weight matrix allows for heteroskedasticity
of the moments. The models are estimated with, respectively, the two-step e¢ cient
GMM estimator, the iterated GMM estimator, and the continuously updated GMM
estimator; and the results are by and large identical. The discount factor is esti-
mated to be very close to unity, and the standard errors are relatively small. The
coe¢ cient of relative risk aversion, , on the other hand, is very poorly estimated,
with very large standard errors. For the iterated GMM estimation in model (N2)
the estimate is 1:0249 with a disappointing 95% con…dence interval of [ 2:70; 4:75].
We note that the Hansen test for the single overidentifying condition does not reject
correct speci…cation.
Rows (N4) (N6) report estimation results for models where the weight matrix is
robust to heteroskedasticity and autocorrelation. The results are basically unchanged.
We conclude that the used data set is not informative enough to empirically
identify the coe¢ cient of relative risk aversion, . One explanation could be that the
economic model is in fact correct, but that we need stronger instruments to identify
Lags T J DF p val
(N1) 2-Step HC 1 0:9987 0:8770 237 0:434 1 0:510
(0:0086) (3:6792)
(N2) Iterated HC 1 0:9982 1:0249 237 1:068 1 0:301
(0:0044) (1:8614)
(N3) CU HC 1 0:9981 0:9549 237 1:067 1 0:302
(0:0044) (1:8629)
(N4) 2-Step HAC 1 0:9987 0:8876 237 0:429 1 0:513

(0:0092) (4:0228)
(N5) Iterated HAC 1 0:9980 0:8472 237 1:091 1 0:296
(0:0045) (1:8757)
(N6) CU HAC 1 0:9977 0:7093 237 1:086 1 0:297
(0:0045) (1:8815)
(N7) 2-Step HC 2 0:9975 0:0149 236 1:597 3 0:660

(0:0066) (2:6415)
(N8) Iterated HC 2 0:9968 0:0210 236 3:579 3 0:311
(0:0045) (1:7925)
(N9) CU HC 2 0:9958 0:5526 236 3:501 3 0:321
(0:0046) (1:8267)
(N10) 2-Step HAC 2 0:9970 0:1872 236 1:672 3 0:643

(0:0068) (2:7476)
(N11) Iterated HAC 2 0:9965 0:2443 236 3:685 3 0:298
(0:0047) (1:8571)
(N12) CU HAC 2 0:9952 0:9094 236 3:591 3 0:309
(0:0048) (1:9108)
Table 13.3: Estimated Euler equations for the C-CAPM model. Standard errors in
parentheses. ’2-step’denotes the two-step e¢ cient GMM estimator, where the initial
weight matrix is a unit matrix. ’Iterated’denotes the iterated GMM estimator. ’CU’
denotes the continously updated GMM estimator. ’Lags’is the number of lags in the
instrument vector. ’DF’ is the number of overidentifying moments for the Hansen
test, J , and ’p-val’is the corresponding p-value.
the parameter. One possible solution is to extent the instruments list with more lags
0
ct ct 1
zt = 1; ; ; Rt ; Rt 1 ;
ct 1 ct 2
but the results in rows (N7) (N12) indicate that more lags do not improve the
estimates. We could try to improve the model by searching for more instruments, but
that is beyond the scope of this example. A second possibility is that the economic
model is not a good representation of the data. Some authors have suggested to
extend the model to allow habit formation in the Euler equation, but that is also
beyond the scope of this section. A third possibility is that there is not enough
variation in the data to identify the shape of the non-linear function in (13.54). In
the data set it holds that ct+1
ct
and Rt+1 are close to unity. If the variance is small, it
holds that
ct+1
Rt+1 1 (1) 1 1;
ct
which is equal to zero with a discount factor of = 1 and (virtually) any value for .

A short and non-technical presentation of the GMM principle and applications in
cross-sectional and time series models is given in Wooldridge (2001). The …rst ap-
plications of the methodology are found in Hansen and Singleton (1982) and Hansen
and Singleton (1983) both based on a C-CAPM model. Many journal articles use the
same framework and according to the Social Sciences Citation Index the …rst paper
is cited more than 500 times.
All presentations of the underlying theory are very technical. The textbook by
Hayashi (2000) uses GMM as the organizing principle and the …rst chapters of that
book are readable. The asymptotic theory was …rst presented in Hansen (1982). The
theory of GMM is also covered in the book edited by Mátyás (1999); which also
contains many extensions e.g. to non-stationary time series. Technical details on
the estimation of HAC covariance matrices are given in Newey and West (1987) and
Andrews (1991).
Appendices:
13.A Quasi-Maximum-Likelihood Estimation

The quasi maximum likelihood estimator can also be seen as an example of GMM,
using the score equations as moment conditions. In this case, the robust QMLE
variance is obtained automatically.
Consider a log-likelihood function given by
X
T
log LT ( ) = log `( j yt );
t=1
where `( j yt ) is the likelihood contribution for observation t given the data. First
order conditions for the maximum likelihood estimator, ^, are given by the likelihood
equations,
XT XT
@ log `t (^)
^
ST ( ) = ^
st ( ) = = 0:
t=1 t=1
@
Now note, that these equations can be seen as a set of K sample moment conditions
X
T XT
@ log `t (^)
gT ( ) = T 1
st (^) = T 1
= 0; (13.55)
t=1 t=1
@
to which ^M L is the unique MM solution. The population moment conditions corre-

sponding to the sample moments in (13.55) are given by
g( 0 ) = E[st ( 0 )] = 0; (13.56)
where st ( ) = f (yt ; ) in the GMM notation.

The MM estimator, ^M M , is the unique solution to (13.55) and it is known to be
a consistent estimator of 0 as long as the population moment conditions in (13.56)
are true. This implies that even if the likelihood function, log LT ( ), is misspeci…ed,
then the MM or QML estimator is consistent as long as the moment conditions
(13.56) are satis…ed. This shows that the ML estimator can be consistent even if
the likelihood function is misspeci…ed; we may say that the likelihood analysis shows
some robustness to the speci…cation.
13.B Linear IV Estimation and 2SLS 321
It follows from the properties of GMM that the asymptotic variance of the QML
estimator is not the inverse information, but is given by = (D0 S 1 D) 1 from
(13.29). Under correct speci…cation of the likelihood function this expression can be
shown to simplify to = S = I( ) 1 . In a given application where we think that the
likelihood function is potentially misspeci…ed, it may be a good idea to base inference
on the robust QML variance, , rather the ML variance, , and it is often a good
idea to compare the two variances. A big di¤erence may suggest that the likelihood
function is misspeci…ed.
13.B Linear IV Estimation and 2SLS

In this section we go through some of the details of GMM estimation for a linear
regression model. The simplest case of the OLS estimator was considered in Exam-
ple 13.2. Here we begin by restating the case for an exactly identi…ed IV estimator
also considered in Example 13.5; we then extend to overidenti…ed cases.
Exact Identification
Consider again the case considered in Example 13.4, i.e. a partitioned regression
yt = x01t 0 + x02t 0 + t; t = 1; 2; :::; T;
where
E[x1t t ] = 0 (K1 1) (13.57)

E[x2t t ] 6= 0 (K2 1): (13.58)
The K1 variables in x1t are predetermined, while the K2 = K K1 variables in x2t

are endogenous.
To obtain identi…cation of the parameters we assume that there exists K2 new
variables, z2t , that are correlated with x2t but uncorrelated with the errors:
E[z2t t ] = 0: (13.59)
Using the notation
x1t x1t 0
xt = ; zt = and 0 = ;
(K 1) x2t (K 1) z2t 0
we have K moment conditions:
g( 0) = E[zt t ] = E[zt (yt x0t 0 )] = 0;

where t = yt x0t 0 is the error term from the linear regression model.
We can write the corresponding sample moment conditions as
1X
T
1 0
gT ( ^ ) = zt yt x0t ^ = Z Y X ^ = 0; (13.60)
T t=1 T
where capital letters denote the usual stacked matrices

0 1 0 0 1 0 1
y1 x1 z10
B y2 C B x0 C B z20 C
B C B 2 C B C
Y = B . C; X = B . C ; and Z =B .. C:
(T 1) @ . A. (T K) @ .. A (T K) @ . A
yT x0T zT0
The MM estimator is the unique solution:

! 1 T
XT X 1
^ = zt x 0
zt yt = (Z 0 X) Z 0 Y;
MM t
t=1 t=1
provided that the K K matrix Z 0 X can be inverted. We note that if the number of
new instruments equals the number of endogenous variables, then the GMM estimator
coincides with the simple IV estimator.
Overidentification
Now assume that we want to introduce more instruments and let zt = (x01 ; z20 )0 be
an R 1 vector with R > K. In this case Z 0 X is no longer invertible and the MM
estimator does not exist. Now we have R moments
1X
T
1 0
gT ( ) = zt (yt x0t ) = Z (Y X );
T t=1 T
and we cannot solve gT ( ) = 0 directly. Instead, we want to derive the GMM

estimator by minimizing the criteria function
QT ( ) = gT ( )0 WT gT ( )
1 0
= T Z 0 (Y X ) WT T 1
Z 0 (Y X )
2 0 0 0 0 0 0
= T (Y ZWT Z Y 2 X ZWT Z Y + X 0 ZWT Z 0 X ) ;
for some weight matrix WT . We take the …rst derivative, and the GMM estimator is
the solution to the K equations
@QT ( ) 2
= 2T X 0 ZWT Z 0 Y + 2T 2
X 0 ZWT Z 0 X = 0
@
13.B Linear IV Estimation and 2SLS 323
that is
^ 1
GM M (WT ) = (X 0 ZWT Z 0 X) X 0 ZWT Z 0 Y:
The estimator depends on the weight matrix, WT . To estimate the optimal weight
matrix, WTopt = ST 1 , we use the estimator in (13.32), that is
1 X 1X 2 0
T T
ST = f (wt ; zt ; )f (wt ; zt ; )0 = ^ zt z ; (13.61)
T t=1 T t=1 t t
which allows for general heteroskedasticity of the disturbance term. The e¢ cient
GMM estimator is given by
^ GM M = X 0 ZS 1 Z 0 X 1
T X 0 ZST 1 Z 0 Y;
1
where we note that any scale factor in the weight matrix, e.g. T , cancels.
For the asymptotic distributions, we recall that
^ GM M a 1 1
N 0; T D0 S 1
D :
The derivative is given by
1
PT
@gT ( 0 ) @ T t=1 zt (yt x0t 0) X
T
1
DT = = = T zt x0t ;
(R K) @ 0 @ 0
t=1
so the variance of the estimator becomes

1
V ^ = T DT0 WTopt DT
1
GM M
0 ! ! 1 !1 1
XT X
T X
T
1@
= T T 1 0
xt zt T 1
^2t zt zt0 T 1
zt x0t A
t=1 t=1 t=1
! 1 ! 1
X
T X
T X
T
= xt zt0 ^2t zt zt0 zt x0t :
t=1 t=1 t=1
We recognize this expression as the heteroskedasticity consistent (HC) variance esti-

mator of White. Using GMM with the allowance for heteroskedastic errors will thus
automatically produce heteroskedasticity consistent standard errors.
If we assume that the error terms are i.i.d., then the optimal weight matrix in
(13.61) simpli…es to
^2 X 0
T
ST = zt zt = T 1 ^ 2 Z 0 Z; (13.62)
T t=1
where ^ 2 is a consistent estimator for 2

. In this case the e¢ cient GMM estimator
becomes
^ GM M = 1
X 0 ZST 1 Z 0 X X 0 ZST 1 Z 0 Y:
1 1 1
1 2 1 2
= X 0Z T ^ Z 0Z Z 0X X 0Z T ^ Z 0Z Z 0Y
1
1 1
= X 0 Z (Z 0 Z) Z 0X X 0 Z (Z 0 Z) Z 0 Y:
Notice that the e¢ cient GMM estimator is identical to the generalized IV estimator
and the two stage least squares (2SLS) estimator. This shows that the 2SLS estimator
is the e¢ cient GMM estimator if the error terms are i.i.d. The variance of the
estimator is
^ GM M = T 1 1
V 1
DT0 ST 1 DT = ^ 2 (X 0 Z (Z 0 Z) Z 0 X) 1 ;
which again coincides with the 2SLS variance.

Chapter 14
Introduction to Vector
and Matrix Differentiation
I
n this appendix we expand on Verbeek (2017) on matrix di¤erentiation. We
…rst present the conventions for derivatives of scalar and vector functions; then
we present the derivatives of a number of special functions particularly useful in
econometrics, and, …nally, we apply the ideas to derive the ordinary least squares
(OLS) estimator in a linear regression model. I should be emphasized that this ap-
pendix is cursory reading; the particular results needed in this course are indicated
with a ( ).
14.1 Conventions for Scalar Functions

Let = ( 1 ; :::; k )0 be a k 1 vector and let f ( ) = f ( 1 ; :::; k ) be a real-valued
function that depends on , i.e. f ( ) : Rk 7 ! R maps the vector into a single
number, f ( ). Then the derivative of f ( ) with respect to is de…ned as
0 1
@f ( )
@ 1
@f ( ) B .. C
=B
@ . C:
A (14.1)
@ @f ( )
@ k
This is a k 1 column vector with typical elements given by the partial derivative
@f ( )
@ i
. Sometimes this vector is referred to as the gradient. It is useful to remember
that the derivative of a scalar function with respect to a column vector gives a column
vector as the result14 .
14 @f ( )
Note that Wooldridge (2006, p. 815) does not follow this convention, and lets @ be a row
vector.
326 Introduction to Vector and Matrix Di¤erentiation
Similarly, the derivative of a scalar function with respect to a row vector yields
the 1 k row vector
@f ( )
= @f@ ( 1 ) @f ( )
:
@ 0 @ k
14.2 Conventions for Vector Functions

Now let 0 1
g1 ( )
B C
g( ) = @ ... A
gn ( )
be a vector function depending on = ( 1 ; :::; k )0 , i.e. g( ) : Rk 7 ! Rn maps the
k 1 vector into a n 1 vector, where gi ( ) = gi ( 1 ; :::; k ), i = 1; 2; :::; n, is a
real-valued function.
Since g( ) is a column vector it is natural to consider the derivative with respect
to a row vector, 0 , i.e.
0 1
@g1 ( ) @g1 ( )
@ 1 @ k
@g( ) B .. .. .. C
=B
@ . . . C;
A (14.2)
@ 0 @gn ( ) @gn ( )
@ 1 @ k
where each row, i = 1; 2; :::; n, contains the derivative of the scalar function gi ( ) with
respect to the elements in . The result is therefore a n k matrix of derivatives
with typical element (i; j) given by @g@i ( ) . If the vector function is de…ned as a row
j
vector, it is natural to take the derivative with respect to the column vector, .
We can note that it holds in general that
0
@ (g( )0 ) @g( )
= , (14.3)
@ @ 0
which in the case above is a k n matrix.
Applying the conventions in (14.1) and (14.2) we can de…ne the Hessian matrix
of second derivatives of a scalar function f ( ) as
0 2 2
1
@ f( ) @ f( )
@ 1@ 1 @ 1@ k
2
@ f( ) @ f( ) B
2
.. .. .. C
0 = =B
@ . . . C;
A
@ @ @ 0@ @2f ( ) @2f ( )
@ k@ 1 @ k@ k
which is a k k matrix with typical elements (i; j) given by the second derivative
@2f ( )
@ i@ j
. Note that it does not matter if we …rst take the derivative with respect to the
column or the row.
14.3 Some Special Functions 327
14.3 Some Special Functions

First, let c be a k 1 vector and let be a k 1 vector of parameters. Next de…ne
the scalar function f ( ) = c0 , which maps the k parameters into a single number.
It holds that
@ (c0 )
= c: (13:4 )
@
To see this, we can write the function as
f ( ) = c0 = c1 1 + c2 2 + ::: + ck k:
Taking the derivative with respect to yields

0 1 0 1
@(c1 1 +c2 2 +:::+ck k )
@ c1
@f ( ) B ..
1 C B . C
=B
@ . C = @ .. A = c;
A
@ @(c1 1 +c2 2 +:::+ck k )
@
ck
k
0
which is a k 1 vector as expected. Also note that since c = c0 , it holds that
@ ( 0 c)
= c: (13:5 )
@
Now, let A be a n k matrix and let be a k 1 vector of parameters. Furthermore
de…ne the vector function g( ) = A , which maps the k parameters into n function
values. g( ) is an n 1 vector and the derivative with respect to 0 is a n k matrix
given by
@ (A )
= A: (13:6 )
@ 0
To see this, write the function as
0 1
A11 1 + A12 2 + ::: + A1k k
B .. C
g( ) = A = @ . A;
An1 1 + An2 2 + ::: + Ank k
and …nd the derivative

0 1 0 1
@(A11 1 +:::+A1k k ) @(A11 1 +:::+A1k k )
A A1k
@g( ) B
@
..
1
...
@
..
k C B .11 ... .. C = A:
=B
@ . . C = @ ..
A . A
@ 0 @(An1 1 +:::+Ank k ) @(An1 1 +:::+Ank k )
@ @
An1 Ank
1 k
0
Similarly, if we consider the transposed function, g( ) = A0 , which is a 1 n row
vector, we can …nd the k n matrix of derivatives as
@ ( 0 A0 )
= A0 . (13:7 )
@
This is just an application of the result in (14.3).

Finally, consider a quadratic function f ( ) = 0 V for some k k matrix V . This
function maps the k parameters into a single number. Here we …nd the derivatives
as the k 1 column vector
@ ( 0V )
= (V + V 0 ) ; (13:8 )
@
or the row variant
@ ( 0V ) 0
= (V + V 0 ): (13:9 )
@ 0
If V is symmetric this reduces to 2V and 2 0 V , respectively. To see how this works,
consider the simple case k = 3 and write the function as
0 10 1
V11 V12 V13 1
0
V = 1 2 3
@ V21 V22 V23 A @ 2 A
V31 V32 V33 3
= V11 21 + V22 22 + V33 23 + (V12 + V21 ) 1 2 + (V13 + V31 ) 1 3 + (V23 + V32 ) 2 3:
Taking the derivative with respect to , we get

0 @( 0 V ) 1
@ 1
@ ( 0V ) B @( 0 V ) C
= @ @ 2 A
@ @( 0 V )
@ 3
0 1
2V11 1 + (V12 + V21 ) 2 + (V13 + V31 ) 3
= @ 2V22 2 + (V12 + V21 ) 1 + (V23 + V32 ) 3 A
2V33 3 + (V13 + V31 ) 1 + (V23 + V32 ) 2
0 10 1
2V11 V12 + V21 V13 + V31 1
= @ V12 + V21 2V22 V23 + V32 A @ 2 A
V13 + V31 V23 + V32 2V33 3
00 1 0 11 0 1
V11 V12 V13 V11 V21 V31 1
= @ @ V21 V22 V23 A + @ V12 V22 V32 AA @ 2
A
V31 V32 V33 V13 V23 V33 3
0
= (V + V ) :
14.4 The Linear Regression Model

To illustrate the use of matrix di¤erentiation consider the linear regression model
yt = x0t + t ; t = 1; 2; :::; T;
14.4 The Linear Regression Model 329
where yt and t are scalars, xt 2 Rk , and 2 Rk . Here k is the number of explanatory

variables and T is the number of observations. We may also write this using matrix
notation,
Y =X + ,
where Y is a T 1 vector of stacked left-hand-side variables, X is a T k matrix of
explanatory variables, and is a T 1 vector of error terms, i.e.
0 1 0 0 1 0 1
y1 x1 1
B .. C B .. C B .. C
B . C B . C B . C
B C B C B C
B C B 0 C
Y = B yt C ; X = B xt C ; and =B C:
B t C
B .. C B .. C B .. C
@ . A @ . A @ . A
0
yT xT T
One way to motivate the ordinary least squares (OLS) principle is to choose the
estimator, ^ , as the value of that minimizes the sum of squared residuals, i.e.
X
T
^ = arg min 2
= arg min 0
:
t
t=1
Looking at the function to be minimized, we …nd that

0
= (Y X )0 (Y X )
0 0 0
= (Y X ) (Y X )
0 0
= Y 0Y Y 0X X 0Y + X 0X
0
= Y 0Y 2Y 0 X + X 0X ;
0
where the last line uses the fact that Y 0 X and X 0 Y are scalar variables, such that
( 0 X 0 Y ) = ( 0 X 0 Y )0 = Y 0 X :
Note that 0 is a scalar function and taking the …rst derivative with respect to
yields the k 1 vector
0
@( 0 ) @ (Y 0 Y 2Y 0 X + X 0X )
= = 2X 0 Y + 2X 0 X ;
@ @
where we have used the results in (13:4 ) and (13:8 ) for X 0 X symmetric. Solving
the k equations,
@( 0 )
= 2X 0 Y + 2X 0 X ^ = 0;
@
yields the OLS estimator
^ = (X 0 X) 1 X 0 Y;
provided that X 0 X is non-singular and can be inverted.

To make sure that ^ is a minimum of 0 and not a maximum, we should formally
ensure that the second derivative is positive de…nite. The k k Hessian matrix of
second derivatives is given by
@2 ( 0 ) @ ( 2X 0 Y + 2X 0 X )
0 = 0 = 2X 0 X;
@ @ @
which is a positive de…nite matrix by construction.
References
Anderson, B., and J. Moore (1979): Optimal Filtering. Prentice-Hall. 270, 273
Andrews, D. W. (1991): “Heteroskedasticity and Autocorrelation Consistent Co-

variance Matrix Estimation,”Econometrica, 59(3), 817–858. 315
Andrews, D. W., and W. Ploberger (1994): “Optimal Tests When a Nuisance

Parameter Is Present Only under the Alternative,” Econometrica, 62(6), 1383–
1414. 266
Balke, N. S., and T. B. Fomby (1997): “Treshold Cointegration,” International

Economic Review, 38(4). 256
Banerjee, A., J. J. Dolado, J. W. Gailbraith, and D. F. Hendry (1993):

Co-Integration, Error-Correction, and the Econometric Analysis of Non-Stationary
Data. Oxford University Press, Oxford. 177, 209
Bera, A. K., and M. L. Higgins (1995): “On ARCH Models: Properties, Estima-
tion and Testing,”in Surveys in Econometrics, chapter 8, pp. 215–272. Blackwell,
Oxford. 254
Bollerslev, T. (1986): “Generalized Autoregressive Conditional Heteroskedastic-

ity,”Journal of Econometrics, 31(3), 307–327. 238
(1990): “Modelling the coherence in short-run nominal exchange rates: a

multivariate generalized ARCH model,” The Review of Economics and Statistics,
72, 498–505. 251
332
(2010): “Glossary to ARCH (GARCH),” in Volatility and Time Series

Econometrics: Essays in Honor of Robert Engle, ed. by T. Bollerslev, J. Russell,
and M. Watson, chap. 8. Oxford University Press, Oxford. 245
Bollerslev, T., R. Y. Chou, and K. F. Kroner (1992): “ARCH Modelling in

Finance - A Review of the Theory and Empirical Evidence,”Journal of Economet-
rics, 52, 5–59. 254
Bollerslev, T., R. F. Engle, and D. B. Nelson (1994): “ARCH models,”

Handbook of Econometrics, IV(313), 2959–3038. 254
Caines, P. (1988): Linear Stochastic Systems. Wiley, New York. 273
Casals, J., A. Garcia-Hiernaux, and M. Jerez (2012): “From General State-

Space to VARMAX Models,” Mathematics and Computers in Simulation, 82(5),
924–936. 271
Davidson, J. (2001): Econometric Theory. Blackwell, Oxford. 15, 28, 49
Davidson, R., and J. G. MacKinnon (1993): Estimation and Inference in Econo-

metrics. Oxford University Press, Oxford. 201
Davies, R. (1977): “Hypothesis Testing when a Nuisance Parameter is Present Only

Under the Alternative,”Biometrika, 64(2), 247–254. 266
Diebold, F., and R. Mariano (1995): “Comparing Predictive Accuracy,”Journal

of Business and Economic Statistics, 13, 253–263. 110
Doornik, J. A. (2013): Econometric Analysis with Markov-Switching Models. Tim-

berlake Consultants Ltd, London. 264, 266, 268
Dunsmuir, W. (1979): “A Central Limit Theorem for Parameter Estimation in

Stationary Time Series and its Applications to Models for a Signal Observed with
Noise,”Annals of Statistics, 7, 490–506. 273
Durbin, J., and S. J. Koopman (2001): Time Series Analysis by State Space
Methods. Oxford University Press, Oxford. 270
Elliott, G., T. J. Rothenberg, and J. H. Stock (1996): “E¢ cient Tests for
an Autoregressive Unit Root,”Econometrica, 64, 813–836. 175
Enders, W. (2004): Applied Econometric Time Series. John Wiley and Sons, 2nd
edn. 110, 147, 177, 209
333
Engle, R. F. (1982): “Autoregressive Conditional Heteroscedasticity with Estimates

of the Variance of United Kingdom In‡ation,”Econometrica, 50(4), 987–1008. 225,
230
Engle, R. F., and C. Granger (1987): “Co-integration and error correction:

representation, estimation, and testing,”Econometrica, 55, 251–276. 190, 196
Engle, R. F., and V. K. Ng (1993): “Measuring and Testing the Impact og News
on Volatility,”Journal of Finance, 48, 1749–1777. 246
Ericsson, N. R., and J. G. MacKinnon (2002): “Distributions of Error Cor-

rection Tests for Cointegration,” The Econometrics Journal, 5(2), 285–318. 201,
209
Glosten, L. R., R. Jagannathan, and D. Runkle (1993): “Relationship Be-

tween the Expected Value and the Volatility of the Nominal Excess Return on
Stocks,”Journal of Finance, 48(5), 1779–1802. 245, 249
Granger, C. W. J. (1969): “Investigating Causal Relations by Econometric Models

and Cross-spectral Methods,”Econometrica, 37(3), 424–438. 146
Greene, W. H. (2008): Econometric Analysis. Prentice Hall, Upper Saddle River,

New Jersey, 6th edn. 64
Haldrup, N., and M. Jansson (2006): “Improving Size and Power in Unit Root
Testing,” in Palgrave Handbook of Econometrics, Vol. 1, ed. by T. C. Mills, and
K. Patterson, chapter 7, pp. 252–277. Palgrave, New York. 175
Hamilton, J. D. (1989): “A new approach to the economic analysis of nonstationary

time series and the business cycle,”Econometrica, 57(2), 357–384. 262
(1990): “Analysis of time series subject to changes in regime,” Journal of

Econometrics, 45, 39–70. 262
(1994): Time Series Analysis. Princeton University Press, Princeton. 28,

49, 72, 177, 209, 264, 268
Hansen, B. E. (1992): “Testing for Parameter Instability in Linear Models,”Journal

of Policy Modeling, 14, 517–533. 40, 41
(1996): “Inference when a Nuisance Parameter is not Identi…ed under the

Null Hypothesis,”Econometrica, 64(2), 413–430. 266
Hansen, L. P. (1982): “Large Sample Properties of Generalized Method of Moments

Estiamtors,”Econometrica, 50(4), 1029–1054. 279, 315
334
Hansen, L. P., J. Heaton, and A. Yaron (1996): “Finite-sample properties of

some alternative GMM estimators,”Journal of Business and Economic Statistics,
14(3), 262–280. 295
Hansen, L. P., and K. J. Singleton (1982): “Generalized Instrumental Variables

Estimation of Nonlinear Rational Expectations Models,” Econometrica, 50(5),
1269–1286. 310, 312, 314
(1983): “Stochastic Consumption, Risk Aversion, and the Temporal Behav-

ior of Asset Returns,”Journal of Political Economy, 91(2), 249–265. 314
Hayashi, F. (2000): Econometrics. Princeton University Press, Princeton. 15, 28,

49, 314
Hendry, D. F. (1995): Dynamic Econometrics. Oxford University Press, Oxford.

123, 147
Hendry, D. F., and K. Juselius (2001): “Explaining Cointegration Analysis: Part

II,”Energy Journal, 22(1), 75–120. 209
Hendry, D. F., and B. Nielsen (2007): Econometric Modeling: A Likelihood

Approach. Princeton University Press, Princeton. 46
Jensen, S. T., and A. Rahbek (2004): “Asymptotic Inference for Nonstationary

GARCH,”Econometric Theory, 20(6), 1203–1226. 63
Johansen, S. (1996): Likelihood-Based Inference in Cointegrated Vector Autoregres-

sive Models. Oxford University Press, Oxford, 2nd edn. 224
Juselius, K. (2007): The Cointegrated VAR model: Econometric Methodology and

Macroeconomic Applications. Oxford University press, Oxford. 224
Kalman, R. E. (1960): “A New Approach to Linear Filtering and Prediction Prob-

lems,”Transactions of the ASME-Journal of Basic Engineering, 82(Series D), 35–
45. 269, 272
Lütkepohl, H. (2005): New Introduction to Multiple Time Series Analysis.

Springer-Verlag, Berlin. 123, 147
Lütkepohl, H., and L. Kilian (2017): Structural Vector Autoregressive Analysis.

Cambridge University Press, Cambridge. 147
Lütkepohl, H., and M. Krätzig (2004): Applied time Series Econometrics. Cam-
bridge University Press, Cambridge. 110, 123, 147
335
Maddala, G. S., and I.-M. Kim (1998): Unit Roots, Cointegration, and Structural
Change. Cambridge University Press, Cambridge. 177, 209
Mandelbrot, B. (1963): “The Variation of Certain Speculative Prices,”The Jour-

nal of Business, 36(4), 394–419. 226
Mátyás, L. (1999): Generalized Method of Moments Estimation. Cambridge Uni-

versity Press, Cambridge. 315
Nelson, D. B. (1990): “Stationarity and Persistence in the GARCH(1,1) Model,”

Econometric Theory, 6(3), 318–334. 244
(1991): “Conditional Heteroskedasticity in Asset Returns: A New Ap-

proach,”Econometrica, 59(2), 347–370. 248
Newey, W. K., and K. D. West (1987): “A Simple, Positive Semi-De…nite, Het-

eroskedasticity and Autocorrelation Consistent Covariance Matrix,”Econometrica,
55(3), 703–708. 315
Nielsen, B. (2008): “Power of Tests for Unit Roots in the Presence of a Linear
Trend,”Oxford Bulletin of Economics and Statistics, 70, 619–644. 175
Nielsen, H. B. (2017): Introduction to Likelihood-Based Estimation and Inference.

Hans Reitzels Forlag, Copenhagen, 3rd edn. 14, 49, 50, 59, 75
Patterson, K. (2000): An Introduction to Applied Econometrics. A Time Series

Approach. Palgrave MacMillan, New York. 177, 209
(2010): A Primer for Unit Root Testing. Palgrave MacMillan, New York.
166
Ploberger, W. (2010): “Law(s) of Large Numbers,” in Macroeconometrics and

Time Series Analysis (The New Palgrave Economics Collection), ed. by S. N.
Durlauf, and L. E. Blume, pp. 158–162. Palgrave Macmillan. 64
Ruiz, E. (1994): “Quasi Maximum Likelihood Estimation of Stochastic Volatility

Models,”Journal of Econometrics, 63, 289— -306. 273
Shumway, R. H., and D. S. Stoffer (2000): Time Series Analysis and Its Appli-
cations. Springer, New York. 270, 273
(2011): Time series Analysis and Its Applications. Springer, 3 edn. 270, 278
Sims, C. A., J. H. Stock, and M. W. Watson (1990): “Inference in Linear Time

Series Models with Some Unit Roots,”Econometrica, 58(1), 113–144. 194
336
Stock, J. H., and M. W. Watson (2003): Introduction to Econometrics. Addison

Wesley. 36
Taylor, J. B. (1993): “Discretion Versus Policy Rules in Practice,” Carnegie-

Rochester Conference Series on Public Policy, 39, 195–214. 281, 306
Taylor, S. J. (1986): Modelling Financial Time Series. Wiley, New York. 238
Teräsvirta, T. (1994): “Speci…cation, estimation, and evaluation of smooth tran-

sition autoregressive models,” Journal of the American Statistical Association,
89(425), 208–218. 256
Tong, H. (2011): “Threshold models in time series analysis - 30 years on,”Statistics

and its Interface, 4(2), 107–118. 256
Verbeek, M. (2017): A Guide to Modern Econometrics. John Wiley and Sons, 5th
edn. 36, 49, 123, 147, 279, 321
Watson, M. (1989): “Recursive Solution Methods for Dynamic Linear Rational

Expectations Models,”Journal of Econometrics, 41, 65–89. 273
Wooldridge, J. M. (2001): “Applications of Generalized Method of Moments

Estimation,”Journal of Economic Perspectives, 15(4), 87–100. 314
(2006): Introductory Econometrics: A Modern Approach. Thomson/South-

Western, Mason, Ohio, 3rd edn. 14, 27, 28, 31, 35, 39, 59, 70, 84, 280, 321

Course v13

Uploaded by

Copyright:

Available Formats

Course v13

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Course v13

Uploaded by

Copyright:

Available Formats

September 1, 2023

Heino Bohn Nielsen

1 Characteristic Features of Economic Time Series . . . . . . . . . . . . . . . . . . . . . 7

5 The Autoregressive Distributed Lag Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

9 The Co-integrated Vector Autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

13.B Linear IV Estimation and 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

1.1 Main Features

y1 ; y2 ; :::; yt ; :::; yT ; (1.1)

Time Dependence. One characteristic feature of many economic time series is a

E(ut j ut 1 ; ut 2 ; :::; u1 ): (1.2)

Throughout, it will be important to distinguish the conditional expectation and the

Trends. Another characteristic feature shared by many time series is a tendency

Long-Run Co-Movements. A third interesting feature is a tendency for some

(A) US unemployment rate (B) Danish productivity (logs)

Figure 1.1: Examples of macro-economic and …nancial time series.

Seasonality. Another feature that characterizes some economic time series is a

yt = 1 x1t + 2 x2t + ::: + k xkt + t; t = 1; 2; :::; T; (1.4)

1.2 Stochastic Processes and Stationarity

(1) (1) (1) (1)

Figure 1.2: Stochastic processes and realized time series.

Definition 1.1 (strict stationarity): A time series, y1 ; y2 ; :::; yt ; :::; yT , is strictly

(yt ; yt+1 ; :::; yt+s ) and (yt+h ; yt+1+h ; :::; yt+s+h )

are the same for all s 2 N and all h 2 N.

Definition 1.2 (weak stationarity): A time series, y1 ; y2 ; :::; yt ; :::; yT , is weakly

for all values of t.

1.2.2 Weak Dependence

Definition 1.3 (weak dependence): A time series, y1 ; y2 ; :::; yt ; :::; yT , is weakly

(A) i.i.d. observations (B) Average of i.i.d. observations

(C) Stationary process (D) Average of stationary process

(E) Non-stationary process (F) Average of non-stationary process

1.3 Measuring Time Dependence

yt , can be interpreted as deviations from equilibrium. In terms of economics, the

It 1 = fyt 1 ; yt 2 ; :::g: (1.9)

1.3.1 The Autocorrelation Function

If the correlation is positive we denote it positive autocorrelation, and in a graph of

therefore be simpli…ed, and we de…ne the autocorrelation function (ACF) as

h = corr(yt ; yt h j yt 1 ; :::; yt h+1 ):

One way to understand the PACF is the following: If yt is correlated with yt 1 ,

1.4 Transformations to Stationarity

(A) US unemployment rate (B) ACF for (A)

1960 1980 2000 0 5 10 15

(E) change in Danish consumption (log) (F) ACF for (E)

1970 1980 1990 2000 2010 0 5 10 15

(G) Danish savings rate (log) (H) ACF for (G)

Figure 1.4: Examples of time series transformations to stationarity.

1.4.2 Transformation by Differencing

di¤erence stationary. Often it is also referred to as integrated of …rst order or I(1)

have chapters devoted to analysis of co-integrating variables. We return to the topic

Linear Regression Models

2.1 The Linear Regression Model

Assumption 2.1 (linear regression model): The relationship between yt and

In many applications the …rst explanatory variable is a constant term, x1t = 1, in

2.1.1 Interpretation and the Ceteris Paribus Assumption

E(yt j xt ) = x0t ; (2.2)

and we make the following assumption:

2.1.2 Properties of Conditional Expectations

2.1.3 Examples of Time Series Regressions

The interpretation of the relationships is therefore like a comparative static exercise,

Autoregressive Distributed Lag Model. The dynamic structure of the regres-

2.2 The OLS Estimator and its Properties