European Journal of Operational Research 200 (2010) 498–507
Contents lists available at ScienceDirect
European Journal of Operational Research
journal homepage: www.elsevier.com/locate/ejor
Stochastics and Statistics
Nonlinearity, data-snooping, and stock index ETF return predictability
Jian Yang a,*, Juan Cabrera b, Tao Wang c
a
b
c
The Business School, University of Colorado Denver, Denver, CO 80217, USA
Department of Economics, The Graduate Center of CUNY, New York, NY 10016, USA
Department of Economics, Queens College and the Graduate Center of CUNY, Flushing, NY 11367, USA
a r t i c l e
i n f o
Article history:
Received 12 September 2007
Accepted 9 January 2009
Available online 16 January 2009
Keywords:
Ishares
Random walk
Nonlinear models
Forecasting evaluation
Reality check
a b s t r a c t
This paper examines daily return predictability for eighteen international stock index ETFs. The out-of-sample tests are conducted, based on linear and various popular nonlinear models and both statistical and
economic criteria for model comparison. The main results show evidence of predictability for six of eighteen
ETFs. A simple linear autoregression model, and a nonlinear-in-variance GARCH model, but not several popular nonlinear-in-mean models help outperform the martingale model. The allowance of data-snooping
bias using White’s Reality Check also substantially weakens otherwise apparently strong predictability.
Ó 2009 Elsevier B.V. All rights reserved.
1. Introduction
Asset return predictability has been one of the most important
topics in financial research. The inference on asset return predictability carries important implications to practitioners, for example,
for the design of portfolio management strategies. Numerous earlier works have been conducted to examine the short-horizon predictability of stock market returns based on past returns. In this
regard, since the variance ratio test was originally developed by
Lo and Mackinlay (1988), it has been widely used in testing the
random walk hypothesis in international stock markets (e.g., Kim
and Singal, 2000; Chaudhuri and Wu, 2003; Patro and Wu, 2004)
and foreign exchange markets (e.g., Tabak and Lima, 2009).1
However, the popular variance ratio test used in the above studies (as well as the traditional autocorrelation test, see, e.g., Chordia
et al. (2005)) assumes linearity and only tests serial uncorrelatedness rather than martingale difference (Hsieh, 1991; Mcqueen and
Thorley, 1991; Hong and Lee, 2003). A nonlinear time series can
have zero autocorrelation but a non-zero mean conditional on its
past history (i.e., predictable based on the past history). That is,
the variance ratio test may fail to capture predicable nonlinearities
* Corresponding author. Tel.: +1 303 556 5852; fax: +1 303 556 5899.
E-mail address: jian.yang@ucdenver.edu (J. Yang).
1
The terms ‘‘random walk” and ‘‘martingale” have been interchangeably used in
the efficient capital markets literature. However, it is the martingale property (or
unpredictability) of security prices that is of essential interest to this huge body of the
literature (Fama, 1991; Granger, 1992). Strictly speaking, the innovations series is
independent and identically distributed for ‘‘random walk”, while it is a martingale
difference sequence for ‘‘martingale.”
0377-2217/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.ejor.2009.01.009
in mean (if any) and could yield misleading conclusions in favor of
the martingale (or loosely random walk) hypothesis.
This study examines daily return predictability of international
stock index exchange-traded funds (ETFs) during the period of
1996–2006. We present the first comprehensive study on the martingale behavior of recently popular international stock index ETFs
(loosely in the context of weak-form market efficiency). As one of
the most successful financial innovations of all time, there were
over 300 ETFs with more than $400 billion of assets as of December
2006. A defining characteristic of ETFs is their ease for intraday active trading and high daily turnover, as it is particularly appealing
to investors who demand short-term liquidity and trade in large
lots (Poterba and Shoven, 2002). International stock index ETFs
presumably provide an attractive investment vehicle for the US
investors to explore potential investment opportunities abroad.
Surprisingly, while many issues such as diversification potentials
and herding behaviors on the ETFs have been examined (e.g.,
Pennathur et al., 2002; Gleason et al., 2004), the important issue
of their short-horizon predictability has not yet been investigated.
Also noteworthy, daily stock index ETF prices are transaction prices
which would not suffer from the notorious non-synchronous trading problem of daily stock market indexes (Ahn et al., 2002), which
plagues numerous studies using such data.2
2
Although international stock index ETFs are designed to track each country stock
market index, there may be substantial tracking errors for the ETFs, partly due to their
considerable exposure to the US market (e.g., Pennathur et al., 2002). Nevertheless,
the predictability of the ETFs as a new financial instrument remains in itself
interesting. To extent that the international stock index ETFs track the performance of
international stock indexes, the evidence from this study could also shed more light
on the short-horizon predictability of international stock market indexes.
499
J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507
We seek to contribute to the literature in the following important aspects. First, we take the model selection approach (e.g.,
Swanson and White, 1995, 1997), rather than the more traditional
hypothesis testing approach as taken in the variance ratio test (or
the autocorrelation test). As discussed in Swanson and White
(1995, 1997), unlike the traditional hypothesis testing approach,
the model selection approach does not require the specification
of a correct model for its valid application. By contrast, earlier
empirical findings based on variance ratio tests are quite sensitive
to potential model misspecification. More important to this study,
it allows us to focus directly on the issue of predictability at hand:
out-of-sample forecasting performance. Arguably, out-of-sample
evidence bears directly on predictability and is important to
mitigate the concern of in-sample model overfitting, particularly
for nonlinear models. This is also well line with Granger’s (1992,
p. 11) observation that ‘‘only out-of-sample evaluation is relevant
and, to some extent, avoids these difficulties (due to data mining).”
By contrast, all the cited studies above only focus on in-sample
evidence (and also typically fail to allow for potential nonlinearity-in-mean).
Further, similar to Hong and Lee (2003), Moreno and Olmeda
(2007), Yang et al. (2008) and Tabak and Lima (2009), this study
presents out-of-sample evidence based on both statistical and economic criteria. With the notable exception of Ratner and Leal
(1999) and Moreno and Olmeda (2007), few earlier studies on
international stock market random walk behavior have considered
economic criteria as measured by magnitude of trading returns
and particularly the direction of forecasted price changes, which
have practical value to investors and other decision-makers.
Second, we extend the literature by applying a number of nonlinear models that allow for both potential nonlinearity-in-mean
and nonlinearity-in-variance. As noted earlier, the cited studies
above using variance ratio tests (and autocorrelation tests) do not
allow for nonlinearity-in-mean. Theoretically, as discussed in Mcqueen and Thorley (1991), existence of fads or rational speculative bubbles suggests the possibility of nonlinear patterns in stock returns.
Or, if the world is governed by a not-too-complex chaotic process,
it should have short-term nonlinear predictability (in mean) but
not linear predictability (Hsieh, 1991, p. 1845). Further, in a survey
on the random walk test literature, Granger (1992, p. 11) concludes
that ‘‘benefits can arise. . .especially from considering non-linear
models.” Toward this end, this study considers several popular
nonlinear models to more comprehensively explore potential
nonlinearities in mean, in addition to the more commonly used nonlinear-in-variance models (i.e., GARCH) (see, e.g., Hsieh, 1991,
1993).3 In fact, some variants of the popular nonlinear models used
in most previous studies (e.g., Hsieh, 1991; Gencay, 1998; Harris and
Kucukozmen, 2001; Monoyios and Sarno, 2002; Hong and Lee, 2003;
Moreno and Olmeda, 2007; Yang et al., 2008) are used in this study.4
Finally, model comparisons in this study are improved relative
to previous studies by using White’s (2000) novel test to address
the concern of data-snooping bias (i.e., spuriously superior
predicative ability of some complex models due to chance).5 When
several forecast models using the same data are compared, it is crucial to take into account the dependence among these models, which
otherwise may result in misleading inference due to data-snooping
bias. While the overfitting problem of nonlinear models is well
aware in the literature, relatively few earlier studies in this line of
the literature have addressed the data-snooping issue, which is
shown to be nontrivial in this study. The rest of this paper is organized as follows: Section 2 presents econometric methodology; Section 3 describes the data; Section 4 discusses the empirical results;
and finally, Section 5 concludes the paper.
2. Econometric methodology
To forecast ETF daily returns, we use various models for
E(Yt jIt1), where Yt represents the first difference of ETF daily closing prices in logarithm, It1 is the information set available at time
t 1. We apply various popular nonlinear models to explore the
possibility that daily ETF returns are not a martingale, and have
the conditional mean dependence in a complicated form (i.e., nonlinearity-in-mean), and the dependence in (e.g., second (or higher)
moments (i.e., nonlinearity-in-variance). We certainly do not assume that the limited number of the nonlinear models can capture
all the nonlinearities. However, they do represent some of the most
popular nonlinear models widely used in the literature thus far.
The martingale model Yt = l + et is used as the benchmark for
comparison with other models. Table A1 lists the various models
examined in the paper, including the autoregressive model
(AR(d)), generalized autoregressive conditional heteroskedasticity
model (GARCH(p, q)), feedforward artificial neural network
(NN(d, q)), functional coefficient model (FC(d, L)), nonparametric
regression model (NP(k, m)), and some combinations of these models. The estimation of the AR(d) and GARCH(p, q) models is relatively standard, using the ordinary least squares method and the
maximum likelihood method, respectively. We next briefly discuss
how to implement more complicated nonlinear models used in this
study (i.e., neural network, functional coefficient and nonparametric models).6
2.1. The feedforward artificial neural network
Artificial neural networks have proven to be useful in capturing
nonlinearity-in-mean in forecasting financial time series. One of the
greatest advantages of neural networks over other commonly-used
nonlinear time series models is that neural networks can well
approximate a large class of functions. The basic structure of neural
networks combines many ‘basic’ nonlinear functions via a multilayer structure. Normally there is one intermediate, or hidden, layer
between the inputs and output. The intuition is that the explanatory variables simultaneously activate the units in the intermediate
layer through some function W and, subsequently, output is produced through some function U from the units in the intermediate
layer. The following equations summarize this approach:
hi;t ¼ W ci0 þ
m
X
cij X j;t
j¼1
3
Note that there is a debate about whether there exists predictable nonlinearityin-mean in US stock market indexes. For example, although Hsieh (1991) finds little
nonlinearity-in-mean in US stock market prices, Gencay (1998) reports nonlinear-inmean predictability for similar indexes.
4
Like many earlier studies, a caveat here is that the inference should still be
interpreted in light of the limited number of models we examine in this study. In
general, martingale means the existence of neither linear nor nonlinear dependence,
and we have to test all possible nonlinear dependence to rule out the martingale
property of stock returns, which is practically impossible.
5
As discussed in Campbell et al. (1997, p. 523–524), the problems of overfitting
and datasnooping are related but different. A typical symptom of overfitting is an
excellent in-sample fit but poor out-of-sample performance, while data-snooping
refers to excellent but spurious out-of-sample performance.
Y t ¼ U b0 þ
q
X
i¼1
bi hi;t
!
!
i ¼ 1; . . . ; q:
6
Also note that some of these models are special cases of the others. For example,
the AR(1) model is a special case of the NN(1, 5) model. Nevertheless, in this study the
forecasting results of the NN(1, 5) model are systematically worse than the results of
the AR(1) model. This, however, may simply indicates rather weak nonlinearity-inmean in the dataset and thus render more complicated NN(1, 5) to perform poorly
while the more parsimonious AR(1) perform rather well in the out-of-sample
forecasting.
500
J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507
^ Þg are cho^j ðuÞ ¼ a
^j , and fða
^j ; b
local linear estimator at point u is a
j
sen by minimizing the sum of locally weighted squares defined as:
or, more compactly,
Y t ¼ U b0 þ
q
X
bi Wðci0 þ
i¼1
m
X
!
cij X j;t Þ ;
j¼1
t¼1
where Xj,t is the input or an independent variable, hi,t is the node or
hidden unit in the intermediate or hidden layer, and Yt is the output
or dependent variable. In this study, the independent variable Xj,t
coincides with the lagged dependent variable Ytj. The functions
W and U can be arbitrarily chosen and still approximate a large
class of functions given sufficiently large numbers of units in the
intermediate layer.
In this study we use single layer feedforward neural networks
(e.g., Lee et al., 1993; Gencay, 1998; Hong and Lee, 2003), which
is the most basic but perhaps most commonly used neural network
in economic and financial applications. In this case, the input variables are connected to multiple nodes (or hidden units), and at
each node they are weighted (differently) and transformed by
the same activation function W. The output of each node is then
weighted again by bi and summed and transformed by a second
activation function U.
Following the literature (e.g., Gencay, 1998; Hong and Lee,
2003), we chose the logistic function for the function W and the
identity function for the function U, which is common practice in
the literature. Coefficients for the NN(d, q) model are estimated
using nonlinear least squares via the Newton–Raphson algorithm.
The final equation we will estimate is as follows:
EðY t jIt1 Þ ¼ b0 þ
d
X
bj Y tj þ
q
X
di G c0i þ
i¼1
j¼1
d
X
j¼1
!
cji Y tj ;
where G(z) = (1 + ez)1 and is a function of W, It1 is the information set available at t 1, and Yt is the dependent variable (i.e.,
ETF returns).
2.2. The functional coefficient model
The functional coefficient model introduced by Cai et al. (2000)
is a new semiparametric nonlinear time series model with timevarying and state-dependent coefficients. It includes threshold
autoregression models, smooth transition regression, and many
other regime switching models as special cases. The basic model
can be expressed as follows:
EðY t jIt1 Þ ¼ a0 ðU t Þ þ
d
X
aj ðU t ÞY tj ;
j¼1
0
where {(Yt, Ut) } is a bivariate stationary process. The smoothing variable Ut may be chosen as a function of explanatory variable vector
Ytj or as a function of other variables. In our forecasts of ETF returns using past returns, Ut is chosen as the difference between
the log index price at time t 1 (pt1), and the moving average of
the most recent periods L of the log prices at time t 1, or:
U t ¼ pt1 L1
L
X
N
X
ptj :
j¼1
In this paper, following the literature (e.g., Gencay, 1998, 1999) and
the common practice of technical analysis, we chose L = 200. Traders often use Ut as a buy or sell signal based on its sign, which reveals information on changes in direction, i.e. the moving average
rule. Thus, the model might be well suited to forecasting the direction of price movements.
Following Cai et al. (2000), we estimate the term {aj(Ut)} nonparametrically using a local linear estimator. We approximate aj
(Ut) locally (when Ut is close to u) by aj(Ut) = aj + bj(Ut u). The
2
Y t aj bj ðU t uÞ K h ðU t uÞ;
where Kh(.) is the kernel function used as weights for points that are
^ Þg. We use the normal distribution as the
^j ; b
included to estimate fða
j
kernel function, and h is the smoothing parameter or the bandwidth
of the window of the kernel function, which is determined by the
modified leave-one-out least square cross-validation method proposed in Cai et al. (2000).
2.3. The nonparametric kernel regression model
Because nonlinearities in the conditional means may be complicated and cannot be expressed explicitly, it is desirable to use nonparametric regression to estimate the model without specifying
the forms of functions. Again, we use the well-known kernel
regression (with some improvements on bandwidth selection to
maximize the forecasting power) for estimation and forecasting.
In general, a nonparametric regression model can be generally expressed as:
EðY t jIt1 Þ ¼ gðY t1 ; Y t2 ; . . . ; Y tj Þ
As mentioned above with respect to the nonparametric estimator of
aj (Ut) in the functional coefficient model, g(.) can be estimated by
local linear regression. At each point yt={yt1, yt2 , . . . , ytj}, we
0
can approximate g(.) locally by a linear function g(Y) = a + (Y y) b.
We can also approximate g(y) locally simply by a constant function
g(y) = a (i.e., the local constant estimator), which is the approach taken here. The local constant estimator is relatively simple to implement and has been widely used in applied research. Compared to
other estimators, it has also drawn most theoretical attention and
thus has clear theoretical properties for estimation and inference
of nonparametric models. The local constant estimator at point y
^, where a
^ minimizes the sum of local weighted
is given by gðyÞ ¼ a
squares:
j
N
X
Y
½Y t a2
K hs ðY ts yts Þ;
t¼1
s¼1
Q
where js¼1 K hs ðY ts yts Þ is the product kernel, Kh,s is the univariate kernel function, and h = (h1 , . . . hj) is chosen by the leave-oneout cross-validation procedure. The smoothing parameter h is the
most important parameter in nonparametric estimation. An inappropriately chosen h will give poor in-sample and out-of-sample
prediction. Traditional nonparametric forecasting uses h that minimizes the in-sample sum square errors to forecast the next-period
value based on previous in-sample data. However, while this h is
optimal for all in-sample data, it may not be the best h for out-ofsample forecasting. Consequently, we use a modified method to select the smoothing parameter.7
Our modified approach consists of finding the best h for out-ofsample forecasting and making forecasts based on this h*. For
example, suppose that we have data points of x1–x100 and that
we want to forecast x101. The traditional approach is to find the
best h to minimize the 100 data points’ in-sample sum of squared
errors (based on x1–x100) and then use the h* and these data points
(i.e., x1–x100) to forecast x101. We propose the following modified
nonparametric forecasting methodology. We use h* and data points
of x1–x80 to forecast x81, data points of x2–x81 to forecast x82, . . . ,
data points of x20–x99 to forecast x100. We find the h* that minimizes the sum of squared errors of out-of-sample forecast of points
7
We thank Qi Li for making the suggestion.
501
J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507
x81–x100 and use this h* and data points x21–x100 to make our final
forecast of x101. In this procedure, we have two parameters to
establish: (1) the out-of-sample evaluation length kis set equal to
x100 ) in the example, and (2) the regression length m is
20 (^
x81 —^
set equal to 80 in the example. Hence, we denote the model as
NP(k, m), where the parameters (k, m) are important to the forecasting performance of this modified nonparametric regression
model. We thus experiment different evaluation lengths in our
study, and it appears that its impact is not substantial in this study.
Therefore, in the tables presented below, we only discuss the results based on a particular combination.
Finally, it has also been argued that no single forecasting model
performs well for all time periods and under all different criteria, as
the pattern of ETF returns can vary over time and may not follow a
simple data generating process. In order to improve the predictability, we closely follow Hong and Lee (2003) and combine several
forecasting models. More specifically, we pool forecasts from the
AR(1), GARCH(1, 1), NN(1, 5), FC(1, 200), and NP(200, 400) models
to forecast the conditional mean of price changes.8
Denoting these five models as models 1, . . . , 5, respectively, the
combined model is given by:
b
Y
t
5
X
k¼1
xkt Yb kt ;
where the weight xkt is determined as follows:
h
i
P
b 2
exp kt t1
s¼1 ðY s Y ks Þ
h
i
xkt P6
Pt1
b 2
j¼1 exp kt
s¼1 ðY s Y ks Þ
with kt ¼ 1=ð2S2t Þ, S2t is the sample variance of {Ys}, s runs from 1 to
t 1, and Yks is the out-of-sample prediction by model k. Intuitively,
xks gives higher weight to the model k if the prediction for model k
is better than other models in previous forecasting exercises as
measured by the mean squared forecast error (MSFE) criterion.
day dynamics. As thoroughly discussed in Hsieh (1991, p.1848),
high-frequency tick by tick data may capture bid-ask bounces and
other dependencies which are caused by the market microstructure.
These ‘‘artificial” dependencies will be picked up by any good test of
nonlinear dynamics. The financial economist must increase the sampling interval in order to average out these ‘‘artificial” dependencies.
Monoyios and Sarno (2002) also argue that the use of daily data can
easily allow for the longer time span of the time series, which is
much more important than the number of observations per se to
model nonlinear dynamics related to lower-frequency properties of
the data. In addition, the number of daily observations is large enough to allow efficient in-sample estimation and out-of-sample forecasting evaluation. A limited number of observations tend to
produce poor fit and inferior predictability, which could make results biased against rejecting the martingale hypothesis.
4. Empirical results
In order to produce out-of-sample forecasts, we use a rolling
regression technique. Suppose there are N observations in the sample, where N = R + P. At time t, we use a rolling sample of size R
observations, as estimated using various linear and nonlinear
b tþ1 . Therefore,
methods, to produce a one-step-ahead forecast, Y
we can generate a sequence of P one-step-ahead forecasts which
is used to evaluate each of the models under consideration. Swanson and White (1995, 1997) suggest that the rolling regression
technique can further allow for the (potentially nonlinear) relation
between the current and past returns to evolve across time.
Applying four forecasting evaluation criteria to the sequence of
out-of-sample forecasts, we investigate the forecasting ability of
the model relative to the benchmark martingale model. The four
evaluation criteria used here are:
MSFE ¼ P1
N1
X
t¼R
3. Data description
The dataset consists of daily return observations for eighteen
international stock index ETFs from CRSP. These ETFs are traded
on the US market and designed to mimic the underlying indices
they represent. They are readily available to US investors who want
to get access to international stock markets without the involvement of currency exchange. More specifically, we use the daily
closing prices on ishares exchange-traded funds (ETFs) that track
a chosen market index.9 These markets have been divided into
two groups: developed and emerging markets. The developed market ETFs include Australia, Canada, France, Germany, Italy, Japan,
Netherlands, Spain, Switzerland, United Kindgom, and the United
States. The emerging market ETFs include Brazil, Hongkong, Korea,
Malaysia, Mexico, Singapore, and Taiwan. The time period covered
for developed market indices spans from April 1, 1996 to August
25, 2006. Among emerging market ETFs, for Hong Kong, Malaysia,
Mexico, and Singapore, the starting period is January 4, 1999; for
Taiwan, the starting period is June 23, 2000; for South Korea, the
starting period is May 12, 2000; and for Brazil, the starting period
is July 14, 2000. The use of daily data is appropriate for the purpose
of this study and similar to many previous studies (Hsieh, 1991; Gencay, 1998). Unlike higher-frequency intraday data, daily ETF data
avoid the microstructure effects which are usually present in intra-
8
The Combined II forecasts pool forecasts from all 5 of these models, while the
Combined I forecasts exclude the forecast of the GARCH(1, 1) model.
9
For the US, the ETF chosen for the S& P 500 index is the SPY because it has a much
higher trading volume than the ishares S& P500 ETF index. Also, All ETF returns are
already adjusted for dividends.
MAFE ¼ P1
N1
X
t¼R
MFTR ¼ P1
N1
X
t¼R
MCFD ¼ P1
b tþ1
Y tþ1 Y
2
;
b tþ1 ;
Y tþ1 Y
b tþ1 ÞY tþ1 ;
signð Y
N 1 h
i
X
b tþ1 ÞsignðY tþ1 Þ > 0 ;
1 signð Y
t¼R
b tþ1 Þ ¼ 1 if Y
b tþ1 P 0 and signð Y
b tþ1 Þ ¼ 1
where sign(.) denotes signð Y
b tþ1 < 0.
if Y
Similar to Hong and Lee (2003), the two statistical criteria,
mean squared forecast error and mean squared absolute error
(MSFE and MAFE) are complemented with two economic criteria,
mean forecast trading return and mean correct forecast direction
(MFTR and MCFD). Both MFTR and MCFD can be particularly informative to profit-maximizing investors. Because stock returns are
volatile, forecast errors can be quite large from period to period,
the statistical accuracy of forecasts (as measured by MSFE and
MAFE) does not necessarily imply economic accuracy in terms of
maximizing investor profits. Investors may base their trading decisions on maximizing profits rather than minimizing forecasting errors. Furthermore, accurate forecasts of the direction of price
changes may be equally important or even more important to
investors than the magnitude of the changes, as they can be easily
translated into profits. Granger (1992) emphasizes that, in this
case, it is also desirable to compute economic measures of forecast
accuracy, e.g., MFTR and MCFD. Many other authors (e.g., Leitch
and Tanner, 1991; Hong and Lee, 2003) have made similar points
in the context of forecasting asset prices. Hence, the use of multiple
502
J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507
Table 1
Forecast evaluation results for developed markets – MSFE.
Benchmark
AR(1)
P1
P2
GARCH(1, 1)
P1
P2
NN(1, 5)
P1
P2
FC(1, 200)
P1
P2
NP(200, 400)
P1
P2
Combined I
P1
P2
Combined II
P1
P2
AU
CA
GE
IT
JP
SW
NE
SP
FR
UK
US
1.175
1.039
1.00
1.00
0.998
0.30
0.59
1.007
0.71
0.75
1.043
1.00
0.82
1.018
0.99
0.86
1.001
0.49
0.86
0.999
0.44
0.86
1.021
1.017
1.00
0.99
0.998
0.23
0.50
1.001
0.53
0.72
1.012
0.96
0.79
1.006
0.88
0.83
0.998
0.31
0.82
0.995
0.11
0.62
1.341
0.994
0.04
0.05
0.996
0.12
0.08
1.001
0.53
0.34
1.027
0.99
0.57
1.001
0.62
0.59
0.990
0.02
0.34
0.990
0.00
0.33
0.933
0.995
0.08
0.09
0.998
0.21
0.11
1.018
0.97
0.38
1.006
0.73
0.55
1.013
0.98
0.64
0.998
0.30
0.64
0.997
0.11
0.64
1.837
1.005
0.79
0.78
1.000
0.35
0.68
1.014
0.97
0.85
1.010
0.84
0.90
1.008
0.85
0.95
0.999
0.39
0.90
0.997
0.24
0.84
1.081
0.973
0.00
0.00
0.998
0.12
0.00
1.053
1.00
0.03
1.001
0.53
0.04
0.988
0.14
0.04
0.978
0.00
0.04
0.979
0.01
0.04
1.086
1.001
0.64
0.67
0.997
0.08
0.27
1.004
0.68
0.55
1.016
0.97
0.68
1.005
0.92
0.70
0.998
0.29
0.71
0.996
0.07
0.65
1.067
1.000
0.46
0.48
0.998
0.19
0.45
0.997
0.32
0.59
1.017
0.97
0.69
1.004
0.91
0.72
0.997
0.16
0.63
0.996
0.06
0.55
0.998
0.996
0.06
0.06
0.998
0.21
0.10
1.030
1.00
0.42
1.009
0.91
0.56
1.008
0.98
0.61
1.001
0.59
0.62
0.999
0.33
0.62
0.844
0.990
0.18
0.16
0.998
0.11
0.16
1.034
0.99
0.37
1.026
0.92
0.53
1.006
0.73
0.56
0.993
0.22
0.56
0.991
0.11
0.56
0.492
0.998
0.18
0.18
0.999
0.36
0.35
1.052
1.00
0.65
1.006
0.91
0.74
1.004
0.83
0.81
1.002
0.76
0.81
1.000
0.52
0.81
Notes: (1) The data are daily data from April 1, 1996 to August 25, 2006. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the
benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2 is the bootstrap reality check p-value for
comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model.
(3) AR, NN, FC, NP are various models under considerations. For the benchmark model, the MSFEs are in levels (104). For all other models, they are MSFE ratios relative to
that of the benchmark model. The smaller MSFE, the better predictive ability of a model.
Table 2
Forecast evaluation results for developed markets – MAFE.
MAFE
Benchmark
AR(1)
P1
P2
GARCH(1, 1)
P1
P2
NN(1, 5)
P1
P2
FC(1, 200)
P1
P2
NP(200, 400)
P1
P2
Combined I
P1
P2
Combined II
P1
P2
AU
CA
GE
IT
JP
SW
NE
SP
FR
UK
US
0.838
1.008
0.90
0.89
0.996
0.02
0.29
1.003
0.63
0.47
1.014
0.99
0.53
1.008
0.99
0.55
1.000
0.51
0.55
0.997
0.10
0.55
0.792
1.006
0.93
0.94
0.996
0.01
0.13
1.000
0.52
0.33
1.002
0.66
0.42
1.004
0.91
0.44
1.000
0.46
0.44
0.997
0.07
0.44
0.904
0.997
0.09
0.10
0.997
0.04
0.10
1.003
0.68
0.37
1.011
0.97
0.56
1.000
0.54
0.60
0.994
0.02
0.30
0.994
0.01
0.29
0.764
0.997
0.08
0.08
0.999
0.20
0.08
1.014
1.00
0.31
1.004
0.81
0.47
1.005
0.90
0.54
0.999
0.31
0.54
0.998
0.12
0.54
1.062
0.998
0.28
0.29
1.000
0.33
0.29
1.008
0.96
0.50
1.002
0.66
0.65
1.003
0.74
0.73
0.997
0.19
0.68
0.998
0.18
0.68
0.814
0.985
0.01
0.00
0.999
0.13
0.00
1.034
1.00
0.02
0.995
0.23
0.03
0.996
0.27
0.04
0.988
0.01
0.04
0.988
0.01
0.04
0.799
1.001
0.63
0.64
0.999
0.18
0.44
1.004
0.78
0.68
1.014
1.00
0.78
1.003
0.89
0.83
0.999
0.32
0.83
0.998
0.12
0.72
0.787
0.999
0.38
0.41
0.999
0.38
0.60
1.001
0.62
0.77
1.008
0.96
0.84
1.002
0.90
0.88
0.999
0.28
0.81
0.999
0.20
0.77
0.779
0.999
0.19
0.19
1.000
0.41
0.33
1.016
1.00
0.62
1.006
0.96
0.72
1.006
0.98
0.79
1.001
0.76
0.79
1.000
0.64
0.79
0.708
0.991
0.07
0.06
0.999
0.32
0.06
1.020
1.00
0.14
1.005
0.74
0.24
1.000
0.54
0.25
0.992
0.07
0.25
0.992
0.05
0.25
0.548
1.000
0.46
0.46
0.996
0.01
0.02
1.034
1.00
0.35
1.001
0.71
0.39
1.006
0.99
0.42
1.002
0.93
0.42
1.000
0.55
0.42
Notes: (1) The data are daily data from April 1, 1996 to August 25, 2006. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the
benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2is the bootstrap reality check p-value for
comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model.
(3) AR, NN, FC, NP are various models under considerations. For the benchmark model, the MAFEs are in levels (102). For all other models, they are MAFE ratios relative to
that of the benchmark model. The smaller MAFE, the better predictive ability of a model.
criteria in this study provides a more comprehensive perspective
on the predictability of stock returns.
As mentioned above, it is important to have an adequately
large number of observations to efficiently estimate the model
parameters. In other words, the size of R must be reasonably large.
On the other hand, the size of P must be also large enough to detect
the differences in forecasting performance across models. Given
the number of observations in our data (N = 2619 and N = 1924
for developed and most emerging markets, respectively), an appropriate or balanced choice for R can be expressed by the ratio
R:P = 2:1.10
10
We also conducted the analysis based on the ratio R:P = 1:1. The results are
similar qualitatively and available upon request.
503
J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507
Tables 1–4 report the results for the developed markets and Tables 5–8 report the results on the emerging markets. Each table
contains one of the forecasting evaluation criteria in the order presented above. For example, Table 1 reports the out-of-sample forecast results using the MSFE for the eleven developed countries
under consideration. All forecast results are based on an R:P ratio
(regression length: total out-of-sample forecasts length) equal to
2:1. Each table also contains the two distinct p-values: P1 and P2
based on the White’s (2000) Reality Check test. White’s (2000) test
addresses the dangerous practice of data-snooping or data
re-usage for the purpose of inference. He constructs a method for
testing the hypothesis that the best model encountered during a
specification search has no predictive superiority over the benchmark model. His method, however, permits for data-snooping to
be undertaken with some degree of confidence that one will not
mistake results generated by chance for genuinely ‘‘good” results.
Table 3
Forecast evaluation results for developed markets – MFTR.
Benchmark
AR(1)
P1
P2
GARCH(1, 1)
P1
P2
NN(1, 5)
P1
P2
FC(1, 200)
P1
P2
NP(200, 400)
P1
P2
Combined I
P1
P2
Combined II
P1
P2
AU
CA
GE
IT
JP
SW
NE
SP
FR
UK
US
0.090
0.003
0.98
0.98
0.090
1.00
0.50
0.054
0.77
0.68
0.022
0.99
0.75
0.052
0.93
0.82
0.025
0.96
0.84
0.089
0.54
0.87
0.094
0.017
1.00
1.00
0.094
1.00
0.51
0.022
0.91
0.69
0.042
0.96
0.77
0.090
0.57
0.84
0.047
0.87
0.86
0.086
0.62
0.88
0.017
0.083
0.04
0.03
0.080
0.01
0.04
0.003
0.37
0.08
0.017
0.22
0.09
0.051
0.71
0.11
0.081
0.04
0.12
0.099
0.02
0.07
0.075
0.087
0.37
0.37
0.075
1.00
0.37
0.018
0.95
0.57
0.045
0.78
0.64
0.008
0.96
0.69
0.044
0.77
0.70
0.073
0.52
0.73
0.019
0.028
0.21
0.23
0.032
0.66
0.29
0.001
0.36
0.42
0.037
0.18
0.42
0.026
0.25
0.45
0.041
0.17
0.44
0.038
0.17
0.44
0.019
0.162
0.01
0.00
0.073
0.01
0.00
0.055
0.90
0.01
0.112
0.03
0.01
0.107
0.06
0.01
0.138
0.01
0.02
0.135
0.01
0.02
0.053
0.012
0.09
0.10
0.072
0.02
0.03
0.025
0.06
0.03
0.026
0.26
0.03
0.050
0.51
0.03
0.011
0.08
0.04
0.027
0.03
0.04
0.086
0.033
0.91
0.90
0.086
1.00
0.50
0.040
0.89
0.72
0.015
0.98
0.77
0.048
0.87
0.82
0.088
0.48
0.83
0.069
0.71
0.85
0.075
0.064
0.63
0.67
0.075
1.00
0.53
0.011
0.96
0.72
0.012
0.96
0.79
0.029
0.80
0.81
0.068
0.58
0.83
0.053
0.78
0.84
0.053
0.055
0.50
0.47
0.068
0.06
0.40
0.034
0.95
0.57
0.053
0.47
0.61
0.034
0.66
0.64
0.066
0.39
0.65
0.067
0.37
0.66
0.026
0.033
0.40
0.42
0.036
0.26
0.51
0.031
0.93
0.64
0.001
0.83
0.70
0.017
0.86
0.74
0.035
0.41
0.75
0.034
0.37
0.76
Notes: (1) The data are daily data from April 1, 1996 to August 25, 2006. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the
benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2 is the bootstrap reality check p-value for
comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model.
(3) AR, NN, FC, NP are various models under considerations. The larger MFTR, the better predictive ability of a model.
Table 4
Forecast evaluation results for developed markets – MCFD.
Benchmark
AR(1)
P1
P2
GARCH(1, 1)
P1
P2
NN(1, 5)
P1
P2
FC(1, 200)
P1
P2
NP(200, 400)
P1
P2
Combined I
P1
P2
Combined II
P1
P2
AU
CA
GE
IT
JP
SW
NE
SP
FR
UK
US
0.552
0.522
0.90
0.91
0.552
1.00
0.50
0.524
0.90
0.71
0.507
0.98
0.76
0.527
0.96
0.82
0.508
0.99
0.84
0.549
0.58
0.87
0.557
0.497
1.00
1.00
0.557
1.00
0.49
0.499
0.99
0.66
0.528
0.95
0.74
0.540
0.86
0.81
0.507
0.99
0.83
0.543
0.79
0.85
0.491
0.522
0.09
0.08
0.530
0.02
0.07
0.498
0.38
0.11
0.502
0.29
0.12
0.494
0.41
0.14
0.532
0.04
0.14
0.525
0.05
0.15
0.523
0.520
0.56
0.53
0.523
1.00
0.48
0.466
0.98
0.68
0.507
0.76
0.73
0.477
0.97
0.77
0.491
0.91
0.78
0.511
0.75
0.80
0.494
0.513
0.20
0.19
0.492
0.54
0.23
0.483
0.66
0.36
0.508
0.27
0.44
0.501
0.42
0.47
0.513
0.21
0.48
0.512
0.22
0.48
0.502
0.558
0.01
0.01
0.522
0.04
0.01
0.453
0.96
0.03
0.532
0.13
0.04
0.528
0.15
0.04
0.540
0.05
0.04
0.540
0.06
0.04
0.443
0.481
0.03
0.04
0.513
0.01
0.01
0.491
0.02
0.01
0.453
0.32
0.01
0.460
0.17
0.01
0.486
0.02
0.01
0.493
0.00
0.01
0.523
0.507
0.78
0.77
0.523
1.00
0.47
0.498
0.89
0.69
0.481
0.96
0.76
0.508
0.78
0.81
0.513
0.69
0.82
0.507
0.87
0.84
0.512
0.498
0.77
0.75
0.512
1.00
0.49
0.494
0.74
0.68
0.487
0.91
0.75
0.478
0.89
0.79
0.498
0.74
0.81
0.493
0.87
0.83
0.497
0.518
0.20
0.20
0.506
0.09
0.20
0.483
0.68
0.33
0.511
0.29
0.37
0.509
0.33
0.40
0.522
0.16
0.36
0.522
0.15
0.37
0.539
0.519
0.79
0.81
0.549
0.20
0.43
0.470
1.00
0.56
0.516
0.87
0.62
0.452
1.00
0.66
0.507
0.90
0.67
0.533
0.61
0.69
Notes: (1) The data are daily data from April 1, 1996 to August 25, 2006. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the
benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2 is the bootstrap reality check p-value for
comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model.
(3) AR, NN, FC, NP are various models under considerations. The larger MCFD, the better predictive ability of a model.
504
J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507
Table 5
Forecast evaluation results for emerging markets – MSFE.
Benchmark
AR(1)
P1
P2
GARCH(1, 1)
P1
P2
NN(1, 5)
P1
P2
FC(1, 200)
P1
P2
NP(200, 400)
P1
P2
Combined I
P1
P2
Combined II
P1
P2
Table 7
Forecast evaluation results for emerging markets – MFTR.
HK
MA
SI
TW
MX
SK
BR
1.139
0.985
0.12
0.09
0.998
0.17
0.09
1.012
0.79
0.24
0.986
0.19
0.34
1.000
0.52
0.34
0.977
0.02
0.15
0.978
0.01
0.15
0.845
0.991
0.07
0.07
1.000
0.30
0.07
1.003
0.64
0.21
0.990
0.18
0.32
0.990
0.10
0.34
0.987
0.02
0.20
0.988
0.03
0.20
1.228
0.989
0.24
0.22
0.997
0.12
0.22
1.052
1.00
0.45
0.994
0.35
0.49
1.002
0.60
0.53
0.987
0.15
0.47
0.986
0.08
0.45
1.926
1.014
0.89
0.92
1.001
0.78
0.90
1.050
0.99
0.96
1.012
0.91
0.97
1.008
0.82
0.98
1.006
0.76
0.98
1.001
0.56
0.98
2.216
1.006
0.99
0.97
0.996
0.16
0.21
1.012
0.89
0.49
1.019
0.97
0.67
1.002
0.68
0.72
0.999
0.33
0.72
0.997
0.14
0.72
2.433
1.001
0.61
0.61
1.000
0.52
0.74
1.020
0.93
0.88
1.002
0.62
0.95
1.007
0.86
0.98
0.998
0.26
0.90
0.998
0.22
0.90
4.590
1.000
0.48
0.47
0.998
0.31
0.53
1.023
0.84
0.76
1.002
0.58
0.81
1.002
0.64
0.83
0.997
0.32
0.75
0.995
0.18
0.67
Benchmark
AR(1)
P1
P2
GARCH(1,1)
P1
P2
NN(1, 5)
P1
P2
FC(1, 200)
P1
P2
NP(200, 400)
P1
P2
Combined I
P1
P2
Combined II
P1
P2
HK
MA
SI
TW
MX
SK
BR
0.016
0.068
0.07
0.05
0.085
0.03
0.04
0.009
0.44
0.06
0.094
0.01
0.05
0.050
0.10
0.05
0.107
0.02
0.03
0.111
0.01
0.02
0.024
0.072
0.14
0.12
0.052
0.11
0.14
0.035
0.88
0.29
0.046
0.35
0.32
0.063
0.18
0.36
0.059
0.21
0.37
0.072
0.14
0.37
0.018
0.106
0.06
0.06
0.079
0.07
0.07
0.004
0.64
0.12
0.057
0.25
0.13
0.066
0.22
0.16
0.096
0.11
0.17
0.121
0.04
0.11
0.033
0.016
0.28
0.28
0.021
0.09
0.32
0.076
0.67
0.46
0.034
0.22
0.41
0.084
0.72
0.46
0.004
0.36
0.48
0.034
0.49
0.49
0.147
0.068
0.98
0.99
0.147
0.00
0.49
0.013
0.97
0.73
0.040
0.94
0.82
0.144
0.55
0.89
0.115
0.84
0.92
0.155
0.31
0.87
0.126
0.137
0.36
0.38
0.125
0.78
0.38
0.025
0.86
0.68
0.093
0.71
0.78
0.110
0.81
0.85
0.098
0.81
0.87
0.121
0.56
0.89
0.158
0.183
0.33
0.32
0.158
0.00
0.32
0.046
0.89
0.57
0.141
0.55
0.67
0.158
0.46
0.71
0.218
0.25
0.54
0.172
0.40
0.55
Notes: (1) The data are daily data from January 4, 1999 to August 25, 2006 for most
of the emerging markets under consideration. (2) P1 is the bootstrap p-value for
comparing a single model with the martingale model (the benchmark model) using
White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing
parameter q = 0.75. P2 is the bootstrap reality check p-value for comparing k models
with the martingale model, where the null hypothesis is that the best of the first k
models has no superior predictive power over the martingale model. (3) AR, NN, FC,
NP are various models under considerations. The smaller MSFE, the better predictive ability of a model.
Notes: (1) The data are daily data from January 4, 1999 to August 25, 2006 for most
of the emerging markets under consideration. (2) P1 is the bootstrap p-value for
comparing a single model with the martingale model (the benchmark model) using
White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing
parameter q = 0.75. P2 is the bootstrap reality check p-value for comparing k models
with the martingale model, where the null hypothesis is that the best of the first k
models has no superior predictive power over the martingale model. (3) AR, NN, FC,
NP are various models under considerations. The larger MFTR, the better predictive
ability of a model.
Table 6
Forecast evaluation results for emerging markets – MAFE.
Table 8
Forecast evaluation results for emerging markets – MCFD.
HK
MAFE
Benchmark
AR(1)
P1
P2
GARCH(1, 1)
P1
P2
NN(1, 5)
P1
P2
FC(1, 200)
P1
P2
NP(200, 400)
P1
P2
Combined I
P1
P2
Combined II
P1
P2
0.817
0.990
0.09
0.08
1.001
0.61
0.08
1.008
0.80
0.18
0.987
0.11
0.16
1.000
0.55
0.16
0.986
0.02
0.14
0.987
0.01
0.14
MA
0.710
0.992
0.02
0.02
1.001
0.88
0.02
1.002
0.63
0.06
0.990
0.05
0.06
0.994
0.12
0.08
0.991
0.01
0.08
0.992
0.01
0.08
SI
0.852
0.986
0.04
0.05
0.997
0.03
0.05
1.028
0.99
0.19
0.989
0.07
0.20
1.002
0.61
0.22
0.990
0.05
0.22
0.990
0.05
0.22
TW
1.078
1.011
0.95
0.95
1.002
0.90
0.95
1.030
0.99
0.98
1.010
0.94
0.99
1.007
0.82
0.99
1.007
0.90
0.99
1.004
0.83
0.99
MX
1.122
1.005
1.00
0.99
0.994
0.01
0.02
1.005
0.86
0.13
1.009
0.94
0.27
0.999
0.31
0.28
0.999
0.34
0.28
0.997
0.07
0.28
SK
1.222
1.000
0.52
0.49
0.999
0.23
0.43
1.014
0.95
0.68
1.001
0.67
0.79
1.002
0.78
0.87
0.999
0.37
0.87
0.999
0.26
0.87
BR
1.665
1.002
0.77
0.79
0.996
0.06
0.10
1.013
0.93
0.43
1.002
0.62
0.58
0.997
0.22
0.60
0.999
0.41
0.60
0.998
0.22
0.60
Notes: (1) The data are daily data from January 4, 1999 to August 25, 2006 for most
of the emerging markets under consideration. (2) P1 is the bootstrap p-value for
comparing a single model with the martingale model (the benchmark model) using
White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing
parameter q = 0.75. P2is the bootstrap reality check p-value for comparing k models
with the martingale model, where the null hypothesis is that the best of the first k
models has no superior predictive power over the martingale model. (3) AR, NN, FC,
NP are various models under considerations. The smaller MAFE, the better predictive ability of a model.
For our purpose, P1 is the bootstrap p-value for comparing a single
model to the benchmark model which is the martingale model
Benchmark
AR(1)
P1
P2
GARCH(1, 1)
P1
P2
NN(1, 5)
P1
P2
FC(1, 200)
P1
P2
NP(200, 400)
P1
P2
Combined I
P1
P2
Combined II
P1
P2
HK
MA
SI
TW
MX
SK
BR
0.476
0.523
0.03
0.04
0.498
0.13
0.05
0.477
0.47
0.09
0.533
0.01
0.04
0.514
0.05
0.05
0.533
0.02
0.05
0.537
0.00
0.04
0.479
0.505
0.09
0.09
0.490
0.13
0.09
0.463
0.68
0.22
0.505
0.13
0.26
0.498
0.20
0.32
0.505
0.14
0.34
0.510
0.07
0.23
0.498
0.544
0.03
0.03
0.526
0.02
0.03
0.491
0.58
0.07
0.531
0.06
0.08
0.505
0.40
0.10
0.540
0.05
0.11
0.551
0.01
0.06
0.477
0.474
0.52
0.51
0.481
0.34
0.58
0.459
0.68
0.72
0.497
0.22
0.46
0.477
0.49
0.51
0.479
0.46
0.52
0.481
0.42
0.53
0.561
0.526
0.99
0.99
0.561
1.00
0.48
0.517
0.99
0.69
0.510
0.98
0.80
0.557
0.62
0.87
0.542
0.94
0.89
0.563
0.35
0.88
0.549
0.545
0.62
0.62
0.549
0.32
0.64
0.512
0.89
0.81
0.536
0.77
0.89
0.542
0.88
0.95
0.532
0.94
0.96
0.545
0.72
0.97
0.564
0.557
0.70
0.70
0.564
1.00
0.49
0.515
1.00
0.72
0.548
0.73
0.81
0.560
0.71
0.86
0.564
0.50
0.88
0.566
0.38
0.84
Notes: (1) The data are daily data from January 4, 1999 to August 25, 2006 for most
of the emerging markets under consideration. (2) P1 is the bootstrap p-value for
comparing a single model with the martingale model (the benchmark model) using
White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing
parameter q = 0.75. P2is the bootstrap reality check p-value for comparing k models
with the martingale model, where the null hypothesis is that the best of the first k
models has no superior predictive power over the martingale model. (3) AR, NN, FC,
NP are various models under considerations. The smaller MFCD, the better predictive ability of a model.
Yt = l + et. P2 is the bootstrap reality check p-value for comparing
the k models to the benchmark model. The value for P2 in the table
is the bootstrap reality check p-value for the null hypothesis that
J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507
the best of the first k models has no superior predictive ability over
the benchmark model. Of course, the last P2 value (in the last row
of the table) checks if the best of all the models under consideration has superior predictive ability over the martingale model.
The difference between each P1 and the last P2 gives an estimate
of data-snooping bias. Sullivan et al. (1999) and Qi and Wu
(2006) used the White’s methodology to examine the data-snooping issue in technical trading rules.
Tables 1 and 2 report the results for 11 developed markets using
statistical criteria MSFE and MAFE. For the benchmark model, the
MSFE and MAFE are in levels (104 and 102, respectively). For
all other models, they are in ratios relative to that of the benchmark model. For Table 1, the results show that except for Spain
with the NN(1, 5) model, and Switzerland with the NP(200, 400)
model, all MSFE ratios for the three nonlinear-in-mean models
(NN(1, 5), FC(1, 200) and NP(200, 400)) are above 1. Therefore, none
of the nonlinear-in-mean models outperforms the benchmark.
These findings are consistent with previous studies (e.g. Hsieh
(1991)) that show a poor forecasting performance of nonlinearin-mean models relative to the benchmark martingale models in
terms of statistical criteria. On the other hand, when evaluated
alone, each of the remaining four models (AR(1), GARCH(1, 1),
and the two combinations) in some cases reveals superior predictive ability than the benchmark. Note that the combined II forecasts pool forecasts from all individual models: AR(1),
GARCH(1, 1), NN(1, 5), FC(1, 200) and NP(200, 400), while the Combined I forecasts exclude the forecast of the GARCH(1, 1) model.
Based on the MSFE criterion and the P1 statistics, the AR(1) and
the Combined II models show the most forecasting power as they
are able to beat the martingale model for four out of the 11 countries. Note that the Combined II forecasts perform better than the
Combined I (CI) forecasts. The result is apparently suggestive of
the importance of using GARCH models to allow for nonlinearity
in volatility. The superiority of these 4 models (albeit moderate)
as measured by the MSFE can be more clearly seen in the case of
Switzerland. All four models are able to beat the benchmark at
the 5% level of significance (except for the GARCH model, which
has a P1 value of 12%). However, with allowance of data-snooping
bias, the P2 in the last row suggests that the best forecasting model
among the 7 models is no better than the martingale model, except
for Switzerland that AR(1) model clearly beats the benchmark
model.
The results obtained using the MAFE as the evaluation criterion
(Table 2) are very similar to those for the MSFE. The combined II
models, when evaluated as a single model, show superior forecasting ability than the benchmark for five countries, which are mostly
contributed by either the AR(1), the GARCH(1, 1) or both. All three
nonlinear-in-mean models fail to outperform the martingale model for all the markets. The GARCH models, however, show a better
predictive ability when evaluated by the MAFE relative to the
MSFE. Nevertheless, with further allowance of data-snooping bias,
the apparent good performance of the Combined II model disappears, again with the only exception of Switzerland, where the
AR(1) model as the best model outperforms the benchmark at
the 5% level (with the P2 value of 0.04).
Tables 3 and 4 report the results using the economic criteria for
all developed countries. All results for these two measures are in
levels. The meaning of these results is straightforward. The MFTR
shows the daily profit (in percentages) generated by the forecasts
of the model, and the MCFD shows the percentage of all directional
changes correctly predicted by the model. For example, in the case
of Switzerland, the AR (1) model generates profit of 0.162% per
trading day on average (or equivalently 40.7% per year with 251
trading days) during the out-of-sample period (before allowance
for transactions cost) and correctly predicts 55.8% of the directions
of changes which is mostly contributed by the superior perfor-
505
mance of the AR(1) model. The results based on the MFTR (Table
3) suggest some evidence of superior predictive ability for the 3
nonlinear-in-mean models.11 The NN model generates statistically
significant profit (i.e., 0.025% per trading day) in case of the Netherlands. The FC and Nonparametric models are both able to beat the
predictive power of the benchmark model in the Swiss stock market.
However, for most other countries, the nonlinear-in-mean models
do not outperform the benchmark model. On the other hand, for
three countries, Germany, Switzerland, and the Netherland, the results reveal that both AR(1) and GARCH(1, 1) are able to improve
the forecasts of the martingale model. The numbers from the combined forecasts as well as the reality check test statistic P2 also confirm the superiority of the AR(1) and GARCH(1, 1) over the
benchmark model for those 3 countries.
The results based on the MCFD criterion are similar to those
based on the MFTR in that the 3 nonlinear-in-mean models generally can not forecast the direction of the changes. Only the NN
model is able to outperform the benchmark in the Netherland market, correctly forecasting directional changes in prices 49.1% of the
time, 4.8% more often than the martingale model. Again, for the
three countries, Germany, Switzerland, and the Netherland, the results reveal that both AR(1) and GARCH(1, 1) are able to improve
the forecasts of the martingale model. The numbers from the combined forecasts as well as the reality check P2 in the last row also
confirm the result.
Overall, there is very limited evidence for predictability based
on nonlinear-in-mean models. Among the 11 developed markets,
only 3 countries, Germany, Switzerland, and the Netherlands show
strong predictability from the AR(1), GARCH(1, 1) and combined
models based on the four statistical and economic criteria. The results based on the statistical criteria for Germany and Netherland,
however, are not as strong as that for Switzerland due to the insignificant reality check statistics of P2 values.
The results for six emerging markets in Tables 5–8 are largely
similar to those of the developed markets. Using statistical evaluation criteria (see Tables 5 and 6), our findings suggest that even
without allowance for data-snooping bias, nonlinear-in-mean
models generally can not outperform the benchmark, except that
the FC model for Malaysia and Singapore outperforms the benchmark based on the MAFE. The models that perform the best are
again the AR(1), GARCH(1, 1), Combined I, and Combined II (the
GARCH model, however, does not outperform the benchmark for
any country when measured by the MSFE). Furthermore, using
MAFE (instead of the MSFE) as the evaluation criterion provides
stronger evidence of predictability in emerging markets. For example, the AR model is able to beat the benchmark in only one market
of measured by the MSFE. The predictive ability of this model,
however, significantly improves if we use the MAFE to measure
forecasting errors. Overall, the statistical evaluation criteria show
that without allowance for data-snooping bias, for up to four ETF
indices, Hong Kong, Malaysia, Singapore, and perhaps Mexico,
the Combined II model based mostly from AR(1) or GARCH(1, 1)
model predictions is able to outperform the benchmark. Again,
the allowance of data-snooping bias substantially changes the picture: the only P2 that is in the last row and below 10% , is for Malaysia with the MAFE criterion.
11
Closely following Fama (1991) and Gencay (1998), we do not explicitly allow for
transaction costs in the evaluation of trading rule performance of various models.
Although there are surely positive information and trading costs, according to Fama
(1991), the researcher instead should focus on the more interesting task of laying out
the evidence on the adjustment of prices to various kinds of information (e.g., past
returns in this study). Also note that some evidence for nonlinear-in-mean predictability would be even weaker after this consideration of transaction costs, which
reinforce the main point of this study.
506
J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507
The economic evaluation criteria in Tables 7 and 8 show, similar
to the case of developed countries, that nonlinear-in-mean models
do not outperform the benchmark except in a few cases. In the case
of Hong Kong, both FC and NP models (as a single model) outperform the benchmark under both the MFTR and MCFD criteria while
only FC model outperforms the benchmark under MCFD for Singapore. In this case, we also find some evidence of superior forecasting ability of the FC model over both the AR and GARCH models.
Still, the AR and GARCH models outperform the benchmark in
some markets. In particular, the AR model outperforms the benchmark for Hong Kong and Singapore based on both MFTR and MCFD,
and for Malaysia based on MCFD. When evaluated alone, the
GARCH model outperforms the benchmark in 5 out of 7 countries
based on MFTR. Overall, based on economic criteria, there remains
strong evidence after allowance of data-snooping bias (i.e., based
on last row P2 values) that there is predictability for Hong Kong
and Singapore, in addition to Malaysia as suggested by one of the
statistical criteria (i.e., MAFE).
5. Conclusions
This study investigates the martingale behavior of eighteen
stock market index ETFs based on out-of-sample forecasts. In
addition to a linear model, this paper employs several popular nonlinear models to more comprehensively explore potential nonlinearity in asset returns. Using both statistical and economic
criteria, we find some evidence against the martingale hypothesis.
Among the 18 ETF stock indices, three out of 11 developed markets
(Germany, Netherlands, and Switerland) and three out of seven
emerging markets (Hong Kong, Singapore and Malaysia) show predictability in terms of either statistical or economic criteria, or
both. However, most of this evidence comes from the linear model
and the nonlinear-in-variance GARCH model, while the popular
nonlinear-in-mean models (neural network, semiparametric functional coefficient model, nonparametric kernel regression) generally do not help much. This finding confirms the in-sample
evidence of Hsieh (1991, 1993) and Harris and Kucukozmen
(2001) in the out-of-sample context, and it is in line with Moreno
and Olmeda (2007) but differs from others (e.g., Gencay, 1998,
1999; Hong and Lee, 2003; Yang et al., 2008). Certainly, the differences of financial markets under study might account for such different findings. It is also important to note that the allowance for
data-snooping bias using White’s Reality Check renders apparent
strong predictability on many markets to be tenuous, and particularly undermine otherwise impressive performance of forecast
combinations. Hence, the findings of the paper underscore the
importance of allowing for data-snooping in addition to the wellknown overfitting problem of nonlinear models.
Finally, our study also contrasts with earlier works (e.g., Patro
and Wu, 2004) on the international stock market predictability
using the variance ratio test. For example, Patro and Wu (2004)
(see their Table 2) show that ten out of the eighteen developed
markets exhibit in-sample (linear) daily return predictability. Our
results suggest that despite more thorough examination with nonlinear models and multiple evaluation criteria, with the counteracting consideration of data-snooping bias, the predictability of
daily international stock market indexes might not be even as
widespread as previously thought.
Acknowledgements
We thank Qi Li, Xiaojing Su, and particularly three anonymous referees and the editor Lorenzo Peccati for many helpful
comments.
Appendix A
Table A1
The summary of models.
Name
Models for E(YtjIt1) and sign[E(Yt jIt1)]
Benchmark
1. AR(d)
2.GARCH(p, q)
3. NN(d, q)
E(YtjIt1) = l
P
EðY t jIt1 Þ ¼ b0 þ dj¼1 bj Y tj
P
P
E(YtjIt1) = l where r2t ¼ x þ pj¼1 bj r2tj þ qi¼1 ai e2ti
P
P
P
EðY t jIt1 Þ ¼ b0 þ dj¼1 bj Y tj þ qi¼1 di Gðc0i þ dj¼1 cji Y tj Þ;
GðzÞ ¼ ð1 þ ez Þ1
P
EðY t jIt1 Þ ¼ a0 ðU t Þ þ dj¼1 aj ðU t ÞY tj where
1 PL
U t ¼ Y t1 L
j¼1 Y tj
E(YtjIt1) = g(Yt1,Yt2)
AR(1), NN(1, 5), FC(1, 200) and NP(200, 400)
4. FC(d, L)
5. NP(k, m)
6. Combined
I (1, 3, 4, 5)
7. Combined
II (1–5)
AR(1), GARCH(1, 1), NN(1, 5), FC(1, 200) and NP(200, 400)
Notes: The benchmark model is the martingale model. AR(d) is the autoregression
model. GARCH(p, q) is the generalized autoregressive conditional heteroskedasticity
model. NN (d, q) is the neural network model. FC is the functional coefficient model
of Cai et al. (2000). NP is the nonparametric model estimated by the kernel estimation approach. For NP(k, m) models the smoothing parameter h is used in nonparametric estimation for minimizing k period out-of-sample.
References
Ahn, D., Boudoukh, J., Richardson, M., Whitelaw, R.F., 2002. Partial adjustment or
stale prices? Implications from stock index and futures return autocorrelations.
Review of Financial Studies 15, 655–689.
Cai, Z., Fan, J., Yao, Q., 2000. Functional-coefficient regression models for nonlinear
time series. Journal of American Statistical Association 95, 941–956.
Campbell, J., Lo, A., MacKinlay, C., 1997. The Econometrics of Financial Markets.
Princeton University Press, Princeton, New Jersey.
Chaudhuri, K., Wu, Y., 2003. Random walk versus breaking trend in stock prices:
Evidence from emerging markets. Journal of Banking and Finance 27,
575–592.
Chordia, T., Roll, R., Subrahmanyam, A., 2005. Evidence on the speed of convergence
to market efficiency. Journal of Financial Economics 76, 271–292.
Fama, E.F., 1991. Efficient capital markets: II. Journal of Finance 46, 1575–
1617.
Gencay, R., 1998. The predictability of security returns with simple technical
trading rules. Journal of Empirical Finance 5, 347–359.
Gencay, R., 1999. Linear, nonlinear and essential foreign exchange rate prediction
with simple trading rules. Journal of International Economics 47, 91–107.
Gleason, K.C., Mathur, I., Peterson, M.A., 2004. Analysis of intraday herding behavior
among the sector ETFs. Journal of Empirical Finance 11, 681–694.
Granger, C.W.J., 1992. Forecasting stock market prices: Lessons for forecasters.
International Journal of Forecasting 8, 3–13.
Harris, R.D.F., Kucukozmen, C.C., 2001. Linear and nonlinear dependence in Turkish
equity returns and its consequences for financial risk management. European
Journal of Operational Research 134, 481–492.
Hong, Y.M., Lee, T.H., 2003. Inference on predictability of foreign exchange rates via
generalized spectrum and nonlinear time series models. Review of Economics
and Statistics 85, 1048–1062.
Hsieh, D.A., 1991. Chaos and nonlinear dynamics: Application to financial markets.
Journal of Finance 46, 1839–1877.
Hsieh, D.A., 1993. Implications of nonlinear dynamics for financial risk
management. Journal of Financial and Quantitative Analysis 28, 41–64.
Kim, E.H., Singal, V., 2000. Stock market openings: Experience of emerging
economies. Journal of Business 73, 25–66.
Lee, T.H., White, H., Granger, C.W.J., 1993. Testing for neglected nonlinearity in time
series models: A comparison of neural network methods and alternative tests.
Journal of Econometrics 56, 269–290.
Leitch, G., Tanner, E., 1991. Economic forecast evaluation: Profits versus
conventional error measures. American Economic Review 81, 580–590.
Lo, A.W., Mackinlay, A.C., 1988. Stock market prices do not follow random walks:
Evidence from a simple specification test. Review of Financial Studies 1,
41–66.
Mcqueen, G., Thorley, S., 1991. Are stock returns predictable? A test using Markov
chains. Journal of Finance 46, 239–263.
Monoyios, M., Sarno, L., 2002. Mean reversion in stock index futures markets: A
nonlinear analysis. Journal of Futures Markets 22, 285–314.
Moreno, D., Olmeda, I., 2007. Is the predictability of emerging and developed stock
markets really exploitable? European Journal of Operational Research 182, 436–
454.
Patro, D.K., Wu, Y., 2004. Predictability of short-horizon returns in international
equity markets. Journal of Empirical Finance 11, 553–584.
Pennathur, A.K., Delcoure, N., Anderson, D., 2002. Diversification benefits of ishares
and closed-end country funds. Journal of Financial Research 25, 541–557.
J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507
Poterba, J.M., Shoven, J.B., 2002. Exchange-traded funds: A new
investment option for taxable investors. American Economic Review
92, 422–427.
Qi, M., Wu, Y., 2006. Technical trading-rule profitability, data snooping, and reality
check: Evidence from the foreign exchange market. Journal of Money, Credit
and Banking 38, 2135–2158.
Ratner, M., Leal, R.P.C., 1999. Tests of technical trading strategies in the emerging
equity markets of Latin America and Asia. Journal of Banking and Finance 23,
1887–1905.
Sullivan, R., Timmermann, A., White, H., 1999. Data-snooping, technical
trading rule performance, and the bootstrap. Journal of Finance 54,
1647–1691.
507
Swanson, N.R., White, H., 1995. A model selection approach to assessing the
information in the term structure using linear models and artificial neural
networks. Journal of Business Economics and Statistics 13, 265–275.
Swanson, N.R., White, H., 1997. A model selection approach to real time
macroeconomic forecasting using linear models and artificial neural
networks. Review of Economics and Statistics 79, 540–550.
Tabak, B.M., Lima, E.J.A., 2009. Market efficiency of Brazilian exchange rate:
Evidence from variance ratio statistics and technical trading rules. European
Journal of Operational Research 194, 814–820.
White, H., 2000. A reality check for data snooping. Econometrica 68, 1097–1126.
Yang, J., Su, X., Kolari, J.W., 2008. Do Euro exchange rates follow a martingale? Some
out-of-sample evidence. Journal of Banking and Finance 32, 729–740.