Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Cabrera-yang-wang 2010 ejor

European Journal of Operational Research 200 (2010) 498–507 Contents lists available at ScienceDirect European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor Stochastics and Statistics Nonlinearity, data-snooping, and stock index ETF return predictability Jian Yang a,*, Juan Cabrera b, Tao Wang c a b c The Business School, University of Colorado Denver, Denver, CO 80217, USA Department of Economics, The Graduate Center of CUNY, New York, NY 10016, USA Department of Economics, Queens College and the Graduate Center of CUNY, Flushing, NY 11367, USA a r t i c l e i n f o Article history: Received 12 September 2007 Accepted 9 January 2009 Available online 16 January 2009 Keywords: Ishares Random walk Nonlinear models Forecasting evaluation Reality check a b s t r a c t This paper examines daily return predictability for eighteen international stock index ETFs. The out-of-sample tests are conducted, based on linear and various popular nonlinear models and both statistical and economic criteria for model comparison. The main results show evidence of predictability for six of eighteen ETFs. A simple linear autoregression model, and a nonlinear-in-variance GARCH model, but not several popular nonlinear-in-mean models help outperform the martingale model. The allowance of data-snooping bias using White’s Reality Check also substantially weakens otherwise apparently strong predictability. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Asset return predictability has been one of the most important topics in financial research. The inference on asset return predictability carries important implications to practitioners, for example, for the design of portfolio management strategies. Numerous earlier works have been conducted to examine the short-horizon predictability of stock market returns based on past returns. In this regard, since the variance ratio test was originally developed by Lo and Mackinlay (1988), it has been widely used in testing the random walk hypothesis in international stock markets (e.g., Kim and Singal, 2000; Chaudhuri and Wu, 2003; Patro and Wu, 2004) and foreign exchange markets (e.g., Tabak and Lima, 2009).1 However, the popular variance ratio test used in the above studies (as well as the traditional autocorrelation test, see, e.g., Chordia et al. (2005)) assumes linearity and only tests serial uncorrelatedness rather than martingale difference (Hsieh, 1991; Mcqueen and Thorley, 1991; Hong and Lee, 2003). A nonlinear time series can have zero autocorrelation but a non-zero mean conditional on its past history (i.e., predictable based on the past history). That is, the variance ratio test may fail to capture predicable nonlinearities * Corresponding author. Tel.: +1 303 556 5852; fax: +1 303 556 5899. E-mail address: jian.yang@ucdenver.edu (J. Yang). 1 The terms ‘‘random walk” and ‘‘martingale” have been interchangeably used in the efficient capital markets literature. However, it is the martingale property (or unpredictability) of security prices that is of essential interest to this huge body of the literature (Fama, 1991; Granger, 1992). Strictly speaking, the innovations series is independent and identically distributed for ‘‘random walk”, while it is a martingale difference sequence for ‘‘martingale.” 0377-2217/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2009.01.009 in mean (if any) and could yield misleading conclusions in favor of the martingale (or loosely random walk) hypothesis. This study examines daily return predictability of international stock index exchange-traded funds (ETFs) during the period of 1996–2006. We present the first comprehensive study on the martingale behavior of recently popular international stock index ETFs (loosely in the context of weak-form market efficiency). As one of the most successful financial innovations of all time, there were over 300 ETFs with more than $400 billion of assets as of December 2006. A defining characteristic of ETFs is their ease for intraday active trading and high daily turnover, as it is particularly appealing to investors who demand short-term liquidity and trade in large lots (Poterba and Shoven, 2002). International stock index ETFs presumably provide an attractive investment vehicle for the US investors to explore potential investment opportunities abroad. Surprisingly, while many issues such as diversification potentials and herding behaviors on the ETFs have been examined (e.g., Pennathur et al., 2002; Gleason et al., 2004), the important issue of their short-horizon predictability has not yet been investigated. Also noteworthy, daily stock index ETF prices are transaction prices which would not suffer from the notorious non-synchronous trading problem of daily stock market indexes (Ahn et al., 2002), which plagues numerous studies using such data.2 2 Although international stock index ETFs are designed to track each country stock market index, there may be substantial tracking errors for the ETFs, partly due to their considerable exposure to the US market (e.g., Pennathur et al., 2002). Nevertheless, the predictability of the ETFs as a new financial instrument remains in itself interesting. To extent that the international stock index ETFs track the performance of international stock indexes, the evidence from this study could also shed more light on the short-horizon predictability of international stock market indexes. 499 J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507 We seek to contribute to the literature in the following important aspects. First, we take the model selection approach (e.g., Swanson and White, 1995, 1997), rather than the more traditional hypothesis testing approach as taken in the variance ratio test (or the autocorrelation test). As discussed in Swanson and White (1995, 1997), unlike the traditional hypothesis testing approach, the model selection approach does not require the specification of a correct model for its valid application. By contrast, earlier empirical findings based on variance ratio tests are quite sensitive to potential model misspecification. More important to this study, it allows us to focus directly on the issue of predictability at hand: out-of-sample forecasting performance. Arguably, out-of-sample evidence bears directly on predictability and is important to mitigate the concern of in-sample model overfitting, particularly for nonlinear models. This is also well line with Granger’s (1992, p. 11) observation that ‘‘only out-of-sample evaluation is relevant and, to some extent, avoids these difficulties (due to data mining).” By contrast, all the cited studies above only focus on in-sample evidence (and also typically fail to allow for potential nonlinearity-in-mean). Further, similar to Hong and Lee (2003), Moreno and Olmeda (2007), Yang et al. (2008) and Tabak and Lima (2009), this study presents out-of-sample evidence based on both statistical and economic criteria. With the notable exception of Ratner and Leal (1999) and Moreno and Olmeda (2007), few earlier studies on international stock market random walk behavior have considered economic criteria as measured by magnitude of trading returns and particularly the direction of forecasted price changes, which have practical value to investors and other decision-makers. Second, we extend the literature by applying a number of nonlinear models that allow for both potential nonlinearity-in-mean and nonlinearity-in-variance. As noted earlier, the cited studies above using variance ratio tests (and autocorrelation tests) do not allow for nonlinearity-in-mean. Theoretically, as discussed in Mcqueen and Thorley (1991), existence of fads or rational speculative bubbles suggests the possibility of nonlinear patterns in stock returns. Or, if the world is governed by a not-too-complex chaotic process, it should have short-term nonlinear predictability (in mean) but not linear predictability (Hsieh, 1991, p. 1845). Further, in a survey on the random walk test literature, Granger (1992, p. 11) concludes that ‘‘benefits can arise. . .especially from considering non-linear models.” Toward this end, this study considers several popular nonlinear models to more comprehensively explore potential nonlinearities in mean, in addition to the more commonly used nonlinear-in-variance models (i.e., GARCH) (see, e.g., Hsieh, 1991, 1993).3 In fact, some variants of the popular nonlinear models used in most previous studies (e.g., Hsieh, 1991; Gencay, 1998; Harris and Kucukozmen, 2001; Monoyios and Sarno, 2002; Hong and Lee, 2003; Moreno and Olmeda, 2007; Yang et al., 2008) are used in this study.4 Finally, model comparisons in this study are improved relative to previous studies by using White’s (2000) novel test to address the concern of data-snooping bias (i.e., spuriously superior predicative ability of some complex models due to chance).5 When several forecast models using the same data are compared, it is crucial to take into account the dependence among these models, which otherwise may result in misleading inference due to data-snooping bias. While the overfitting problem of nonlinear models is well aware in the literature, relatively few earlier studies in this line of the literature have addressed the data-snooping issue, which is shown to be nontrivial in this study. The rest of this paper is organized as follows: Section 2 presents econometric methodology; Section 3 describes the data; Section 4 discusses the empirical results; and finally, Section 5 concludes the paper. 2. Econometric methodology To forecast ETF daily returns, we use various models for E(Yt jIt1), where Yt represents the first difference of ETF daily closing prices in logarithm, It1 is the information set available at time t  1. We apply various popular nonlinear models to explore the possibility that daily ETF returns are not a martingale, and have the conditional mean dependence in a complicated form (i.e., nonlinearity-in-mean), and the dependence in (e.g., second (or higher) moments (i.e., nonlinearity-in-variance). We certainly do not assume that the limited number of the nonlinear models can capture all the nonlinearities. However, they do represent some of the most popular nonlinear models widely used in the literature thus far. The martingale model Yt = l + et is used as the benchmark for comparison with other models. Table A1 lists the various models examined in the paper, including the autoregressive model (AR(d)), generalized autoregressive conditional heteroskedasticity model (GARCH(p, q)), feedforward artificial neural network (NN(d, q)), functional coefficient model (FC(d, L)), nonparametric regression model (NP(k, m)), and some combinations of these models. The estimation of the AR(d) and GARCH(p, q) models is relatively standard, using the ordinary least squares method and the maximum likelihood method, respectively. We next briefly discuss how to implement more complicated nonlinear models used in this study (i.e., neural network, functional coefficient and nonparametric models).6 2.1. The feedforward artificial neural network Artificial neural networks have proven to be useful in capturing nonlinearity-in-mean in forecasting financial time series. One of the greatest advantages of neural networks over other commonly-used nonlinear time series models is that neural networks can well approximate a large class of functions. The basic structure of neural networks combines many ‘basic’ nonlinear functions via a multilayer structure. Normally there is one intermediate, or hidden, layer between the inputs and output. The intuition is that the explanatory variables simultaneously activate the units in the intermediate layer through some function W and, subsequently, output is produced through some function U from the units in the intermediate layer. The following equations summarize this approach: hi;t ¼ W ci0 þ m X cij X j;t j¼1 3 Note that there is a debate about whether there exists predictable nonlinearityin-mean in US stock market indexes. For example, although Hsieh (1991) finds little nonlinearity-in-mean in US stock market prices, Gencay (1998) reports nonlinear-inmean predictability for similar indexes. 4 Like many earlier studies, a caveat here is that the inference should still be interpreted in light of the limited number of models we examine in this study. In general, martingale means the existence of neither linear nor nonlinear dependence, and we have to test all possible nonlinear dependence to rule out the martingale property of stock returns, which is practically impossible. 5 As discussed in Campbell et al. (1997, p. 523–524), the problems of overfitting and datasnooping are related but different. A typical symptom of overfitting is an excellent in-sample fit but poor out-of-sample performance, while data-snooping refers to excellent but spurious out-of-sample performance. Y t ¼ U b0 þ q X i¼1 bi hi;t ! ! i ¼ 1; . . . ; q: 6 Also note that some of these models are special cases of the others. For example, the AR(1) model is a special case of the NN(1, 5) model. Nevertheless, in this study the forecasting results of the NN(1, 5) model are systematically worse than the results of the AR(1) model. This, however, may simply indicates rather weak nonlinearity-inmean in the dataset and thus render more complicated NN(1, 5) to perform poorly while the more parsimonious AR(1) perform rather well in the out-of-sample forecasting. 500 J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507 ^ Þg are cho^j ðuÞ ¼ a ^j , and fða ^j ; b local linear estimator at point u is a j sen by minimizing the sum of locally weighted squares defined as: or, more compactly, Y t ¼ U b0 þ q X bi Wðci0 þ i¼1 m X ! cij X j;t Þ ; j¼1 t¼1 where Xj,t is the input or an independent variable, hi,t is the node or hidden unit in the intermediate or hidden layer, and Yt is the output or dependent variable. In this study, the independent variable Xj,t coincides with the lagged dependent variable Ytj. The functions W and U can be arbitrarily chosen and still approximate a large class of functions given sufficiently large numbers of units in the intermediate layer. In this study we use single layer feedforward neural networks (e.g., Lee et al., 1993; Gencay, 1998; Hong and Lee, 2003), which is the most basic but perhaps most commonly used neural network in economic and financial applications. In this case, the input variables are connected to multiple nodes (or hidden units), and at each node they are weighted (differently) and transformed by the same activation function W. The output of each node is then weighted again by bi and summed and transformed by a second activation function U. Following the literature (e.g., Gencay, 1998; Hong and Lee, 2003), we chose the logistic function for the function W and the identity function for the function U, which is common practice in the literature. Coefficients for the NN(d, q) model are estimated using nonlinear least squares via the Newton–Raphson algorithm. The final equation we will estimate is as follows: EðY t jIt1 Þ ¼ b0 þ d X bj Y tj þ q X di G c0i þ i¼1 j¼1 d X j¼1 ! cji Y tj ; where G(z) = (1 + ez)1 and is a function of W, It1 is the information set available at t  1, and Yt is the dependent variable (i.e., ETF returns). 2.2. The functional coefficient model The functional coefficient model introduced by Cai et al. (2000) is a new semiparametric nonlinear time series model with timevarying and state-dependent coefficients. It includes threshold autoregression models, smooth transition regression, and many other regime switching models as special cases. The basic model can be expressed as follows: EðY t jIt1 Þ ¼ a0 ðU t Þ þ d X aj ðU t ÞY tj ; j¼1 0 where {(Yt, Ut) } is a bivariate stationary process. The smoothing variable Ut may be chosen as a function of explanatory variable vector Ytj or as a function of other variables. In our forecasts of ETF returns using past returns, Ut is chosen as the difference between the log index price at time t  1 (pt1), and the moving average of the most recent periods L of the log prices at time t  1, or: U t ¼ pt1  L1 L X N X  ptj : j¼1 In this paper, following the literature (e.g., Gencay, 1998, 1999) and the common practice of technical analysis, we chose L = 200. Traders often use Ut as a buy or sell signal based on its sign, which reveals information on changes in direction, i.e. the moving average rule. Thus, the model might be well suited to forecasting the direction of price movements. Following Cai et al. (2000), we estimate the term {aj(Ut)} nonparametrically using a local linear estimator. We approximate aj (Ut) locally (when Ut is close to u) by aj(Ut) = aj + bj(Ut  u). The 2 Y t  aj  bj ðU t  uÞ K h ðU t  uÞ; where Kh(.) is the kernel function used as weights for points that are ^ Þg. We use the normal distribution as the ^j ; b included to estimate fða j kernel function, and h is the smoothing parameter or the bandwidth of the window of the kernel function, which is determined by the modified leave-one-out least square cross-validation method proposed in Cai et al. (2000). 2.3. The nonparametric kernel regression model Because nonlinearities in the conditional means may be complicated and cannot be expressed explicitly, it is desirable to use nonparametric regression to estimate the model without specifying the forms of functions. Again, we use the well-known kernel regression (with some improvements on bandwidth selection to maximize the forecasting power) for estimation and forecasting. In general, a nonparametric regression model can be generally expressed as: EðY t jIt1 Þ ¼ gðY t1 ; Y t2 ; . . . ; Y tj Þ As mentioned above with respect to the nonparametric estimator of aj (Ut) in the functional coefficient model, g(.) can be estimated by local linear regression. At each point yt={yt1, yt2 , . . . , ytj}, we 0 can approximate g(.) locally by a linear function g(Y) = a + (Y  y) b. We can also approximate g(y) locally simply by a constant function g(y) = a (i.e., the local constant estimator), which is the approach taken here. The local constant estimator is relatively simple to implement and has been widely used in applied research. Compared to other estimators, it has also drawn most theoretical attention and thus has clear theoretical properties for estimation and inference of nonparametric models. The local constant estimator at point y ^, where a ^ minimizes the sum of local weighted is given by gðyÞ ¼ a squares: j N X Y ½Y t  a2 K hs ðY ts  yts Þ; t¼1 s¼1 Q where js¼1 K hs ðY ts  yts Þ is the product kernel, Kh,s is the univariate kernel function, and h = (h1 , . . . hj) is chosen by the leave-oneout cross-validation procedure. The smoothing parameter h is the most important parameter in nonparametric estimation. An inappropriately chosen h will give poor in-sample and out-of-sample prediction. Traditional nonparametric forecasting uses h that minimizes the in-sample sum square errors to forecast the next-period value based on previous in-sample data. However, while this h is optimal for all in-sample data, it may not be the best h for out-ofsample forecasting. Consequently, we use a modified method to select the smoothing parameter.7 Our modified approach consists of finding the best h for out-ofsample forecasting and making forecasts based on this h*. For example, suppose that we have data points of x1–x100 and that we want to forecast x101. The traditional approach is to find the best h to minimize the 100 data points’ in-sample sum of squared errors (based on x1–x100) and then use the h* and these data points (i.e., x1–x100) to forecast x101. We propose the following modified nonparametric forecasting methodology. We use h* and data points of x1–x80 to forecast x81, data points of x2–x81 to forecast x82, . . . , data points of x20–x99 to forecast x100. We find the h* that minimizes the sum of squared errors of out-of-sample forecast of points 7 We thank Qi Li for making the suggestion. 501 J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507 x81–x100 and use this h* and data points x21–x100 to make our final forecast of x101. In this procedure, we have two parameters to establish: (1) the out-of-sample evaluation length kis set equal to x100 ) in the example, and (2) the regression length m is 20 (^ x81 —^ set equal to 80 in the example. Hence, we denote the model as NP(k, m), where the parameters (k, m) are important to the forecasting performance of this modified nonparametric regression model. We thus experiment different evaluation lengths in our study, and it appears that its impact is not substantial in this study. Therefore, in the tables presented below, we only discuss the results based on a particular combination. Finally, it has also been argued that no single forecasting model performs well for all time periods and under all different criteria, as the pattern of ETF returns can vary over time and may not follow a simple data generating process. In order to improve the predictability, we closely follow Hong and Lee (2003) and combine several forecasting models. More specifically, we pool forecasts from the AR(1), GARCH(1, 1), NN(1, 5), FC(1, 200), and NP(200, 400) models to forecast the conditional mean of price changes.8 Denoting these five models as models 1, . . . , 5, respectively, the combined model is given by: b  Y t 5 X k¼1 xkt Yb kt ; where the weight xkt is determined as follows: h i P b 2 exp kt t1 s¼1 ðY s  Y ks Þ h i xkt  P6 Pt1 b 2 j¼1 exp kt s¼1 ðY s  Y ks Þ with kt ¼ 1=ð2S2t Þ, S2t is the sample variance of {Ys}, s runs from 1 to t  1, and Yks is the out-of-sample prediction by model k. Intuitively, xks gives higher weight to the model k if the prediction for model k is better than other models in previous forecasting exercises as measured by the mean squared forecast error (MSFE) criterion. day dynamics. As thoroughly discussed in Hsieh (1991, p.1848), high-frequency tick by tick data may capture bid-ask bounces and other dependencies which are caused by the market microstructure. These ‘‘artificial” dependencies will be picked up by any good test of nonlinear dynamics. The financial economist must increase the sampling interval in order to average out these ‘‘artificial” dependencies. Monoyios and Sarno (2002) also argue that the use of daily data can easily allow for the longer time span of the time series, which is much more important than the number of observations per se to model nonlinear dynamics related to lower-frequency properties of the data. In addition, the number of daily observations is large enough to allow efficient in-sample estimation and out-of-sample forecasting evaluation. A limited number of observations tend to produce poor fit and inferior predictability, which could make results biased against rejecting the martingale hypothesis. 4. Empirical results In order to produce out-of-sample forecasts, we use a rolling regression technique. Suppose there are N observations in the sample, where N = R + P. At time t, we use a rolling sample of size R observations, as estimated using various linear and nonlinear b tþ1 . Therefore, methods, to produce a one-step-ahead forecast, Y we can generate a sequence of P one-step-ahead forecasts which is used to evaluate each of the models under consideration. Swanson and White (1995, 1997) suggest that the rolling regression technique can further allow for the (potentially nonlinear) relation between the current and past returns to evolve across time. Applying four forecasting evaluation criteria to the sequence of out-of-sample forecasts, we investigate the forecasting ability of the model relative to the benchmark martingale model. The four evaluation criteria used here are: MSFE ¼ P1 N1  X t¼R 3. Data description The dataset consists of daily return observations for eighteen international stock index ETFs from CRSP. These ETFs are traded on the US market and designed to mimic the underlying indices they represent. They are readily available to US investors who want to get access to international stock markets without the involvement of currency exchange. More specifically, we use the daily closing prices on ishares exchange-traded funds (ETFs) that track a chosen market index.9 These markets have been divided into two groups: developed and emerging markets. The developed market ETFs include Australia, Canada, France, Germany, Italy, Japan, Netherlands, Spain, Switzerland, United Kindgom, and the United States. The emerging market ETFs include Brazil, Hongkong, Korea, Malaysia, Mexico, Singapore, and Taiwan. The time period covered for developed market indices spans from April 1, 1996 to August 25, 2006. Among emerging market ETFs, for Hong Kong, Malaysia, Mexico, and Singapore, the starting period is January 4, 1999; for Taiwan, the starting period is June 23, 2000; for South Korea, the starting period is May 12, 2000; and for Brazil, the starting period is July 14, 2000. The use of daily data is appropriate for the purpose of this study and similar to many previous studies (Hsieh, 1991; Gencay, 1998). Unlike higher-frequency intraday data, daily ETF data avoid the microstructure effects which are usually present in intra- 8 The Combined II forecasts pool forecasts from all 5 of these models, while the Combined I forecasts exclude the forecast of the GARCH(1, 1) model. 9 For the US, the ETF chosen for the S& P 500 index is the SPY because it has a much higher trading volume than the ishares S& P500 ETF index. Also, All ETF returns are already adjusted for dividends. MAFE ¼ P1 N1 X t¼R MFTR ¼ P1 N1 X t¼R MCFD ¼ P1 b tþ1 Y tþ1  Y 2 ; b tþ1 ; Y tþ1  Y b tþ1 ÞY tþ1 ; signð Y N 1 h i X b tþ1 ÞsignðY tþ1 Þ > 0 ; 1 signð Y t¼R b tþ1 Þ ¼ 1 if Y b tþ1 P 0 and signð Y b tþ1 Þ ¼ 1 where sign(.) denotes signð Y b tþ1 < 0. if Y Similar to Hong and Lee (2003), the two statistical criteria, mean squared forecast error and mean squared absolute error (MSFE and MAFE) are complemented with two economic criteria, mean forecast trading return and mean correct forecast direction (MFTR and MCFD). Both MFTR and MCFD can be particularly informative to profit-maximizing investors. Because stock returns are volatile, forecast errors can be quite large from period to period, the statistical accuracy of forecasts (as measured by MSFE and MAFE) does not necessarily imply economic accuracy in terms of maximizing investor profits. Investors may base their trading decisions on maximizing profits rather than minimizing forecasting errors. Furthermore, accurate forecasts of the direction of price changes may be equally important or even more important to investors than the magnitude of the changes, as they can be easily translated into profits. Granger (1992) emphasizes that, in this case, it is also desirable to compute economic measures of forecast accuracy, e.g., MFTR and MCFD. Many other authors (e.g., Leitch and Tanner, 1991; Hong and Lee, 2003) have made similar points in the context of forecasting asset prices. Hence, the use of multiple 502 J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507 Table 1 Forecast evaluation results for developed markets – MSFE. Benchmark AR(1) P1 P2 GARCH(1, 1) P1 P2 NN(1, 5) P1 P2 FC(1, 200) P1 P2 NP(200, 400) P1 P2 Combined I P1 P2 Combined II P1 P2 AU CA GE IT JP SW NE SP FR UK US 1.175 1.039 1.00 1.00 0.998 0.30 0.59 1.007 0.71 0.75 1.043 1.00 0.82 1.018 0.99 0.86 1.001 0.49 0.86 0.999 0.44 0.86 1.021 1.017 1.00 0.99 0.998 0.23 0.50 1.001 0.53 0.72 1.012 0.96 0.79 1.006 0.88 0.83 0.998 0.31 0.82 0.995 0.11 0.62 1.341 0.994 0.04 0.05 0.996 0.12 0.08 1.001 0.53 0.34 1.027 0.99 0.57 1.001 0.62 0.59 0.990 0.02 0.34 0.990 0.00 0.33 0.933 0.995 0.08 0.09 0.998 0.21 0.11 1.018 0.97 0.38 1.006 0.73 0.55 1.013 0.98 0.64 0.998 0.30 0.64 0.997 0.11 0.64 1.837 1.005 0.79 0.78 1.000 0.35 0.68 1.014 0.97 0.85 1.010 0.84 0.90 1.008 0.85 0.95 0.999 0.39 0.90 0.997 0.24 0.84 1.081 0.973 0.00 0.00 0.998 0.12 0.00 1.053 1.00 0.03 1.001 0.53 0.04 0.988 0.14 0.04 0.978 0.00 0.04 0.979 0.01 0.04 1.086 1.001 0.64 0.67 0.997 0.08 0.27 1.004 0.68 0.55 1.016 0.97 0.68 1.005 0.92 0.70 0.998 0.29 0.71 0.996 0.07 0.65 1.067 1.000 0.46 0.48 0.998 0.19 0.45 0.997 0.32 0.59 1.017 0.97 0.69 1.004 0.91 0.72 0.997 0.16 0.63 0.996 0.06 0.55 0.998 0.996 0.06 0.06 0.998 0.21 0.10 1.030 1.00 0.42 1.009 0.91 0.56 1.008 0.98 0.61 1.001 0.59 0.62 0.999 0.33 0.62 0.844 0.990 0.18 0.16 0.998 0.11 0.16 1.034 0.99 0.37 1.026 0.92 0.53 1.006 0.73 0.56 0.993 0.22 0.56 0.991 0.11 0.56 0.492 0.998 0.18 0.18 0.999 0.36 0.35 1.052 1.00 0.65 1.006 0.91 0.74 1.004 0.83 0.81 1.002 0.76 0.81 1.000 0.52 0.81 Notes: (1) The data are daily data from April 1, 1996 to August 25, 2006. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2 is the bootstrap reality check p-value for comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model. (3) AR, NN, FC, NP are various models under considerations. For the benchmark model, the MSFEs are in levels (104). For all other models, they are MSFE ratios relative to that of the benchmark model. The smaller MSFE, the better predictive ability of a model. Table 2 Forecast evaluation results for developed markets – MAFE. MAFE Benchmark AR(1) P1 P2 GARCH(1, 1) P1 P2 NN(1, 5) P1 P2 FC(1, 200) P1 P2 NP(200, 400) P1 P2 Combined I P1 P2 Combined II P1 P2 AU CA GE IT JP SW NE SP FR UK US 0.838 1.008 0.90 0.89 0.996 0.02 0.29 1.003 0.63 0.47 1.014 0.99 0.53 1.008 0.99 0.55 1.000 0.51 0.55 0.997 0.10 0.55 0.792 1.006 0.93 0.94 0.996 0.01 0.13 1.000 0.52 0.33 1.002 0.66 0.42 1.004 0.91 0.44 1.000 0.46 0.44 0.997 0.07 0.44 0.904 0.997 0.09 0.10 0.997 0.04 0.10 1.003 0.68 0.37 1.011 0.97 0.56 1.000 0.54 0.60 0.994 0.02 0.30 0.994 0.01 0.29 0.764 0.997 0.08 0.08 0.999 0.20 0.08 1.014 1.00 0.31 1.004 0.81 0.47 1.005 0.90 0.54 0.999 0.31 0.54 0.998 0.12 0.54 1.062 0.998 0.28 0.29 1.000 0.33 0.29 1.008 0.96 0.50 1.002 0.66 0.65 1.003 0.74 0.73 0.997 0.19 0.68 0.998 0.18 0.68 0.814 0.985 0.01 0.00 0.999 0.13 0.00 1.034 1.00 0.02 0.995 0.23 0.03 0.996 0.27 0.04 0.988 0.01 0.04 0.988 0.01 0.04 0.799 1.001 0.63 0.64 0.999 0.18 0.44 1.004 0.78 0.68 1.014 1.00 0.78 1.003 0.89 0.83 0.999 0.32 0.83 0.998 0.12 0.72 0.787 0.999 0.38 0.41 0.999 0.38 0.60 1.001 0.62 0.77 1.008 0.96 0.84 1.002 0.90 0.88 0.999 0.28 0.81 0.999 0.20 0.77 0.779 0.999 0.19 0.19 1.000 0.41 0.33 1.016 1.00 0.62 1.006 0.96 0.72 1.006 0.98 0.79 1.001 0.76 0.79 1.000 0.64 0.79 0.708 0.991 0.07 0.06 0.999 0.32 0.06 1.020 1.00 0.14 1.005 0.74 0.24 1.000 0.54 0.25 0.992 0.07 0.25 0.992 0.05 0.25 0.548 1.000 0.46 0.46 0.996 0.01 0.02 1.034 1.00 0.35 1.001 0.71 0.39 1.006 0.99 0.42 1.002 0.93 0.42 1.000 0.55 0.42 Notes: (1) The data are daily data from April 1, 1996 to August 25, 2006. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2is the bootstrap reality check p-value for comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model. (3) AR, NN, FC, NP are various models under considerations. For the benchmark model, the MAFEs are in levels (102). For all other models, they are MAFE ratios relative to that of the benchmark model. The smaller MAFE, the better predictive ability of a model. criteria in this study provides a more comprehensive perspective on the predictability of stock returns. As mentioned above, it is important to have an adequately large number of observations to efficiently estimate the model parameters. In other words, the size of R must be reasonably large. On the other hand, the size of P must be also large enough to detect the differences in forecasting performance across models. Given the number of observations in our data (N = 2619 and N = 1924 for developed and most emerging markets, respectively), an appropriate or balanced choice for R can be expressed by the ratio R:P = 2:1.10 10 We also conducted the analysis based on the ratio R:P = 1:1. The results are similar qualitatively and available upon request. 503 J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507 Tables 1–4 report the results for the developed markets and Tables 5–8 report the results on the emerging markets. Each table contains one of the forecasting evaluation criteria in the order presented above. For example, Table 1 reports the out-of-sample forecast results using the MSFE for the eleven developed countries under consideration. All forecast results are based on an R:P ratio (regression length: total out-of-sample forecasts length) equal to 2:1. Each table also contains the two distinct p-values: P1 and P2 based on the White’s (2000) Reality Check test. White’s (2000) test addresses the dangerous practice of data-snooping or data re-usage for the purpose of inference. He constructs a method for testing the hypothesis that the best model encountered during a specification search has no predictive superiority over the benchmark model. His method, however, permits for data-snooping to be undertaken with some degree of confidence that one will not mistake results generated by chance for genuinely ‘‘good” results. Table 3 Forecast evaluation results for developed markets – MFTR. Benchmark AR(1) P1 P2 GARCH(1, 1) P1 P2 NN(1, 5) P1 P2 FC(1, 200) P1 P2 NP(200, 400) P1 P2 Combined I P1 P2 Combined II P1 P2 AU CA GE IT JP SW NE SP FR UK US 0.090 0.003 0.98 0.98 0.090 1.00 0.50 0.054 0.77 0.68 0.022 0.99 0.75 0.052 0.93 0.82 0.025 0.96 0.84 0.089 0.54 0.87 0.094 0.017 1.00 1.00 0.094 1.00 0.51 0.022 0.91 0.69 0.042 0.96 0.77 0.090 0.57 0.84 0.047 0.87 0.86 0.086 0.62 0.88 0.017 0.083 0.04 0.03 0.080 0.01 0.04 0.003 0.37 0.08 0.017 0.22 0.09 0.051 0.71 0.11 0.081 0.04 0.12 0.099 0.02 0.07 0.075 0.087 0.37 0.37 0.075 1.00 0.37 0.018 0.95 0.57 0.045 0.78 0.64 0.008 0.96 0.69 0.044 0.77 0.70 0.073 0.52 0.73 0.019 0.028 0.21 0.23 0.032 0.66 0.29 0.001 0.36 0.42 0.037 0.18 0.42 0.026 0.25 0.45 0.041 0.17 0.44 0.038 0.17 0.44 0.019 0.162 0.01 0.00 0.073 0.01 0.00 0.055 0.90 0.01 0.112 0.03 0.01 0.107 0.06 0.01 0.138 0.01 0.02 0.135 0.01 0.02 0.053 0.012 0.09 0.10 0.072 0.02 0.03 0.025 0.06 0.03 0.026 0.26 0.03 0.050 0.51 0.03 0.011 0.08 0.04 0.027 0.03 0.04 0.086 0.033 0.91 0.90 0.086 1.00 0.50 0.040 0.89 0.72 0.015 0.98 0.77 0.048 0.87 0.82 0.088 0.48 0.83 0.069 0.71 0.85 0.075 0.064 0.63 0.67 0.075 1.00 0.53 0.011 0.96 0.72 0.012 0.96 0.79 0.029 0.80 0.81 0.068 0.58 0.83 0.053 0.78 0.84 0.053 0.055 0.50 0.47 0.068 0.06 0.40 0.034 0.95 0.57 0.053 0.47 0.61 0.034 0.66 0.64 0.066 0.39 0.65 0.067 0.37 0.66 0.026 0.033 0.40 0.42 0.036 0.26 0.51 0.031 0.93 0.64 0.001 0.83 0.70 0.017 0.86 0.74 0.035 0.41 0.75 0.034 0.37 0.76 Notes: (1) The data are daily data from April 1, 1996 to August 25, 2006. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2 is the bootstrap reality check p-value for comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model. (3) AR, NN, FC, NP are various models under considerations. The larger MFTR, the better predictive ability of a model. Table 4 Forecast evaluation results for developed markets – MCFD. Benchmark AR(1) P1 P2 GARCH(1, 1) P1 P2 NN(1, 5) P1 P2 FC(1, 200) P1 P2 NP(200, 400) P1 P2 Combined I P1 P2 Combined II P1 P2 AU CA GE IT JP SW NE SP FR UK US 0.552 0.522 0.90 0.91 0.552 1.00 0.50 0.524 0.90 0.71 0.507 0.98 0.76 0.527 0.96 0.82 0.508 0.99 0.84 0.549 0.58 0.87 0.557 0.497 1.00 1.00 0.557 1.00 0.49 0.499 0.99 0.66 0.528 0.95 0.74 0.540 0.86 0.81 0.507 0.99 0.83 0.543 0.79 0.85 0.491 0.522 0.09 0.08 0.530 0.02 0.07 0.498 0.38 0.11 0.502 0.29 0.12 0.494 0.41 0.14 0.532 0.04 0.14 0.525 0.05 0.15 0.523 0.520 0.56 0.53 0.523 1.00 0.48 0.466 0.98 0.68 0.507 0.76 0.73 0.477 0.97 0.77 0.491 0.91 0.78 0.511 0.75 0.80 0.494 0.513 0.20 0.19 0.492 0.54 0.23 0.483 0.66 0.36 0.508 0.27 0.44 0.501 0.42 0.47 0.513 0.21 0.48 0.512 0.22 0.48 0.502 0.558 0.01 0.01 0.522 0.04 0.01 0.453 0.96 0.03 0.532 0.13 0.04 0.528 0.15 0.04 0.540 0.05 0.04 0.540 0.06 0.04 0.443 0.481 0.03 0.04 0.513 0.01 0.01 0.491 0.02 0.01 0.453 0.32 0.01 0.460 0.17 0.01 0.486 0.02 0.01 0.493 0.00 0.01 0.523 0.507 0.78 0.77 0.523 1.00 0.47 0.498 0.89 0.69 0.481 0.96 0.76 0.508 0.78 0.81 0.513 0.69 0.82 0.507 0.87 0.84 0.512 0.498 0.77 0.75 0.512 1.00 0.49 0.494 0.74 0.68 0.487 0.91 0.75 0.478 0.89 0.79 0.498 0.74 0.81 0.493 0.87 0.83 0.497 0.518 0.20 0.20 0.506 0.09 0.20 0.483 0.68 0.33 0.511 0.29 0.37 0.509 0.33 0.40 0.522 0.16 0.36 0.522 0.15 0.37 0.539 0.519 0.79 0.81 0.549 0.20 0.43 0.470 1.00 0.56 0.516 0.87 0.62 0.452 1.00 0.66 0.507 0.90 0.67 0.533 0.61 0.69 Notes: (1) The data are daily data from April 1, 1996 to August 25, 2006. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2 is the bootstrap reality check p-value for comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model. (3) AR, NN, FC, NP are various models under considerations. The larger MCFD, the better predictive ability of a model. 504 J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507 Table 5 Forecast evaluation results for emerging markets – MSFE. Benchmark AR(1) P1 P2 GARCH(1, 1) P1 P2 NN(1, 5) P1 P2 FC(1, 200) P1 P2 NP(200, 400) P1 P2 Combined I P1 P2 Combined II P1 P2 Table 7 Forecast evaluation results for emerging markets – MFTR. HK MA SI TW MX SK BR 1.139 0.985 0.12 0.09 0.998 0.17 0.09 1.012 0.79 0.24 0.986 0.19 0.34 1.000 0.52 0.34 0.977 0.02 0.15 0.978 0.01 0.15 0.845 0.991 0.07 0.07 1.000 0.30 0.07 1.003 0.64 0.21 0.990 0.18 0.32 0.990 0.10 0.34 0.987 0.02 0.20 0.988 0.03 0.20 1.228 0.989 0.24 0.22 0.997 0.12 0.22 1.052 1.00 0.45 0.994 0.35 0.49 1.002 0.60 0.53 0.987 0.15 0.47 0.986 0.08 0.45 1.926 1.014 0.89 0.92 1.001 0.78 0.90 1.050 0.99 0.96 1.012 0.91 0.97 1.008 0.82 0.98 1.006 0.76 0.98 1.001 0.56 0.98 2.216 1.006 0.99 0.97 0.996 0.16 0.21 1.012 0.89 0.49 1.019 0.97 0.67 1.002 0.68 0.72 0.999 0.33 0.72 0.997 0.14 0.72 2.433 1.001 0.61 0.61 1.000 0.52 0.74 1.020 0.93 0.88 1.002 0.62 0.95 1.007 0.86 0.98 0.998 0.26 0.90 0.998 0.22 0.90 4.590 1.000 0.48 0.47 0.998 0.31 0.53 1.023 0.84 0.76 1.002 0.58 0.81 1.002 0.64 0.83 0.997 0.32 0.75 0.995 0.18 0.67 Benchmark AR(1) P1 P2 GARCH(1,1) P1 P2 NN(1, 5) P1 P2 FC(1, 200) P1 P2 NP(200, 400) P1 P2 Combined I P1 P2 Combined II P1 P2 HK MA SI TW MX SK BR 0.016 0.068 0.07 0.05 0.085 0.03 0.04 0.009 0.44 0.06 0.094 0.01 0.05 0.050 0.10 0.05 0.107 0.02 0.03 0.111 0.01 0.02 0.024 0.072 0.14 0.12 0.052 0.11 0.14 0.035 0.88 0.29 0.046 0.35 0.32 0.063 0.18 0.36 0.059 0.21 0.37 0.072 0.14 0.37 0.018 0.106 0.06 0.06 0.079 0.07 0.07 0.004 0.64 0.12 0.057 0.25 0.13 0.066 0.22 0.16 0.096 0.11 0.17 0.121 0.04 0.11 0.033 0.016 0.28 0.28 0.021 0.09 0.32 0.076 0.67 0.46 0.034 0.22 0.41 0.084 0.72 0.46 0.004 0.36 0.48 0.034 0.49 0.49 0.147 0.068 0.98 0.99 0.147 0.00 0.49 0.013 0.97 0.73 0.040 0.94 0.82 0.144 0.55 0.89 0.115 0.84 0.92 0.155 0.31 0.87 0.126 0.137 0.36 0.38 0.125 0.78 0.38 0.025 0.86 0.68 0.093 0.71 0.78 0.110 0.81 0.85 0.098 0.81 0.87 0.121 0.56 0.89 0.158 0.183 0.33 0.32 0.158 0.00 0.32 0.046 0.89 0.57 0.141 0.55 0.67 0.158 0.46 0.71 0.218 0.25 0.54 0.172 0.40 0.55 Notes: (1) The data are daily data from January 4, 1999 to August 25, 2006 for most of the emerging markets under consideration. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2 is the bootstrap reality check p-value for comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model. (3) AR, NN, FC, NP are various models under considerations. The smaller MSFE, the better predictive ability of a model. Notes: (1) The data are daily data from January 4, 1999 to August 25, 2006 for most of the emerging markets under consideration. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2 is the bootstrap reality check p-value for comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model. (3) AR, NN, FC, NP are various models under considerations. The larger MFTR, the better predictive ability of a model. Table 6 Forecast evaluation results for emerging markets – MAFE. Table 8 Forecast evaluation results for emerging markets – MCFD. HK MAFE Benchmark AR(1) P1 P2 GARCH(1, 1) P1 P2 NN(1, 5) P1 P2 FC(1, 200) P1 P2 NP(200, 400) P1 P2 Combined I P1 P2 Combined II P1 P2 0.817 0.990 0.09 0.08 1.001 0.61 0.08 1.008 0.80 0.18 0.987 0.11 0.16 1.000 0.55 0.16 0.986 0.02 0.14 0.987 0.01 0.14 MA 0.710 0.992 0.02 0.02 1.001 0.88 0.02 1.002 0.63 0.06 0.990 0.05 0.06 0.994 0.12 0.08 0.991 0.01 0.08 0.992 0.01 0.08 SI 0.852 0.986 0.04 0.05 0.997 0.03 0.05 1.028 0.99 0.19 0.989 0.07 0.20 1.002 0.61 0.22 0.990 0.05 0.22 0.990 0.05 0.22 TW 1.078 1.011 0.95 0.95 1.002 0.90 0.95 1.030 0.99 0.98 1.010 0.94 0.99 1.007 0.82 0.99 1.007 0.90 0.99 1.004 0.83 0.99 MX 1.122 1.005 1.00 0.99 0.994 0.01 0.02 1.005 0.86 0.13 1.009 0.94 0.27 0.999 0.31 0.28 0.999 0.34 0.28 0.997 0.07 0.28 SK 1.222 1.000 0.52 0.49 0.999 0.23 0.43 1.014 0.95 0.68 1.001 0.67 0.79 1.002 0.78 0.87 0.999 0.37 0.87 0.999 0.26 0.87 BR 1.665 1.002 0.77 0.79 0.996 0.06 0.10 1.013 0.93 0.43 1.002 0.62 0.58 0.997 0.22 0.60 0.999 0.41 0.60 0.998 0.22 0.60 Notes: (1) The data are daily data from January 4, 1999 to August 25, 2006 for most of the emerging markets under consideration. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2is the bootstrap reality check p-value for comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model. (3) AR, NN, FC, NP are various models under considerations. The smaller MAFE, the better predictive ability of a model. For our purpose, P1 is the bootstrap p-value for comparing a single model to the benchmark model which is the martingale model Benchmark AR(1) P1 P2 GARCH(1, 1) P1 P2 NN(1, 5) P1 P2 FC(1, 200) P1 P2 NP(200, 400) P1 P2 Combined I P1 P2 Combined II P1 P2 HK MA SI TW MX SK BR 0.476 0.523 0.03 0.04 0.498 0.13 0.05 0.477 0.47 0.09 0.533 0.01 0.04 0.514 0.05 0.05 0.533 0.02 0.05 0.537 0.00 0.04 0.479 0.505 0.09 0.09 0.490 0.13 0.09 0.463 0.68 0.22 0.505 0.13 0.26 0.498 0.20 0.32 0.505 0.14 0.34 0.510 0.07 0.23 0.498 0.544 0.03 0.03 0.526 0.02 0.03 0.491 0.58 0.07 0.531 0.06 0.08 0.505 0.40 0.10 0.540 0.05 0.11 0.551 0.01 0.06 0.477 0.474 0.52 0.51 0.481 0.34 0.58 0.459 0.68 0.72 0.497 0.22 0.46 0.477 0.49 0.51 0.479 0.46 0.52 0.481 0.42 0.53 0.561 0.526 0.99 0.99 0.561 1.00 0.48 0.517 0.99 0.69 0.510 0.98 0.80 0.557 0.62 0.87 0.542 0.94 0.89 0.563 0.35 0.88 0.549 0.545 0.62 0.62 0.549 0.32 0.64 0.512 0.89 0.81 0.536 0.77 0.89 0.542 0.88 0.95 0.532 0.94 0.96 0.545 0.72 0.97 0.564 0.557 0.70 0.70 0.564 1.00 0.49 0.515 1.00 0.72 0.548 0.73 0.81 0.560 0.71 0.86 0.564 0.50 0.88 0.566 0.38 0.84 Notes: (1) The data are daily data from January 4, 1999 to August 25, 2006 for most of the emerging markets under consideration. (2) P1 is the bootstrap p-value for comparing a single model with the martingale model (the benchmark model) using White’s (2000) test with 1000 bootstrap replications and a bootstrap smoothing parameter q = 0.75. P2is the bootstrap reality check p-value for comparing k models with the martingale model, where the null hypothesis is that the best of the first k models has no superior predictive power over the martingale model. (3) AR, NN, FC, NP are various models under considerations. The smaller MFCD, the better predictive ability of a model. Yt = l + et. P2 is the bootstrap reality check p-value for comparing the k models to the benchmark model. The value for P2 in the table is the bootstrap reality check p-value for the null hypothesis that J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507 the best of the first k models has no superior predictive ability over the benchmark model. Of course, the last P2 value (in the last row of the table) checks if the best of all the models under consideration has superior predictive ability over the martingale model. The difference between each P1 and the last P2 gives an estimate of data-snooping bias. Sullivan et al. (1999) and Qi and Wu (2006) used the White’s methodology to examine the data-snooping issue in technical trading rules. Tables 1 and 2 report the results for 11 developed markets using statistical criteria MSFE and MAFE. For the benchmark model, the MSFE and MAFE are in levels (104 and 102, respectively). For all other models, they are in ratios relative to that of the benchmark model. For Table 1, the results show that except for Spain with the NN(1, 5) model, and Switzerland with the NP(200, 400) model, all MSFE ratios for the three nonlinear-in-mean models (NN(1, 5), FC(1, 200) and NP(200, 400)) are above 1. Therefore, none of the nonlinear-in-mean models outperforms the benchmark. These findings are consistent with previous studies (e.g. Hsieh (1991)) that show a poor forecasting performance of nonlinearin-mean models relative to the benchmark martingale models in terms of statistical criteria. On the other hand, when evaluated alone, each of the remaining four models (AR(1), GARCH(1, 1), and the two combinations) in some cases reveals superior predictive ability than the benchmark. Note that the combined II forecasts pool forecasts from all individual models: AR(1), GARCH(1, 1), NN(1, 5), FC(1, 200) and NP(200, 400), while the Combined I forecasts exclude the forecast of the GARCH(1, 1) model. Based on the MSFE criterion and the P1 statistics, the AR(1) and the Combined II models show the most forecasting power as they are able to beat the martingale model for four out of the 11 countries. Note that the Combined II forecasts perform better than the Combined I (CI) forecasts. The result is apparently suggestive of the importance of using GARCH models to allow for nonlinearity in volatility. The superiority of these 4 models (albeit moderate) as measured by the MSFE can be more clearly seen in the case of Switzerland. All four models are able to beat the benchmark at the 5% level of significance (except for the GARCH model, which has a P1 value of 12%). However, with allowance of data-snooping bias, the P2 in the last row suggests that the best forecasting model among the 7 models is no better than the martingale model, except for Switzerland that AR(1) model clearly beats the benchmark model. The results obtained using the MAFE as the evaluation criterion (Table 2) are very similar to those for the MSFE. The combined II models, when evaluated as a single model, show superior forecasting ability than the benchmark for five countries, which are mostly contributed by either the AR(1), the GARCH(1, 1) or both. All three nonlinear-in-mean models fail to outperform the martingale model for all the markets. The GARCH models, however, show a better predictive ability when evaluated by the MAFE relative to the MSFE. Nevertheless, with further allowance of data-snooping bias, the apparent good performance of the Combined II model disappears, again with the only exception of Switzerland, where the AR(1) model as the best model outperforms the benchmark at the 5% level (with the P2 value of 0.04). Tables 3 and 4 report the results using the economic criteria for all developed countries. All results for these two measures are in levels. The meaning of these results is straightforward. The MFTR shows the daily profit (in percentages) generated by the forecasts of the model, and the MCFD shows the percentage of all directional changes correctly predicted by the model. For example, in the case of Switzerland, the AR (1) model generates profit of 0.162% per trading day on average (or equivalently 40.7% per year with 251 trading days) during the out-of-sample period (before allowance for transactions cost) and correctly predicts 55.8% of the directions of changes which is mostly contributed by the superior perfor- 505 mance of the AR(1) model. The results based on the MFTR (Table 3) suggest some evidence of superior predictive ability for the 3 nonlinear-in-mean models.11 The NN model generates statistically significant profit (i.e., 0.025% per trading day) in case of the Netherlands. The FC and Nonparametric models are both able to beat the predictive power of the benchmark model in the Swiss stock market. However, for most other countries, the nonlinear-in-mean models do not outperform the benchmark model. On the other hand, for three countries, Germany, Switzerland, and the Netherland, the results reveal that both AR(1) and GARCH(1, 1) are able to improve the forecasts of the martingale model. The numbers from the combined forecasts as well as the reality check test statistic P2 also confirm the superiority of the AR(1) and GARCH(1, 1) over the benchmark model for those 3 countries. The results based on the MCFD criterion are similar to those based on the MFTR in that the 3 nonlinear-in-mean models generally can not forecast the direction of the changes. Only the NN model is able to outperform the benchmark in the Netherland market, correctly forecasting directional changes in prices 49.1% of the time, 4.8% more often than the martingale model. Again, for the three countries, Germany, Switzerland, and the Netherland, the results reveal that both AR(1) and GARCH(1, 1) are able to improve the forecasts of the martingale model. The numbers from the combined forecasts as well as the reality check P2 in the last row also confirm the result. Overall, there is very limited evidence for predictability based on nonlinear-in-mean models. Among the 11 developed markets, only 3 countries, Germany, Switzerland, and the Netherlands show strong predictability from the AR(1), GARCH(1, 1) and combined models based on the four statistical and economic criteria. The results based on the statistical criteria for Germany and Netherland, however, are not as strong as that for Switzerland due to the insignificant reality check statistics of P2 values. The results for six emerging markets in Tables 5–8 are largely similar to those of the developed markets. Using statistical evaluation criteria (see Tables 5 and 6), our findings suggest that even without allowance for data-snooping bias, nonlinear-in-mean models generally can not outperform the benchmark, except that the FC model for Malaysia and Singapore outperforms the benchmark based on the MAFE. The models that perform the best are again the AR(1), GARCH(1, 1), Combined I, and Combined II (the GARCH model, however, does not outperform the benchmark for any country when measured by the MSFE). Furthermore, using MAFE (instead of the MSFE) as the evaluation criterion provides stronger evidence of predictability in emerging markets. For example, the AR model is able to beat the benchmark in only one market of measured by the MSFE. The predictive ability of this model, however, significantly improves if we use the MAFE to measure forecasting errors. Overall, the statistical evaluation criteria show that without allowance for data-snooping bias, for up to four ETF indices, Hong Kong, Malaysia, Singapore, and perhaps Mexico, the Combined II model based mostly from AR(1) or GARCH(1, 1) model predictions is able to outperform the benchmark. Again, the allowance of data-snooping bias substantially changes the picture: the only P2 that is in the last row and below 10% , is for Malaysia with the MAFE criterion. 11 Closely following Fama (1991) and Gencay (1998), we do not explicitly allow for transaction costs in the evaluation of trading rule performance of various models. Although there are surely positive information and trading costs, according to Fama (1991), the researcher instead should focus on the more interesting task of laying out the evidence on the adjustment of prices to various kinds of information (e.g., past returns in this study). Also note that some evidence for nonlinear-in-mean predictability would be even weaker after this consideration of transaction costs, which reinforce the main point of this study. 506 J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507 The economic evaluation criteria in Tables 7 and 8 show, similar to the case of developed countries, that nonlinear-in-mean models do not outperform the benchmark except in a few cases. In the case of Hong Kong, both FC and NP models (as a single model) outperform the benchmark under both the MFTR and MCFD criteria while only FC model outperforms the benchmark under MCFD for Singapore. In this case, we also find some evidence of superior forecasting ability of the FC model over both the AR and GARCH models. Still, the AR and GARCH models outperform the benchmark in some markets. In particular, the AR model outperforms the benchmark for Hong Kong and Singapore based on both MFTR and MCFD, and for Malaysia based on MCFD. When evaluated alone, the GARCH model outperforms the benchmark in 5 out of 7 countries based on MFTR. Overall, based on economic criteria, there remains strong evidence after allowance of data-snooping bias (i.e., based on last row P2 values) that there is predictability for Hong Kong and Singapore, in addition to Malaysia as suggested by one of the statistical criteria (i.e., MAFE). 5. Conclusions This study investigates the martingale behavior of eighteen stock market index ETFs based on out-of-sample forecasts. In addition to a linear model, this paper employs several popular nonlinear models to more comprehensively explore potential nonlinearity in asset returns. Using both statistical and economic criteria, we find some evidence against the martingale hypothesis. Among the 18 ETF stock indices, three out of 11 developed markets (Germany, Netherlands, and Switerland) and three out of seven emerging markets (Hong Kong, Singapore and Malaysia) show predictability in terms of either statistical or economic criteria, or both. However, most of this evidence comes from the linear model and the nonlinear-in-variance GARCH model, while the popular nonlinear-in-mean models (neural network, semiparametric functional coefficient model, nonparametric kernel regression) generally do not help much. This finding confirms the in-sample evidence of Hsieh (1991, 1993) and Harris and Kucukozmen (2001) in the out-of-sample context, and it is in line with Moreno and Olmeda (2007) but differs from others (e.g., Gencay, 1998, 1999; Hong and Lee, 2003; Yang et al., 2008). Certainly, the differences of financial markets under study might account for such different findings. It is also important to note that the allowance for data-snooping bias using White’s Reality Check renders apparent strong predictability on many markets to be tenuous, and particularly undermine otherwise impressive performance of forecast combinations. Hence, the findings of the paper underscore the importance of allowing for data-snooping in addition to the wellknown overfitting problem of nonlinear models. Finally, our study also contrasts with earlier works (e.g., Patro and Wu, 2004) on the international stock market predictability using the variance ratio test. For example, Patro and Wu (2004) (see their Table 2) show that ten out of the eighteen developed markets exhibit in-sample (linear) daily return predictability. Our results suggest that despite more thorough examination with nonlinear models and multiple evaluation criteria, with the counteracting consideration of data-snooping bias, the predictability of daily international stock market indexes might not be even as widespread as previously thought. Acknowledgements We thank Qi Li, Xiaojing Su, and particularly three anonymous referees and the editor Lorenzo Peccati for many helpful comments. Appendix A Table A1 The summary of models. Name Models for E(YtjIt1) and sign[E(Yt jIt1)] Benchmark 1. AR(d) 2.GARCH(p, q) 3. NN(d, q) E(YtjIt1) = l P EðY t jIt1 Þ ¼ b0 þ dj¼1 bj Y tj P P E(YtjIt1) = l where r2t ¼ x þ pj¼1 bj r2tj þ qi¼1 ai e2ti P P P EðY t jIt1 Þ ¼ b0 þ dj¼1 bj Y tj þ qi¼1 di Gðc0i þ dj¼1 cji Y tj Þ; GðzÞ ¼ ð1 þ ez Þ1 P EðY t jIt1 Þ ¼ a0 ðU t Þ þ dj¼1 aj ðU t ÞY tj where 1 PL U t ¼ Y t1  L j¼1 Y tj E(YtjIt1) = g(Yt1,Yt2) AR(1), NN(1, 5), FC(1, 200) and NP(200, 400) 4. FC(d, L) 5. NP(k, m) 6. Combined I (1, 3, 4, 5) 7. Combined II (1–5) AR(1), GARCH(1, 1), NN(1, 5), FC(1, 200) and NP(200, 400) Notes: The benchmark model is the martingale model. AR(d) is the autoregression model. GARCH(p, q) is the generalized autoregressive conditional heteroskedasticity model. NN (d, q) is the neural network model. FC is the functional coefficient model of Cai et al. (2000). NP is the nonparametric model estimated by the kernel estimation approach. For NP(k, m) models the smoothing parameter h is used in nonparametric estimation for minimizing k period out-of-sample. References Ahn, D., Boudoukh, J., Richardson, M., Whitelaw, R.F., 2002. Partial adjustment or stale prices? Implications from stock index and futures return autocorrelations. Review of Financial Studies 15, 655–689. Cai, Z., Fan, J., Yao, Q., 2000. Functional-coefficient regression models for nonlinear time series. Journal of American Statistical Association 95, 941–956. Campbell, J., Lo, A., MacKinlay, C., 1997. The Econometrics of Financial Markets. Princeton University Press, Princeton, New Jersey. Chaudhuri, K., Wu, Y., 2003. Random walk versus breaking trend in stock prices: Evidence from emerging markets. Journal of Banking and Finance 27, 575–592. Chordia, T., Roll, R., Subrahmanyam, A., 2005. Evidence on the speed of convergence to market efficiency. Journal of Financial Economics 76, 271–292. Fama, E.F., 1991. Efficient capital markets: II. Journal of Finance 46, 1575– 1617. Gencay, R., 1998. The predictability of security returns with simple technical trading rules. Journal of Empirical Finance 5, 347–359. Gencay, R., 1999. Linear, nonlinear and essential foreign exchange rate prediction with simple trading rules. Journal of International Economics 47, 91–107. Gleason, K.C., Mathur, I., Peterson, M.A., 2004. Analysis of intraday herding behavior among the sector ETFs. Journal of Empirical Finance 11, 681–694. Granger, C.W.J., 1992. Forecasting stock market prices: Lessons for forecasters. International Journal of Forecasting 8, 3–13. Harris, R.D.F., Kucukozmen, C.C., 2001. Linear and nonlinear dependence in Turkish equity returns and its consequences for financial risk management. European Journal of Operational Research 134, 481–492. Hong, Y.M., Lee, T.H., 2003. Inference on predictability of foreign exchange rates via generalized spectrum and nonlinear time series models. Review of Economics and Statistics 85, 1048–1062. Hsieh, D.A., 1991. Chaos and nonlinear dynamics: Application to financial markets. Journal of Finance 46, 1839–1877. Hsieh, D.A., 1993. Implications of nonlinear dynamics for financial risk management. Journal of Financial and Quantitative Analysis 28, 41–64. Kim, E.H., Singal, V., 2000. Stock market openings: Experience of emerging economies. Journal of Business 73, 25–66. Lee, T.H., White, H., Granger, C.W.J., 1993. Testing for neglected nonlinearity in time series models: A comparison of neural network methods and alternative tests. Journal of Econometrics 56, 269–290. Leitch, G., Tanner, E., 1991. Economic forecast evaluation: Profits versus conventional error measures. American Economic Review 81, 580–590. Lo, A.W., Mackinlay, A.C., 1988. Stock market prices do not follow random walks: Evidence from a simple specification test. Review of Financial Studies 1, 41–66. Mcqueen, G., Thorley, S., 1991. Are stock returns predictable? A test using Markov chains. Journal of Finance 46, 239–263. Monoyios, M., Sarno, L., 2002. Mean reversion in stock index futures markets: A nonlinear analysis. Journal of Futures Markets 22, 285–314. Moreno, D., Olmeda, I., 2007. Is the predictability of emerging and developed stock markets really exploitable? European Journal of Operational Research 182, 436– 454. Patro, D.K., Wu, Y., 2004. Predictability of short-horizon returns in international equity markets. Journal of Empirical Finance 11, 553–584. Pennathur, A.K., Delcoure, N., Anderson, D., 2002. Diversification benefits of ishares and closed-end country funds. Journal of Financial Research 25, 541–557. J. Yang et al. / European Journal of Operational Research 200 (2010) 498–507 Poterba, J.M., Shoven, J.B., 2002. Exchange-traded funds: A new investment option for taxable investors. American Economic Review 92, 422–427. Qi, M., Wu, Y., 2006. Technical trading-rule profitability, data snooping, and reality check: Evidence from the foreign exchange market. Journal of Money, Credit and Banking 38, 2135–2158. Ratner, M., Leal, R.P.C., 1999. Tests of technical trading strategies in the emerging equity markets of Latin America and Asia. Journal of Banking and Finance 23, 1887–1905. Sullivan, R., Timmermann, A., White, H., 1999. Data-snooping, technical trading rule performance, and the bootstrap. Journal of Finance 54, 1647–1691. 507 Swanson, N.R., White, H., 1995. A model selection approach to assessing the information in the term structure using linear models and artificial neural networks. Journal of Business Economics and Statistics 13, 265–275. Swanson, N.R., White, H., 1997. A model selection approach to real time macroeconomic forecasting using linear models and artificial neural networks. Review of Economics and Statistics 79, 540–550. Tabak, B.M., Lima, E.J.A., 2009. Market efficiency of Brazilian exchange rate: Evidence from variance ratio statistics and technical trading rules. European Journal of Operational Research 194, 814–820. White, H., 2000. A reality check for data snooping. Econometrica 68, 1097–1126. Yang, J., Su, X., Kolari, J.W., 2008. Do Euro exchange rates follow a martingale? Some out-of-sample evidence. Journal of Banking and Finance 32, 729–740.