Time Series Momentum For Improved Factor Timing
Time Series Momentum For Improved Factor Timing
Time Series Momentum For Improved Factor Timing
Where r is the return of asset i, t indicates time period, σ is a volatility estimator for the asset
returns, α is the regression constant and β is the regression coefficient. Note that in order to
combine data across asset classes, we normalize returns by the previous month realized
volatility. While Baltas & Kosowski (2013) use a variance-covariance estimator which is robust
to cross-asset correlations, autocorrelation and heteroskedasticity, for simplicity, we only make
use of a heteroskedasticity and autocorrelation robust estimator. This implies that our standard
errors are underestimated, for which reason we don’t cite confidence intervals. We mainly
include the regression for illustrative purposes. Nevertheless, we would be remiss not to mention
that our results do appear to be less significant than those reported in Baltas & Kosowski (2013),
which could be a result of windowing – as we show later, the TSMOM anomaly seems to have
been increasingly deteriorating after 2010.
Figure 1: An illustration of the t-value of the AR coefficient of 60 separate regressions at different
monthly lags. Bars show the coefficient t-value, the red line is a 12-month windowed mean of the
t-values. The data is cross-asset-class monthly futures returns data normalized by asset volatility.
We use the 5 Fama-French factors from the mid-1960’s to 2020 to determine what risk
factors each portfolio is exposed to. Note this does not include momentum. The graph below
shows historical portfolio value growth over time for each of these 5 factors (Market, Size,
Value, Operating Profit, Investment).
Figure 2: Historical portfolio value growth for each of the five Fama-French factors
These factors are all long/short with low risk/returns historically. Returns range from
roughly 6% (excess market return) to only 1.5% (for size). Remember the sample period we have
data for is from 1964 so some of the size premium addressed in literature is not captured in our
sample.
4 Strategy & Methodology
100% Equity
Why not invest in 100% equities? Well 100% equities (assuming no costs) would return
roughly 10% a year annualized (on a historical basis), but with a 15% average annual volatility
and a max drawdown of 55%. This implies a multiple of 0.66 for the strategy and our factor
exposure (using the FF 5-factor model) would clearly show that we have all market risk. We
should be able to diversify these risks to produce better risk adjusted returns!
60/40 Portfolio
The next classic portfolio we should touch on is the 60/40 portfolio. We proxied this
portfolio assuming we were invested in 60% the FF market portfolio and 40% the risk free rate
with no transaction fees or holding costs. While this portfolio only produces an 8% per year
average annualized return, it does so with a smaller vol of only 9% and a smaller max drawdown
of 36% than being invested in 100% equities. Overall the multiple is 0.89 so in many ways the
60/40 does appear to be better diversified from the all equity portfolio. The factor exposures are
still overwhelmingly correlated with the market risk (see factor exposures below).
Equal Weight, Risk Parity, and Momentum on the Fama French 5 Factors
Next we wanted to achieve better risk adjusted returns by diversifying our factor
exposures. The best way to do this was to directly invest in combinations of the Fama French
risk factors themselves. We started with an equal weight of each of the 5 factors to achieve
maximum factor diversification. The equal weight portfolio produces a multiple of 1.09 and a
maximum drawdown of 17%, but does so at the cost of any real returns with an average annual
return of 3.7%.
We also applied a risk parity approach investing more in FF factors with lower volatility
in the last month. This tilted away from more volatile market risk (as one would anticipate) and
produced a multiple of 1.15 but with only 3.1% annualized returns. The diversified factor
exposures can be seen in the chart below.
Table 2: Risk factor decomposition of the 5 factor risk parity portfolio
Lastly, to try to get higher returns we looked at a momentum strategy on top of the FF 5
factor model to tilt out weights more heavily on factors that had performed well in the past year
(excluding the last month to avoid the short term reversal). This did increase returns but only to
3.8% annualized with a multiple of 1.1.
We have seen that 100% equity is too concentrated and is only exposed to one risk factor
(the market). With the 60/40 portfolio we were able to achieve better risk adjusted return but
with lower absolute returns and we still had only one risk factor. Explicit diversifying risk factors
brought us better risk adjusted returns, but with much too low absolute returns. Moving on, our
paper proceeds to focus on a time series momentum strategy that aims to increase absolute
returns while minimizing risk factors.
We implement the TSMOM strategy using purchased futures data from Pinnacle Data
Corp CLC Database. The original data set consists of 98 contracts traded across FX, Fixed
Income, Commodities and Equity. The data ranges from 1969 for the oldest contracts all the way
to 2020. We only use the subset of the contracts that have complete data from 1990 to 2020,
which reduces the data set to 48 contracts. The reason we chose this data set is because it was the
data set of choice for Lim et al. (2019) when creating their deep momentum network. When
working with models that are difficult to fit and tune, like neural networks, having reportedly
successful data provides some hope that it is the ill-calibrated model that is failing and not the
poor quality of the data, a very powerful motivator when debugging custom tensorflow loss
functions at 3am.
Since the raw data is a series of different maturity futures contracts, some preformatting
must be done to obtain a continuous price series for each asset. We use the backwards
ratio-adjusted method to do the rolling for each asset. Compared to other methods
(first-of-month, last-trading-day, most-liquid) , the ratio-adjusted method does not produce jumps
in the continuous price series, which is a desirable feature if we are working with returns data.
An important quality to consider about futures data is that the price not only has exposure
to shocks in the asset’s spot price but is also exposed to roll yield. Based on the movement of the
contract along the term structure, the holder of a futures contract may make or lose money
(depending if he is on the long / short leg and if the term structure is contango / backwardation).
The returns of a futures contract may thus be decomposed into two components:
While we didn’t have the time to disentangle the two return streams for this project, it
would be an interesting extension to model these two return streams separately. As an example,
for E-mini futures S&P 500 we could subtract out the S&P 500 spot price to obtain the roll yield.
We could then use term structure data to create a predictive model for the roll yield separate of
the TSMOM we capture from the spot price. The combination of the two would give a more
granular view of the return dynamics and hopefully a more powerful forecasting model.
As a baseline TSMOM factor we use the sign- strategy presented by Moskowitz et al.
(2012). The strategy is described by the following formula:
Where s indicates the asset or contract being traded, t is the time period, σ t is a backward looking
volatility estimator for asset s, σ tgt is the target volatility. In plain English, this strategy is the
trading rule to go long in an asset if it has had a net positive return over the past 12 months and
go short otherwise. The exposure or leveraged weighting to the asset is such that a target
volatility σtgt is achieved. This weighting is highly reliable on the accuracy of the volatility
estimator σt. Different choices of the volatility estimator and the estimator’s effect on the
TSMOM strategy are discussed in Baltas & Kosowski (2013). We later return to this discussion
when inspecting deep momentum networks.
Since we are normalizing asset returns with volatility, we can create a cross-asset
TSMOM portfolio by arithmetically combining the risk-weighted returns and leveraging it to a
desired target volatility (essentially a risk-parity portfolio of single-asset TSMOM strategies).
This is done by taking the sum over all assets S:
Where N i s the number of assets. To introduce a more consistent formalism, Lim et al. argue that
there are two components to any TSMOM strategy. First, a prediction of the trend, i.e. the
direction and/or magnitude of the next period returns. Second, given the trend estimate, the
position size to take for the next period bet. According to this formalism, the sign-strategy
described above would be encapsulated by:
While this basic strategy is effective at capturing TSMOM, it does lend itself to criticism.
To touch on the most important shortcomings of the strategy:
1. Position Sizing: The position sizing is determined fully by the volatility estimator σ t ,
which is backward looking. If the model had predictive ability, based on confidence and
size of the move, it should be able to do more optimal position sizing, taking larger risk
when greater return is expected.
3. Non-optimal: A last criticism is that this model isn’t optimized for any return metric.
Depending on our investment preferences, we would like to optimize the TSMOM factor
to have certain characteristics, for example maximum- return, minimum-risk-to-reward, or
minimum-downturn-to-reward. By optimizing with respect to these investment metrics,
we can design a TSMOM factor suited to the investor’s preferences.
The Deep Momentum Networks (DMN) proposed by Lim et al. (2019) address the
shortcomings of the basic sign-strategy. DMNs use a traditional neural network architecture to
map lagged time series features to an optimal next-period trend or size prediction.
In a traditional supervised learning problem, we have a labelled data set, i.e. explicitly
defined training targets. For our problem, this would mean that we have knowledge of the
optimal position size Y at all previous time steps and optimize a metric such as
mean-squared-error or mean-absolute-error to derive the optimal model. However, when
forming a trading strategy, we don’t have these labels. If we did, we would already have a model
to determine optimal position size, so why fit a model on top? In practice, optimal position sizing
is determined by the investor’s utility and risk preferences. So, instead of optimizing on known
targets and traditional loss functions, DMNs optimize their prediction on financial trading
metrics via a custom loss function.
We explore a loss function that optimizes Sharpe ratio. The implementation isn’t limited
to these loss functions, for example, if the investor is more concerned about downside risk with
respect to target returns, he may choose to optimize on Sortino ratio or drawdown. We define the
loss functions as follows:
Where s indicates the asset or contract, t the time period, r is return, σ s,t is a backward-looking
volatility estimator for asset s, σ tgt is the target volatility. In the loss functions the input the DMN
provides is the position size X ∈ [-1, 1]. Note that we thus let the network jointly predict trend
and position size. Positive values constitute a long position, negative values short. By predicting
and optimizing on position size, we aim for the DMN to capture opportune times of when to take
more or less risk.
Since the main innovation of DMN is the use of custom loss functions, the black box
which fits between input and output is arbitrary. Indeed, we might be better-off using a simpler
machine learning model such as a support-vector-machine. But for the sake of the modelling
challenge, we proceed with the neural network approach.
The network architectures explored in Lim et al. (2019) are the feedforward
Multilayer-Perceptron ( MLP), a convolution-based architecture, WaveNet (CNV), as well as a
Long-Short-Term-Memory Network (LSTM). They find, perhaps unsurprisingly, that LSTM
outperforms MLP and CNV. However, the simpler MLP beats CNV.
While LSTMs seem to be the better approach for a time series problem like this, they
require more extensive data preprocessing and tuning. In the end we settled for the MLP
architecture, which provides the best compromise between result quality, implementation
complexity, and convergence speed.
The MLP network is implemented using Keras. Tunable hyperparameters include the
dropout of the input layer, dropout of the hidden layers, number of hidden layers, number of
neurons per hidden layer, activation function for the input layer, activation function for the
hidden layers and the learning rate of the optimizer. We optimize the model hyperparameters
using hyperopt, a framework for Bayesian optimization of model parameters. For the
hyperparameter optimization we train each network on data from 1990-1995 and validate on data
from 1995-2000. We choose the model architecture with the smallest loss on the validation set,
which results in the architecture presented in Table 3 (parameters rounded).
To constrain the prediction interval to [-1, 1] we use the hyperbolic tan function as the
output layer activation. Furthermore, we use the Adam optimizer, which we found to converge
better than Stochastic Gradient Descent. We selected batch size for maximum computational
speed of our specific compute / GPU machine. In general, we found that the data required limited
training epochs, usually less than 10 epochs seemed sufficient, most best-scoring models only
required two or three epochs of training. This points to the fact that these networks really aren’t
that “Deep”. While the DMN does have hidden layers, given the small model complexity, the
naming “Deep Momentum Networks'' seems somewhat sensational.
The nine input features used to fit the DMN model are outlined in Appendix A.
Where X is the position size, t indicates the time period, s the asset or contract, σ is the volatility
estimator for the asset, N is the number of assets in the portfolio. The σ related to DMN is an
additional backward looking scaling factor to make sure that the combined portfolio has a
desired target volatility.
Due to volatility scaling, the assets are commonly traded on leverage. Leverage is not a
practical implementation concern for the strategy, as all traded contracts are futures, which are de
facto traded in accounts commonly allowing 20x leverage for institutional traders. However, we
do make the controversial assumption that we can serve the margin requirement in times of
stress. Our backtest methodology does not include margin calls or deleveraging of the account.
Incorporating this into the returns of the strategy would severely complicate our goal of creating
a pure TSMOM factor by deteriorating the signal which we are trying to encapsulate in the first
place. From an implementation perspective, this means that the investor or fund must have
enough liquidity to stomach the downturns, even during prolonged losing streaks.
For institutional investors, trading futures contracts has very little cost involved when
compared to other assets. There are no extra fees for being on the short leg of a contract and
exchange commissions or brokerage fees are small. However, when implementing any strategy,
particularly a quant factor strategy for a large fund, there could be significant implementation
shortfall involved. This means that we execute at a lower than desired price, consistently eating
into our returns. To be mindful to trading costs and market impact, we gauge cost relative to
portfolio turnover, which, in the context of volatility weighted portfolios, is defined by Baltas &
Kosowski (2015) as:
Where O is the turnover, c is a constant for transaction cost, X is the position size at t, σ is the
volatility estimator.
We don’t consider potential interest accrual from excess cash in times when the TSMOM
strategy has decreased exposure.
5 Results and Analysis
We display the performance of the TSMOM factor with and without transaction costs in
Figure 3. For comparison, we display performance of the AQR cross-sectional momentum factor
as well as those of the baseline SIGN-TSMOM. The curve closely replicates the one produced by
Lim et al. (2019), which confirms that we are capturing the same qualitative factor behaviour as
they did with their DMN. Pure DMN-TSMOM easily outperforms the other strategies over the
backtest period. However, there seems to be a structural change around 2005 when the
DMN-TSMOM returns start deteriorating and volatility starts increasing. This effect exacerbates
over time and the DMN-TSMOM strategy posts flat returns post 2012. The AQR momentum
factor as well as SIGN-TSMOM post more consistent performance throughout the backtest
period. A plus for TSMOM strategies in general however, is that they produce smaller downturns
than the AQR cross-sectional momentum. When applying a transaction cost of 2 basis points to
DMN-TSMOM we observe a significant reduction in performance. This is evident of the fact
that momentum strategies, as produced by our DMN, have large turnover. To observe the effect
of a range of different transaction costs on the DMN-TSMOM factor, please refer to Appendix
B.
Next we revisited our Market, 60/40, and Fama French 5-Factor portfolios but added
TSMOM as a 6th factor to try to increase the absolute return. TSMOM significantly
outperformed the market (100% equity portfolio) and the 60/40 portfolio in all metrics of
risk/return that we chose to track. The time series of the returns can be seen in Figure 4 below.
Figure 4: Shows the material outperformance of TSMOM over traditional benchmarks
The Fama-French portfolios all exhibit significantly superior returns and multiples when
we added the TSMOM factor. Surprisingly, an equal weighting of each of the now 6 factors saw
the best performance. The drastic difference with and without TSMOM can be seen in Figure 5
below.
Figure 5: Shows the outperformance of Fama-French 5 factor equal wright portfolio when TSMOM is
added.
An interesting observation in favor of TSMOM is that both the DTM and the SIGN
strategies seem to have little-to-no sensitivity to the dot-com bust or the downturn of the great
financial crisis. Baltas & Kosowski (2015) confirm this result for time series momentum. The
strategy’s robustness is most likely attributed to two factors. Firstly, the diversification effect of
trading across asset classes (FX, fixed income, equity, commodities). Secondly, the ability to
cheaply take short positions via futures contracts. The remarkable robustness of the factor to
these two financial events increase its attractiveness of addition to a factor portfolio.
Given the outstanding risk adjusted returns, DTM-TSMOM appears a promising
contender for inclusion in a quant factor portfolio. However, there are a couple of caveats to
touch on. Firstly, the metrics don’t hold once 2 basis points of transaction cost are applied to the
strategy. However, we can’t conclusively say if the transaction cost included DTM-TSMOM
strategy is outperformed by AQR cross-sectional momentum or SIGN-TSMOM, as the latter
don’t include transaction cost. Further analysis would be required for such a statement. Secondly,
a major concern with DTM-TSMOM is that its return profile seems to have been deteriorating
over the past decade. It is possible that the trade has become crowded over time and the signal
destroyed, which would make the strategy unprofitable going forward. This wouldn’t be
surprising as the inputs for this strategy are very simple and available to all investors. What we
can say is that over the first 15 backtest years, DTM-TSMOM has had excellent performance.
Figure 6: Shapley value based feature importance of the final Deep Momentum Network model. Impact is
measured as the mean absolute Shapley value, which captures the average magnitude but not the direction
of feature impact.
Figure 7: Impact of the daily normalized returns feature on predictions as measured by Shapley value.
Strong positive returns indicate an increase in next-period model output, strong negative returns a
decrease.
6 Conclusion
We’ve constructed a time series momentum factor using Deep Momentum Networks by
combining nearly 50 futures contracts from several major asset classes into a strategy portfolio.
The factor outperforms baseline strategies and cross-sectional momentum on risk and reward
metrics such as Sharpe and Sortino ratio. It performs well in extreme periods such as the dot-com
bubble and the great financial crisis, which makes it an attractive diversification to an equity
factor portfolio. Adding time series momentum to our reference portfolios, each of the
Fama-French factor-based portfolios were materially improved when we added the time series
momentum factor.
However, since 2005 the performance of the time series momentum factor has been
consistently deteriorating, and since 2012 has in fact been flat. The deterioration has manifested
itself in the form of decreasing expected returns and increasing volatility. Accounting for
transaction costs, the superior risk and return metrics of the strategy flaunder. This is a feature of
the high turnover associated with implementing a momentum strategy.
While the factor has a proven track record and demonstrably adds value to the
Fama-French factor portfolios, given the poor recent performance and high sensitivity to
transaction costs, any investor considering incorporating this strategy should be aware that it
comes with significant implementation risk.
References
Asness, C., Moskowitz, T.J., Pedersen, L.H.: 2010. Value and momentum everywhere. AQR
Capital, University of Chicago, and National Bureau of Economic Research.
Baltas, AN. and Kosowski, R.: 2013, Improving Time-Series Momentum Strategies: The Role of
Volatility Estimators and Trading Signals, CME Working Paper
Baltas, AN. and Kosowski, R.: 2015, Improving Time-Series Momentum Strategies: The Role of
Volatility Estimators and Trading Signals (Revised), CME Working Paper
Cameron, A. C., Gelbach, J. B. and Miller, D. L.: 2011, Robust inference with multiway
clustering, Journal of Business and Economic Statistics 29(2), 238–249
Gupta, T. and Kelly, B.: 2018, Factor Momentum Everywhere, Yale ICF Working Paper
No.2018-23
Lim, B., Zohren, S. and Roberts, S.: 2019, Enhancing Time-Series Momentum Strategies Using
Deep Neural Networks, The Journal of Financial Data Science Fall 2019, 1 (4), 19-38
Moskowitz, T., Ooi, Y. H. and Pedersen, L. H.: 2012, Time series momentum, Journal of
Financial Economics 104(2), 228 – 250
Thompson, S. B.: 2011, Simple formulas for standard errors that cluster by both firm and time,
Journal of Financial Economics 99(1), 1–10
Appendix A: DMN Feature Space
We use a mix of lagged returns, volatility estimates and technical momentum indicators
as our deep momentum network feature space:
Even though most features are already to some extent normalized, for convergence
purposes, we perform additional standardization on the data set by transforming each feature to
have normal mean and variance within the training set.
Appendix B: DMN-TSMOM Transaction Cost Impact
Figure 9: Performance of the DMN-TSMOM strategy with different levels of transaction cost. The y-axis is
on log scale and the series have been normalized to start at the same origin.