Abstract
Previous studies have revealed that Internet search data is a new source of data that can be used to predict the stock market. In this new, data-driven research field, choosing a method for preprocessing data is crucial to achieving accurate prediction performance. This paper proposes a preprocessing method of Internet search data: composite leading search index (CLSI), which is composed of three steps: (a) keyword selection, (b) time difference measurement, and (c) leading index composition. We demonstrate the validity of CLSI by comparing this method’s results with the results from search volume index (SVI), which is most commonly used in previous literatures. We build a time series model (TS) with error correction and support vector regression (SVR) for stock trend prediction, and combine into four models for comparison: SVI–TS, CLSI–TS, SVI–SVR, and CLSI–SVR. We test these four models in the context of the Chinese stock market, which interests more and more investors nowadays, and analyzed results in nine datasets: stable periods, peak periods and trough periods of Shanghai Composite Index, Shenzhen Composite Index, and Hushen 300 index respectively. The results show that using TS and SVR as forecasting models, CLSI performs better than SVI on majority of the test dataset while has almost the same performance with that of SVI on the remaining test dataset. It is to some extent convincing that CLSI is a more efficient preprocessing method of Internet search data for stock trend prediction.
Similar content being viewed by others
Notes
Available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html.
In optimization, if KKT conditions are satisfied, then the solution of the original optimization problems is identical to that of the dual problems. In applications of Support Vector Machines (including classification and regression), dual problems rather than original problems are solved.
References
Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? The information content of internet stock message boards. The Journal of Finance, 59(3), 1259–1294.
Askitas, N., & Zimmermann, K. F. (2009). Google econometrics and unemployment forecasting. Applied Economics Quarterly, 55(2), 107–120.
Boehm, E. A. (2001). The contribution of economic indicator analysis to understanding and forecasting business cycles. Indian Economic Review, 36, 1–36.
Bowerman, B. L., O’Connell, R. T., & Koehler, A. B. (2004). Forecasting, time series and regression: An applied approach. Belmont, CA: Thomson Brooks/Cole.
Cao, Q., Parry, M. E., & Leggio, K. B. (2011). The three-factor model and artificial neural networks: Predicting stock price movement in China. Annals of Operations Research, 185(1), 25–44.
Choi, H., & Varian, H. (2012). Predicting the present with Google trends. Economic Record, 88(s1), 2–9.
Clarkson, G. P. E. (1963). A model of the trust investment process. Computers and thought. New York: McGraw-Hill.
Capon, N., Fitzsimons, G. J., & Prince, R. A. (1996). An individual level analysis of the mutual fund investment decision. Journal of Financial Services Research, 10, 59–82.
Da, Z., Engelberg, J., & Gao, P. (2011). In search of attention. The Journal of Finance, 66(5), 1461–1499.
Granger, C. W. (1988). Some recent development in a concept of causality. Journal of Econometrics, 39(1), 199–211.
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457, 1012–1014.
Huang, W., Nakamori, Y., & Wang, S. (2005). Forecasting stock market movement direction with using support vector machine. Computers and Operations Research, 32, 2513–2522.
Hulth, A., Rydevik, G., & Linde, A. (2009). Web queries as a source for syndromic surveillance. PLoS One, 4(2), e4378.
Kahneman, D. (1973). Attention and effort. Englewood Cliffs, NJ: Prentice-Hall.
Kim, K. (2003). Financial time series forecasting using support vector machines. Neurocomputing, 55, 307–319.
Kullback, S. (1987). The kullback–leibler distance. The American Statistician, 41(4), 340–341.
Mao, H., Counts, S., Bollen, J. (2011). Predicting financial markets: Comparing survey, news, twitter and search engine data. arXiv preprint arXiv:1112.1051
Moore, G. H., & Shiskin, J. (1967). Indicators of business expansions and contractions. NBER. Occasional Paper, No 103.
Mitchell, T. (2009). Mining our reality. Science, 326, 1644–1645.
Hanke, J. E., & Reitsch, A. G. (1995). Business forecasting (5th ed.). Englewood Cliffs, NJ: Prentice-Hall.
Smith, G. P. (2012). Google internet search activity and volatility prediction in the market for foreign currency. Finance Research Letters, 9(2), 103–110.
Tierney, H. L. R., & Pan, B. (2012). A poisson regression examination of the relationship between website traffic and search engine queries. NETNOMICS: Economic Research and Electronic Networking, 13, 155–189.
Tumarkin, R., & Whitelaw, R. F. (2001). News or noise? Internet postings and stock prices. Financial Analysts, 57(3), 41–51.
Wang, L., & Zhu, J. (2010). Financial market forecasting using a two-step kernel learning method for the support vector regression. Annals of Operations Research, 174(1), 103–120.
Acknowledgments
This work has been partially supported by the National Natural Science Foundation of China under Grant 71202115, 71172199, 71201143, and 70972104, Beijing Natural Science Foundation under Grant 9143021, Postdoctoral Science Foundation of China under Grant 2013T60158, and Sponsorship from China scholarship Council (CSC).
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors are grateful to the anonymous reviewers and editors for their helpful comments and suggestions to improve the paper.
Rights and permissions
About this article
Cite this article
Liu, Y., Chen, Y., Wu, S. et al. Composite leading search index: a preprocessing method of internet search data for stock trends prediction. Ann Oper Res 234, 77–94 (2015). https://doi.org/10.1007/s10479-014-1779-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-014-1779-z