Abstract
Performance measurements obtained from dividing a single sample into training and test sets, e.g. by employing cross-validation, may not give an accurate picture of the performance of any model developed from the sample, on the set of examples to which the model will be applied. Such measurements, which are due to that training and test samples are drawn according to different distributions may hence be misleading. In this study, two support vector machine models for predicting malaria incidence developed from certain regions and time periods in Mozambique are evaluated on data from novel regions and time periods, and the use of selection bias correction is investigated. It is observed that significant reductions in the predicted error can be obtained using the latter approach, strongly suggesting that techniques of this kind should be employed if test data can be expected to be drawn from some other distribution than what is the origin of the training data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
The United States Global Health Initiative, Mozambique-national health strategy 2011-2015 (2012), http://hingx.org:8080/svn/main/eHealth%20Regulation/
World Health Organization, Using Climate to Predict Infectious Disease Outbreaks: A Review (2004), http://www.who.int/globalchange/publications/en/
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting protein evolutionary and structural relationships. Journal of Computational Biology 10(6), 867–868 (2003)
Han, L.Y., Cai, C.Z., Lo, S.L., Chung, M.C.M., Chen, Y.Z.: Prediction of RNA-biding proteins from primary sequence by support vector machine approach. RNA Journal 10, 355–368 (2004)
Byvatov, E., Schneider, G.: SVM-based feature selection for characterization of focused compound collections. Journal of Chemical and Information Computer Science 44(3), 993–999 (2004)
Viademonte, S., Burstein, F.: From knowledge discovery to computational intelligence: A framework for intelligent decision support systems. Series on Intelligent Decision-making Support Systems, pp. 57–78. Springer-Verlag, London Limited, London (2006)
Zacarias, O.P., Boström, H.: Strengthening the Health Information System in Mozambique through Malaria Incidence Prediction. In: Proceedings of IST-Africa International Conference, Nairobi, Kenya (May 2013)
Zacarias, O.P., Boström, H.: Comparing Support Vector Regression and Random Forests for Predicting Malaria Incidence in Mozambique. In: Proceedings of International Conference on Advances in ICT for Emerging Regions, Colombo, Sri-Lanka (to appear, December 2013)
Moons, K.G.M., Kengne, A.P., Grobbee, D.E., Royston, P., Vergouwe, Y., Altman, D.G., Woodward, M.: Risk prediction models: II. External validation, model updating, and impact assessment. Biomedical Heart Journal (March 2012), doi:10.1136/heartjnl-2011-301247
Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall, Pearson Education, Inc., Upper Saddle River, New Jersey (2002)
Obenshain, M.K.: Application of Data Mining Techniques to Healthcare Data. Journal of Infection Control and Hospital Epidemiology 25(8), 690–695 (2004)
Temu, E.A., Coleman, M., Abilio, A.P., Kleinschmidt, I.: High Prevalence of Malaria in Zambezia, Mozambique: The Protective Effect of IRS versus Increased Risks Due to Pig-Keeping and House Construction. PLoS ONE 7 (2012), doi:10.1371/journal.pone.0031409
Steyerberg, E.W., Borsboom, G.J.J.M., van Houwelingen, H.C., Eijkemans, M.J.C., Habbema, J.D.F.: Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Journal of Statistics in Medicine 23, 2567–2586 (2004)
Clifton, C., Thuraisingham, B.: Emerging standards for data mining. Journal of Computers Standards and Interfaces 23, 187–193 (2001)
Cortes, C., Mohri, M., Riley, M.D., Rostamizadeh, A.: Sample Selection Bias Correction Theory. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 38–53. Springer, Heidelberg (2008)
Zadrozny, B.: Learning and Evaluating Classifiers under Sample Selection Bias. In: Proceedings of the 21st International Conference on Machine Learning, Banff, Canada (2004)
Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., Schölkopf, B.: Correcting Sample Selection Bias by Unlabeled Data. In: NIPS (2007)
Boser, B.E., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, pp. 144–152. ACM Press, New York (1992)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, Inc., New York (2001)
Barbella, D., Benzaid, S., Christensen, J., Jackson, B., Qin, X.V., Musicant, D.: Understanding Support Vector Machine Classifications via a Recommender System-Like Approach. In: Proceedings of the International Conference on Data Mining (DMIN), Las Vegas, USA, July 13-16 (2009)
Blanford, J.I., Blanford, S., Crane, R.G., Mann, M.E., Paaijmans, K.P., Schreiber, K.V., Thomas, M.B.: Implications of temperature variation for malaria parasite development across Africa. Scientific Reports 3(1300) (2013), doi:10.1038/srep01300
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
R-Statistical tool for data analysis, http://CRAN.R-project.org/ (accessed September 16, 2012)
Heckman, J.J.: Sample selection bias as a specification error. Econometrica 47(1), 153–161 (1979)
Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI, pp. 973–978 (2001)
Fan, W., Davidsno, I., Zadrozny, B., Yu, P.S.: An improved categorization of classifier’s sensitivity on sample selection bias. In: ICDM, pp. 605–608. IEEE Computer Society, Los Alamitos (2005)
Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227–244 (2000)
Shimazaki, H., Shinomoto, S.: Kernel bandwidth optimization in spike rate estimation. Journal of Computer Neuroscience 29, 171–182 (2010)
Hernandez, N., Kiralj, R., Ferreira, M.M.C., Talavera, I.: Critical comparative analysis, validation and interpretation of SVM and PLS regression models in a QSAR study on HIV-1 protease inhibitors. Journal of Chemometrics and Intelligent Laboratory Systems 98, 65–77 (2009)
Meyer, D., Dimitriadou, E., Hornik, K., Weingesse, A., Leisch, F.: Manual of Package e1071, http://CRAN.R-project.org/=e1071 (accessed October 2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zacarias, O.P., Boström, H. (2013). Generalization of Malaria Incidence Prediction Models by Correcting Sample Selection Bias. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds) Advanced Data Mining and Applications. ADMA 2013. Lecture Notes in Computer Science(), vol 8347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53917-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-53917-6_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53916-9
Online ISBN: 978-3-642-53917-6
eBook Packages: Computer ScienceComputer Science (R0)