Generalization of Malaria Incidence Prediction Models by Correcting Sample Selection Bias

Orlando P. Zacarias^25,26 &
Henrik Boström²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8347))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

3255 Accesses
1 Citation

Abstract

Performance measurements obtained from dividing a single sample into training and test sets, e.g. by employing cross-validation, may not give an accurate picture of the performance of any model developed from the sample, on the set of examples to which the model will be applied. Such measurements, which are due to that training and test samples are drawn according to different distributions may hence be misleading. In this study, two support vector machine models for predicting malaria incidence developed from certain regions and time periods in Mozambique are evaluated on data from novel regions and time periods, and the use of selection bias correction is investigated. It is observed that significant reductions in the predicted error can be obtained using the latter approach, strongly suggesting that techniques of this kind should be employed if test data can be expected to be drawn from some other distribution than what is the origin of the training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

eBook: USD 13.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Efficient deep learning-based approach for malaria detection using red blood cell smears

Article Open access 10 June 2024

Mapping multiple components of malaria risk for improved targeting of elimination interventions

Article Open access 13 November 2017

Multivariate vector autoregressive modelling of malaria with climate and vegetation factors in a remote hilly region of Northeast India

Article 05 April 2025

References

The United States Global Health Initiative, Mozambique-national health strategy 2011-2015 (2012), http://hingx.org:8080/svn/main/eHealth%20Regulation/
World Health Organization, Using Climate to Predict Infectious Disease Outbreaks: A Review (2004), http://www.who.int/globalchange/publications/en/
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting protein evolutionary and structural relationships. Journal of Computational Biology 10(6), 867–868 (2003)
Article Google Scholar
Han, L.Y., Cai, C.Z., Lo, S.L., Chung, M.C.M., Chen, Y.Z.: Prediction of RNA-biding proteins from primary sequence by support vector machine approach. RNA Journal 10, 355–368 (2004)
Article Google Scholar
Byvatov, E., Schneider, G.: SVM-based feature selection for characterization of focused compound collections. Journal of Chemical and Information Computer Science 44(3), 993–999 (2004)
Article Google Scholar
Viademonte, S., Burstein, F.: From knowledge discovery to computational intelligence: A framework for intelligent decision support systems. Series on Intelligent Decision-making Support Systems, pp. 57–78. Springer-Verlag, London Limited, London (2006)
Google Scholar
Zacarias, O.P., Boström, H.: Strengthening the Health Information System in Mozambique through Malaria Incidence Prediction. In: Proceedings of IST-Africa International Conference, Nairobi, Kenya (May 2013)
Google Scholar
Zacarias, O.P., Boström, H.: Comparing Support Vector Regression and Random Forests for Predicting Malaria Incidence in Mozambique. In: Proceedings of International Conference on Advances in ICT for Emerging Regions, Colombo, Sri-Lanka (to appear, December 2013)
Google Scholar
Moons, K.G.M., Kengne, A.P., Grobbee, D.E., Royston, P., Vergouwe, Y., Altman, D.G., Woodward, M.: Risk prediction models: II. External validation, model updating, and impact assessment. Biomedical Heart Journal (March 2012), doi:10.1136/heartjnl-2011-301247
Google Scholar
Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall, Pearson Education, Inc., Upper Saddle River, New Jersey (2002)
Google Scholar
Obenshain, M.K.: Application of Data Mining Techniques to Healthcare Data. Journal of Infection Control and Hospital Epidemiology 25(8), 690–695 (2004)
Article Google Scholar
Temu, E.A., Coleman, M., Abilio, A.P., Kleinschmidt, I.: High Prevalence of Malaria in Zambezia, Mozambique: The Protective Effect of IRS versus Increased Risks Due to Pig-Keeping and House Construction. PLoS ONE 7 (2012), doi:10.1371/journal.pone.0031409
Google Scholar
Steyerberg, E.W., Borsboom, G.J.J.M., van Houwelingen, H.C., Eijkemans, M.J.C., Habbema, J.D.F.: Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Journal of Statistics in Medicine 23, 2567–2586 (2004)
Article Google Scholar
Clifton, C., Thuraisingham, B.: Emerging standards for data mining. Journal of Computers Standards and Interfaces 23, 187–193 (2001)
Article Google Scholar
Cortes, C., Mohri, M., Riley, M.D., Rostamizadeh, A.: Sample Selection Bias Correction Theory. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 38–53. Springer, Heidelberg (2008)
Chapter Google Scholar
Zadrozny, B.: Learning and Evaluating Classifiers under Sample Selection Bias. In: Proceedings of the 21st International Conference on Machine Learning, Banff, Canada (2004)
Google Scholar
Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., Schölkopf, B.: Correcting Sample Selection Bias by Unlabeled Data. In: NIPS (2007)
Google Scholar
Boser, B.E., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, pp. 144–152. ACM Press, New York (1992)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, Inc., New York (2001)
MATH Google Scholar
Barbella, D., Benzaid, S., Christensen, J., Jackson, B., Qin, X.V., Musicant, D.: Understanding Support Vector Machine Classifications via a Recommender System-Like Approach. In: Proceedings of the International Conference on Data Mining (DMIN), Las Vegas, USA, July 13-16 (2009)
Google Scholar
Blanford, J.I., Blanford, S., Crane, R.G., Mann, M.E., Paaijmans, K.P., Schreiber, K.V., Thomas, M.B.: Implications of temperature variation for malaria parasite development across Africa. Scientific Reports 3(1300) (2013), doi:10.1038/srep01300
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Google Scholar
R-Statistical tool for data analysis, http://CRAN.R-project.org/ (accessed September 16, 2012)
Heckman, J.J.: Sample selection bias as a specification error. Econometrica 47(1), 153–161 (1979)
Article MATH MathSciNet Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI, pp. 973–978 (2001)
Google Scholar
Fan, W., Davidsno, I., Zadrozny, B., Yu, P.S.: An improved categorization of classifier’s sensitivity on sample selection bias. In: ICDM, pp. 605–608. IEEE Computer Society, Los Alamitos (2005)
Google Scholar
Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227–244 (2000)
Article MATH MathSciNet Google Scholar
Shimazaki, H., Shinomoto, S.: Kernel bandwidth optimization in spike rate estimation. Journal of Computer Neuroscience 29, 171–182 (2010)
Article MathSciNet Google Scholar
Hernandez, N., Kiralj, R., Ferreira, M.M.C., Talavera, I.: Critical comparative analysis, validation and interpretation of SVM and PLS regression models in a QSAR study on HIV-1 protease inhibitors. Journal of Chemometrics and Intelligent Laboratory Systems 98, 65–77 (2009)
Article Google Scholar
Meyer, D., Dimitriadou, E., Hornik, K., Weingesse, A., Leisch, F.: Manual of Package e1071, http://CRAN.R-project.org/=e1071 (accessed October 2012)

Download references

Author information

Authors and Affiliations

Department of Computer and Systems Sciences, Stockholm University, Forum 100, SE-164 40, Kista, Sweden
Orlando P. Zacarias & Henrik Boström
Department of Mathematics and Informatics - Faculty of Science, Eduardo Mondlane University, Main Campus, P.O. Box 250, Maputo, Mozambique
Orlando P. Zacarias

Authors

Orlando P. Zacarias
View author publications
Search author on:PubMed Google Scholar
Henrik Boström
View author publications
Search author on:PubMed Google Scholar

Editor information

Editors and Affiliations

US Air Force Office of Scientific Research, 106-0032, Tokyo, Japan
Hiroshi Motoda
School of Computer Science and Technology, Zhejiang University, 310027, Hangzhou, China
Zhaohui Wu
Faculty of Engineering and Information Technology, University of Technology, Chippendale, 2008, Sydney, NSW, Australia
Longbing Cao
Department of Computing Science, Edmonton, University of Alberta, T6G 2E8, Canada
Osmar Zaiane
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Min Yao
School of Computer Science, Fudan University, 200433, Shanghai, China
Wei Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zacarias, O.P., Boström, H. (2013). Generalization of Malaria Incidence Prediction Models by Correcting Sample Selection Bias. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds) Advanced Data Mining and Applications. ADMA 2013. Lecture Notes in Computer Science(), vol 8347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53917-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-53917-6_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53916-9
Online ISBN: 978-3-642-53917-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics