Nothing Special   »   [go: up one dir, main page]

Skip to main content

Generalization of Malaria Incidence Prediction Models by Correcting Sample Selection Bias

  • Conference paper
Advanced Data Mining and Applications (ADMA 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8347))

Included in the following conference series:

  • 3255 Accesses

  • 1 Citation

Abstract

Performance measurements obtained from dividing a single sample into training and test sets, e.g. by employing cross-validation, may not give an accurate picture of the performance of any model developed from the sample, on the set of examples to which the model will be applied. Such measurements, which are due to that training and test samples are drawn according to different distributions may hence be misleading. In this study, two support vector machine models for predicting malaria incidence developed from certain regions and time periods in Mozambique are evaluated on data from novel regions and time periods, and the use of selection bias correction is investigated. It is observed that significant reductions in the predicted error can be obtained using the latter approach, strongly suggesting that techniques of this kind should be employed if test data can be expected to be drawn from some other distribution than what is the origin of the training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

eBook
USD 13.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. The United States Global Health Initiative, Mozambique-national health strategy 2011-2015 (2012), http://hingx.org:8080/svn/main/eHealth%20Regulation/

  2. World Health Organization, Using Climate to Predict Infectious Disease Outbreaks: A Review (2004), http://www.who.int/globalchange/publications/en/

  3. Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting protein evolutionary and structural relationships. Journal of Computational Biology 10(6), 867–868 (2003)

    Article  Google Scholar 

  4. Han, L.Y., Cai, C.Z., Lo, S.L., Chung, M.C.M., Chen, Y.Z.: Prediction of RNA-biding proteins from primary sequence by support vector machine approach. RNA Journal 10, 355–368 (2004)

    Article  Google Scholar 

  5. Byvatov, E., Schneider, G.: SVM-based feature selection for characterization of focused compound collections. Journal of Chemical and Information Computer Science 44(3), 993–999 (2004)

    Article  Google Scholar 

  6. Viademonte, S., Burstein, F.: From knowledge discovery to computational intelligence: A framework for intelligent decision support systems. Series on Intelligent Decision-making Support Systems, pp. 57–78. Springer-Verlag, London Limited, London (2006)

    Google Scholar 

  7. Zacarias, O.P., Boström, H.: Strengthening the Health Information System in Mozambique through Malaria Incidence Prediction. In: Proceedings of IST-Africa International Conference, Nairobi, Kenya (May 2013)

    Google Scholar 

  8. Zacarias, O.P., Boström, H.: Comparing Support Vector Regression and Random Forests for Predicting Malaria Incidence in Mozambique. In: Proceedings of International Conference on Advances in ICT for Emerging Regions, Colombo, Sri-Lanka (to appear, December 2013)

    Google Scholar 

  9. Moons, K.G.M., Kengne, A.P., Grobbee, D.E., Royston, P., Vergouwe, Y., Altman, D.G., Woodward, M.: Risk prediction models: II. External validation, model updating, and impact assessment. Biomedical Heart Journal (March 2012), doi:10.1136/heartjnl-2011-301247

    Google Scholar 

  10. Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall, Pearson Education, Inc., Upper Saddle River, New Jersey (2002)

    Google Scholar 

  11. Obenshain, M.K.: Application of Data Mining Techniques to Healthcare Data. Journal of Infection Control and Hospital Epidemiology 25(8), 690–695 (2004)

    Article  Google Scholar 

  12. Temu, E.A., Coleman, M., Abilio, A.P., Kleinschmidt, I.: High Prevalence of Malaria in Zambezia, Mozambique: The Protective Effect of IRS versus Increased Risks Due to Pig-Keeping and House Construction. PLoS ONE 7 (2012), doi:10.1371/journal.pone.0031409

    Google Scholar 

  13. Steyerberg, E.W., Borsboom, G.J.J.M., van Houwelingen, H.C., Eijkemans, M.J.C., Habbema, J.D.F.: Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Journal of Statistics in Medicine 23, 2567–2586 (2004)

    Article  Google Scholar 

  14. Clifton, C., Thuraisingham, B.: Emerging standards for data mining. Journal of Computers Standards and Interfaces 23, 187–193 (2001)

    Article  Google Scholar 

  15. Cortes, C., Mohri, M., Riley, M.D., Rostamizadeh, A.: Sample Selection Bias Correction Theory. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 38–53. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  16. Zadrozny, B.: Learning and Evaluating Classifiers under Sample Selection Bias. In: Proceedings of the 21st International Conference on Machine Learning, Banff, Canada (2004)

    Google Scholar 

  17. Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., Schölkopf, B.: Correcting Sample Selection Bias by Unlabeled Data. In: NIPS (2007)

    Google Scholar 

  18. Boser, B.E., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, pp. 144–152. ACM Press, New York (1992)

    Google Scholar 

  19. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, Inc., New York (2001)

    MATH  Google Scholar 

  20. Barbella, D., Benzaid, S., Christensen, J., Jackson, B., Qin, X.V., Musicant, D.: Understanding Support Vector Machine Classifications via a Recommender System-Like Approach. In: Proceedings of the International Conference on Data Mining (DMIN), Las Vegas, USA, July 13-16 (2009)

    Google Scholar 

  21. Blanford, J.I., Blanford, S., Crane, R.G., Mann, M.E., Paaijmans, K.P., Schreiber, K.V., Thomas, M.B.: Implications of temperature variation for malaria parasite development across Africa. Scientific Reports 3(1300) (2013), doi:10.1038/srep01300

    Google Scholar 

  22. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

  23. R-Statistical tool for data analysis, http://CRAN.R-project.org/ (accessed September 16, 2012)

  24. Heckman, J.J.: Sample selection bias as a specification error. Econometrica 47(1), 153–161 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  25. Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI, pp. 973–978 (2001)

    Google Scholar 

  26. Fan, W., Davidsno, I., Zadrozny, B., Yu, P.S.: An improved categorization of classifier’s sensitivity on sample selection bias. In: ICDM, pp. 605–608. IEEE Computer Society, Los Alamitos (2005)

    Google Scholar 

  27. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227–244 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  28. Shimazaki, H., Shinomoto, S.: Kernel bandwidth optimization in spike rate estimation. Journal of Computer Neuroscience 29, 171–182 (2010)

    Article  MathSciNet  Google Scholar 

  29. Hernandez, N., Kiralj, R., Ferreira, M.M.C., Talavera, I.: Critical comparative analysis, validation and interpretation of SVM and PLS regression models in a QSAR study on HIV-1 protease inhibitors. Journal of Chemometrics and Intelligent Laboratory Systems 98, 65–77 (2009)

    Article  Google Scholar 

  30. Meyer, D., Dimitriadou, E., Hornik, K., Weingesse, A., Leisch, F.: Manual of Package e1071, http://CRAN.R-project.org/=e1071 (accessed October 2012)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zacarias, O.P., Boström, H. (2013). Generalization of Malaria Incidence Prediction Models by Correcting Sample Selection Bias. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds) Advanced Data Mining and Applications. ADMA 2013. Lecture Notes in Computer Science(), vol 8347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53917-6_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-53917-6_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-53916-9

  • Online ISBN: 978-3-642-53917-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics