Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2896338.2896347acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdhConference Proceedingsconference-collections
research-article

Feature Importance and Predictive Modeling for Multi-source Healthcare Data with Missing Values

Published: 11 April 2016 Publication History

Abstract

With rapid development of sensor technologies and the internet of things, research in the area of connected health is increasing in importance and complexity with wide-reaching impacts for public health. As data sources such as mobile (wearable) sensors get cheaper, smaller, and smarter, important research questions can be answered by combining information from multiple data sources. However, integration of multiple heterogeneous data streams often results in a dataset with several empty cells or missing values. The challenge is to use such sparsely populated integrated datasets without compromising model performance. Naïve approaches for dataset modification such as discarding observations or ad-hoc replacement of missing values often lead to misleading results. In this paper, we discuss and evaluate current best-practices for modeling such data with missing values and then propose an ensemble-learning based sparse-data modeling framework. We develop a predictive model using this framework and compare it with existing models using a study in a healthcare setting. Instead of generating a single score on variable/feature importance, our framework enables the user to understand the importance of a variable based on the existing data values and their localized impact on the outcome.

References

[1]
Buuren, S. Van 2012. Flexible Imputation of Missing Data. CRC press.
[2]
Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn, C.G.M. and Rubin, D.B. 2006. Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation. 76, 12 (2006), 1049--1064.
[3]
Van Buuren, S. and Groothuis-Oudshoorn, K. 2011. Multivariate Imputation by Chained Equations. Journal Of Statistical Software. 45, 3 (2011), 1--67.
[4]
Chai, T. and Draxler, R.R. 2014. Root mean square error ( RMSE ) or mean absolute error ( MAE )? -- Arguments against avoiding RMSE in the literature. Geoscientific Model Development. (2014), 1247--1250.
[5]
Chan, M., Estève, D., Fourniols, J.-Y., Escriba, C. and Campo, E. 2012. Smart wearable systems: Current status and future challenges. Artificial Intelligence in Medicine. 56, 3 (2012), 137--156.
[6]
Díaz, I., Hubbard, A.E., Decker, A. and Cohen, M. 2013. Variable Importance and Prediction Methods for Longitudinal Problems with Missing Variables. (2013), 1--17.
[7]
Graham, J.W. 2012. Missing Data Analysis and Design. Springer Science & Business Media.
[8]
Graham, J.W. 2009. Missing Data Analysis: Making It Work in the Real World. Annual Review of Psychology. 60, 1 (2009), 549--576.
[9]
Hastie, T., Tibshirani, R., Friedman, J. and Hastie, Trevor, Tibshirani, Robert, Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. The Mathematical Intelligencer.
[10]
Kuch, B., Hense, H.W., Sinnreich, R., Kark, J.D., von Eckardstein, A., Sapoznikov, D. and Bolte, H.D. 2001. Determinants of short-period heart rate variability in the general population. Cardiology. 95, 3 (Jan. 2001), 131--8.
[11]
Lai, A.C.K., Mui, K.W., Wong, L.T. and Law, L.Y. 2009. An evaluation model for indoor environmental quality (IEQ) acceptance in residential buildings. Energy and Buildings. 41, 9 (2009), 930--936.
[12]
Lauritzen, S.L. 1995. The EM algorithm for graphical association models with missing data. Computational Statistics & Data Analysis. 19, 2 (1995), 191--201.
[13]
Liaw, A. and Wiener, M. 2015. Package' randomForest'. Breiman and Cutler's random forests for classification and regression. CRAN Reference manual.
[14]
Pieper, S., Brosschot, J.F., van der Leeden, R. and Thayer, J.F. 2007. Cardiac Effects of Momentary Assessed Worry Episodes and Stressful Events. Psychosomatic Medicine. 69, 9 (2007), 901--909.
[15]
Ram, S., Wang, Y., Currim, F. and Currim, S. 2015. Using Big Data for Predicting Freshmen Retention. Conference on Information Systems (ICIS) (2015).
[16]
Ren, C., O'Neill, M.S., Park, S.K., Sparrow, D., Vokonas, P. and Schwartz, J. 2011. Ambient Temperature, Air Pollution, and Heart Rate Variability in an Aging Population. American Journal of Epidemiology. 173, 9 (2011), 1013--1021.
[17]
Resler, L.M., Shao, Y., Tomback, D.F. and Malanson, G.P. 2014. Predicting Functional Role and Occurrence of Whitebark Pine (Pinus albicaulis) at Alpine Treelines: Model Accuracy and Variable Importance. Annals of the Association of American Geographers. 104, 4 (2014), 703--722.
[18]
Rubin, D.B. 1976. Inference and missing data. Biometrika. 63(3), (1976), 581--592.
[19]
Saar-tsechansky, M. and Provost, F. 2007. Handling Missing Values when Applying Classification Models. Journal of Machine Learning Research. 8, (2007), 1625--1657.
[20]
Salvador, S. and Chan, P. 2004. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI. Ictai (2004), 576--584.
[21]
Setz, C., Schumm, J., Lorenz, C., Arnrich, B. and Tröster, G. 2009. Combining Worthless Sensor Data. Measuring Mobile Emotions Workshop at MobileHCI. (2009).
[22]
Shmueli, G. 2011. To Explain or to Predict? Statistical science. 25, 3 (2011), 289--310.
[23]
Sternberg, E., Gilligan, B. and Lindberg, C. 2016. Health and Wellbeing in GSA Office Buildings and Beyond.
[24]
Thayer, J.F., Verkuil, B., Brosschot, J.F., Kampschroer, K., West, A., Sterling, C., Christie, I.C., Abernethy, D.R., Sollers, J.J., Cizza, G., Marques, A.H. and Sternberg, E.M. 2010. Effects of the physical work environment on physiological measures of stress. European Journal of Cardiovascular Prevention & Rehabilitation. 17, 4 (2010), 431--439.
[25]
Thayer, J.F., Verkuil, B., Brosschot, J.F., Kampschroer, K., West, A., Sterling, C., Christie, I.C., Abernethy, D.R., Sollers, J.J., Cizza, G., Marques, A.H. and Sternberg, E.M. 2010. Effects of the physical work environment on physiological measures of stress. European journal of cardiovascular prevention and rehabilitation?: official journal of the European Society of Cardiology, Working Groups on Epidemiology & Prevention and Cardiac Rehabilitation and Exercise Physiology. 17, 4 (Aug. 2010), 431--9.
[26]
Venkatesh, V., Brown, S. a and Bala, H. 2013. Bridging the Qualitative-Quantitative Divide: Guidelines for Conducting Mixed Methods Research in Information Systems. MIS Quarterly. 37, 1 (2013), 21--54.
[27]
Yang, S., Tian, W., Heo, Y., Meng, Q. and Wei, L. 2015. Variable Importance Analysis for Urban Building Energy Assessment in the Presence of Correlated Factors. Procedia Engineering. 121, (2015), 277--284.

Cited By

View all
  • (2023)An overview of Business Intelligence research in healthcare organizations using a topic modeling approach2023 13th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE60553.2023.10326258(521-527)Online publication date: 1-Nov-2023
  • (2020)Integrating Co-Clustering and Interpretable Machine Learning for the Prediction of Intravenous Immunoglobulin Resistance in Kawasaki DiseaseIEEE Access10.1109/ACCESS.2020.29963028(97064-97071)Online publication date: 2020
  • (2019)The Extent and Coverage of Current Knowledge of Connected Health: Systematic Mapping StudyJournal of Medical Internet Research10.2196/1439421:9(e14394)Online publication date: 25-Sep-2019
  • Show More Cited By

Index Terms

  1. Feature Importance and Predictive Modeling for Multi-source Healthcare Data with Missing Values

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    DH '16: Proceedings of the 6th International Conference on Digital Health Conference
    April 2016
    186 pages
    ISBN:9781450342247
    DOI:10.1145/2896338
    © 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    • UQAM: Université du Québec à Montréal

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 April 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data science with missing data
    2. mobile-sensors
    3. multi-source data
    4. well-being analysis

    Qualifiers

    • Research-article

    Funding Sources

    • U.S General Services Administration

    Conference

    DH '16
    Sponsor:
    • UQAM
    DH '16: Digital Health 2016
    April 11 - 13, 2016
    Québec, Montréal, Canada

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)An overview of Business Intelligence research in healthcare organizations using a topic modeling approach2023 13th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE60553.2023.10326258(521-527)Online publication date: 1-Nov-2023
    • (2020)Integrating Co-Clustering and Interpretable Machine Learning for the Prediction of Intravenous Immunoglobulin Resistance in Kawasaki DiseaseIEEE Access10.1109/ACCESS.2020.29963028(97064-97071)Online publication date: 2020
    • (2019)The Extent and Coverage of Current Knowledge of Connected Health: Systematic Mapping StudyJournal of Medical Internet Research10.2196/1439421:9(e14394)Online publication date: 25-Sep-2019
    • (2019)Feature Selection: Multi-source and Multi-view Data Limitations, Capabilities and Potentials2019 29th International Telecommunication Networks and Applications Conference (ITNAC)10.1109/ITNAC46935.2019.9077992(1-6)Online publication date: Nov-2019
    • (2017)A Regularization Approach for Identifying Cumulative Lagged Effects in Smart Health ApplicationsProceedings of the 2017 International Conference on Digital Health10.1145/3079452.3079503(99-103)Online publication date: 2-Jul-2017
    • (2016)Indoor Environmental Effects on Individual WellbeingProceedings of the 6th International Conference on Digital Health Conference10.1145/2896338.2896376(167-168)Online publication date: 11-Apr-2016

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media