Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2684200.2684288acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Investigations into Missing Values Imputation Using Random Forests for Semi-supervised Data

Published: 04 December 2014 Publication History

Abstract

This paper presents a revised procedure that imputes missing values by using random forests on semi-supervised data. The method has a feature that not only allows missing data to be found in a response variable but in a predictive variable, and furthermore, it can now deal with any types of data, i.e., numerical values, categories and categories with an order. By evaluating this method using Titanic data and eleven UC Irvine repository datasets, we found that our method performed fairly well, and a method of naive median imputation was also suitable in these cases.

References

[1]
L. Breiman, Random forests, Machine Learning, 45 (1), 5--32, 2001.
[2]
L. Breiman, Manual for Setting Up, Using, and Understanding Random Forest V4.0, http://oz.berkeley.edu/users/breiman/Using_random_forests_v4.0.pdf, 2003.
[3]
L. Breiman and A. Cutler, Random forests, http://www.stat.berkeley.edu/~breiman/RandomForests/updated March 3, 2004.
[4]
O. Chapelle and A. Zien, Semi-supervised classification by low density separation, In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 57--64, 2005.
[5]
CRAN, Package randomForest, http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
[6]
C. K. Enders, A note on the use of missing auxiliary variables in FIML-based structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 15, 434--448, 2008.
[7]
A. Gelman and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press, 2007.
[8]
Y. Guo, H. Zhang, and X. Liu. Instance selection in semi-supervised learning. In proceeding of Advances in Artificial Intelligence -- 24th Canadian Conference on Artificial Intelligence (Canadian AI 2011), Canada, 2011.
[9]
F. E. Harrell Jr., Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer. 2001.
[10]
F. E. Harrell Jr., Titanic Data, http://www.stats4stem.org/r-titanic3-data.html, 2002.
[11]
T. Ishioka, Imputation of missing values for semi-supervised data using the proximity in Random Forests, iiWAS 2012, Bali, 309--312, 2012.
[12]
A. Liaw, Missing value imputations by randomForest, R Documentation, http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/randomForest/html/rfImpute.html
[13]
The R Project for Statistical Computing, http://www.r-project.org/
[14]
D. B. Rubin, Multiple imputation for nonresponse in surveys, New York: Wiley, 1987.
[15]
UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/~mlearn/MLRepository.html
[16]
G. C. Valls, T. M. Bandos, and D. Zhou, Semi-supervised graph-based hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing, 45 (10), 3044--3054, 2007.
[17]
J. Xu, H. He, and H. Man, DCPE Co-training for classification, Neurocomputing, 86, 75--85, 2012.
[18]
X. Zhu, Semi-supervised learning literature survey, TR-1530, University of Wisconsin-Madison Department of Computer Science, 2005 (last modified on July 17, 2008).

Cited By

View all
  • (2022)Forgetting Practices in the Data SciencesProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3517644(1-19)Online publication date: 29-Apr-2022
  • (2022)Recommender systems, ground truth, and preference pollutionAI Magazine10.1002/aaai.1205543:2(177-189)Online publication date: 23-Jun-2022
  • (2021)Multiobjective semisupervised learning with a right‐censored endpoint adapted to the multiple imputation frameworkBiometrical Journal10.1002/bimj.20200036564:8(1446-1466)Online publication date: 27-Jun-2021

Index Terms

  1. Investigations into Missing Values Imputation Using Random Forests for Semi-supervised Data

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        iiWAS '14: Proceedings of the 16th International Conference on Information Integration and Web-based Applications & Services
        December 2014
        587 pages
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        In-Cooperation

        • @WAS: International Organization of Information Integration and Web-based Applications and Services
        • Johannes Kepler Univ Linz: Johannes Kepler Universität Linz

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 December 2014

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Ensemble learning
        2. R
        3. UCI machine learning repository
        4. data imputation
        5. missing data
        6. rfImpute

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        • Grant-in-Aid for Scientic Research

        Conference

        iiWAS '14

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)4
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 23 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)Forgetting Practices in the Data SciencesProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3517644(1-19)Online publication date: 29-Apr-2022
        • (2022)Recommender systems, ground truth, and preference pollutionAI Magazine10.1002/aaai.1205543:2(177-189)Online publication date: 23-Jun-2022
        • (2021)Multiobjective semisupervised learning with a right‐censored endpoint adapted to the multiple imputation frameworkBiometrical Journal10.1002/bimj.20200036564:8(1446-1466)Online publication date: 27-Jun-2021

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media