Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Learning from labeled and unlabeled data: an empirical study across techniques and domains

Published: 01 March 2005 Publication History

Abstract

There has been increased interest in devising learning techniques that combine unlabeled data with labeled data -- i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi -supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias.

References

[1]
Bennett, K., Demiriz, A. & Maclin, R. (2002). Exploiting unlabeled data in ensemble methods. In Proceedings of Sixth International Conference on Knowledge Discovery and Databases, pp. 289-296, Edmonton, Canada.
[2]
Blake, C., Keogh, E., & Merz, C.J. (1999). UCI repository of machine learning databases. (URL: http://www.ics.uci.edu/~mlearn/MLRepository.html)
[3]
Blum, A. & Chawla, S. (2001). Learning from Labeled and Unlabeled Data using Graph Mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 19-26, San Francisco, CA.
[4]
Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the Workshop on Computational Learning Theory, pp. 92-100, Madison, WI.
[5]
Bradley, A.P. (1997). The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30(6), 145-1159.
[6]
Chawla, N.V., Bowyer, K.W., Hall, L.O. & Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.
[7]
Cozman, F., Cohen, I., & Cirelo, M. (2002). Unlabeled data can degrade classification performance of generative classifiers. In Proceedings Fifteenth International Florida Artificial Intelligence Society Conference, pp. 327-331, Pensacola, FL.
[8]
Cozman, F., Cohen, I., & Cirelo, M. (2003). Semi-supervised learning of mixture models. In Proceedings of Twentieth International Conference on Machine Learning, pp. 99-106, Washington, DC.
[9]
Crook, J. & Banasik, J (2002). Sample selection bias in credit scoring models. International Conference on Credit Risk Modeling and Decisioning, Philadelphia, PA.
[10]
Dempster, A., Laird, N. & Rubin, D. (1977). Maximum Likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38.
[11]
Fayyad, U.M. & Irani, K.B. (1993). Multi-Interval discretization of continuous-valued attributes for classification learning. In Proceedings of Thirteenth International Joint Conference on AI, pp. 1022-1027, San Francisco, CA.
[12]
Feelders, A.J. (2000). Credit scoring and reject inferencing with mixture models. International Journal of Intelligent Systems in Accounting, Finance & Management, 9, 1-8.
[13]
Freund, Y. & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148-156, Bari, Italy.
[14]
Ghahramani, Z. & Jordan, M.I. (1994). Learning from incomplete data. Technical Report 108, MIT Center for Biological and Computational Learning.
[15]
Ghani, R., Jones, R. & Rosenberg, C. (2003). Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining. In Twenty-first International Conference on Machine Learning, Washington, D.C.
[16]
Goldman, S. & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 327-334, San Francisco, CA.
[17]
Greene, W. (1998). Sample selection in credit-scoring models. Japan and the World Economy, 10: 299-316.
[18]
Guyon, I. (2001). NIPS 2001 Workshop on variable and feature Selection. http://www.clopinet.com/isabelle/Projects/NIPS2001/.
[19]
Hand, D.J. (1997). Construction and assessment of classification rules. Chichester: John Wiley and Sons.
[20]
Hettich, S. and Bay, S. D. (1999). The UCI KDD Archive, http://kdd.ics.uci.edu. University of California at Irvine, Department of Information and Computer Science.
[21]
Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153-161.
[22]
Jacobs, B.A., Jordan, M.I., Nowlan, S.J., and Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79-87.
[23]
Joachims, T. (2003). Transductive Learning via Spectral Graph Partitioning. In Proceedings of the Twentieth International Conference on Machine Learning, pp. 290-297, Washington, DC.
[24]
Karakoulas, G & Salakhutdinov, R. (2003). Semi-supervised Mixture of Experts Classification. In Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 138-145, Brighton UK.
[25]
Kontkanen, P., Myllymaki, P., & Tirri, H. (1996). Constructing bayesian finite mixture models by the EM algorithm. Technical Report NC-TR-97-003, ESPRIT: Neural and Computational Learning (NeuroCOLT).
[26]
Kremer, S. & Stacey, D. (2001). NIPS 2001 Workshop and Competition on unlabeled data for supervised learning. http://q.cis.guelph.ca/~skremer/NIPS2001/.
[27]
Little, R.J.A & Rubin, D.R. (1987). Statistical Analysis with Missing Data. Wiley: New York.
[28]
Miller, D. & Uyar, S. (1997). A mixture of experts classifier with learning based on both labeled and unlabeled data. Advances in Neural Information Processing Systems 9, pp. 571-578, MIT Press.
[29]
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103-134.
[30]
Nigam, K. & Ghani, R. (2000). Analyzing the effectiveness and applicability of co-training. In Proceedings of Ninth International Conference on Information and Knowledge Management pp. 86-93.
[31]
Provost, F. & Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42, 203-231.
[32]
Provost, F. & Domingos, P. (2003). Tree induction for probability based ranking. Machine Learning, 52, 199-215.
[33]
Quinlan R. (1992) C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
[34]
Seeger, M. (2000). Learning with labeled and unlabeled data. Technical report, Institute for ANC, Edinburgh, UK. http://www.dai.ed.ac.uk/~seeger/papers.html.
[35]
Shahshahni, B. & Landgrebe, D. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5), 1087-1095.
[36]
Talavera, L. (2000). Dependency-based feature selection for clustering symbolic data. Intelligent Data Analysis, 4(1), 19-28.
[37]
Swets, J. (1988). Measuring the Accuracy of Diagnostic Systems. Science, 240, 1285-1293.
[38]
Zadrozny, B. & Elkan, C. (2000). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pp 204-213, San Francisco, CA.
[39]
Zhang, T., & Oles, F.J. (2000). A probability analysis on the value of unlabeled data for classification problems. In Proceedings of Seventeenth International conference on Machine Learning, pp 1191-1198, Stanford, CA.

Cited By

View all
  • (2024)PositionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693480(34670-34692)Online publication date: 21-Jul-2024
  • (2024)A Human-Like Visual Perception System for Autonomous Vehicles Using a Neuron-Triggered Hybrid Unsupervised Deep Learning MethodIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.341024025:7(8171-8180)Online publication date: 1-Jul-2024
  • (2023)A Survey on Deep Semi-Supervised LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322021935:9(8934-8954)Online publication date: 1-Sep-2023
  • Show More Cited By
  1. Learning from labeled and unlabeled data: an empirical study across techniques and domains

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Journal of Artificial Intelligence Research
    Journal of Artificial Intelligence Research  Volume 23, Issue 1
    January 2005
    720 pages

    Publisher

    AI Access Foundation

    El Segundo, CA, United States

    Publication History

    Published: 01 March 2005
    Received: 01 June 2004
    Published in JAIR Volume 23, Issue 1

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)PositionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693480(34670-34692)Online publication date: 21-Jul-2024
    • (2024)A Human-Like Visual Perception System for Autonomous Vehicles Using a Neuron-Triggered Hybrid Unsupervised Deep Learning MethodIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.341024025:7(8171-8180)Online publication date: 1-Jul-2024
    • (2023)A Survey on Deep Semi-Supervised LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322021935:9(8934-8954)Online publication date: 1-Sep-2023
    • (2021)Towards Safe Weakly Supervised LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2019.292239643:1(334-346)Online publication date: 1-Jan-2021
    • (2020)Negative sampling in semi-supervised learningProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525097(1704-1714)Online publication date: 13-Jul-2020
    • (2020)Semi-supervised Learning Algorithm Based on Linear Lie Group for Imbalanced Multi-class ClassificationNeural Processing Letters10.1007/s11063-020-10287-852:1(869-889)Online publication date: 1-Aug-2020
    • (2019)Safe semi-supervised learningFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-019-8452-213:4(669-676)Online publication date: 1-Aug-2019
    • (2019)Semi-supervised online structure learning for composite event recognitionMachine Language10.1007/s10994-019-05794-2108:7(1085-1110)Online publication date: 1-Jul-2019
    • (2018)A general formulation for safely exploiting weakly supervised dataProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence10.5555/3504035.3504417(3126-3133)Online publication date: 2-Feb-2018
    • (2018)Exploiting Data and Human Knowledge for Predicting Wildlife PoachingProceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies10.1145/3209811.3209879(1-8)Online publication date: 20-Jun-2018
    • Show More Cited By

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media