article

Learning from labeled and unlabeled data: an empirical study across techniques and domains

Authors:

Nitesh V. Chawla,

Grigoris KarakoulasAuthors Info & Claims

Journal of Artificial Intelligence Research, Volume 23, Issue 1

Pages 331 - 366

Published: 01 March 2005 Publication History

Abstract

There has been increased interest in devising learning techniques that combine unlabeled data with labeled data -- i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi -supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias.

References

[1]

Bennett, K., Demiriz, A. & Maclin, R. (2002). Exploiting unlabeled data in ensemble methods. In Proceedings of Sixth International Conference on Knowledge Discovery and Databases, pp. 289-296, Edmonton, Canada.

Digital Library

[2]

Blake, C., Keogh, E., & Merz, C.J. (1999). UCI repository of machine learning databases. (URL: http://www.ics.uci.edu/~mlearn/MLRepository.html)

[3]

Blum, A. & Chawla, S. (2001). Learning from Labeled and Unlabeled Data using Graph Mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 19-26, San Francisco, CA.

Digital Library

[4]

Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the Workshop on Computational Learning Theory, pp. 92-100, Madison, WI.

Digital Library

[5]

Bradley, A.P. (1997). The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30(6), 145-1159.

Digital Library

[6]

Chawla, N.V., Bowyer, K.W., Hall, L.O. & Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.

[7]

Cozman, F., Cohen, I., & Cirelo, M. (2002). Unlabeled data can degrade classification performance of generative classifiers. In Proceedings Fifteenth International Florida Artificial Intelligence Society Conference, pp. 327-331, Pensacola, FL.

Digital Library

[8]

Cozman, F., Cohen, I., & Cirelo, M. (2003). Semi-supervised learning of mixture models. In Proceedings of Twentieth International Conference on Machine Learning, pp. 99-106, Washington, DC.

[9]

Crook, J. & Banasik, J (2002). Sample selection bias in credit scoring models. International Conference on Credit Risk Modeling and Decisioning, Philadelphia, PA.

[10]

Dempster, A., Laird, N. & Rubin, D. (1977). Maximum Likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38.

[11]

Fayyad, U.M. & Irani, K.B. (1993). Multi-Interval discretization of continuous-valued attributes for classification learning. In Proceedings of Thirteenth International Joint Conference on AI, pp. 1022-1027, San Francisco, CA.

[12]

Feelders, A.J. (2000). Credit scoring and reject inferencing with mixture models. International Journal of Intelligent Systems in Accounting, Finance & Management, 9, 1-8.

[13]

Freund, Y. & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148-156, Bari, Italy.

Digital Library

[14]

Ghahramani, Z. & Jordan, M.I. (1994). Learning from incomplete data. Technical Report 108, MIT Center for Biological and Computational Learning.

Digital Library

[15]

Ghani, R., Jones, R. & Rosenberg, C. (2003). Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining. In Twenty-first International Conference on Machine Learning, Washington, D.C.

[16]

Goldman, S. & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 327-334, San Francisco, CA.

Digital Library

[17]

Greene, W. (1998). Sample selection in credit-scoring models. Japan and the World Economy, 10: 299-316.

[18]

Guyon, I. (2001). NIPS 2001 Workshop on variable and feature Selection. http://www.clopinet.com/isabelle/Projects/NIPS2001/.

[19]

Hand, D.J. (1997). Construction and assessment of classification rules. Chichester: John Wiley and Sons.

[20]

Hettich, S. and Bay, S. D. (1999). The UCI KDD Archive, http://kdd.ics.uci.edu. University of California at Irvine, Department of Information and Computer Science.

[21]

Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153-161.

[22]

Jacobs, B.A., Jordan, M.I., Nowlan, S.J., and Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79-87.

[23]

Joachims, T. (2003). Transductive Learning via Spectral Graph Partitioning. In Proceedings of the Twentieth International Conference on Machine Learning, pp. 290-297, Washington, DC.

[24]

Karakoulas, G & Salakhutdinov, R. (2003). Semi-supervised Mixture of Experts Classification. In Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 138-145, Brighton UK.

Digital Library

[25]

Kontkanen, P., Myllymaki, P., & Tirri, H. (1996). Constructing bayesian finite mixture models by the EM algorithm. Technical Report NC-TR-97-003, ESPRIT: Neural and Computational Learning (NeuroCOLT).

[26]

Kremer, S. & Stacey, D. (2001). NIPS 2001 Workshop and Competition on unlabeled data for supervised learning. http://q.cis.guelph.ca/~skremer/NIPS2001/.

[27]

Little, R.J.A & Rubin, D.R. (1987). Statistical Analysis with Missing Data. Wiley: New York.

Digital Library

[28]

Miller, D. & Uyar, S. (1997). A mixture of experts classifier with learning based on both labeled and unlabeled data. Advances in Neural Information Processing Systems 9, pp. 571-578, MIT Press.

[29]

Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103-134.

Digital Library

[30]

Nigam, K. & Ghani, R. (2000). Analyzing the effectiveness and applicability of co-training. In Proceedings of Ninth International Conference on Information and Knowledge Management pp. 86-93.

Digital Library

[31]

Provost, F. & Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42, 203-231.

Digital Library

[32]

Provost, F. & Domingos, P. (2003). Tree induction for probability based ranking. Machine Learning, 52, 199-215.

Digital Library

[33]

Quinlan R. (1992) C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.

Digital Library

[34]

Seeger, M. (2000). Learning with labeled and unlabeled data. Technical report, Institute for ANC, Edinburgh, UK. http://www.dai.ed.ac.uk/~seeger/papers.html.

[35]

Shahshahni, B. & Landgrebe, D. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5), 1087-1095.

[36]

Talavera, L. (2000). Dependency-based feature selection for clustering symbolic data. Intelligent Data Analysis, 4(1), 19-28.

Digital Library

[37]

Swets, J. (1988). Measuring the Accuracy of Diagnostic Systems. Science, 240, 1285-1293.

[38]

Zadrozny, B. & Elkan, C. (2000). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pp 204-213, San Francisco, CA.

Digital Library

[39]

Zhang, T., & Oles, F.J. (2000). A probability analysis on the value of unlabeled data for classification problems. In Proceedings of Seventeenth International conference on Machine Learning, pp 1191-1198, Stanford, CA.

Digital Library

Cited By

Mao HChen ZTang WZhao JMa YZhao TShah NGalkin MTang JSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)PositionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693480(34670-34692)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693480
Yu BGao KCheng ZChen YYue L(2024)A Human-Like Visual Perception System for Autonomous Vehicles Using a Neuron-Triggered Hybrid Unsupervised Deep Learning MethodIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.341024025:7(8171-8180)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TITS.2024.3410240
Yang XSong ZKing IXu Z(2023)A Survey on Deep Semi-Supervised LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322021935:9(8934-8954)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1109/TKDE.2022.3220219
Show More Cited By

Learning from labeled and unlabeled data: an empirical study across techniques and domains
1. Computing methodologies

Recommendations

Learning Instance Weighted Naive Bayes from labeled and unlabeled data

In real-world data mining applications, it is often the case that unlabeled instances are abundant, while available labeled instances are very limited. Thus, semi-supervised learning, which attempts to benefit from large amount of unlabeled data ...
Combining labeled and unlabeled data with word-class distribution learning
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

We describe a novel simple and highly scalable semi-supervised method called Word-Class Distribution Learning (WCDL), and apply it task of information extraction (IE) by utilizing unlabeled sentences to improve supervised classification methods. WCDL ...
Learning safe multi-label prediction for weakly labeled data

In this paper we study multi-label learning with weakly labeled data, i.e., labels of training examples are incomplete, which commonly occurs in real applications, e.g., image classification, document categorization. This setting includes, e.g., (i) ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Artificial Intelligence Research

Journal of Artificial Intelligence Research Volume 23, Issue 1

January 2005

720 pages

ISSN:1076-9757

Issue’s Table of Contents

Publisher

AI Access Foundation

El Segundo, CA, United States

Publication History

Published: 01 March 2005

Received: 01 June 2004

Published in JAIR Volume 23, Issue 1

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mao HChen ZTang WZhao JMa YZhao TShah NGalkin MTang JSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)PositionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693480(34670-34692)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693480
Yu BGao KCheng ZChen YYue L(2024)A Human-Like Visual Perception System for Autonomous Vehicles Using a Neuron-Triggered Hybrid Unsupervised Deep Learning MethodIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.341024025:7(8171-8180)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TITS.2024.3410240
Yang XSong ZKing IXu Z(2023)A Survey on Deep Semi-Supervised LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322021935:9(8934-8954)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1109/TKDE.2022.3220219
Li YGuo LZhou Z(2021)Towards Safe Weakly Supervised LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2019.292239643:1(334-346)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/TPAMI.2019.2922396
Chen JShah VKyrillidis ADaumé HSingh A(2020)Negative sampling in semi-supervised learningProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525097(1704-1714)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3524938.3525097
Xu CZhu G(2020)Semi-supervised Learning Algorithm Based on Linear Lie Group for Imbalanced Multi-class ClassificationNeural Processing Letters10.1007/s11063-020-10287-852:1(869-889)Online publication date: 1-Aug-2020
https://dl.acm.org/doi/10.1007/s11063-020-10287-8
Li YLiang D(2019)Safe semi-supervised learningFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-019-8452-213:4(669-676)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1007/s11704-019-8452-2
Michelioudakis EArtikis APaliouras G(2019)Semi-supervised online structure learning for composite event recognitionMachine Language10.1007/s10994-019-05794-2108:7(1085-1110)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1007/s10994-019-05794-2
Guo LLi YMcIlraith SWeinberger K(2018)A general formulation for safely exploiting weakly supervised dataProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence10.5555/3504035.3504417(3126-3133)Online publication date: 2-Feb-2018
https://dl.acm.org/doi/10.5555/3504035.3504417
Gurumurthy SYu LZhang CJin YLi WZhang XFang FZegura E(2018)Exploiting Data and Human Knowledge for Predicting Wildlife PoachingProceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies10.1145/3209811.3209879(1-8)Online publication date: 20-Jun-2018
https://dl.acm.org/doi/10.1145/3209811.3209879
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents