Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/775047.775147acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

B-EM: a classifier incorporating bootstrap with EM approach for data mining

Published: 23 July 2002 Publication History

Abstract

This paper investigates the problem of augmenting labeled data with unlabeled data to improve classification accuracy. This is significant for many applications such as image classification where obtaining classification labels is expensive, while large unlabeled examples are easily available. We investigate an Expectation Maximization (EM) algorithm for learning from labeled and unlabeled data. The reason why unlabeled data boosts learning accuracy is because it provides the information about the joint probability distribution. A theoretical argument shows that the more unlabeled examples are combined in learning, the more accurate the result. We then introduce B-EM algorithm, based on the combination of EM with bootstrap method, to exploit the large unlabeled data while avoiding prohibitive I/O cost. Experimental results over both synthetic and real data sets that the proposed approach has a satisfactory performance.

References

[1]
S. Babu and J. Widom. Continuous queries over data streams. SIGMOD Record, 30(3), Sept 2001.
[2]
P. S. Bradley, U. M. Fayyad, and C. A. Reina. Scaling clustering algorithms to large databases. In Proceedings of the Fourth ACM KDD International Conference on Knowledge Discovery and Data Mining, pages 9--15, August 1998.
[3]
V. Castelli and T. Cover. On the exponential value of labeled samples. Pattern Recognition Letters, 6:105--111, 1995.
[4]
V. Castelli and T. Cover. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameters. IEEE Transaction on Information Theory, 42(6), 1996.
[5]
B. Chapmann and R. Tibshirani. An Introduction to the Bootstrap. Monograph on Statistics and Applied Probability. Chapman and Hall, 1993.
[6]
P. Cheeseman and J. Stutz. Bayesian classification (autoclass): Theory and results. Advances in Knowledge Discovery and Data Mining, 1996.
[7]
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(1):1--38, 1977.
[8]
J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. Boat-optimistic decision tree construction. In Proceedings of the SIGMOD Conference, pages 169--180, 1999.
[9]
Z. Ghaharmani and M. Jordan. Supervised learning from incomplete data via an em approach. Advances in Neural Information Processing Systems 6, 1994.
[10]
D. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufmann, 1999.
[11]
M. James. Classification Algorithms. Wiley, 1985.
[12]
R. Lippmann. An introduction to computing with neural nets. IEEE ASSP Magazine, 4(22), 1987.
[13]
M. Mehta, R. Agrawal, and J. Risanen. Sliq: A fast scalable classifier for data mining. In Proceedings of the Fifth International Conference on Extending Database Technology, March 1996.
[14]
K. Nigam, A. Mccallum, S. Thrun, and T. Mitchel. Text classification from labeled and unlabeled documents. Machine Learning, 39(2/3):103--134, 2000.
[15]
M. Ortega-Binderberger. Corel image features. http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.html.
[16]
J. Quinlan. Induction of decision trees. Machine Learning, 1:81--106, 1986.
[17]
J. Shafer, R. Agrawal, and M. Mehta. Sprint: A scalable parallel classifier for data mining. In Proceedings of the 2Pnd VLDB Conference, pages 544--555, 1996.
[18]
B. Shanhshanhani and D. Landgrebe. The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5):1087--1095, 1994.

Cited By

View all
  • (2018)Incorporating large unlabeled data to enhance EM classificationJournal of Intelligent Information Systems10.1007/s10844-006-0865-326:3(211-226)Online publication date: 27-Dec-2018
  • (2017)Antibody Exchange: Information Extraction of Biological Antibody Donation and a Web-Portal to Find Donors and SeekersData10.3390/data20400382:4(38)Online publication date: 21-Nov-2017
  • (2016)Data-Driven Imputation Method for Traffic Data in Sectional Units of Road LinksIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2016.253031217:6(1762-1771)Online publication date: Jun-2016

Index Terms

  1. B-EM: a classifier incorporating bootstrap with EM approach for data mining

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
    July 2002
    719 pages
    ISBN:158113567X
    DOI:10.1145/775047
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 July 2002

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bootstrap method
    2. classification
    3. expectation maximization
    4. supervised and unsupervised learning

    Qualifiers

    • Article

    Conference

    KDD02
    Sponsor:

    Acceptance Rates

    KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Incorporating large unlabeled data to enhance EM classificationJournal of Intelligent Information Systems10.1007/s10844-006-0865-326:3(211-226)Online publication date: 27-Dec-2018
    • (2017)Antibody Exchange: Information Extraction of Biological Antibody Donation and a Web-Portal to Find Donors and SeekersData10.3390/data20400382:4(38)Online publication date: 21-Nov-2017
    • (2016)Data-Driven Imputation Method for Traffic Data in Sectional Units of Road LinksIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2016.253031217:6(1762-1771)Online publication date: Jun-2016

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media