Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1183614.1183709acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Performance thresholding in practical text classification

Published: 06 November 2006 Publication History

Abstract

In practical classification, there is often a mix of learnable and unlearnable classes and only a classifier above a minimum performance threshold can be deployed. This problem is exacerbated if the training set is created by active learning. The bias of actively learned training sets makes it hard to determine whether a class has been learned. We give evidence that there is no general and efficient method for reducing the bias and correctly identifying classes that have been learned. However, we characterize a number of scenarios where active learning can succeed despite these difficulties.

References

[1]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998.
[2]
L. Breiman. Bagging predictors. Mach. Learn., 24(2):123--140, 1996.
[3]
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.
[4]
D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Mach. Learn., 15(2):201--221, 1994.
[5]
P. Domingos and M. J. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn., 29(2-3):103--130, 1997.
[6]
Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Mach. Learn., 28(2-3):133--168, 1997.
[7]
J. H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1):55--77, 1997.
[8]
A. Genkin, D. D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text categorization. In preparation, 2005.
[9]
D. Haussler. Probably approximately correct learning. In AAAI, pages 1101--1108, 1990.
[10]
V. S. Iyengar, C. Apte, and T. Zhang. Active learning using adaptive resampling. In SIGKDD, pages 91--98. ACM Press, 2000.
[11]
T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
[12]
D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In SIGIR, pages 246--254, 1995.
[13]
D. D. Lewis. Training text classifiers by uncertainty sampling. Manuscript, AT&T Labs, 2001.
[14]
D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR, pages 3--12, 1994.
[15]
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361--397, 2004.
[16]
O. Madani, D. M. Pennock, and G. W. Flake. Co-validation: Using model disagreement on unlabeled data to validate classification algorithms. In NIPS, pages 873--880, 2004.
[17]
A. K. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In ICML, pages 350--358, 1998.
[18]
T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
[19]
I. Muslea, S. Minton, and C. A. Knoblock. Active + semi-supervised learning = robust multi-view learning. In ICML, pages 435--442, 2002.
[20]
J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schoelkopf, and D. Schuurmans, editors, Advances in large margin classifiers, pages 61--74, 2000.
[21]
N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In ICML, pages 441--448, 2001.
[22]
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In ICML, pages 839--846, 2000.
[23]
F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1--47, 2002.
[24]
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In COLT, pages 287--294, 1992.
[25]
S. L. Smith and J. N. Mosier. Guidelines for designing user interface software. Technical Report ESD-TR-86-278, MITRE, 1986.
[26]
G. W. Snedecor and W. G. Cochran. Statistical methods. Iowa State University Press, 1989.
[27]
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2:45--66, 2001.
[28]
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. Second Edition.
[29]
V. N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer, Berlin, 1982.
[30]
Y. Yang. Sampling strategies and learning efficiency in text categorization. In AAAI Spring Symposium on Machine Learning in Information Access, pages 88--95, 1996.
[31]
Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR, pages 42--49, 1999.
[32]
Y. Zhang and J. P. Callan. Maximum likelihood estimation for filtering thresholds. In SIGIR, pages 294--302, 2001.

Cited By

View all
  • (2024)Natural Language Processing Approaches in BioinformaticsReference Module in Life Sciences10.1016/B978-0-323-95502-7.00179-2Online publication date: 2024
  • (2024)Experts still needed: boosting long-term android malware detection with active learningJournal of Computer Virology and Hacking Techniques10.1007/s11416-024-00536-y20:4(901-918)Online publication date: 3-Oct-2024
  • (2023)On the application of active learning for efficient and effective IoT botnet detectionFuture Generation Computer Systems10.1016/j.future.2022.10.024141(40-53)Online publication date: Apr-2023
  • Show More Cited By

Index Terms

  1. Performance thresholding in practical text classification

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management
      November 2006
      916 pages
      ISBN:1595934332
      DOI:10.1145/1183614
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 06 November 2006

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. accuracy estimation
      2. active learning
      3. learnability
      4. practical text classification

      Qualifiers

      • Article

      Conference

      CIKM06
      CIKM06: Conference on Information and Knowledge Management
      November 6 - 11, 2006
      Virginia, Arlington, USA

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 25 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Natural Language Processing Approaches in BioinformaticsReference Module in Life Sciences10.1016/B978-0-323-95502-7.00179-2Online publication date: 2024
      • (2024)Experts still needed: boosting long-term android malware detection with active learningJournal of Computer Virology and Hacking Techniques10.1007/s11416-024-00536-y20:4(901-918)Online publication date: 3-Oct-2024
      • (2023)On the application of active learning for efficient and effective IoT botnet detectionFuture Generation Computer Systems10.1016/j.future.2022.10.024141(40-53)Online publication date: Apr-2023
      • (2023)On the Application of Active Learning to Handle Data Evolution in Android Malware DetectionDigital Forensics and Cyber Crime10.1007/978-3-031-36574-4_15(256-273)Online publication date: 16-Jul-2023
      • (2022)Surfing the Modeling of pos Taggers in Low-Resource ScenariosMathematics10.3390/math1019352610:19(3526)Online publication date: 27-Sep-2022
      • (2022)Efficient Learning Strategies for Machine Learning-Based Characterization of Aging-Aware Cell LibrariesIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2022.320143169:12(5233-5246)Online publication date: Dec-2022
      • (2022)Functional Criticality Analysis of Structural Faults in AI AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.316610841:12(5657-5670)Online publication date: Dec-2022
      • (2022)Absolute convergence and error thresholds in non-active adaptive samplingJournal of Computer and System Sciences10.1016/j.jcss.2022.05.002Online publication date: May-2022
      • (2020)Functional Criticality Classification of Structural Faults in AI Accelerators2020 IEEE International Test Conference (ITC)10.1109/ITC44778.2020.9325272(1-5)Online publication date: 1-Nov-2020
      • (2019)Uncertainty-Based Active Learning via Sparse Modeling for Image ClassificationIEEE Transactions on Image Processing10.1109/TIP.2018.286791328:1(316-329)Online publication date: 1-Jan-2019
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media