Article

Performance thresholding in practical text classification

Authors:

Hinrich Schütze,

Emre Velipasaoglu,

Jan O. PedersenAuthors Info & Claims

CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

Pages 662 - 671

https://doi.org/10.1145/1183614.1183709

Published: 06 November 2006 Publication History

Abstract

In practical classification, there is often a mix of learnable and unlearnable classes and only a classifier above a minimum performance threshold can be deployed. This problem is exacerbated if the training set is created by active learning. The bias of actively learned training sets makes it hard to determine whether a class has been learned. We give evidence that there is no general and efficient method for reducing the bias and correctly identifying classes that have been learned. However, we characterize a number of scenarios where active learning can succeed despite these difficulties.

References

[1]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998.

Digital Library

[2]

L. Breiman. Bagging predictors. Mach. Learn., 24(2):123--140, 1996.

Digital Library

[3]

C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.

Digital Library

[4]

D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Mach. Learn., 15(2):201--221, 1994.

Digital Library

[5]

P. Domingos and M. J. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn., 29(2-3):103--130, 1997.

Digital Library

[6]

Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Mach. Learn., 28(2-3):133--168, 1997.

Digital Library

[7]

J. H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1):55--77, 1997.

Digital Library

[8]

A. Genkin, D. D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text categorization. In preparation, 2005.

[9]

D. Haussler. Probably approximately correct learning. In AAAI, pages 1101--1108, 1990.

[10]

V. S. Iyengar, C. Apte, and T. Zhang. Active learning using adaptive resampling. In SIGKDD, pages 91--98. ACM Press, 2000.

Digital Library

[11]

T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.

Digital Library

[12]

D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In SIGIR, pages 246--254, 1995.

Digital Library

[13]

D. D. Lewis. Training text classifiers by uncertainty sampling. Manuscript, AT&T Labs, 2001.

[14]

D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR, pages 3--12, 1994.

Digital Library

[15]

D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361--397, 2004.

Digital Library

[16]

O. Madani, D. M. Pennock, and G. W. Flake. Co-validation: Using model disagreement on unlabeled data to validate classification algorithms. In NIPS, pages 873--880, 2004.

[17]

A. K. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In ICML, pages 350--358, 1998.

Digital Library

[18]

T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

Digital Library

[19]

I. Muslea, S. Minton, and C. A. Knoblock. Active + semi-supervised learning = robust multi-view learning. In ICML, pages 435--442, 2002.

Digital Library

[20]

J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schoelkopf, and D. Schuurmans, editors, Advances in large margin classifiers, pages 61--74, 2000.

[21]

N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In ICML, pages 441--448, 2001.

Digital Library

[22]

G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In ICML, pages 839--846, 2000.

Digital Library

[23]

F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1--47, 2002.

Digital Library

[24]

H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In COLT, pages 287--294, 1992.

Digital Library

[25]

S. L. Smith and J. N. Mosier. Guidelines for designing user interface software. Technical Report ESD-TR-86-278, MITRE, 1986.

[26]

G. W. Snedecor and W. G. Cochran. Statistical methods. Iowa State University Press, 1989.

[27]

S. Tong and D. Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2:45--66, 2001.

Digital Library

[28]

C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. Second Edition.

Digital Library

[29]

V. N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer, Berlin, 1982.

Digital Library

[30]

Y. Yang. Sampling strategies and learning efficiency in text categorization. In AAAI Spring Symposium on Machine Learning in Information Access, pages 88--95, 1996.

[31]

Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR, pages 42--49, 1999.

Digital Library

[32]

Y. Zhang and J. P. Callan. Maximum likelihood estimation for filtering thresholds. In SIGIR, pages 294--302, 2001.

Digital Library

Cited By

Martinis MChiara Z(2024)Natural Language Processing Approaches in BioinformaticsReference Module in Life Sciences10.1016/B978-0-323-95502-7.00179-2Online publication date: 2024
https://doi.org/10.1016/B978-0-323-95502-7.00179-2
Guerra-Manzanares ABahsi H(2024)Experts still needed: boosting long-term android malware detection with active learningJournal of Computer Virology and Hacking Techniques10.1007/s11416-024-00536-y20:4(901-918)Online publication date: 3-Oct-2024
https://doi.org/10.1007/s11416-024-00536-y
Guerra-Manzanares ABahsi H(2023)On the application of active learning for efficient and effective IoT botnet detectionFuture Generation Computer Systems10.1016/j.future.2022.10.024141(40-53)Online publication date: Apr-2023
https://doi.org/10.1016/j.future.2022.10.024
Show More Cited By

Index Terms

Performance thresholding in practical text classification
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Active learning for text classification with reusability

We investigate the reusability problem in active learning for text classification.The reusability problem affects active learning systems for text classification.If the consumer classifier type is known, it should be used for the selector.Local and ...
A probabilistic model of active learning with multiple noisy oracles

In this paper, we focus on obtaining an accurate classifier in active learning, when there are multiple noisy oracles with different and unknown levels of expertise to provide labels for selected instances. We propose a probabilistic model of active ...
A Novel Active Learning Method Using SVM for Text Classification

Support vector machines (SVMs) are a popular class of supervised learning algorithms, and are particularly applicable to large and high-dimensional classification problems. Like most machine learning methods for data classification and information ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

November 2006

916 pages

ISBN:1595934332

DOI:10.1145/1183614

General Chair:
Philip S. Yu
IBM T.J. Watson Research Center (USA)
,
Program Chairs:
Vassilis Tsotras
University of California-Riverside (USA)
,
Edward Fox
Virginia Tech (USA)
,
Bing Liu
University of Illinois at Chicago (USA)

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CIKM06

Sponsor:

CIKM06: Conference on Information and Knowledge Management

November 6 - 11, 2006

Virginia, Arlington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
575
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Martinis MChiara Z(2024)Natural Language Processing Approaches in BioinformaticsReference Module in Life Sciences10.1016/B978-0-323-95502-7.00179-2Online publication date: 2024
https://doi.org/10.1016/B978-0-323-95502-7.00179-2
Guerra-Manzanares ABahsi H(2024)Experts still needed: boosting long-term android malware detection with active learningJournal of Computer Virology and Hacking Techniques10.1007/s11416-024-00536-y20:4(901-918)Online publication date: 3-Oct-2024
https://doi.org/10.1007/s11416-024-00536-y
Guerra-Manzanares ABahsi H(2023)On the application of active learning for efficient and effective IoT botnet detectionFuture Generation Computer Systems10.1016/j.future.2022.10.024141(40-53)Online publication date: Apr-2023
https://doi.org/10.1016/j.future.2022.10.024
Guerra-Manzanares ABahsi H(2023)On the Application of Active Learning to Handle Data Evolution in Android Malware DetectionDigital Forensics and Cyber Crime10.1007/978-3-031-36574-4_15(256-273)Online publication date: 16-Jul-2023
https://doi.org/10.1007/978-3-031-36574-4_15
Vilares Ferro MDarriba Bilbao VRibadas Pena FGraña Gil J(2022)Surfing the Modeling of pos Taggers in Low-Resource ScenariosMathematics10.3390/math1019352610:19(3526)Online publication date: 27-Sep-2022
https://doi.org/10.3390/math10193526
Klemme FAmrouch H(2022)Efficient Learning Strategies for Machine Learning-Based Characterization of Aging-Aware Cell LibrariesIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2022.320143169:12(5233-5246)Online publication date: Dec-2022
https://doi.org/10.1109/TCSI.2022.3201431
Chaudhuri ATalukdar JSu FChakrabarty K(2022)Functional Criticality Analysis of Structural Faults in AI AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.316610841:12(5657-5670)Online publication date: Dec-2022
https://doi.org/10.1109/TCAD.2022.3166108
Vilares Ferro MDarriba Bilbao VVilares Ferro J(2022)Absolute convergence and error thresholds in non-active adaptive samplingJournal of Computer and System Sciences10.1016/j.jcss.2022.05.002Online publication date: May-2022
https://doi.org/10.1016/j.jcss.2022.05.002
Chaudhuri ATalukdar JSu FChakrabarty K(2020)Functional Criticality Classification of Structural Faults in AI Accelerators2020 IEEE International Test Conference (ITC)10.1109/ITC44778.2020.9325272(1-5)Online publication date: 1-Nov-2020
https://doi.org/10.1109/ITC44778.2020.9325272
Wang GHwang JRose CWallace F(2019)Uncertainty-Based Active Learning via Sparse Modeling for Image ClassificationIEEE Transactions on Image Processing10.1109/TIP.2018.286791328:1(316-329)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1109/TIP.2018.2867913
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents