Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

An extensive empirical study of feature selection metrics for text classification

Published: 01 March 2003 Publication History

Abstract

Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair---e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.

References

[1]
Susan Dumais, John Platt, David Heckerman and Mehran Sahami. Inductive Learning Algorithms and Representations for Text Categorization. In Proceedings of the 17th International Conference on Information and Knowledge Management, pages 148-155, Maryland, 1998.
[2]
Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46(1-3): 389-422, 2002.
[3]
Eui-Hong Sam Han and George Karypis. Centroid-Based Document Classification: Analysis & Experimental Results. In Proceedings of the Fourth European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD), pages 424-431, Lyon, France, 2000.
[4]
James A. Hanley. The Robustness of the "Binormal" Assumptions Used in Fitting ROC Curves. Medical Decision Making, 8(3): 197-203, 1988.
[5]
Thorsten Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the Tenth European Conference on Machine Learning (ECML), pages 137-142, Berlin, Germany, 1998.
[6]
Ron Kohavi and George H. John. Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1- 2): 273-324, 1997.
[7]
George A. Miller and Edwin B. Newman. Tests of a statistical explanation of the rank-frequency relation for words in written English. American Journal of Psychology, 71: 209-218, 1958.
[8]
Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. Workshop on Learning for Text Categorization, In the Fifteenth National Conference on Artificial Intelligence (AAAI), 1998.
[9]
Dunja Mladenic and Marko Grobelnik. Feature Selection for Unbalanced Class Distribution and Naïve Bayes. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 258-267, 1999.
[10]
Bernhard Schölkopf and Alex Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
[11]
Adrian J. Simpson and Mike J. Fitter. What is the Best Index of Detectability? Psychological Bulletin, 80(6): 481-488, 1973.
[12]
Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.
[13]
Yiming Yang and Xin Liu. A Re-examination of Text Categorization Methods. In Proceedings of the Twenty-Second International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 42-49, 1999.
[14]
Yiming Yang and Jan O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML), pages 412-420, 1997.

Cited By

View all
  • (2024)Detecting Ambiguities in Requirement Documents Written in Arabic Using Machine Learning AlgorithmsInternational Journal of Cloud Applications and Computing10.4018/IJCAC.33956314:1(1-19)Online publication date: 9-Apr-2024
  • (2024)FeatureLTE: Learning to Estimate Feature ImportanceProceedings of the ACM on Management of Data10.1145/36549422:3(1-19)Online publication date: 30-May-2024
  • (2024)Traceable Group-Wise Self-Optimizing Feature Transformation Learning: A Dual Optimization PerspectiveACM Transactions on Knowledge Discovery from Data10.1145/363805918:4(1-22)Online publication date: 13-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 3, Issue
3/1/2003
1437 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 March 2003
Published in JMLR Volume 3

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)96
  • Downloads (Last 6 weeks)16
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Detecting Ambiguities in Requirement Documents Written in Arabic Using Machine Learning AlgorithmsInternational Journal of Cloud Applications and Computing10.4018/IJCAC.33956314:1(1-19)Online publication date: 9-Apr-2024
  • (2024)FeatureLTE: Learning to Estimate Feature ImportanceProceedings of the ACM on Management of Data10.1145/36549422:3(1-19)Online publication date: 30-May-2024
  • (2024)Traceable Group-Wise Self-Optimizing Feature Transformation Learning: A Dual Optimization PerspectiveACM Transactions on Knowledge Discovery from Data10.1145/363805918:4(1-22)Online publication date: 13-Feb-2024
  • (2024)ERASE: Benchmarking Feature Selection Methods for Deep Recommender SystemsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671571(5194-5205)Online publication date: 25-Aug-2024
  • (2024)Regularized Feature Selection Landscapes: An Empirical Study of MultimodalityParallel Problem Solving from Nature – PPSN XVIII10.1007/978-3-031-70055-2_25(409-426)Online publication date: 14-Sep-2024
  • (2023)Hybrid Human and Machine Learning Algorithms to Forecast the European Stock MarketComplexity10.1155/2023/58478872023Online publication date: 1-Jan-2023
  • (2023)Feature selection based on absolute deviation factor for text classificationInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10325160:3Online publication date: 1-May-2023
  • (2023)Maximizing total yield in safety hazard monitoring of online reviewsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120540229:PAOnline publication date: 13-Jul-2023
  • (2023)Random feature selection using random subspace logistic regressionExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119535217:COnline publication date: 1-May-2023
  • (2023)Max-difference maximization criterion: a feature selection method for text categorizationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-2154-x17:1Online publication date: 12-Jan-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media