article

Free access

An extensive empirical study of feature selection metrics for text classification

Author:

George FormanAuthors Info & Claims

The Journal of Machine Learning Research, Volume 3

Pages 1289 - 1305

Published: 01 March 2003 Publication History

PDF eReader

Abstract

Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair---e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.

References

[1]

Susan Dumais, John Platt, David Heckerman and Mehran Sahami. Inductive Learning Algorithms and Representations for Text Categorization. In Proceedings of the 17th International Conference on Information and Knowledge Management, pages 148-155, Maryland, 1998.

Crossref

Google Scholar

[2]

Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46(1-3): 389-422, 2002.

Crossref

Google Scholar

[3]

Eui-Hong Sam Han and George Karypis. Centroid-Based Document Classification: Analysis & Experimental Results. In Proceedings of the Fourth European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD), pages 424-431, Lyon, France, 2000.

Google Scholar

[4]

James A. Hanley. The Robustness of the "Binormal" Assumptions Used in Fitting ROC Curves. Medical Decision Making, 8(3): 197-203, 1988.

Google Scholar

[5]

Thorsten Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the Tenth European Conference on Machine Learning (ECML), pages 137-142, Berlin, Germany, 1998.

Crossref

Google Scholar

[6]

Ron Kohavi and George H. John. Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1- 2): 273-324, 1997.

Crossref

Google Scholar

[7]

George A. Miller and Edwin B. Newman. Tests of a statistical explanation of the rank-frequency relation for words in written English. American Journal of Psychology, 71: 209-218, 1958.

Google Scholar

[8]

Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. Workshop on Learning for Text Categorization, In the Fifteenth National Conference on Artificial Intelligence (AAAI), 1998.

Google Scholar

[9]

Dunja Mladenic and Marko Grobelnik. Feature Selection for Unbalanced Class Distribution and Naïve Bayes. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 258-267, 1999.

Crossref

Google Scholar

[10]

Bernhard Schölkopf and Alex Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

Google Scholar

[11]

Adrian J. Simpson and Mike J. Fitter. What is the Best Index of Detectability? Psychological Bulletin, 80(6): 481-488, 1973.

Google Scholar

[12]

Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.

Crossref

Google Scholar

[13]

Yiming Yang and Xin Liu. A Re-examination of Text Categorization Methods. In Proceedings of the Twenty-Second International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 42-49, 1999.

Crossref

Google Scholar

[14]

Yiming Yang and Jan O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML), pages 412-420, 1997.

Crossref

Google Scholar

Cited By

View all

Althunibat AAlsawareah BMaidin SHawashin BJebril IZaqaibeh BAl-khawaja H(2024)Detecting Ambiguities in Requirement Documents Written in Arabic Using Machine Learning AlgorithmsInternational Journal of Cloud Applications and Computing10.4018/IJCAC.33956314:1(1-19)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.4018/IJCAC.339563
Zhang TWang ZQian CLi JLou Y(2024)FeatureLTE: Learning to Estimate Feature ImportanceProceedings of the ACM on Management of Data10.1145/36549422:3(1-19)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654942
Xiao MWang DWu MLiu KXiong HZhou YFu Y(2024)Traceable Group-Wise Self-Optimizing Feature Transformation Learning: A Dual Optimization PerspectiveACM Transactions on Knowledge Discovery from Data10.1145/363805918:4(1-22)Online publication date: 13-Feb-2024
https://dl.acm.org/doi/10.1145/3638059
Show More Cited By

Index Terms

An extensive empirical study of feature selection metrics for text classification

Recommendations

High-performing feature selection for text classification
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive ...
An empirical study of reducing multiclass classification methodologies
MLDM'13: Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

One-against-all and one-against-one are two popular methodologies for reducing multiclass classification problems into a set of binary classifications. In this paper, we are interested in the performance of both one-against-all and one-against-one for ...
Feature sub-set selection metrics for Arabic text classification

Feature sub-set selection (FSS) is an important step for effective text classification (TC) systems. This paper presents an empirical comparison of seventeen traditional FSS metrics for TC tasks. The TC is restricted to support vector machine (SVM) ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 3, Issue

3/1/2003

1437 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 March 2003

Published in JMLR Volume 3

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

530
Total Citations
View Citations
5,619
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)16

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Althunibat AAlsawareah BMaidin SHawashin BJebril IZaqaibeh BAl-khawaja H(2024)Detecting Ambiguities in Requirement Documents Written in Arabic Using Machine Learning AlgorithmsInternational Journal of Cloud Applications and Computing10.4018/IJCAC.33956314:1(1-19)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.4018/IJCAC.339563
Zhang TWang ZQian CLi JLou Y(2024)FeatureLTE: Learning to Estimate Feature ImportanceProceedings of the ACM on Management of Data10.1145/36549422:3(1-19)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654942
Xiao MWang DWu MLiu KXiong HZhou YFu Y(2024)Traceable Group-Wise Self-Optimizing Feature Transformation Learning: A Dual Optimization PerspectiveACM Transactions on Knowledge Discovery from Data10.1145/363805918:4(1-22)Online publication date: 13-Feb-2024
https://dl.acm.org/doi/10.1145/3638059
Jia PWang YDu ZZhao XWang YChen BWang WGuo HTang RBaeza-Yates RBonchi F(2024)ERASE: Benchmarking Feature Selection Methods for Deep Recommender SystemsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671571(5194-5205)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671571
Sánchez-Díaz XMasson CMengshoel O(2024)Regularized Feature Selection Landscapes: An Empirical Study of MultimodalityParallel Problem Solving from Nature – PPSN XVIII10.1007/978-3-031-70055-2_25(409-426)Online publication date: 14-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70055-2_25
Creamer GSakamoto YNickerson JRen Y(2023)Hybrid Human and Machine Learning Algorithms to Forecast the European Stock MarketComplexity10.1155/2023/58478872023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/5847887
Jin LZhang LZhao L(2023)Feature selection based on absolute deviation factor for text classificationInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10325160:3Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.ipm.2022.103251
Goldberg DAbrahams A(2023)Maximizing total yield in safety hazard monitoring of online reviewsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120540229:PAOnline publication date: 13-Jul-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120540
Wichitaksorn NKang YZhang F(2023)Random feature selection using random subspace logistic regressionExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119535217:COnline publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119535
Jin LZhang LZhao L(2023)Max-difference maximization criterion: a feature selection method for text categorizationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-2154-x17:1Online publication date: 12-Jan-2023
https://dl.acm.org/doi/10.1007/s11704-022-2154-x
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

High-performing feature selection for text classification

An empirical study of reducing multiclass classification methodologies

Feature sub-set selection metrics for Arabic text classification

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations