article

The impact of preprocessing on text classification

Authors:

Alper Kursat Uysal,

Serkan GunalAuthors Info & Claims

Information Processing and Management: an International Journal, Volume 50, Issue 1

Pages 104 - 112

https://doi.org/10.1016/j.ipm.2013.08.006

Published: 25 November 2019 Publication History

Abstract

Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on.

References

[1]

Asuncion, A., & Newman, D. J. (2007). UCI machine learning repository. In C. A. Irvine (Ed.), University of California, Department of Information and Computer Science.

[2]

Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology. v59. 407-421.

Digital Library

[3]

Using chi-square statistics to measure similarities for text categorization. Expert Systems with Applications. v38. 3085-3090.

Digital Library

[4]

Author gender identification from text. Digital Investigation. v8. 78-88.

Digital Library

[5]

Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science and Technology. v60. 2347-2352.

Digital Library

[6]

Turkish anti-spam filtering using binary and probabilistic models. AWERProcedia Information Technology and Computer Science. v1. 1007-1012.

[7]

A Bayesian feature selection paradigm for text classification. Information Processing & Management. v48. 283-302.

Digital Library

[8]

An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research. v3. 1289-1305.

Digital Library

[9]

Automated text classification using a dynamic artificial neural network model. Expert Systems with Applications. v39. 10967-10976.

Digital Library

[10]

Gonçalves, C. A., Gonçalves, C. T., Camacho, R., & Oliveira, E. C. (2010). The impact of pre-processing on the classification of MEDLINE documents. In Proceedings of the 10th international workshop on pattern recognition in information systems (pp. 53-61).

[11]

Subspace based feature selection for pattern recognition. Information Sciences. v178. 3716-3726.

Digital Library

[12]

On feature extraction for spam e-mail detection. Lecture Notes in Computer Science. v4105. 635-642.

Digital Library

[13]

A probabilistic analysis of the Rocchio algorithm with tfidf for text categorization. In: 14th international conference on machine learning, Morgan Kaufmann Publishers Inc. pp. 143-151.

Digital Library

[14]

A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognition Letters. v31. 1437-1444.

Digital Library

[15]

Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing & Management. v42. 155-165.

Digital Library

[16]

Feature selection with dynamic mutual information. Pattern Recognition. v42. 1330-1339.

Digital Library

[17]

A lexicon model for deep sentiment analysis and opinion mining applications. Decision Support Systems. v53. 680-688.

Digital Library

[18]

. Proceedings of the 11th spanish association conference on current topics in artificial intelligence, 2006.Santiago de Compostela, Spain, Springer-Verlag.

[19]

Metsis, V., Androutsopoulos, I., & Paliouras, G. (2006). Spam filtering with naive Bayes - Which naive Bayes? In 3rd conference on email and anti-spam (Vol. 17, pp. 28-69).

[20]

A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Systems with Applications. v38. 3407-3415.

Digital Library

[21]

The Influence of preprocessing parameters on text categorization. International Journal of Applied Science, Engineering and Technology. v4. 430-434.

[22]

An algorithm for suffix stripping. Program. v14. 130-137.

[23]

A vector space model for automatic indexing. Communications of the ACM. v18. 613-620.

Digital Library

[24]

A novel feature selection algorithm for text categorization. Expert Systems with Applications. v33. 1-5.

[25]

A comparative study on text representation schemes in text categorization. Pattern Analysis and Applications. v8. 199-209.

Digital Library

[26]

Adapting centroid classifier for document categorization. Expert Systems with Applications. v38. 10264-10273.

Digital Library

[27]

Pattern recognition. 4th ed. Academic Press.

[28]

Toman, M., Tesar, R., & Jezek, K. (2006). Influence of word normalization on text classification. In Proceedings of the 1st international conference on multidisciplinary information sciences & technologies (Vol. 2, pp. 354-358). Merida, Spain.

[29]

Toraman, C., Can, F., & Kocberber, S. (2011). Developing a text categorization template for Turkish news portals. In International symposium on innovations in intelligent systems and applications (INISTA) (pp. 379-383).

[30]

Torunoglu, D., Cakirman, E., Ganiz, M. C., Akyokus, S., & Gurbuz, M. Z. (2011). Analysis of preprocessing methods on classification of Turkish texts. In International Symposium on Innovations in Intelligent Systems and Applications (INISTA) (pp. 112-117).

[31]

Uysal, A. K., Gunal, S., Ergin, S., & Gunal, E. S. (2012). A novel framework for sms spam filtering. In Proceedings of the IEEE international symposium on innovations in intelligent systems and applications. Trabzon, Turkiye.

[32]

A novel probabilistic feature selection method for text classification. Knowledge-Based Systems. v36. 226-235.

Digital Library

[33]

A comparative study on feature selection in text categorization. In: 14th international conference on machine learning, Morgan Kaufmann Publishers Inc. pp. 412-420.

Digital Library

[34]

Zemberek. <http://code.google.com/p/zemberek/> (Accessed January 2013).

Cited By

Geisler NBinnig CFekete JOmidvar-Tehrani BRong KShraga R(2024)Towards Extending XAI for Full Data Science PipelinesProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665967(1-7)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3665939.3665967
Dou MTang JTiwari PDing YGuo F(2024)Drug–Drug Interaction Relation Extraction Based on Deep Learning: A ReviewACM Computing Surveys10.1145/364508956:6(1-33)Online publication date: 7-Feb-2024
https://dl.acm.org/doi/10.1145/3645089
Mongardini ALa Morgia MJajodia SVincenzo Mancini LMei A(2024)DARD: Deceptive Approaches for Robust Defense Against IP TheftIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.340243319(5591-5606)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1109/TIFS.2024.3402433
Show More Cited By

The impact of preprocessing on text classification
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning

Recommendations

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
The impact of indexing approaches on Arabic text classification

This paper investigates the impact of using different indexing approaches full-word, stem, and root when classifying Arabic text. In this study, the na ve Bayes classifier is used to construct the multinomial classification models and is evaluated using ...
Feature selection for text classification with Naïve Bayes

As an important preprocessing technology in text classification, feature selection can improve the scalability, efficiency and accuracy of a text classifier. In general, a good feature selection method should consider domain and algorithm ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 50, Issue 1

January, 2014

234 pages

ISSN:0306-4573

Issue’s Table of Contents

Copyright © Elsevier Ltd © 2013.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 25 November 2019

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

98
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Geisler NBinnig CFekete JOmidvar-Tehrani BRong KShraga R(2024)Towards Extending XAI for Full Data Science PipelinesProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665967(1-7)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3665939.3665967
Dou MTang JTiwari PDing YGuo F(2024)Drug–Drug Interaction Relation Extraction Based on Deep Learning: A ReviewACM Computing Surveys10.1145/364508956:6(1-33)Online publication date: 7-Feb-2024
https://dl.acm.org/doi/10.1145/3645089
Mongardini ALa Morgia MJajodia SVincenzo Mancini LMei A(2024)DARD: Deceptive Approaches for Robust Defense Against IP TheftIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.340243319(5591-5606)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1109/TIFS.2024.3402433
Babanejad NDavoudi HAgrawal AAn APapagelis M(2024)The Role of Preprocessing for Word Representation Learning in Affective TasksIEEE Transactions on Affective Computing10.1109/TAFFC.2023.327011515:1(254-272)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TAFFC.2023.3270115
Okkalioglu M(2024)A novel redistribution-based feature selection for text classificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.123119246:COnline publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.123119
Teng TVarathan KCrestani F(2024)A comprehensive review of cyberbullying-related content classification in online social mediaExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122644244:COnline publication date: 15-Jun-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122644
Bencheikh Lehocine MBelhadef H(2024)Preprocessing-Based Approach for Prompt Intrusion Detection in SDN NetworksJournal of Network and Systems Management10.1007/s10922-024-09841-932:4Online publication date: 16-Aug-2024
https://dl.acm.org/doi/10.1007/s10922-024-09841-9
Kekül HErgen BArslan H(2024)Estimating vulnerability metrics with word embedding and multiclass classification methodsInternational Journal of Information Security10.1007/s10207-023-00734-723:1(247-270)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10207-023-00734-7
Atzenhofer-Baumgartner FKovács T(2024)Is Text Normalization Relevant for Classifying Medieval Charters?Linking Theory and Practice of Digital Libraries10.1007/978-3-031-72440-4_12(125-132)Online publication date: 24-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72440-4_12
Zou WZhang WTian ZWu W(2023)A hybrid model for text classification using part-of-speech featuresJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23169945:1(1235-1249)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-231699
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents