Pre-processing is considered to be the first step in text classification, and choosing the right pre-processing techniques can improve classification effectiveness. We experimentally compare 15 commonly used pre-processing techniques on two Twitter datasets. We employ three different machine learning algorithms, namely, Linear SVC, Bernoulli Naïve Bayes, and Logistic Regression, and report the classification accuracy and the resulting number of features for each pre-processing technique. Finally, based on our results, we categorize these techniques based on their performance. We find that techniques like stemming, removing numbers, and replacing elongated words improve accuracy, while others like removing punctuation do not.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, LSM 2011, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 30–38 (2011). http://dl.acm.org/citation.cfm?id=2021109.2021114
Bird, S.: NLTK: the natural language toolkit. In: Calzolari, N., Cardie, C., Isabelle, P. (eds.) ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17–21 July 2006. The Association for Computer Linguistics (2006). http://aclweb.org/anthology/p06-4018
Cherkassky, V.: The nature of statistical learning theory. IEEE Trans. Neural Netw. 8(6), 1564 (1997). doi:10.1109/TNN.1997.641482
Fayyad, U.M., Piatetsky-Shapiro, G., Uthurusamy, R.: Summary from the KDD-03 panel: data mining: the next 10 years. SIGKDD Explor. 5(2), 191–196 (2003). doi:10.1145/980972.981004
John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: UAI 1995: Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, 18–20 August 1995, pp. 338–345 (1995). https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=450&proceeding_id=11
Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, 2–6 November 2009, pp. 375–384 (2009). http://doi.acm.org/10.1145/1645953.1646003
Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995). doi:10.1145/219717.219748
Mohammad, S., Kiritchenko, S., Zhu, X.: NRC-Canada: building the state-of-the-art in sentiment analysis of tweets. In: Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2013, Atlanta, Georgia, USA, 14–15 June 2013, pp. 321–327 (2013). http://aclweb.org/anthology/S/S13/S13-2053.pdf
Mohammad, S.M., Zhu, X., Kiritchenko, S., Martin, J.D.: Sentiment, emotion, purpose, and style in electoral tweets. Inf. Process. Manage. 51(4), 480–499 (2015). doi:10.1016/j.ipm.2014.09.003
Mullen, T., Malouf, R.: A preliminary investigation into sentiment analysis of informal political discourse. In: Computational Approaches to Analyzing Weblogs, Papers from the 2006 AAAI Spring Symposium, Technical Report SS-06-03, Stanford, California, USA, 27–29 March 2006, pp. 159–162 (2006). http://www.aaai.org/Library/Symposia/Spring/2006/ss06-03-031.php
Na, J.C., Sui, H., Khoo, C., Chan, S., Zhou, Y.: Effectiveness of simple linguistic processing in automatic sentiment classification of product reviews. In: Conference of the International Society for Knowledge Organization (ISKO), pp. 49–54 (2004)
Nakov, P., Rosenthal, S., Kozareva, Z., Stoyanov, V., Ritter, A., Wilson, T.: SemEval-2013 task 2: sentiment analysis in twitter. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 312–320. Association for Computational Linguistics, Atlanta, Georgia, USA, June 2013. http://www.aclweb.org/anthology/S13-2052
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011). http://dl.acm.org/citation.cfm?id=2078195
Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham (2010)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). doi:10.1108/eb046814
Prasad, S.: Micro-blogging sentiment analysis using bayesian classification methods. Technical report (2010)
Saif, H., Fernández, M., He, Y., Alani, H.: Evaluation datasets for twitter sentiment analysis: a survey and a new dataset, the STS-gold. In: Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and Perspectives from AI (ESSEM 2013) A Workshop of the XIII International Conference of the Italian Association for Artificial Intelligence (AI*IA 2013), Turin, Italy, 3 December 2013, pp. 9–21 (2013). http://ceur-ws.org/Vol-1096/paper1.pdf
Shi, Y., Xi, Y., Wolcott, P., Tian, Y., Li, J., Berg, D., Chen, Z., Herrera-Viedma, E., Kou, G., Lee, H., Peng, Y., Yu, L. (eds.): Proceedings of the First International Conference on Information Technology and Quantitative Management, ITQM 2013, Dushu Lake Hotel, Sushou, China, 16–18 May 2013, Procedia Computer Science, vol. 17. Elsevier (2013). http://www.sciencedirect.com/science/journal/18770509/17
Singh, T., Kumari, M.: Role of text pre-processing in twitter sentiment analysis. Proc. Comput. Sci. 89, 549–554 (2016). http://www.sciencedirect.com/science/article/pii/S1877050916311607
Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment strength detection for the social web. JASIST 63(1), 163–173 (2012). doi:10.1002/asi.21662
Uysal, A.K., Günal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50(1), 104–112 (2014). doi:10.1016/j.ipm.2013.08.006
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Effrosynidis, D., Symeonidis, S., Arampatzis, A. (2017). A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science(), vol 10450. Springer, Cham. https://doi.org/10.1007/978-3-319-67008-9_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-67008-9_31
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67007-2
Online ISBN: 978-3-319-67008-9
eBook Packages: Computer ScienceComputer Science (R0)