A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis

Dimitrios Effrosynidis¹⁸,
Symeon Symeonidis¹⁸ &
Avi Arampatzis¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10450))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

3298 Accesses

Abstract

Pre-processing is considered to be the first step in text classification, and choosing the right pre-processing techniques can improve classification effectiveness. We experimentally compare 15 commonly used pre-processing techniques on two Twitter datasets. We employ three different machine learning algorithms, namely, Linear SVC, Bernoulli Naïve Bayes, and Logistic Regression, and report the classification accuracy and the resulting number of features for each pre-processing technique. Finally, based on our results, we categorize these techniques based on their performance. We find that techniques like stemming, removing numbers, and replacing elongated words improve accuracy, while others like removing punctuation do not.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Pre-processing Framework for Twitter Sentiment Classification

A Survey on Twitter Sentiment Analysis Using Machine Learning Techniques

Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation

Notes

References

Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, LSM 2011, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 30–38 (2011). http://dl.acm.org/citation.cfm?id=2021109.2021114
Bird, S.: NLTK: the natural language toolkit. In: Calzolari, N., Cardie, C., Isabelle, P. (eds.) ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17–21 July 2006. The Association for Computer Linguistics (2006). http://aclweb.org/anthology/p06-4018
Cherkassky, V.: The nature of statistical learning theory. IEEE Trans. Neural Netw. 8(6), 1564 (1997). doi:10.1109/TNN.1997.641482
Article Google Scholar
Fayyad, U.M., Piatetsky-Shapiro, G., Uthurusamy, R.: Summary from the KDD-03 panel: data mining: the next 10 years. SIGKDD Explor. 5(2), 191–196 (2003). doi:10.1145/980972.981004
Article Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: UAI 1995: Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, 18–20 August 1995, pp. 338–345 (1995). https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=450&proceeding_id=11
Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, 2–6 November 2009, pp. 375–384 (2009). http://doi.acm.org/10.1145/1645953.1646003
Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995). doi:10.1145/219717.219748
Article Google Scholar
Mohammad, S., Kiritchenko, S., Zhu, X.: NRC-Canada: building the state-of-the-art in sentiment analysis of tweets. In: Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2013, Atlanta, Georgia, USA, 14–15 June 2013, pp. 321–327 (2013). http://aclweb.org/anthology/S/S13/S13-2053.pdf
Mohammad, S.M., Zhu, X., Kiritchenko, S., Martin, J.D.: Sentiment, emotion, purpose, and style in electoral tweets. Inf. Process. Manage. 51(4), 480–499 (2015). doi:10.1016/j.ipm.2014.09.003
Article Google Scholar
Mullen, T., Malouf, R.: A preliminary investigation into sentiment analysis of informal political discourse. In: Computational Approaches to Analyzing Weblogs, Papers from the 2006 AAAI Spring Symposium, Technical Report SS-06-03, Stanford, California, USA, 27–29 March 2006, pp. 159–162 (2006). http://www.aaai.org/Library/Symposia/Spring/2006/ss06-03-031.php
Na, J.C., Sui, H., Khoo, C., Chan, S., Zhou, Y.: Effectiveness of simple linguistic processing in automatic sentiment classification of product reviews. In: Conference of the International Society for Knowledge Organization (ISKO), pp. 49–54 (2004)
Google Scholar
Nakov, P., Rosenthal, S., Kozareva, Z., Stoyanov, V., Ritter, A., Wilson, T.: SemEval-2013 task 2: sentiment analysis in twitter. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 312–320. Association for Computational Linguistics, Atlanta, Georgia, USA, June 2013. http://www.aclweb.org/anthology/S13-2052
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011). http://dl.acm.org/citation.cfm?id=2078195
MathSciNet MATH Google Scholar
Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham (2010)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). doi:10.1108/eb046814
Article Google Scholar
Prasad, S.: Micro-blogging sentiment analysis using bayesian classification methods. Technical report (2010)
Google Scholar
Saif, H., Fernández, M., He, Y., Alani, H.: Evaluation datasets for twitter sentiment analysis: a survey and a new dataset, the STS-gold. In: Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and Perspectives from AI (ESSEM 2013) A Workshop of the XIII International Conference of the Italian Association for Artificial Intelligence (AI*IA 2013), Turin, Italy, 3 December 2013, pp. 9–21 (2013). http://ceur-ws.org/Vol-1096/paper1.pdf
Shi, Y., Xi, Y., Wolcott, P., Tian, Y., Li, J., Berg, D., Chen, Z., Herrera-Viedma, E., Kou, G., Lee, H., Peng, Y., Yu, L. (eds.): Proceedings of the First International Conference on Information Technology and Quantitative Management, ITQM 2013, Dushu Lake Hotel, Sushou, China, 16–18 May 2013, Procedia Computer Science, vol. 17. Elsevier (2013). http://www.sciencedirect.com/science/journal/18770509/17
Singh, T., Kumari, M.: Role of text pre-processing in twitter sentiment analysis. Proc. Comput. Sci. 89, 549–554 (2016). http://www.sciencedirect.com/science/article/pii/S1877050916311607
Article Google Scholar
Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment strength detection for the social web. JASIST 63(1), 163–173 (2012). doi:10.1002/asi.21662
Article Google Scholar
Uysal, A.K., Günal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50(1), 104–112 (2014). doi:10.1016/j.ipm.2013.08.006
Article Google Scholar

Download references

Author information

Authors and Affiliations

Database and Information Retrieval Research Unit, Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100, Xanthi, Greece
Dimitrios Effrosynidis, Symeon Symeonidis & Avi Arampatzis

Authors

Dimitrios Effrosynidis
View author publications
You can also search for this author in PubMed Google Scholar
Symeon Symeonidis
View author publications
You can also search for this author in PubMed Google Scholar
Avi Arampatzis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitrios Effrosynidis .

Editor information

Editors and Affiliations

Faculteit der Geesteswetenschappen, Universiteit van Amsterdam , Amsterdam, The Netherlands
Jaap Kamps
Library & Information Center, University of Patras , Patras, Greece
Giannis Tsakonas
Aristotle University of Thessaloniki , Thessaloniki, Greece
Yannis Manolopoulos
Civil Engineering, University of Thrace , Kimmeria, Greece
Lazaros Iliadis
Informatics, Ionian University , Kerkyra, Greece
Ioannis Karydis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Effrosynidis, D., Symeonidis, S., Arampatzis, A. (2017). A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science(), vol 10450. Springer, Cham. https://doi.org/10.1007/978-3-319-67008-9_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-67008-9_31
Published: 02 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67007-2
Online ISBN: 978-3-319-67008-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Pre-processing Framework for Twitter Sentiment Classification

A Survey on Twitter Sentiment Analysis Using Machine Learning Techniques

Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Pre-processing Framework for Twitter Sentiment Classification

A Survey on Twitter Sentiment Analysis Using Machine Learning Techniques

Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation