Abstract
For the task of hate speech and offensive language detection, this paper explores the potential advantages of using small datasets to develop efficient word embeddings used in models for deep learning. We investigate the impact of feature vectors generated by four selected word embedding techniques (word2vec, wang2vec, fastText, and GloVe) applied to text datasets with size in the order of a billion tokens. After training the classifiers using pre-trained word embeddings, we compare the classification performance with the results from using feature vectors generated from small datasets with size in the order of thousands of tokens. Using numerical examples, we show that the word embeddings with the smallest size yield slightly worse accuracy values but, in combination with smaller training times, such embeddings lead to non-dominated solutions. That fact has an immediate application in significantly reducing training time at a small penalty in classification accuracy. We explore two ways to rank the studied alternatives based on performance factors and on PROMETHEE-II scores. According to both rankings, GloVe is the best method for NILC-embedding, and fastText is the best method for dataset-specific embedding. It is expected that specific word embedding should yield a better fit to a particular dataset, which should yield shorter training and better accuracy. However, the obtained results indicate that NILC-embeddings would lead to an equally good fit.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Training time involves CNN training, exclusively, and to stress that, we often refer to it as CNN training time throughout the manuscript. The word-embedding training is assumed to be performed a priori and only once: we either use pre-trained word embeddings, or train our own models. In either case, the training time of the word-embedding model is not taken into account in this study.
The term “word embeddings” refers to the representation of words as vectors containing real numbers. These representations should include some knowledge of positional information among words.
Token are atomic units of data used for text analysis. In general, any string delimited by spaces or punctuation marks is considered as a token.
http://inf.ufrgs.br/~rppelle/hatedetector/, last accessed in October 23rd, 2020.
https://rdm.inesctec.pt/id/dataset/cs-2017-008, last accessed in October 23rd, 2020.
http://nilc.icmc.usp.br/embeddings, last accessed in October 24th, 2020.
https://github.com/wlin12/wang2vec, last accessed in October 24th, 2020.
https://fasttext.cc/docs/en/support.html, last accessed in October 24th, 2020.
https://github.com/stanfordnlp/GloVe, last accessed in October 24th, 2020.
https://github.com/samcaetano/hatespeech_detector, last accessed in July 25th, 2019.
References
Abadi M, et al. (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467
Angiani G, Ferrari L, Fontanini T, Fornacciari P, Iotti E, Magliani F, Manicardi S (2016) A comparison between preprocessing techniques for sentiment analysis in Twitter. In: 2nd international workshop on knowledge discovery on the WEB. Cagliari, Italy
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146
Brans J-P, Mareschal B (2005) Promethee Methods. Springer, New York, pp 163–186
Cer D, Yang Y, Kong S-Y, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung Y-H, Strope B, Kurzweil R (2018) Universal sentence encoder. arXiv:1803.11175
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Palmer M, Hwa R, Riedel S (eds) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP Copenhagen, Denmark, September 9-11, 2017. Association for Computational Linguistics, p 2017
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297
de Pelle R, Moreira V (2017) Offensive comments in the brazilian web: a dataset and baseline results, SP, Brazil
Devlin J, Chang M. -W., Lee K, Toutanova K (2019) BERT: Pre-training Of deep bidirectional transformers for language understanding. In: 4171–4186. Association for Computational Linguistics
Dhiman H, Deb D (2020) Multi-criteria decision-making: An overview Decision and Control, vol 253. Springer, Singapore
Ezeibe C (2021) Hate speech and election violence in Nigeria. Journal of Asian and African Studies 56(4):919–935
Fortuna P (2017) Automatic detection of hatespeech in text: An overview of the topic and dataset annotation with hierarchical classes. Master’s thesis. https://hdl.handle.net/10216/106028, Faculty of Engineering, University of Porto. Porto, Portugal
Fortuna P, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput Surv 51(4):1–30
Fortuna P, Rocha da Silva J, Soler-Company J, Wanner L, Nunes S (2019) A hierarchically-labeled Portuguese hate speech dataset. In: Proceedings of the Third Workshop on Abusive Language Online. Association for Computational Linguistics, Italy, pp 94–104
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. http://www.deeplearningbook.org MIT Press
Hartmann N, Fonseca E, Shulby C, Treviso M, Rodrigues J, Aluísio S. (2017) Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: Anais do XI simposio brasileiro de tecnologia da informação e da linguagem humana, Porto Alegre, RS, Brasiĺ, pp 122–131. SBC
Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Spain, pp 427–431
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Qatar, pp 1746–1751
Leite J, Silva D, Bontcheva K, Scarton C (2020) Toxic language detection in social media for Brazilian Portuguese: New dataset and multilingual analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, China, pp 914–924
Lima C, Dal Bianco G (2019) Extração de característica para identificação de discurso de ódio em documentos. In: Anais da XV escola regional de banco de dados, Porto Alegre, RS, Brasil, pp 61–70. SBC
Ling W, Dyer C, Black A, Trancoso I (2015) Two/Too simple adaptations of word2Vec for syntax problems. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Colorado, pp 1299–1304
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representation in vector space. arXiv:1301.3781v3
Pari C, Nunes G, Gomes J (2019) Avaliação de técnicas de word embedding na tarefa de detecção de discurso de ódio. In: Anais do XVI encontro nacional de inteligência artificial e computacional, porto alegre, RS, Brasil, pp 1020–1031. SBC
Pennington J, Socher R, Manning C (2014) GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Qatar, pp 1532–1543
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, Louisiana, pp 2227–2237
Petrolito R, Dell’Orletta F (2018) Word embeddings in sentiment analysis, Turin, Italy
Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V (2021) Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resour Eval 55(2):477–523
Pugliero F (2018) Como ódio viralizou no Brasil Available at: https://www.dw.com/pt-br/como-o-odio-viralizou-no-brasil/a-45097506 Accessed: October 23rd
Rodrigues J, Branco A, Neale S, Silva J (2016) LX-DSEmVectors: Distributional semantics models for Portuguese language. In: 12Th international conference on computational processing of the portuguese, PROPOR. Tomar, Portugal
Roy P, Tripathy A, Das T, Gao X-Z (2020) A framework for hate speech detection using deep convolutional neural network. IEEE Access 8:204951–204962
Sherstinsky A (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena 404:132306
Silva S, Serapião A (2018) Detecção de discurso de ódio em português usando CNN combinada a vetores de palavras. In: Symposium on knowledge discovery, mining and learning, KDMILE. São Paulo, Brazil, p 2018
Spertus E (1997) Smokey: Automatic recognition of hostile messages. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, AAAI’97/IAAI’97. AAAI Press, pp 1058–1065
Thireou T, Reczko M (2007) Bidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteins. IEEE/ACM Trans Comput Biol Bioinformatics 4(3):441–446
Vargas F, de Góes F, Carvalho I, Benevenuto F, Pardo T (2021) Contextual lexicon-based approach for hate speech and offensive language detection. arXiv:2104.12265
Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv:1903.08983
Zhang Y, Wallace B (2015) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv:1510.03820
Acknowledgment
This work was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq – Brazil), research grants 432997/2018-0, 310841/2019-4 and 440074/2020-7, and by Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ– Brazil), research grant 210.364/2018 and 203.111/2018.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Soto, C.P., Nunes, G.M.S., Gomes, J.G.R.C. et al. Application-specific word embeddings for hate and offensive language detection. Multimed Tools Appl 81, 27111–27136 (2022). https://doi.org/10.1007/s11042-021-11880-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11880-2