Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

OMCD: Offensive Moroccan Comments Dataset

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Offensive content, such as verbal attacks, demeaning comments, or hate speech, has become widespread on social media. Automatic detection of this content is considered an important and challenging task. Although several research works have been proposed to address this challenge for high-resource languages, research on detecting offensive content in Dialectal Arabic (DA) remains under-explored. Recently, the detection of offensive language in DA has gained increasing interest among researchers in Natural Language Processing (NLP). However, only a limited number of annotated datasets have been introduced for single or multiple coarse-grained dialects. In this paper, we introduce Offensive Moroccan Comments Dataset (OMCD), the first dataset for offensive language detection for the Moroccan dialect. First, we present the data collection steps, the statistical analysis, and the annotation guidelines of the introduced dataset. Then, we evaluate several state-of-the-art Machine Learning (ML) and Deep Learning (DL) based models on the OMCD dataset. Finally, we highlight the impact of emojis on the evaluated models for offensive language detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://socialblade.com/.

  2. The dataset is publicly available at: https://github.com/kabilessefar/OMCD-Offensive-Moroccan-Comments-Dataset.

References

  • Abdelali, A., Hassan, S., Mubarak, H., Darwish, K., & Samih, Y. (2021). Pre-training BERT on Arabic tweets: Practical considerations. CoRR. http://arxiv.org/2102.10684

  • Abdul-Mageed, M., Elmadany, A. A., & Nagoudi, E. M. B. (2021). ARBERT & MARBERT: Deep bidirectional transformers for Arabic. CoRR. arXiv:abs/2101.01785

  • Abozinadah, E. A., Mbaziira, A. V., & Jones, J. (2015). Detection of abusive accounts with Arabic tweets. International Journal of Knowledge Engineering-IACSIT, 1(2), 113–119.

    Article  Google Scholar 

  • Agarwal, S., & Sureka, A. (2014). A focused crawler for mining hate and extremism promoting videos on YouTube. In Proceedings of the 25th ACM conference on hypertext and social media. HT ’14 (pp. 294–296). Association for Computing Machinery. https://doi.org/10.1145/2631775.2631776

  • Alakrot, A., Murray, L., & Nikolov, N. S. (2018a). Dataset construction for the detection of anti-social behaviour in online communication in Arabic. Procedia Computer Science, 142, 174–181. https://doi.org/10.1016/j.procs.2018.10.473

    Article  Google Scholar 

  • Alakrot, A., Murray, L., & Nikolov, N. S. (2018b). Towards accurate detection of offensive language in online communication in Arabic. Procedia Computer Science, 142, 315–320. https://doi.org/10.1016/j.procs.2018.10.491

    Article  Google Scholar 

  • Albadi, N., Kurdi, M., & Mishra, S. (2018). Are they our brothers? Analysis and detection of religious hate speech in the Arabic Twittersphere. In 2018 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM) (pp. 69–76).

  • Antoine, J.-Y., Villaneau, J., & Lefeuvre, A. (2014). Weighted Krippendorff’s alpha is a more reliable metrics for multi-coders ordinal annotations: Experimental studies on emotion, opinion and coreference annotation. In Proceedings of the 14th conference of the European chapter of the Association for Computational Linguistics (pp. 550–559). Association for Computational Linguistics. https://doi.org/10.3115/v1/E14-1058, https://www.aclweb.org/anthology/E14-1058

  • Antoun, W., Baly, F., & Hajj, H. M. (2020). Arabert: Transformer-based model for Arabic language understanding. CoRR. arXiv:abs/2003.00104

  • Artstein, R., & Poesio, M. (2008). Survey article: Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596. https://doi.org/10.1162/coli.07-034-R2

    Article  Google Scholar 

  • Barbieri, F., Ballesteros, M., & Saggion, H. (2017). Are emojis predictable? CoRR. arXiv:abs/1702.07285

  • Baudhuin, E. S. (1973). Obscene language and evaluative response: An empirical study. Psychological Reports, 32(2), 399–402.

    Article  Google Scholar 

  • Burnap, P., & Williams, M. L. (2015). Cyber hate speech on Twitter: An application of machine classification and statistical modeling for policy and decision making. Policy & Internet, 7(2), 223–242. https://doi.org/10.1002/poi3.85

    Article  Google Scholar 

  • Chatzakou, D., Kourtellis, N., Blackburn, J., De Cristofaro, E., Stringhini, G., & Vakali, A. (2017). Mean birds: Detecting aggression and bullying on twitter. In Proceedings of the 2017 ACM on web science conference. WebSci ’17 (pp. 13–22). Association for Computing Machinery. https://doi.org/10.1145/3091478.3091487

  • Chowdhury, S .A., Mubarak, H., Abdelali, A., Jung, S.-G., Jansen, B. J., & Salminen, J. (2020). A multi-platform Arabic news comment dataset for offensive language detection. In Proceedings of the 12th language resources and evaluation conference (pp. 6203–6212). European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.761

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104

    Article  Google Scholar 

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    Article  Google Scholar 

  • Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–232.

    Google Scholar 

  • Dai, W., Yu, T., Liu, Z., & Fung, P. (2020). Kungfupanda at SemEval-2020 task 12: BERT-based multi-task learning for offensive language detection. In Proceedings of the fourteenth workshop on semantic evaluation (pp. 2060–2066). International Committee for Computational Linguistics. https://doi.org/10.18653/v1/2020.semeval-1.272, https://aclanthology.org/2020.semeval-1.272

  • Darwish, K., Habash, N., Abbas, M., Al-Khalifa, H., Al-Natsheh, H. T., Bouamor, H., Bouzoubaa, K., Cavalli-Sforza, V., El-Beltagy, S. R., El-Hajj, W., Jarrar, M., & Mubarak, H. (2021). A panoramic survey of natural language processing in the Arab world. Communications of the ACM, 64(4), 72–81. https://doi.org/10.1145/3447735

    Article  Google Scholar 

  • Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated hate speech detection and the problem of offensive language. Proceedings of the International AAAI Conference on Web and Social Media, 11(1), 512–515.

    Article  Google Scholar 

  • El Mekki, A., El Mahdaouy, A., Berrada, I., & Khoumsi, A. (2021a). Domain adaptation for Arabic cross-domain and cross-dialect sentiment analysis from contextualized word embedding. In Proceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2824–2837). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.226, https://aclanthology.org/2021.naacl-main.226

  • El Mekki, A., El Mahdaouy, A., Berrada, I., & Khoumsi, A. (2021b). On the role of orthographic variations in building multidialectal Arabic word embeddings. In Proceedings of the Canadian conference on artificial intelligence. https://doi.org/10.21428/594757db.5febef29, https://caiac.pubpub.org/pub/pdf9jqoh

  • El Mekki, A., El Mahdaouy, A., Essefar, K., El Mamoun, N., Berrada, I., & Khoumsi, A. (2021c). BERT-based multi-task model for country and province level MSA and dialectal Arabic identification. In Proceedings of the sixth Arabic natural language processing workshop (pp. 271–275). Association for Computational Linguistics, Kyiv (Virtual). https://aclanthology.org/2021.wanlp-1.31

  • Erdmann, A., Zalmout, N., & Habash, N. (2018). Addressing noise in multidialectal word embeddings. In Proceedings of the 56th annual meeting of the Association for Computational Linguistics (Vol. 2: Short Papers, pp. 558–565).

  • Eryani, F., Habash, N., Bouamor, H., & Khalifa, S. (2020). A spelling correction corpus for multiple Arabic dialects. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 12th language resources and evaluation conference, LREC 2020, Marseille, May 11–16, 2020 (pp. 4130–4138). European Language Resources Association.

  • Essefar, K., El Mekki, A., El Mahdaouy, A., El Mamoun, N., & Berrada, I. (2021). CS-UM6P at SemEval-2021 task 7: Deep multi-task learning model for detecting and rating humor and offense. In Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021) (pp. 1135–1140). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.semeval-1.159, https://aclanthology.org/2021.semeval-1.159

  • Guellil, I., Saâdane, H., Azouaou, F., Gueni, B., & Nouvel, D. (2021). Arabic natural language processing: An overview. Journal of King Saud University: Computer and Information Sciences, 33(5), 497–507. https://doi.org/10.1016/j.jksuci.2019.02.006

    Article  Google Scholar 

  • Gwet, Kilem. (2011). On the Krippendorff’s alpha coefficient. Retrieved October 2, 2011

  • Haddad, H., Mulki, H., & Oueslati, A. (2019). T-HSAB: A Tunisian hate speech and abusive dataset. In K. Smaïli (Ed.), Arabic language processing: From theory to practice (pp. 251–263). Springer.

    Chapter  Google Scholar 

  • Hinduja, S., & Patchin, J. W. (2010). Bullying, cyberbullying, and suicide. Archives of Suicide Research, 14(3), 206–221.

    Article  Google Scholar 

  • Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278–282). IEEE.

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hodeib, C. (2021). Variability in perceptions of (im)politeness in Syrian Arabic: The observers’ perspective. Argumentum, 17, 125–160.

    Article  Google Scholar 

  • Husain, F., & Uzuner, O. (2021a). Exploratory Arabic offensive language dataset analysis. arXiv Preprint. http://arxiv.org/abs/2101.11434

  • Husain, F., & Uzuner, O. (2021b). A survey of offensive language detection for the Arabic language. ACM Transactions on Asian and Low-Resource Language Information Processing. https://doi.org/10.1145/3421504

    Article  Google Scholar 

  • Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H., & Habash, N. (2021). The interplay of variant, size, and task type in Arabic pre-trained language models. CoRR. http://arxiv.org/2103.06678

  • Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.

    Article  Google Scholar 

  • Khaddaj, A., Hajj, H., & El-Hajj, W. (2019). Improved generalization of Arabic text classifiers. In Proceedings of the fourth Arabic natural language processing workshop (pp. 167–174). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4618. https://aclanthology.org/W19-4618

  • Krippendorff, K. (2004). Content analysis: An introduction to its methodology (p. 241). Sage.

  • Kumar, R., Ojha, A. K., Malmasi, S., & Zampieri, M. (2018). Benchmarking aggression identification in social media. In Proceedings of the first workshop on trolling, aggression and cyberbullying (TRAC-2018) (pp. 1–11). Association for Computational Linguistics. https://aclanthology.org/W18-4401

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

    Article  Google Scholar 

  • LaValle, S. M., Branicky, M. S., & Lindemann, S. R. (2004). On the relationship between classical grid search and probabilistic roadmaps. The International Journal of Robotics Research, 23(7–8), 673–692.

    Article  Google Scholar 

  • Liu, Y., Yang, M., Ramsay, M., Li, X., & Coid, J. (2011). A comparison of logistic regression, classification and regression tree, and neural networks models in predicting violent re-offending. Journal of Quantitative Criminology, 27, 547–573. https://doi.org/10.1007/s10940-011-9137-7

    Article  Google Scholar 

  • McCallum, A., Nigam, K., et al. (1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41–48). CiteSeer.

  • Mengü, M., & Mengü, S. (2015). Violence and social media. Athens Journal of Mass Media and Communications, 1(3), 211–227.

    Article  Google Scholar 

  • Mouheb, D., Ismail, R., Qaraghuli, S. A., Aghbari, Z. A., & Kamel, I. (2018). Detection of offensive messages in Arabic social media communications. In 2018 international conference on innovations in information technology (IIT) (pp. 24–29). https://doi.org/10.1109/INNOVATIONS.2018.8606030

  • Mubarak, H., Darwish, K., & Magdy, W. (2017). Abusive language detection on Arabic social media. In Proceedings of the first workshop on abusive language online (pp. 52–56). Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-3008. https://www.aclweb.org/anthology/W17-3008

  • Mubarak, H., Darwish, K., Magdy, W., Elsayed, T., & Al-Khalifa, H. (2020a). Overview of OSACT4 Arabic offensive language detection shared task. In Proceedings of the 4th workshop on open-source Arabic corpora and processing tools, with a shared task on offensive language detection (pp. 48–52). European Language Resource Association. https://www.aclweb.org/anthology/2020.osact-1.7

  • Mubarak, H., Hassan, S., & Chowdhury, S. A. (2022). Emojis as anchors to detect Arabic offensive language and hate speech. CoRR. arXiv:abs/2201.06723

  • Mubarak, H., Rashed, A., Darwish, K., Samih, Y., & Abdelali, A. (2020b). Arabic offensive language on twitter: Analysis and experiments. arXiv Preprint. arXiv:2004.02192

  • Mubarak, H., Rashed, A., Darwish, K., Samih, Y., & Abdelali, A. (2020c). Arabic offensive language on twitter: Analysis and experiments. CoRR. arXiv:2004.02192

  • Muhammad, A.-M., Chiyu, Z., Houda, B., & Nizar, H. (2020). NADI 2020: The first nuanced Arabic dialect identification shared task. arXiv:2010.11334arXiv:2010.11334

  • Mulki, H., Haddad, H., Bechikh Ali, C., & Alshabani, H. (2019). L-HSAB: A Levantine Twitter dataset for hate speech and abusive language. In Proceedings of the third workshop on abusive language online (pp. 111–118). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-3512. https://aclanthology.org/W19-3512

  • Obeid, O., Zalmout, N., Khalifa, S., Taji, D., Oudah, M., Alhafni, B., Inoue, G., Eryani, F., Erdmann, A., & Habash, N. (2020). CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of the 12th language resources and evaluation conference (pp. 7022–7032). European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.868

  • Ousidhoum, N., Lin, Z., Zhang, H., Song, Y., & Yeung, D.-Y. (2019). Multilingual and multi-aspect hate speech analysis. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 4675–4684). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1474, https://aclanthology.org/D19-1474

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  • Qwaider, C., Chatzikyriakidis, S., & Dobnik, S. (2019). Can modern standard Arabic approaches be used for Arabic dialects? Sentiment analysis as a case study. In Proceedings of the 3rd workshop on Arabic corpus linguistics (pp. 40–50). Association for Computational Linguistics. https://aclanthology.org/W19-5606

  • Rainie, H., Anderson, J. Q., & Albright, J. (2017). The future of free speech, trolls, anonymity and fake news online. Washington, DC: Pew Research Center.

    Google Scholar 

  • Saadane, H., & Habash, N. (2015). A conventional orthography for Algerian Arabic. In N. Habash, S. Vogel, & K. Darwish (Eds.), Proceedings of the second workshop on Arabic natural language processing, ANLP@ACL 2015, Beijing, July 30, 2015 (pp. 69–79). Association for Computational Linguistics. https://doi.org/10.18653/v1/W15-3208

  • Sarika. (2022). 84 YouTube statistics you can’t ignore in 2022. https://invideo.io/blog/youtube-statistics/

  • Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093

    Article  Google Scholar 

  • Waseem, Z., & Hovy, D. (2016). Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL student research workshop (pp. 88–93). Association for Computational Linguistics. https://doi.org/10.18653/v1/N16-2013, https://www.aclweb.org/anthology/N16-2013

  • Whillock, R. K., & Slayden, D. (1995). Hate speech. ERIC.

  • Younes, J., Souissi, E., Achour, H., & Ferchichi, A. (2020). Language resources for Maghrebi Arabic dialects’ NLP: A survey. Language Resources and Evaluation, 54(4), 1079–1142. https://doi.org/10.1007/s10579-020-09490-9

    Article  Google Scholar 

  • Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 1 (Long and Short Papers), pp. 1415–1420). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1144, https://aclanthology.org/N19-1144

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kabil Essefar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Essefar, K., Ait Baha, H., El Mahdaouy, A. et al. OMCD: Offensive Moroccan Comments Dataset. Lang Resources & Evaluation 57, 1745–1765 (2023). https://doi.org/10.1007/s10579-023-09663-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-023-09663-2

Keywords

Navigation