research-article

Open access

Fine-Tuning BERT on Twitter and Reddit Data in Luganda and English

Authors:

Richard Kimera,

Heeyoul ChoiAuthors Info & Claims

NLPIR '23: Proceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval

Pages 63 - 70

https://doi.org/10.1145/3639233.3639344

Published: 05 March 2024 Publication History

All formats PDF

Abstract

Deep learning techniques, driven by the Transformer architecture and models like BERT, find broad utility. While sentiment analysis in high-resource languages is well-established, it’s largely unexplored in low-resource ones. Our focus is on Luganda, a prevalent Ugandan language, spoken by over 21 million people. We utilized three datasets, from social media, to train machine learning models as baseline models and used BERT for deep learning. Our findings enhance sentiment analysis in both Luganda and English. Our approach for data extraction aids domain-specific dataset construction. This research advances NLP and aligns with global deep-learning initiatives.

References

[1]

Husien A Alhammi and Kais Haddar. 2020. Building a Libyan Dialect Lexicon-Based Sentiment Analysis System Using Semantic Orientation of Adjective-Adverb Combinations. International Journal of Computer Theory and Engineering 12, 6 (2020), 145–150.

[2]

Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics 7 (2019), 597–610.

[3]

Elizabeth Baertlein and Martin Ssekitto. 2014. Luganda Nouns: Inflectional Morphology and Tests. Linguistic Portfolios 3, 1 (2014), 3.

[4]

Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, 2022. Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983 (2022).

[5]

Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, and Luis Espinosa-Anke. 2020. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. arXiv preprint arXiv:2010.12421 (2020).

[6]

Francesco Barbieri, Jose Camacho-Collados, Francesco Ronzano, Luis Espinosa Anke, Miguel Ballesteros, Valerio Basile, Viviana Patti, and Horacio Saggion. 2018. Semeval 2018 task 2: Multilingual emoji prediction. In Proceedings of the 12th international workshop on semantic evaluation. 24–33.

[7]

Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. 2019. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th international workshop on semantic evaluation. 54–63.

[8]

Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. arXiv preprint arXiv:2004.03720 (2020).

[9]

Shirley Byakutaga, Henry Kabayo, Herbert Sengendo, and Ven Kitone. 2008. An Introduction to Survival Luganda. Kampala, Uganda: Peace Corps Uganda. Accessed July 11 (2008), 2017.

[10]

Isaac Caswell. 2022. Google Translate learns 24 new languages. Google Blog. https://blog.google/products/translate/24-new-languages/Accessed on 2024/01/13 14:46:56.

[11]

Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. 2020. Rethinking embedding coupling in pre-trained language models. arXiv preprint arXiv:2010.12821 (2020).

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[13]

Brendan S Gillon. 1992. Towards a common semantics for English count and mass nouns. Linguistics and philosophy 15 (1992), 597–639.

[14]

Razhan Hameed, Sina Ahmadi, and Fatemeh Daneshfar. 2023. Transfer Learning for Low-Resource Sentiment Analysis. arXiv preprint arXiv:2304.04703 (2023).

[15]

C Hans, D Suhartono, C Andry, and KZ Zamli. 2021. Text based personality prediction from multiple social media data sources using pre-trained language model and model averaging. Journal of Big Data 8, 68 (2021).

[16]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021).

[17]

Rohit Kumar Kaliyar, Anurag Goswami, and Pratik Narang. 2021. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimedia tools and applications 80, 8 (2021), 11765–11788.

[18]

Richard Kimera, Daniela N Rim, and Heeyoul Choi. 2023. Building a Parallel Corpus and Training Translation Models Between Luganda and English. arXiv preprint arXiv:2301.02773 (2023).

[19]

Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013).

[20]

Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 task 1: Affect in tweets. In Proceedings of the 12th international workshop on semantic evaluation. 1–17.

[21]

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. 2016. Semeval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). 31–41.

[22]

Sudhir Kumar Mohapatra, Srinivas Prasad, Dwiti Krishna Bebarta, Tapan Kumar Das, Kathiravan Srinivasan, and Yuh-Chung Hu. 2021. Automatic hate speech detection in english-odia code mixed social media data using machine learning techniques. Applied Sciences 11, 18 (2021), 8575.

[23]

Marzieh Mozafari, Reza Farahbakhsh, and Noel Crespi. 2020. A BERT-based transfer learning approach for hate speech detection in online social media. In Complex Networks and Their Applications VIII: Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019 8. Springer, 928–940.

[24]

Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, 2022. Naijasenti: A nigerian twitter sentiment corpus for multilingual sentiment analysis. arXiv preprint arXiv:2201.08277 (2022).

[25]

Isaac Mukonyezi, Claire Babirye, and Ernest Mwebaze. 2018. Mining voter sentiments from Twitter data for the 2016 Uganda Presidential elections. International Journal of Technology and Management 3, 2 (2018), 12–12.

[26]

Juliet Nabbuye-Sekandi, Fredrick E Makumbi, Arabat Kasangaki, Irene Betty Kizza, Joshua Tugumisirize, Edith Nshimye, Specioza Mbabali, and David H Peters. 2011. Patient satisfaction with services in outpatient clinics at Mulago hospital, Uganda. International Journal for Quality in Health Care 23, 5 (2011), 516–523.

[27]

Peter Nabende, David Kabiito, Claire Babirye, Hewitt Tusiime, and Joyce Nakatumba-Nabende. 2021. Misinformation detection in Luganda-English code-mixed social media text. arXiv preprint arXiv:2104.00124 (2021).

[28]

Frank Namugera, Ronald Wesonga, and Peter Jehopio. 2019. Text mining and determinants of sentiments: Twitter social media usage by traditional media houses in Uganda. Computational Social Networks 6 (2019), 1–21.

[29]

Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. KinyaBERT: a morphology-aware Kinyarwanda language model. arXiv preprint arXiv:2203.08459 (2022).

[30]

Kelechi Ogueji. 2022. AfriBERTa: Towards Viable Multilingual Language Models for Low-resource Languages. Master’s thesis. University of Waterloo.

[31]

Amrutha Ragothaman and Ching-Yu Huang. 2021. Sentiment analysis on Covid-19 twitter data. International Journal of Computer Theory and Engineering 13, 4 (2021), 100–107.

[32]

Aamir Rashid and Ching-yu Huang. 2021. Sentiment Analysis on Consumer Reviews of Amazon Products. International Journal of Computer Theory and Engineering 13, 2 (2021), 7.

[33]

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2019. SemEval-2017 task 4: Sentiment analysis in Twitter. arXiv preprint arXiv:1912.00741 (2019).

[34]

Craig P Sayers and Meichun Hsu. 2013. Extracting interesting related context-dependent concepts from social media streams using temporal distributions. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 1308–1311.

Digital Library

[35]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).

[36]

Casper S Shikali and Refuoe Mokhosi. 2020. Enhancing African low-resource languages: Swahili data for language modelling. Data in brief 31 (2020), 105951.

[37]

Abdul Male Ssentumbwe, YuChul Jung, Hyunah Lee, and Byeong Man Kim. 2020. Low-resource youtube comment encoding for luganda sentiment classification performance. Journal of Digital Contents Society 21, 5 (2020), 951–958.

[38]

Uganda National Health Users’/Consumers’ Organisation (UNHCO). 2003. Study on Patient Feedback Mechanisms at Health Facilities in Uganda. Technical Report. Uganda National Health Users’/Consumers’ Organisation (UNHCO). http://unhco.or.ug/wp-content/uploads/downloads/2010/12/UNHCO-Report-on-Feedback-Mechanisms-at-Health-Facilities-2003.pdf

[39]

Cynthia Van Hee, Els Lefever, and Véronique Hoste. 2018. Semeval-2018 task 3: Irony detection in english tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation. 39–50.

[40]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[41]

Tian Yan and Fang Liu. 2023. Has Sentiment Returned to the Pre-pandemic Level? A Sentiment Analysis Using US College Subreddit Data from 2019 to 2022. arXiv preprint arXiv:2309.08845 (2023).

[42]

Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019).

Recommendations

Disinformation Warfare: Understanding State-Sponsored Trolls on Twitter and Their Influence on the Web
WWW '19: Companion Proceedings of The 2019 World Wide Web Conference

Over the past couple of years, anecdotal evidence has emerged linking coordinated campaigns by state-sponsored actors with efforts to manipulate public opinion on the Web, often around major political events, through dedicated accounts, or “trolls.” ...
Boosting generalization of fine-tuning BERT for fake news detection
Abstract
Using deep neural networks to detect fake news has gradually become an extremely important field of research. Recent studies, which adopt fine-tuning BERT to detect fake news, have yielded promising results. Unfortunately, fine-tuning BERT is ...
Highlights
- Study the effects of freezing, initializing and iterations for generalization.
- Propose adversarial fine-tuning method with feature regularizer to detect fake news.
- Achieve better generalization than SOAT fake news detection ...
A sentiment analysis of audiences on twitter: who is the positive or negative audience of popular twitterers?
ICHIT'11: Proceedings of the 5th international conference on Convergence and hybrid information technology

Microblogging is a new informal communication medium of blogging that differs from a traditional blog in which content is much shorter. Microbloggers post about topics that describe their current status. Twitter is a popular microblogging service and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

NLPIR '23: Proceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval

December 2023

336 pages

ISBN:9798400709227

DOI:10.1145/3639233

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 March 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Institute for Information & communications Technology Promotion
National Research Foundation of Korea
Korea International Cooperation Agency

Conference

NLPIR 2023

NLPIR 2023: 2023 7th International Conference on Natural Language Processing and Information Retrieval

December 15 - 17, 2023

Seoul, Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
260
Total Downloads

Downloads (Last 12 months)260
Downloads (Last 6 weeks)50

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents