Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3639233.3639344acmotherconferencesArticle/Chapter ViewAbstractPublication PagesnlpirConference Proceedingsconference-collections
research-article
Open access

Fine-Tuning BERT on Twitter and Reddit Data in Luganda and English

Published: 05 March 2024 Publication History

Abstract

Deep learning techniques, driven by the Transformer architecture and models like BERT, find broad utility. While sentiment analysis in high-resource languages is well-established, it’s largely unexplored in low-resource ones. Our focus is on Luganda, a prevalent Ugandan language, spoken by over 21 million people. We utilized three datasets, from social media, to train machine learning models as baseline models and used BERT for deep learning. Our findings enhance sentiment analysis in both Luganda and English. Our approach for data extraction aids domain-specific dataset construction. This research advances NLP and aligns with global deep-learning initiatives.

References

[1]
Husien A Alhammi and Kais Haddar. 2020. Building a Libyan Dialect Lexicon-Based Sentiment Analysis System Using Semantic Orientation of Adjective-Adverb Combinations. International Journal of Computer Theory and Engineering 12, 6 (2020), 145–150.
[2]
Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics 7 (2019), 597–610.
[3]
Elizabeth Baertlein and Martin Ssekitto. 2014. Luganda Nouns: Inflectional Morphology and Tests. Linguistic Portfolios 3, 1 (2014), 3.
[4]
Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, 2022. Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983 (2022).
[5]
Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, and Luis Espinosa-Anke. 2020. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. arXiv preprint arXiv:2010.12421 (2020).
[6]
Francesco Barbieri, Jose Camacho-Collados, Francesco Ronzano, Luis Espinosa Anke, Miguel Ballesteros, Valerio Basile, Viviana Patti, and Horacio Saggion. 2018. Semeval 2018 task 2: Multilingual emoji prediction. In Proceedings of the 12th international workshop on semantic evaluation. 24–33.
[7]
Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. 2019. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th international workshop on semantic evaluation. 54–63.
[8]
Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. arXiv preprint arXiv:2004.03720 (2020).
[9]
Shirley Byakutaga, Henry Kabayo, Herbert Sengendo, and Ven Kitone. 2008. An Introduction to Survival Luganda. Kampala, Uganda: Peace Corps Uganda. Accessed July 11 (2008), 2017.
[10]
Isaac Caswell. 2022. Google Translate learns 24 new languages. Google Blog. https://blog.google/products/translate/24-new-languages/Accessed on 2024/01/13 14:46:56.
[11]
Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. 2020. Rethinking embedding coupling in pre-trained language models. arXiv preprint arXiv:2010.12821 (2020).
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[13]
Brendan S Gillon. 1992. Towards a common semantics for English count and mass nouns. Linguistics and philosophy 15 (1992), 597–639.
[14]
Razhan Hameed, Sina Ahmadi, and Fatemeh Daneshfar. 2023. Transfer Learning for Low-Resource Sentiment Analysis. arXiv preprint arXiv:2304.04703 (2023).
[15]
C Hans, D Suhartono, C Andry, and KZ Zamli. 2021. Text based personality prediction from multiple social media data sources using pre-trained language model and model averaging. Journal of Big Data 8, 68 (2021).
[16]
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021).
[17]
Rohit Kumar Kaliyar, Anurag Goswami, and Pratik Narang. 2021. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimedia tools and applications 80, 8 (2021), 11765–11788.
[18]
Richard Kimera, Daniela N Rim, and Heeyoul Choi. 2023. Building a Parallel Corpus and Training Translation Models Between Luganda and English. arXiv preprint arXiv:2301.02773 (2023).
[19]
Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013).
[20]
Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 task 1: Affect in tweets. In Proceedings of the 12th international workshop on semantic evaluation. 1–17.
[21]
Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. 2016. Semeval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). 31–41.
[22]
Sudhir Kumar Mohapatra, Srinivas Prasad, Dwiti Krishna Bebarta, Tapan Kumar Das, Kathiravan Srinivasan, and Yuh-Chung Hu. 2021. Automatic hate speech detection in english-odia code mixed social media data using machine learning techniques. Applied Sciences 11, 18 (2021), 8575.
[23]
Marzieh Mozafari, Reza Farahbakhsh, and Noel Crespi. 2020. A BERT-based transfer learning approach for hate speech detection in online social media. In Complex Networks and Their Applications VIII: Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019 8. Springer, 928–940.
[24]
Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, 2022. Naijasenti: A nigerian twitter sentiment corpus for multilingual sentiment analysis. arXiv preprint arXiv:2201.08277 (2022).
[25]
Isaac Mukonyezi, Claire Babirye, and Ernest Mwebaze. 2018. Mining voter sentiments from Twitter data for the 2016 Uganda Presidential elections. International Journal of Technology and Management 3, 2 (2018), 12–12.
[26]
Juliet Nabbuye-Sekandi, Fredrick E Makumbi, Arabat Kasangaki, Irene Betty Kizza, Joshua Tugumisirize, Edith Nshimye, Specioza Mbabali, and David H Peters. 2011. Patient satisfaction with services in outpatient clinics at Mulago hospital, Uganda. International Journal for Quality in Health Care 23, 5 (2011), 516–523.
[27]
Peter Nabende, David Kabiito, Claire Babirye, Hewitt Tusiime, and Joyce Nakatumba-Nabende. 2021. Misinformation detection in Luganda-English code-mixed social media text. arXiv preprint arXiv:2104.00124 (2021).
[28]
Frank Namugera, Ronald Wesonga, and Peter Jehopio. 2019. Text mining and determinants of sentiments: Twitter social media usage by traditional media houses in Uganda. Computational Social Networks 6 (2019), 1–21.
[29]
Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. KinyaBERT: a morphology-aware Kinyarwanda language model. arXiv preprint arXiv:2203.08459 (2022).
[30]
Kelechi Ogueji. 2022. AfriBERTa: Towards Viable Multilingual Language Models for Low-resource Languages. Master’s thesis. University of Waterloo.
[31]
Amrutha Ragothaman and Ching-Yu Huang. 2021. Sentiment analysis on Covid-19 twitter data. International Journal of Computer Theory and Engineering 13, 4 (2021), 100–107.
[32]
Aamir Rashid and Ching-yu Huang. 2021. Sentiment Analysis on Consumer Reviews of Amazon Products. International Journal of Computer Theory and Engineering 13, 2 (2021), 7.
[33]
Sara Rosenthal, Noura Farra, and Preslav Nakov. 2019. SemEval-2017 task 4: Sentiment analysis in Twitter. arXiv preprint arXiv:1912.00741 (2019).
[34]
Craig P Sayers and Meichun Hsu. 2013. Extracting interesting related context-dependent concepts from social media streams using temporal distributions. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 1308–1311.
[35]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).
[36]
Casper S Shikali and Refuoe Mokhosi. 2020. Enhancing African low-resource languages: Swahili data for language modelling. Data in brief 31 (2020), 105951.
[37]
Abdul Male Ssentumbwe, YuChul Jung, Hyunah Lee, and Byeong Man Kim. 2020. Low-resource youtube comment encoding for luganda sentiment classification performance. Journal of Digital Contents Society 21, 5 (2020), 951–958.
[38]
Uganda National Health Users’/Consumers’ Organisation (UNHCO). 2003. Study on Patient Feedback Mechanisms at Health Facilities in Uganda. Technical Report. Uganda National Health Users’/Consumers’ Organisation (UNHCO). http://unhco.or.ug/wp-content/uploads/downloads/2010/12/UNHCO-Report-on-Feedback-Mechanisms-at-Health-Facilities-2003.pdf
[39]
Cynthia Van Hee, Els Lefever, and Véronique Hoste. 2018. Semeval-2018 task 3: Irony detection in english tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation. 39–50.
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[41]
Tian Yan and Fang Liu. 2023. Has Sentiment Returned to the Pre-pandemic Level? A Sentiment Analysis Using US College Subreddit Data from 2019 to 2022. arXiv preprint arXiv:2309.08845 (2023).
[42]
Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019).

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
NLPIR '23: Proceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval
December 2023
336 pages
ISBN:9798400709227
DOI:10.1145/3639233
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 March 2024

Check for updates

Author Tags

  1. BERT
  2. Luganda
  3. Reddit
  4. Twitter
  5. fine-tuning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

NLPIR 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 260
    Total Downloads
  • Downloads (Last 12 months)260
  • Downloads (Last 6 weeks)50
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media