Nothing Special   »   [go: up one dir, main page]

Skip to main content

Identifying Citizen-Related Issues from Social Media Using LLM-Based Data Augmentation

  • Conference paper
  • First Online:
Advanced Information Systems Engineering (CAiSE 2024)

Abstract

Social media platforms, such as Twitter, offer an accessible way for people to share information and perspectives on a wide range of topics. Such citizen discourse can be a valuable source of information and offer policymakers and researchers insights into public sentiment, needs, and suggestions, guiding more informed and responsive planning and policy decisions. In this paper, we propose a novel approach using Large Language Models (LLMs) for data augmentation and multi-class classification to extract domain-specific data from tweets and identify issues raised by citizens thus providing policymakers and social science researchers with valuable data to formulate effective plans and policies for improving services. This approach involves initially collecting data from Twitter using specific keywords and manually labelling a subset of the acquired data. Then, we introduce a new data augmentation strategy employing a LLM that leverages the initial human-labelled data to enhance text diversity and address imbalances in the dataset. Finally, we use the manual-labelled and augmented data to fine-tune different LLMs to classify texts across multiple topics. We test our approach considering the identification of issues related to the cycling domain as case study, detecting tweets across eleven categories associated with infrastructure, safety, and accidents. Through fine-tuning BERT-based models and experimenting with zero- and few-shot prompts with GPT for tweet classification, we accomplished an accuracy of up to 90.9%.

Supported by Dublin City University (DCU).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adadi, A.: A survey on data-efficient algorithms in big data era. J. Big Data 8(1), 24 (2021)

    Article  Google Scholar 

  2. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  3. Chiu, K.L., Collins, A., Alexander, R.: Detecting hate speech with GPT-3. arXiv preprint arXiv:2103.12407 (2021)

  4. Daemi, A., Chugh, R., Kanagarajoo, M.V.: Social media in project management: a systematic narrative literature review. Int. J. Inf. Syst. Proj. Manag. 8(4), 5–21 (2021)

    Google Scholar 

  5. Das, S., Dutta, A., Medina, G., Minjares-Kyle, L., Elgart, Z.: Extracting patterns from Twitter to promote biking. IATSS Res. 43(1), 51–59 (2019)

    Article  Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  7. Dill, J.: Bicycling for transportation and health: the role of infrastructure. J. Public Health Policy 30, S95–S110 (2009)

    Article  Google Scholar 

  8. Fang, L., Lee, G.G., Zhai, X.: Using GPT-4 to augment unbalanced data for automatic scoring. arXiv preprint arXiv:2310.18365 (2023)

  9. Feng, S.Y., et al.: A survey of data augmentation approaches for NLP. arXiv preprint arXiv:2105.03075 (2021)

  10. Ghojogh, B., Ghodsi, A.: Attention mechanism, transformers, BERT, and GPT: tutorial and survey (2020)

    Google Scholar 

  11. Glazkova, A., Kadantsev, M., Glazkov, M.: Fine-tuning of pre-trained transformers for hate, offensive, and profane content detection in English and Marathi. arXiv preprint arXiv:2110.12687 (2021)

  12. Gu, Y., Qian, Z.S., Chen, F.: From twitter to detector: real-time traffic incident detection using social media data. Transp. Res. Part C Emerg. Technol. 67, 321–342 (2016)

    Article  Google Scholar 

  13. Gupta, S., Bolden, S., Kachhadia, J., Korsunska, A., Stromer-Galley, J.: PoliBERT: classifying political social media messages with BERT. In: Social, Cultural and Behavioral Modeling (SBP-BRIMS 2020) Conference, Washington, DC (2020)

    Google Scholar 

  14. Heinen, E., Maat, K., Van Wee, B.: The effect of work-related factors on the bicycle commute mode choice in The Netherlands. Transportation 40, 23–43 (2013)

    Article  Google Scholar 

  15. Iwińska, K., Blicharska, M., Pierotti, L., Tainio, M., de Nazelle, A.: Cycling in Warsaw, Poland-Perceived enablers and barriers according to cyclists and non-cyclists. Transp. Res. Part A Policy Pract. 113, 291–301 (2018)

    Article  Google Scholar 

  16. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  17. Liu, Y., Dmitriev, P., Huang, Y., Brooks, A., Dong, L.: An evaluation of transfer learning for classifying sales engagement emails at large scale. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 542–548. IEEE (2019)

    Google Scholar 

  18. Møller, A.G., Dalsgaard, J.A., Pera, A., Aiello, L.M.: Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks. arXiv preprint arXiv:2304.13861 (2023)

  19. Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200 (2020)

  20. Paleyes, A., Urma, R.G., Lawrence, N.D.: Challenges in deploying machine learning: a survey of case studies. ACM Comput. Surv. 55(6), 1–29 (2022)

    Article  Google Scholar 

  21. Park, J.Y., Mistur, E., Kim, D., Mo, Y., Hoefer, R.: Toward human-centric urban infrastructure: text mining for social media data to identify the public perception of COVID-19 policy in transportation hubs. Sustain. Urban Areas 76, 103524 (2022)

    Google Scholar 

  22. Pavlov, T., Mirceva, G.: COVID-19 fake news detection by using BERT and RoBERTa models. In: 2022 45th Jubilee International Convention on Information, Communication and Electronic Technology (MIPRO), pp. 312–316. IEEE (2022)

    Google Scholar 

  23. Plunz, R.A., et al.: Twitter sentiment in New York City parks as measure of well-being. Landsc. Urban Plan. 189, 235–246 (2019)

    Article  Google Scholar 

  24. Qi, B., Costin, A., Jia, M.: A framework with efficient extraction and analysis of Twitter data for evaluating public opinions on transportation services. Travel behav. Soc. 21, 10–23 (2020)

    Article  Google Scholar 

  25. Rahim Taleqani, A., Hough, J., Nygard, K.E.: Public opinion on dockless bike sharing: a machine learning approach. Transp. Res. Rec. 2673(4), 195–204 (2019)

    Article  Google Scholar 

  26. Sun, X., et al.: Text classification via large language models. arXiv preprint arXiv:2305.08377 (2023)

  27. Van Nooten, J., Daelemans, W.: Improving Dutch vaccine hesitancy monitoring via multi-label data augmentation with GPT-3.5. In: Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, Toronto, Canada, July 2023, vol. 1, pp. 251–270 (2023)

    Google Scholar 

  28. Viswanathan, V., Zhao, C., Bertsch, A., Wu, T., Neubig, G.: Prompt2Model: generating deployable models from natural language instructions. arXiv preprint arXiv:2308.12261 (2023)

  29. Wahle, J.P., Ashok, N., Ruas, T., Meuschke, N., Ghosal, T., Gipp, B.: Testing the generalization of neural language models for COVID-19 misinformation detection. In: Smits, M. (ed.) iConference 2022. LNCS, vol. 13192, pp. 381–392. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-96957-8_33

    Chapter  Google Scholar 

  30. Yoo, K.M., Park, D., Kang, J., Lee, S.W., Park, W.: GPT3Mix: leveraging large-scale language models for text augmentation. arXiv preprint arXiv:2104.08826 (2021)

  31. Zahera, H.M., Elgendy, I.A., Jalota, R., Sherif, M.A., Voorhees, E.: Fine-tuned BERT model for multi-label tweets classification. In: TREC, pp. 1–7 (2019)

    Google Scholar 

  32. Zhao, B., Jin, W., Del Ser, J., Yang, G.: ChatAgri: exploring potentials of ChatGPT on cross-linguistic agricultural text classification. arXiv preprint arXiv:2305.15024 (2023)

Download references

Acknowledgement

This work was supported by funding from the European Consortium of Innovative Universities Bicizen project. This work was also conducted with the financial support of the Science Foundation Ireland Centre for Research Training in Artificial Intelligence under Grant No. 18/CRT/6223.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vitor Gaboardi dos Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

dos Santos, V.G., Santos, G.L., Lynn, T., Benatallah, B. (2024). Identifying Citizen-Related Issues from Social Media Using LLM-Based Data Augmentation. In: Guizzardi, G., Santoro, F., Mouratidis, H., Soffer, P. (eds) Advanced Information Systems Engineering. CAiSE 2024. Lecture Notes in Computer Science, vol 14663. Springer, Cham. https://doi.org/10.1007/978-3-031-61057-8_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-61057-8_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-61056-1

  • Online ISBN: 978-3-031-61057-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics