Identifying Citizen-Related Issues from Social Media Using LLM-Based Data Augmentation

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14663))

Included in the following conference series:

International Conference on Advanced Information Systems Engineering

683 Accesses

Abstract

Social media platforms, such as Twitter, offer an accessible way for people to share information and perspectives on a wide range of topics. Such citizen discourse can be a valuable source of information and offer policymakers and researchers insights into public sentiment, needs, and suggestions, guiding more informed and responsive planning and policy decisions. In this paper, we propose a novel approach using Large Language Models (LLMs) for data augmentation and multi-class classification to extract domain-specific data from tweets and identify issues raised by citizens thus providing policymakers and social science researchers with valuable data to formulate effective plans and policies for improving services. This approach involves initially collecting data from Twitter using specific keywords and manually labelling a subset of the acquired data. Then, we introduce a new data augmentation strategy employing a LLM that leverages the initial human-labelled data to enhance text diversity and address imbalances in the dataset. Finally, we use the manual-labelled and augmented data to fine-tune different LLMs to classify texts across multiple topics. We test our approach considering the identification of issues related to the cycling domain as case study, detecting tweets across eleven categories associated with infrastructure, safety, and accidents. Through fine-tuning BERT-based models and experimenting with zero- and few-shot prompts with GPT for tweet classification, we accomplished an accuracy of up to 90.9%.

Supported by Dublin City University (DCU).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adadi, A.: A survey on data-efficient algorithms in big data era. J. Big Data 8(1), 24 (2021)
Article Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Chiu, K.L., Collins, A., Alexander, R.: Detecting hate speech with GPT-3. arXiv preprint arXiv:2103.12407 (2021)
Daemi, A., Chugh, R., Kanagarajoo, M.V.: Social media in project management: a systematic narrative literature review. Int. J. Inf. Syst. Proj. Manag. 8(4), 5–21 (2021)
Google Scholar
Das, S., Dutta, A., Medina, G., Minjares-Kyle, L., Elgart, Z.: Extracting patterns from Twitter to promote biking. IATSS Res. 43(1), 51–59 (2019)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dill, J.: Bicycling for transportation and health: the role of infrastructure. J. Public Health Policy 30, S95–S110 (2009)
Article Google Scholar
Fang, L., Lee, G.G., Zhai, X.: Using GPT-4 to augment unbalanced data for automatic scoring. arXiv preprint arXiv:2310.18365 (2023)
Feng, S.Y., et al.: A survey of data augmentation approaches for NLP. arXiv preprint arXiv:2105.03075 (2021)
Ghojogh, B., Ghodsi, A.: Attention mechanism, transformers, BERT, and GPT: tutorial and survey (2020)
Google Scholar
Glazkova, A., Kadantsev, M., Glazkov, M.: Fine-tuning of pre-trained transformers for hate, offensive, and profane content detection in English and Marathi. arXiv preprint arXiv:2110.12687 (2021)
Gu, Y., Qian, Z.S., Chen, F.: From twitter to detector: real-time traffic incident detection using social media data. Transp. Res. Part C Emerg. Technol. 67, 321–342 (2016)
Article Google Scholar
Gupta, S., Bolden, S., Kachhadia, J., Korsunska, A., Stromer-Galley, J.: PoliBERT: classifying political social media messages with BERT. In: Social, Cultural and Behavioral Modeling (SBP-BRIMS 2020) Conference, Washington, DC (2020)
Google Scholar
Heinen, E., Maat, K., Van Wee, B.: The effect of work-related factors on the bicycle commute mode choice in The Netherlands. Transportation 40, 23–43 (2013)
Article Google Scholar
Iwińska, K., Blicharska, M., Pierotti, L., Tainio, M., de Nazelle, A.: Cycling in Warsaw, Poland-Perceived enablers and barriers according to cyclists and non-cyclists. Transp. Res. Part A Policy Pract. 113, 291–301 (2018)
Article Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Y., Dmitriev, P., Huang, Y., Brooks, A., Dong, L.: An evaluation of transfer learning for classifying sales engagement emails at large scale. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 542–548. IEEE (2019)
Google Scholar
Møller, A.G., Dalsgaard, J.A., Pera, A., Aiello, L.M.: Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks. arXiv preprint arXiv:2304.13861 (2023)
Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200 (2020)
Paleyes, A., Urma, R.G., Lawrence, N.D.: Challenges in deploying machine learning: a survey of case studies. ACM Comput. Surv. 55(6), 1–29 (2022)
Article Google Scholar
Park, J.Y., Mistur, E., Kim, D., Mo, Y., Hoefer, R.: Toward human-centric urban infrastructure: text mining for social media data to identify the public perception of COVID-19 policy in transportation hubs. Sustain. Urban Areas 76, 103524 (2022)
Google Scholar
Pavlov, T., Mirceva, G.: COVID-19 fake news detection by using BERT and RoBERTa models. In: 2022 45th Jubilee International Convention on Information, Communication and Electronic Technology (MIPRO), pp. 312–316. IEEE (2022)
Google Scholar
Plunz, R.A., et al.: Twitter sentiment in New York City parks as measure of well-being. Landsc. Urban Plan. 189, 235–246 (2019)
Article Google Scholar
Qi, B., Costin, A., Jia, M.: A framework with efficient extraction and analysis of Twitter data for evaluating public opinions on transportation services. Travel behav. Soc. 21, 10–23 (2020)
Article Google Scholar
Rahim Taleqani, A., Hough, J., Nygard, K.E.: Public opinion on dockless bike sharing: a machine learning approach. Transp. Res. Rec. 2673(4), 195–204 (2019)
Article Google Scholar
Sun, X., et al.: Text classification via large language models. arXiv preprint arXiv:2305.08377 (2023)
Van Nooten, J., Daelemans, W.: Improving Dutch vaccine hesitancy monitoring via multi-label data augmentation with GPT-3.5. In: Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, Toronto, Canada, July 2023, vol. 1, pp. 251–270 (2023)
Google Scholar
Viswanathan, V., Zhao, C., Bertsch, A., Wu, T., Neubig, G.: Prompt2Model: generating deployable models from natural language instructions. arXiv preprint arXiv:2308.12261 (2023)
Wahle, J.P., Ashok, N., Ruas, T., Meuschke, N., Ghosal, T., Gipp, B.: Testing the generalization of neural language models for COVID-19 misinformation detection. In: Smits, M. (ed.) iConference 2022. LNCS, vol. 13192, pp. 381–392. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-96957-8_33
Chapter Google Scholar
Yoo, K.M., Park, D., Kang, J., Lee, S.W., Park, W.: GPT3Mix: leveraging large-scale language models for text augmentation. arXiv preprint arXiv:2104.08826 (2021)
Zahera, H.M., Elgendy, I.A., Jalota, R., Sherif, M.A., Voorhees, E.: Fine-tuned BERT model for multi-label tweets classification. In: TREC, pp. 1–7 (2019)
Google Scholar
Zhao, B., Jin, W., Del Ser, J., Yang, G.: ChatAgri: exploring potentials of ChatGPT on cross-linguistic agricultural text classification. arXiv preprint arXiv:2305.15024 (2023)

Download references

Acknowledgement

This work was supported by funding from the European Consortium of Innovative Universities Bicizen project. This work was also conducted with the financial support of the Science Foundation Ireland Centre for Research Training in Artificial Intelligence under Grant No. 18/CRT/6223.

Author information

Authors and Affiliations

Dublin City University, Dublin, Ireland
Vitor Gaboardi dos Santos, Guto Leoni Santos, Theo Lynn & Boualem Benatallah

Authors

Vitor Gaboardi dos Santos
View author publications
You can also search for this author in PubMed Google Scholar
Guto Leoni Santos
View author publications
You can also search for this author in PubMed Google Scholar
Theo Lynn
View author publications
You can also search for this author in PubMed Google Scholar
Boualem Benatallah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vitor Gaboardi dos Santos .

Editor information

Editors and Affiliations

University of Twente, Enschede, The Netherlands
Giancarlo Guizzardi
State University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
Flavia Santoro
University of Essex, Colchester, UK
Haralambos Mouratidis
University of Haifa, Haifa, Israel
Pnina Soffer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

dos Santos, V.G., Santos, G.L., Lynn, T., Benatallah, B. (2024). Identifying Citizen-Related Issues from Social Media Using LLM-Based Data Augmentation. In: Guizzardi, G., Santoro, F., Mouratidis, H., Soffer, P. (eds) Advanced Information Systems Engineering. CAiSE 2024. Lecture Notes in Computer Science, vol 14663. Springer, Cham. https://doi.org/10.1007/978-3-031-61057-8_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-61057-8_31
Published: 03 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-61056-1
Online ISBN: 978-3-031-61057-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics