Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3648188.3675153acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article
Free access

Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research

Published: 10 September 2024 Publication History

Abstract

Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from a real dataset consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.

References

[1]
Rachith Aiyappa, Matthew R. DeVerna, Manita Pote, Bao Tran Truong, Wanying Zhao, David Axelrod, Aria Pessianzadeh, Zoher Kachwala, Munjung Kim, Ozgur Can Seckin, Minsuk Kim, Sunny Gandhi, Amrutha Manikonda, Francesco Pierri, Filippo Menczer, and Kai-Cheng Yang. 2023. A Multi-Platform Collection of Social Media Posts about the 2022 U.S. Midterm Elections. Proceedings of the International AAAI Conference on Web and Social Media 17 (June 2023), 981–989.
[2]
Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. 2020. TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. In Findings of the Association for Computational Linguistics: EMNLP 2020. ACL Anthology, online, 1644–1650.
[3]
Thales Bertaglia, Lily Heisig, Rishabh Kaushal, and Adriana Iamnitchi. 2024. InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection. Proceedings of the International AAAI Conference on Web and Social Media 18, 1 (May 2024), 139–151.
[4]
Amit Das, Mostafa Rahgouy, Dongji Feng, Zheng Zhang, Tathagata Bhattacharya, Nilanjana Raychawdhary, Mary Sandage, Lauramarie Pope, Gerry Dozier, and Cheryl Seals. 2024. OffLanDat: A Community Based Implicit Offensive Language Dataset Generated by Large Language Model Through Prompt Engineering. arXiv preprint arXiv:2403.02472 3, 02472 (2024), 1–15.
[5]
Arka Dutta, Adel Khorramrouz, Sujan Dutta, and Ashiqur R KhudaBukhsh. 2023. Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models. arXiv e-prints 1, 2309 (2023), arXiv–2309.
[6]
Hamideh Ghanadian, Isar Nejadgholi, and Hussein Al Osman. 2024. Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models. IEEE Access 12, 1 (2024), 1–14.
[7]
Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 1, 1 (2022), 1–10.
[8]
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509 03, 09509 (2022), 1–13.
[9]
Yining Hua, Hang Jiang, Shixu Lin, Jie Yang, Joseph M Plasek, David W Bates, and Li Zhou. 2022. Using Twitter data to understand public perceptions of approved versus off-label use for COVID-19-related medications. Journal of the American Medical Informatics Association 29, 10 (2022), 1668–1678.
[10]
David M. J. Lazer, Alex Pentland, Duncan J. Watts, Sinan Aral, Susan Athey, Noshir Contractor, Deen Freelon, Sandra Gonzalez-Bailon, Gary King, Helen Margetts, Alondra Nelson, Matthew J. Salganik, Markus Strohmaier, Alessandro Vespignani, and Claudia Wagner. 2020. Computational social science: Obstacles and opportunities. Science 369, 6507 (2020), 1060–1062.
[11]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv abs/1907.11692 (2019), 1–13.
[12]
Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-Collados. 2022. TimeLMs: Diachronic Language Models from Twitter. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Dublin, 251–260.
[13]
Josephine Lukito. 2020. Coordinating a Multi-Platform Disinformation Campaign: Internet Research Agency Activity on Three U.S. Social Media Platforms, 2015 to 2017. Political Communication 37, 2 (2020), 238–255.
[14]
Anders Giovanni Møller, Jacob Aarup Dalsgaard, Arianna Pera, and Luca Maria Aiello. 2023. Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks. arXiv preprint arXiv:2304.13861 04, 13861 (2023), 1–12.
[15]
Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, 2022. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 01, 10005 (2022), 1–13.
[16]
Francesco Pierri, Geng Liu, and Stefano Ceri. 2023. ITA-ELECTION-2022: A Multi-Platform Dataset of Social Media Conversations Around the 2022 Italian General Election. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management(CIKM ’23). Association for Computing Machinery, Birmingham, 5386–5390.
[17]
David Schoch, Chung hong Chan, Claudia Wagner, and Arnim Bleier. 2023. Computational Reproducibility in Computational Social Science. arxiv:2307.01918 [cs.CY]
[18]
Juan Carlos Medina Serrano, Morteza Shahrezaye, Orestis Papakyriakopoulos, and Simon Hegelich. 2019. The Rise of Germany’s AfD: A Social Media Analysis. In Proceedings of the 10th International Conference on Social Media and Society (Toronto, ON, Canada). Association for Computing Machinery, Toronto, 214–223.
[19]
Zeynep Tufekci. 2014. Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls. Proceedings of the International AAAI Conference on Web and Social Media 8, 1 (May 2014), 505–514.
[20]
David Van Bruwaene, Qianjia Huang, and Diana Inkpen. 2020. A multi-platform dataset for detecting cyberbullying in social media. Lang. Resour. Eval. 54, 4 (dec 2020), 851–874.
[21]
Tim Verbeij, Ine Beyens, Damian Trilling, and Patti M Valkenburg. 2024. Happiness and Sadness in Adolescents’ Instagram Direct Messaging: A Neural Topic Modeling Approach. Social Media+ Society 10, 1 (2024), 20563051241229655.
[22]
Veniamin Veselovsky, Manoel Horta Ribeiro, Akhil Arora, Martin Josifoski, Ashton Anderson, and Robert West. 2023. Generating faithful synthetic data with large language models: A case study in computational social science. arXiv preprint arXiv:2305.15041 05, 15041 (2023), 1–8.
[23]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
[24]
Boyang Zhang, Xinyue Shen, Wai Man Si, Zeyang Sha, Zeyuan Chen, Ahmed Salem, Yun Shen, Michael Backes, and Yang Zhang. 2023. Comprehensive Assessment of Toxicity in ChatGPT. arXiv preprint arXiv:2311.14685 11, 14685 (2023), 1–11.

Index Terms

  1. Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HT '24: Proceedings of the 35th ACM Conference on Hypertext and Social Media
    September 2024
    415 pages
    ISBN:9798400705953
    DOI:10.1145/3648188
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 September 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. LLMs
    2. Social Media Research
    3. Synthetic Data

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    HT '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 378 of 1,158 submissions, 33%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 33
      Total Downloads
    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)33
    Reflects downloads up to 29 Sep 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media