Computer Science > Computation and Language

arXiv:2304.13861 (cs)

[Submitted on 26 Apr 2023 (v1), last revised 5 Feb 2024 (this version, v2)]

Title:The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks

Authors:Anders Giovanni Møller, Jacob Aarup Dalsgaard, Arianna Pera, Luca Maria Aiello

View PDF

Abstract:In the realm of Computational Social Science (CSS), practitioners often navigate complex, low-resource domains and face the costly and time-intensive challenges of acquiring and annotating data. We aim to establish a set of guidelines to address such challenges, comparing the use of human-labeled data with synthetically generated data from GPT-4 and Llama-2 in ten distinct CSS classification tasks of varying complexity. Additionally, we examine the impact of training data sizes on performance. Our findings reveal that models trained on human-labeled data consistently exhibit superior or comparable performance compared to their synthetically augmented counterparts. Nevertheless, synthetic augmentation proves beneficial, particularly in improving performance on rare classes within multi-class tasks. Furthermore, we leverage GPT-4 and Llama-2 for zero-shot classification and find that, while they generally display strong performance, they often fall short when compared to specialized classifiers trained on moderately sized training sets.

Comments:	Accepted at EACL 2024. 14 pages, 4 figures, 2 tables
Subjects:	Computation and Language (cs.CL); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
Cite as:	arXiv:2304.13861 [cs.CL]
	(or arXiv:2304.13861v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2304.13861

Submission history

From: Luca Maria Aiello [view email]
[v1] Wed, 26 Apr 2023 23:09:02 UTC (12,428 KB)
[v2] Mon, 5 Feb 2024 14:41:35 UTC (2,048 KB)

Computer Science > Computation and Language

Title:The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators