Evaluation of Text Cluster Naming with Generative Large Language Models
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 376–392
Pub. online: 26 August 2024
Type: Data Science In Action
Open Access
Received
30 November 2023
30 November 2023
Accepted
20 July 2024
20 July 2024
Published
26 August 2024
26 August 2024
Abstract
Text clustering can streamline many labor-intensive tasks, but it creates a new challenge: efficiently labeling and interpreting the clusters. Generative large language models (LLMs) are a promising option to automate the process of naming text clusters, which could significantly streamline workflows, especially in domains with large datasets and esoteric language. In this study, we assessed the ability of GPT-3.5-turbo to generate names for clusters of texts and compared these to human-generated text cluster names. We clustered two benchmark datasets, each from a specialized domain: research abstracts and clinical patient notes. We generated names for each cluster using four prompting strategies (different ways of including information about the cluster in the prompt used to get LLM responses). For both datasets, the best prompting strategy beat the manual approach across all quality domains. However, name quality varied by prompting strategy and dataset. We conclude that practitioners should consider trying automated cluster naming to avoid bottlenecks or when the scale of the effort is enough to take advantage of the cost savings offered by automation, as detailed in our supplemental blueprint for using LLM cluster naming. However, to get the best performance, it is vital to test a variety of prompting strategies and perform a small test to identify which one performs best on each project’s unique data.
References
Bowman SR, Dahl GE (2021). What will it take to fix benchmarking in natural language understanding? arXiv preprint: https://arxiv.org/abs/2104.02145.
Fabbri AR, Kryściński W, McCann B, Xiong C, Socher R, Radev D (2021). Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9: 391–409. https://doi.org/10.1162/tacl_a_00373
Giray L (2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51: 2629–2633. https://doi.org/10.1007/s10439-023-03272-4
Hosna A, Merry E, Gyalmo J, Alom Z, Aung Z, Abdul M (2022). Transfer learning: A friendly introduction. Journal of Big Data, 9: 102. https://doi.org/10.1186/s40537-022-00652-w
Kamalloo E, Dziri N, Clarke CLA, Rafiei D (2023). Evaluating open-domain question answering in the era of large language models. arXiv preprint: https://arxiv.org/abs/2305.06984.
Kryściński W, McCann B, Xiong C, Socher R (2020). Evaluating the factual consistency of abstractive text summarization. arXiv preprint: https://arxiv.org/abs/1910.12840.
Ma C, Zhang WE, Guo M, Wang H, Sheng QZ (2021). Multi-document summarization via deep learning techniques: A survey. arXiv preprint: https://arxiv.org/abs/2011.04843.
Reimers N, Gurevych I (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint: https://arxiv.org/abs/1908.10084.
Xiao W, Beltagy I, Carenini G, Cohan A (2022). Primera: Pyramid-based masked sentence pre-training for multi-document summarization. arXiv preprint: https://arxiv.org/abs/2110.08499.
Zhang T, Ladhak F, Durmus E, Liang P, McKeown K, Hashimoto TB (2023a). Benchmarking large language models for news summarization. arXiv preprint: https://arxiv.org/abs/2301.13848.
Zhang Y, Li Y, Cui L, Cai D, Liu L, Fu T, et al. (2023b). Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint: https://arxiv.org/abs/2309.01219.
Zhao Z, Jin Q, Chen F, Peng T, Yu S (2023). PMC-patients: A large-scale dataset of patient summaries and relations for benchmarking retrieval-based clinical decision support systems. arXiv preprint: https://arxiv.org/abs/2202.13876.