Factuality challenges in the era of large language models and opportunities for fact-checking

3432 Accesses
9 Citations
89 Altmetric
10 Mentions
Explore all metrics

Abstract

The emergence of tools based on large language models (LLMs), such as OpenAI’s ChatGPT and Google’s Gemini, has garnered immense public attention owing to their advanced natural language generation capabilities. These remarkably natural-sounding tools have the potential to be highly useful for various tasks. However, they also tend to produce false, erroneous or misleading content—commonly referred to as hallucinations. Moreover, LLMs can be misused to generate convincing, yet false, content and profiles on a large scale, posing a substantial societal challenge by potentially deceiving users and spreading inaccurate information. This makes fact-checking increasingly important. Despite their issues with factual accuracy, LLMs have shown proficiency in various subtasks that support fact-checking, which is essential to ensure factually accurate responses. In light of these concerns, we explore issues related to factuality in LLMs and their impact on fact-checking. We identify key challenges, imminent threats and possible solutions to these factuality issues. We also thoroughly examine these challenges, existing solutions and potential prospects for fact-checking. By analysing the factuality constraints within LLMs and their impact on fact-checking, we aim to contribute to a path towards maintaining accuracy at a time of confluence of generative artificial intelligence and misinformation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pipeline and dataset generation for automated fact-checking in almost any language

Article Open access 02 August 2024

Stochastic LLMs do not Understand Language: Towards Symbolic, Explainable and Ontologically Based LLMs

Generative AI for Explainable Automated Fact Checking on the FactEx: A New Benchmark Dataset

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Shannon, C. E. A mathematical theory of communication. Bell Syst. Techn. J. 27, 379–423 (1948).
Article MathSciNet Google Scholar
Wang, Y. et al. M4GT-Bench: evaluation benchmark for black-box machine-generated text detection. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Long Papers) (2023).
Huang, J. & Chang, K. C.-C. Towards reasoning in large language models: a survey. In Findings of the Association for Computational Linguistics 1049–1065 (ACL, 2023).
Radford, A. et al. Improving language understanding by generative pre-training. OpenAI https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (2018).
OpenAI. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Llama Team, AI@Meta. The Llama3 Herd of Models. arXiv https://doi.org/10.48550/arXiv.2407.21783 (2024).
Zhao, W. X. et al. A survey of large language models. Preprint at https://arxiv.org/abs/2303.18223 (2023).
Bang, Y. et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proc. 13th International Joint Conference on Natural Language Processing and 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics Vol. 1 (eds Park, J. C. et al.) 675–718 (ACL, 2023).
Bergstrom, C. T. & Ogbunu, C. B. ChatGPT isn’t ‘hallucinating.’ It’s bullshitting. Undark https://undark.org/2023/04/06/chatgpt-isnt-hallucinating-its-bullshitting (2023).
Filippova, K. Controlled hallucinations: learning to generate faithfully from noisy data. In Findings of the Association for Computational Linguistics: EMNLP 2020 864–870 (ACL, 2020).
Sison, A. J. G., Daza, M. T., Gozalo-Brizuela, R. & Garrido-Merchán, E. C. ChatGPT: more than a ‘weapon of mass deception’ – ethical challenges and responses from the Human-Centered Artificial Intelligence (HCAI) perspective. Int. J. Hum.–Comput. Interact. https://doi.org/10.1080/10447318.2023.2225931 (2023).
Iftikhar, L. et al. DocGPT: impact of ChatGPT-3 on health services as a virtual doctor. EC Paediatri. 12, 45–55 (2023).
Google Scholar
Chin, H. et al. User-chatbot conversations during the COVID-19 pandemic: study based on topic modeling and sentiment analysis. J. Med. Internet Res. 25, e40922 (2023).
Article Google Scholar
Peskoff, D. & Stewart, B. Credible without credit: domain experts assess generative language models. In Proc. 61st Annual Meeting of the Association for Computational Linguistics Vol. 2, 427–438 (ACL, 2023).
Srivastava, B. Did chatbots miss their ‘Apollo moment’? Potential, gaps, and lessons from using collaboration assistants during COVID-19. Patterns 2, 100308 (2021).
Article Google Scholar
Verma, P. & Oremus, W. ChatGPT invented a sexual harassment scandal and named a real law prof as the accused. Washington Post (5 April 2023); https://www.washingtonpost.com/technology/2023/04/05/chatgpt-lies/
DeVerna, M. R., Yan, H. Y., Yang, K.-C. & Menczer, F. Fact-checking information generated by a large language model can decrease news discernment. Preprint at https://arxiv.org/abs/2308.10800 (2023).
Ferrara, E. The history of digital spam. Commun. ACM 62, 82–91 (2019).
Article Google Scholar
Metz, C. Five technologies that will rock your world. New York Times (13 November 2017); https://www.nytimes.com/2017/11/13/business/dealbook/five-technologies-that-will-rock-your-world.html
Vincent, J. Google’s AI chatbot Bard makes factual error in first demo. The Verge (8 February 2023); https://www.theverge.com/2023/2/8/23590864/google-ai-chatbot-bard-mistake-error-exoplanet-demo
Anand, N. Google’s Gemini AI accused of acting too ‘woke’, company admits mistake. Business Standard (22 Feburary 2024); https://www.business-standard.com/companies/news/google-s-gemini-ai-accused-of-acting-too-woke-company-admits-mistake-124022200663_1.html
Marcus, G. Deep learning is hitting a wall. Nautilus (10 March, 2022); https://nautil.us/deep-learning-is-hitting-a-wall-238440/
Dutta, S. & Chakraborty, T. Thus spake ChatGPT. Commun. ACM 66, 16–19 (2023).
Article Google Scholar
Menczer, F., Crandall, D., Ahn, Y.-Y. & Kapadia, A. Addressing the harms of AI-generated inauthentic content. Nat. Mach. Intell. 5, 678–680 (2023).
Article Google Scholar
Patel, A. & Sattler, J. Creatively Malicious Prompt Engineering (WithSecure Labs, 2023).
Vykopal, I. et al. Disinformation capabilities of large language models. Preprint at https://arxiv.org/abs/2311.08838 (2024).
Zhang, H. et al. R-Tuning: teaching large language models to refuse unknown questions. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1 (eds Duh, K. et al.) 7113–7139 (ACL, 2024).
Brewster, J., Wang, M. & Palmer, C. Plagiarism-bot? How low-quality websites are using AI to deceptively rewrite content from mainstream news outlets. NewsGuard (24 August 2023); https://www.newsguardtech.com/misinformation-monitor/august-2023/
Yang, K.-C. & Menczer, F. Anatomy of an AI-powered malicious social botnet. J. Quant. Descr. Digit. Media https://doi.org/10.51685/jqd.2024.icwsm.7 (2024).
Wang, C. et al. Survey on factuality in large language models: knowledge, retrieval and domain-specificity. Preprint at https://arxiv.org/abs/2310.07521 (2023).
Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 1–38 (2023).
Article Google Scholar
Rawte, V., Sheth, A. & Das, A. A survey of hallucination in large foundation models. Preprint at https://arxiv.org/abs/2309.05922 (2023).
Zhang, Y. et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. Preprint at https://arxiv.org/abs/2309.01219 (2023).
Ferrara, E. Should ChatGPT be biased? Challenges and risks of bias in large language models. First Monday https://doi.org/10.5210/fm.v28i11.13346 (2023).
Weizenbaum, J. ELIZA—a computer program for the study of natural language communication between man and machine. Commun. ACM 9, 36–45 (1966).
Article MathSciNet Google Scholar
Pan Y et al. On the risk of misinformation pollution with large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 1389–1403 (ACL, 2013).
Mirsky, Y. & Lee, W. The creation and detection of deepfakes: A survey. ACM Comput. Surv. 54, 7 (2021).
Yang, K.-C., Singh, D. & Menczer, F. Characteristics and prevalence of fake social media profiles with AI-generated faces. Preprint at https://arxiv.org/abs/2401.02627 (2024).
Liu, N. F., Zhang, T. & Liang, P. Evaluating verifiability in generative search engines. In Findings of the Association for Computational Linguistics: EMNLP 2023 7001–7025 (ACL, 2023).
Galitsky, B. A. Truth-o-meter: collaborating with llm in fighting its hallucinations. Preprints https://doi.org/10.20944/preprints202307.1723.v1 (2023).
Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. https://doi.org/10.1145/3571730 (2023).
Vincent, J. AI-generated answers temporarily banned on coding Q&A site Stack Overflow. The Verge (5 December 2022); https://www.theverge.com/2022/12/5/23493932/chatgpt-ai-generated-answers-temporarily-banned-stack-overflow-llms-dangers
Antaki, F., Touma, S., Milad, D., El-Khoury, J. & Duval, R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol. Sci. 3, 100324 (2023).
Article Google Scholar
Abels, G. Can ChatGPT fact-check? We tested. Poynter (31 May 2023); https://www.poynter.org/fact-checking/2023/chatgpt-ai-replace-fact-checking/
Fadeeva, E. et al. Fact-checking the output of large language models via token-level uncertainty quantification. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (2024).
Geng J. et al. A survey of confidence estimation and calibration in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 6577–6595 (ACL, 2024).
Wang, Y., Li, H., Han, X., Nakov, P. & Baldwin, T. Do-not-answer: evaluating safeguards in LLMs. In Findings of the Association for Computational Linguistics 2024 896–911 (ACL, 2024).
Xie, Y., Fang, M., Pi, R. & Gong, N. GradSafe: detecting unsafe prompts for LLMs via safety-critical gradient analysis. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Long Papers) (2024).
Bai, H., Voelkel, J. G., Eichstaedt, j. C. & Willer, R. Artificial intelligence can persuade humans on political issues. Preprint at https://doi.org/10.31219/osf.io/stakv (2023).
Brashier, N. M. & Marsh, E. J. Judging truth. Annu. Rev. Psychol. 71, 499–515 (2020).
Article Google Scholar
Whatsapp. IFCN fact-checking organizations on WhatsApp. https://faq.whatsapp.com/5059120540855664 (2023).
Nisbett, R. E. & Wilson, T. D. The halo effect: evidence for unconscious alteration of judgments. J. Pers. Soc. Psychol. 35, 250–256 (1977).
Article Google Scholar
Guillory, J. E. & Hancock, J. T. in The Psychology of Social Networking Vol. 1, 66–77 (De Gruyter Open Poland, 2015).
Qin, J. et al. Why does new knowledge create messy ripple effects in llms? Preprint at https://arxiv.org/abs/2407.12828 (2024).
Zhang, Y. et al. Knowledge overshadowing causes amalgamated hallucination in large language models: analysis and solution. Preprint at https://arxiv.org/abs/2407.08039v1 (2024).
Liu, J. et al. EVEDIT: Event-based knowledge editing with deductive editing boundaries. Preprint at https://arxiv.org/abs/2402.11324 (2024).
Chakraborty, T. & Masud, S. Judging the creative prowess of AI. Nat. Mach. Intell. 5, 558 (2023).
Article Google Scholar
Srivastava, A. et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. In Transactions on Machine Learning Research (2023).
Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proc. 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP 353–355 (ACL, 2018).
Wang, A. et al. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In Proc. 33rd International Conference on Neural Information Processing Systems 3266–3280 (Curran Associates Inc., 2019).
Lin, S., Hilton, J. & Evans, O. TruthfulQA: measuring how models mimic human falsehoods. In Proc. 60th Annual Meeting of the Association for Computational Linguistics Vol. 1 (eds Muresan, S. et al.) 3214–3252 (ACL, 2022).
Golchin, S. & Surdeanu, M. Time travel in LLMs: tracing data contamination in large language models. In Proc. 12th International Conference on Learning Representations (2024).
Fu, J., Ng, S.-K., Jiang, Z. & Liu, P. GPTScore: evaluate as you desire. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1 (eds Duh, K. et al.) 6556–6576 (ACL, 2024).
Liu, Y. et al. G-Eval: NLG evaluation using GPT-4 with better human alignment. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 2511–2522 (ACL, 2023).
Manakul, P., Liusie, A. & Gales, M. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 9004–9017 (ACL, 2023).
Wang, P. et al. Large language models are not fair evaluators. Preprint at https://arxiv.org/abs/2305.17926 (2023).
Coles, C. 11% of data employees paste into ChatGPT is confidential. Cyberhaven https://www.cyberhaven.com/blog/4-2-of-workers-have-pasted-company-data-into-chatgpt (2023).
Meta. Meta’s Third-Party Fact-Checking Program. https://www.facebook.com/formedia/mjp/programs/third-party-fact-checking (2016).
Truong, B. T., Lou, X., Flammini, A. & Menczer, F. Vulnerabilities of the online public square to manipulation. PNAS Nexus 3, pgae258 (2024).
Talwar, S., Dhir, A., Singh, D., Virk, G. S. & Salo, J. Sharing of fake news on social media: application of the honeycomb framework and the third-person effect hypothesis. J. Retail. Consum. Serv. 57, 102197 (2020).
Article Google Scholar
Avram, M., Micallef, N., Patil, S. & Menczer, F. Exposure to social engagement metrics increases vulnerability to misinformation. HKS Misinform. Rev. https://doi.org/10.37016/mr-2020-033 (2020).
Pierri, F. et al. Online misinformation is linked to early COVID-19 vaccination hesitancy and refusal. Sci. Rep. 12, 5966 (2022).
Article Google Scholar
Christiano, P. et al. Deep reinforcement learning from human preferences. In Proc. 31st International Conference on Neural Information Processing Systems 4302–4310 (Curran Associates Inc., 2017).
Sengupta, N. et al. Jais and Jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. Preprint at https://arxiv.org/abs/2308.16149 (2023).
Lin, S.-C. et al. FLAME: factuality-aware alignment for large language models. Preprint at https://arxiv.org/abs/2405.01525 (2024).
Lee, N. et al. Factuality enhanced language models for open-ended text generation. In Proc. 36th International Conference on Neural Information Processing Systems 34586–34599 (Curran Associates Inc., 2024).
Ians. Hackers exploiting ChatGPT to write malicious codes to steal your data. Business Standard (8 January 2023); https://www.business-standard.com/article/technology/hackers-exploiting-chatgpt-to-write-malicious-codes-to-steal-your-data-123010800216_1.html
Sunilkumar, S. R. Cybercriminals using ChatGPT AI bot to develop malicious tools? Hindustan Times (16 January 2023); https://www.hindustantimes.com/technology/cybercriminals-using-chatgpt-ai-bot-to-develop-malicious-tools-101673876956902.html
Guu, K., Lee, K., Tung, Z., Pasupat, P. & Chang, M.-W. REALM: retrieval-augmented language model pre-training. In Proc. 37th International Conference on Machine Learning 3929–3938 (JMLR, 2020).
Reddy, R. G. et al. SmartBook: AI-assisted situation report generation. Preprint at https://arxiv.org/abs/2303.14337 (2023).
Martineau, K. What is retrieval-augmented generation? IBM Blog https://research.ibm.com/blog/retrieval-augmented-generation-RAG (2023).
Gou, Z. et al. CRITIC: large language models can self-correct with tool-interactive critiquing. In Proc. 12th International Conference on Learning Representations (2024).
Cohen, R., Hamri, M., Geva, M. & Globerson, A. LM vs LM: detecting factual errors via cross examination. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 12621–12640 (ACL, 2023).
Dziri, N., Madotto, A., Zaïane, O. & Bose, A. J. Neural path hunter: reducing hallucination in dialogue systems via path grounding. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 2197–2214 (ACL, 2021).
De Cao, N., Aziz, W. & Titov, I. Editing factual knowledge in language models. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 6491–6506 (ACL, 2021).
Yu, P. & Ji, H. Self information update for large language models through mitigating exposure bias. Preprint at https://arxiv.org/abs/2305.18582 (2023).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: evaluating text generation with BERT. In Proc. 8th International Conference on Learning Representations (2020).
Li, J., Cheng, X., Zhao, X., Nie, J. Y. & Wen, J. R. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proc. Conference on Empirical Methods in Natural Language Processing 6449–6464 (ACL, 2023).
Min, S. et al. FActScore: fine-grained atomic evaluation of factual precision in long form text generation. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 12076–12100 (ACL, 2023).
Cheng, Q. et al. Evaluating hallucinations in Chinese large language models. Preprint at https://arxiv.org/abs/2310.03368 (2023).
Wang, Y. et al. M4: multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection. In Proc. Conference of the European Chapter of the Association for Computational Linguistics 1369–1407 (ACL, 2024).
Huang, K.-H., McKeown, K., Nakov, P., Choi, Y. & Ji, H. Faking fake news for real fake news detection: propaganda-loaded training data generation. In Proc. 61st Annual Meeting of the Association for Computational Linguistics Vol. 1, 14571–14589 (ACL, 2023).
Su, J., Zhuo, T. Y., Mansurov, J., Wang, D. & Nakov, P. Fake news detectors are biased against texts generated by large language models. Preprint at https://arxiv.org/abs/2309.08674 (2023).
Su, J., Cardie, C. & Nakov, P. Adapting fake news detection to the era of large language models. In Findings of the Association for Computational Linguistics: NAACL 2024 1473–1490 (ACL, 2024).
Kirchenbauer, J. et al. On the reliability of watermarks for large language models. In Proc. 12th International Conference on Learning Representations (2024).
Groh, M. et al. Human detection of political speech deepfakes across transcripts, audio, and video. Preprint at https://arxiv.org/abs/2202.12883 (2023).
Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W. & Feizi, S. Can AI-generated text be reliably detected? Preprint at https://arxiv.org/abs/2303.11156 (2023).
Hussain, S., Neekhara, P., Jere, M., Koushanfar, F. & McAuley, J. Adversarial deepfakes: evaluating vulnerability of deepfake detectors to adversarial examples. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 3348–3357 (IEEE, 2021).
Quelle, D. & Bovet, A. The perils and promises of fact-checking with large language models. Front. Artif. Intell. https://doi.org/10.3389/frai.2024.1341697 (2024).
Sundriyal, M., Singh, P., Akhtar, M. S., Sengupta, S. & Chakraborty, T. DESYR: definition and syntactic representation based claim detection on the web. In Proc. 30th ACM International Conference on Information & Knowledge Management 1764–1773 (ACM, 2021).
Sundriyal, M., Chakraborty, T. & Nakov, P. From chaos to clarity: claim normalization to empower fact-checking. In Findings of the Association for Computational Linguistics: EMNLP 2023 6594–6609 (ACL, 2023).
Huang, K.-H., Chan, H. P. & Ji, H. Zero-shot faithful factual error correction. In Proc. 61st Annual Meeting of the Association for Computational Linguistics Vol. 1, 5660–5676 (ACL, 2023).
Shaar, S., Babulkov, N., Da San Martino, G. & Nakov, P. That is a known lie: detecting previously fact-checked claims. In Proc. 58th Annual Meeting of the Association for Computational Linguistics 3607–3618 (ACL, 2020).
Zhang, B., Ding, D. & Jing, L. How would stance detection techniques evolve after the launch of ChatGPT? Preprint at https://arxiv.org/abs/2212.14548 (2022).
Wang, Y., Wang, M. & Nakov, P. Rethinking STS and NLI in large language models. In Findings of the Association for Computational Linguistics: EACL 2024 965–982 (ACL, 2024).
Kocoń, J. et al. ChatGPT: jack of all trades, master of none. Inform. Fusion 99, 101861 (2023).
Article Google Scholar
Shankar, A. Remembering conversations: building chatbots with short and long-term memory on AWS. ITNEXT https://itnext.io/remembering-conversations-building-chatbots-with-short-and-long-term-memory-on-aws-c1361c130046 (2023).
Baly, R., Karadzhov, G., Alexandrov, D., Glass, J. & Nakov, P. Predicting factuality of reporting and bias of news media sources. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing 3528–3539 (ACL, 2018).
Yang, K.-C. & Menczer, F. Large language models can rate news outlet credibility. Preprint at https://arxiv.org/abs/2304.00228 (2023).
Panayotov, P., Shukla, U., Sencar, H. T., Nabeel, M. & Nakov, P. GREENER: graph neural networks for news media profiling. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing 7470–7480 (ACL, 2022).
Nakov, P. et al. A survey on predicting the factuality and the bias of news media. Preprint at https://arxiv.org/abs/2103.12506 (2021).
Dickson, B. Fact-checking and truth in the age of ChatGPT and LLMs. TechTalks https://bdtechtalks.com/2023/10/30/llm-fact-checking-hallucinations/ (2023).
Chern, I. et al. FacTool: factuality detection in generative AI—a tool augmented framework for multi-task and multi-domain scenarios. Preprint at https://arxiv.org/abs/2307.13528 (2023).
Sun, L. et al. TrustLLM: trustworthiness in large language models. In Proc. 41st International Conference on Machine Learning (2024).
Chen, S. et al. FELM: benchmarking factuality evaluation of large language models. In Proc. 37th Conference on Neural Information Processing Systems Datasets and Benchmarks Track 44502–44523 (Curran Associates Inc., 2023).
Li, S. et al. Open-domain hierarchical event schema induction by incremental prompting and verification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Vol. 1 (eds Rogers, A. et al.) 5677–5697 (ACL, 2023).
Wang, Y. et al. Factcheck-Bench: fine-grained evaluation benchmark for automatic fact-checkers. Preprint at https://arxiv.org/abs/2311.09000 (2024).
Feng, S. et al. Knowledge card: filling LLMs’ knowledge gaps with plug-in specialized language models. In Proc. 12th International Conference on Learning Representations (2024).
Choi, E. C. & Ferrara, E. FACT-GPT: fact-checking augmentation via claim matching with LLMs. In Companion Proceedings of the ACM on Web Conference 883–886 (ACM, 2024).
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 610–623 (ACM, 2021).
Generative Artificial Intelligence in Education Departmental Statement (Department for Education, 2023); https://www.gov.uk/government/publications/generative-artificial-intelligence-in-education/generative-artificial-intelligence-ai-in-education
Peng, B. et al. Check your facts and try again: improving large language models with external knowledge and automated feedback. Preprint at https://arxiv.org/abs/2302.12813 (2023).
Shi, C. et al. A thorough examination of decoding methods in the era of LLMs. Preprint at https://arxiv.org/abs/2402.06925 (2024).
Zhang, Z., Fang, M., Chen, L., Namazi-Rad, M.-R. & Wang, J. How do large language models capture the ever-changing world knowledge? A review of recent advances. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 8289–8311 (ACL, 2023).
Patterson, D. et al. Carbon emissions and large neural network training. Preprint at https://arxiv.org/abs/2104.10350 (2021).
Bereczki, T. & Liber, Á. AI’s emergent abilities a ‘double-edged sword’. IAPP https://iapp.org/news/a/ais-emergent-abilities-a-double-edged-sword (2023).
Lu, S., Bigoulaeva, I., Sachdeva, R., Madabushi, H. T. & Gurevych, I. Are emergent abilities in large language models just in-context learning? Preprint at https://arxiv.org/abs/2309.01809 (2023).
Gupta, G., Rastegarpanah, B., Iyer, A., Rubin, J. & Kenthapadi, K. Measuring distributional shifts in text: the advantage of language model-based embeddings. Preprint at https://arxiv.org/abs/2312.02337 (2023).
Brown, T. et al. Language models are few-shot learners. Adv. Neur. Inf. Process. Syst. 33, 1877–1901 (2020).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).
Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
Human Genome Editing: Science, Ethics, and Governance (National Academies, 2017).
ChatGPT: OpenAI Reopens the Platform in Italy Guaranteeing More Transparency and More Rights to European Users and Non-users (GPDP, 2023); https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/9881490
Chatbots, deepfakes, and voice clones: AI deception for sale. FTC Business Blog https://www.ftc.gov/business-guidance/blog/2023/03/chatbots-deepfakes-voice-clones-ai-deception-sale (2023).
Cohen, J. Right on track: NVIDIA open-source software helps developers add guardrails to AI chatbots. NVIDIA Blogs https://blogs.nvidia.com/blog/2023/04/25/ai-chatbot-guardrails-nemo (2023).
Chen, A. & Chen, D. O. Accuracy of chatbots in citing journal articles. JAMA Netw. Open 6, e2327647 (2023).
Article Google Scholar
Spataro, J. Introducing Microsoft 365 Copilot – your copilot for work. Official Microsoft Blog https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work (2023).
Pacheco, D. et al. Uncovering coordinated networks on social media: methods and case studies. In Proc. International AAAI Conference on Web and Social Media 455–466 (AAAI, 2021).

Download references

Acknowledgements

M.C. is supported by the Institute for Basic Science (grant number IBS-R029-C2) and the National Research Foundation of Korea (grant number RS-2022-00165347). T.C. acknowledges the financial support of Wipro AI. I.A. is supported in part by the European Union (ERC, ExplainYourself, grant number 101077481). G.L.C. is supported by the National Science Foundation (grant numbers 2239194 and 2229885). E.F. and F.M. are partly supported by DARPA (award number HR001121C0169). F.M. is also partly supported by the Knight Foundation and Craig Newmark Philanthropies. G.Z.’s fact-checking project receives funding from the European Union through multiple grants and is part of Meta’s 3PFC Program. H.J. is partially supported by US DARPA SemaFor programme number HR001120C0123. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders.

Author information

Authors and Affiliations

University of Copenhagen, Copenhagen, Denmark
Isabelle Augenstein
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Timothy Baldwin & Preslav Nakov
Max Planck Institute for Security and Privacy, Universitätsstraße, Bochum, Germany
Meeyoung Cha
Indian Institute of Technology Delhi, New Delhi, India
Tanmoy Chakraborty & Shivam Sharma
University of Maryland, College Park, MD, USA
Giovanni Luca Ciampaglia
Full Fact, London, UK
David Corney
Stanford University, Stanford, CA, USA
Renee DiResta
University of Southern California, Los Angeles, CA, USA
Emilio Ferrara
University of Oxford, Oxford, UK
Scott Hale
Meta AI, Menlo Park, CA, USA
Alon Halevy
Carnegie Mellon University, Pittsburgh, PA, USA
Eduard Hovy
University of Illinois Urbana-Champaign, Champaign, IL, USA
Heng Ji
Indiana University, Bloomington, IN, USA
Filippo Menczer
Newtrales, Madrid, Spain
Ruben Miguez
University of Wisconsin, Madison, WI, USA
Dietram Scheufele
Pagella Politica/Facta, Milan, Italy
Giovanni Zagni

Authors

Isabelle Augenstein
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Baldwin
View author publications
You can also search for this author in PubMed Google Scholar
Meeyoung Cha
View author publications
You can also search for this author in PubMed Google Scholar
Tanmoy Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Luca Ciampaglia
View author publications
You can also search for this author in PubMed Google Scholar
David Corney
View author publications
You can also search for this author in PubMed Google Scholar
Renee DiResta
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Ferrara
View author publications
You can also search for this author in PubMed Google Scholar
Scott Hale
View author publications
You can also search for this author in PubMed Google Scholar
Alon Halevy
View author publications
You can also search for this author in PubMed Google Scholar
Eduard Hovy
View author publications
You can also search for this author in PubMed Google Scholar
Heng Ji
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Menczer
View author publications
You can also search for this author in PubMed Google Scholar
Ruben Miguez
View author publications
You can also search for this author in PubMed Google Scholar
Preslav Nakov
View author publications
You can also search for this author in PubMed Google Scholar
Dietram Scheufele
View author publications
You can also search for this author in PubMed Google Scholar
Shivam Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Zagni
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

I.A., T.B., M.C., T.C., G.L.C., D.C., R.D., E.F., S.H., A.H., E.H., H.J., F.M., R.M., P.N., D.S., S.S. and G.Z. contributed to conceptualizing, preparing and finalizing the manuscript. The author list is arranged alphabetically by the surname. T.C. and S.S. led the effort of writing the initial draft of the manuscript. T.C. coordinated the entire project.

Corresponding author

Correspondence to Tanmoy Chakraborty.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Mario Giulianelli and Hai-Tao Zheng for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Augenstein, I., Baldwin, T., Cha, M. et al. Factuality challenges in the era of large language models and opportunities for fact-checking. Nat Mach Intell 6, 852–863 (2024). https://doi.org/10.1038/s42256-024-00881-z

Download citation

Received: 25 October 2023
Accepted: 12 July 2024
Published: 22 August 2024
Issue Date: August 2024
DOI: https://doi.org/10.1038/s42256-024-00881-z
Springer Nature Limited

This article is cited by

A dataset for evaluating clinical research claims in large language models
- Boya Zhang
- Alban Bornet
- Douglas Teodoro
Scientific Data (2025)
Pick your AI poison

Nature Machine Intelligence (2024)
Results and implications for generative AI in a large introductory biomedical and health informatics course
- William Hersh
- Kate Fultz Hollis
npj Digital Medicine (2024)