Computer Science > Computation and Language

arXiv:2310.18168 (cs)

[Submitted on 27 Oct 2023 (v1), last revised 6 Feb 2024 (this version, v5)]

Title:Personas as a Way to Model Truthfulness in Language Models

Authors:Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, He He

View PDF

Abstract:Large language models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. While unintuitive from a classic view of LMs, recent work has shown that the truth value of a statement can be elicited from the model's representations. This paper presents an explanation for why LMs appear to know the truth despite not being trained with truth labels. We hypothesize that the pretraining data is generated by groups of (un)truthful agents whose outputs share common features, and they form a (un)truthful persona. By training on this data, LMs can infer and represent the persona in its activation space. This allows the model to separate truth from falsehoods and controls the truthfulness of its generation. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that structures of the pretraining data are crucial for the model to infer the truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2310.18168 [cs.CL]
	(or arXiv:2310.18168v5 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.18168

Submission history

From: Javier Rando [view email]
[v1] Fri, 27 Oct 2023 14:27:43 UTC (1,829 KB)
[v2] Mon, 30 Oct 2023 13:20:39 UTC (1,829 KB)
[v3] Tue, 21 Nov 2023 09:19:03 UTC (2,001 KB)
[v4] Thu, 14 Dec 2023 15:14:13 UTC (2,001 KB)
[v5] Tue, 6 Feb 2024 09:04:04 UTC (2,004 KB)

Computer Science > Computation and Language

Title:Personas as a Way to Model Truthfulness in Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Personas as a Way to Model Truthfulness in Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators