Computer Science > Computation and Language

arXiv:2302.07459 (cs)

[Submitted on 15 Feb 2023 (v1), last revised 18 Feb 2023 (this version, v2)]

Title:The Capacity for Moral Self-Correction in Large Language Models

View PDF

Abstract:We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2302.07459 [cs.CL]
	(or arXiv:2302.07459v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2302.07459

Submission history

From: Nicholas Schiefer [view email]
[v1] Wed, 15 Feb 2023 04:25:40 UTC (430 KB)
[v2] Sat, 18 Feb 2023 21:30:27 UTC (431 KB)

Computer Science > Computation and Language

Title:The Capacity for Moral Self-Correction in Large Language Models

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:The Capacity for Moral Self-Correction in Large Language Models

Submission history

Access Paper:

References & Citations

1 blog link

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators