Computer Science > Computation and Language

arXiv:2306.13421 (cs)

[Submitted on 23 Jun 2023 (v1), last revised 21 Jul 2024 (this version, v2)]

Title:Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval

Abstract:Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added post-hoc to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch and apply it to the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM. We evaluate RPT on four long-range language modeling tasks, spanning books, code, and mathematical writing, and demonstrate that RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.

Comments:	Accepted to TACL 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2306.13421 [cs.CL]
	(or arXiv:2306.13421v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.13421

Submission history

From: Ohad Rubin [view email]
[v1] Fri, 23 Jun 2023 10:18:02 UTC (7,376 KB)
[v2] Sun, 21 Jul 2024 07:35:23 UTC (7,395 KB)

Computer Science > Computation and Language

Title:Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators