Computer Science > Computation and Language

arXiv:2410.04335 (cs)

[Submitted on 6 Oct 2024]

Title:ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Authors:Shuhao Gu, Mengdi Zhao, Bowen Zhang, Liangdong Wang, Jijie Li, Guang Liu

Abstract:Tokenizer is an essential component for large language models (LLMs), and a tokenizer with a high compression rate can improve the model's representation and processing efficiency. However, the tokenizer cannot ensure high compression rate in all scenarios, and an increase in the average input and output lengths will increases the training and inference costs of the model. Therefore, it is crucial to find ways to improve the model's efficiency with minimal cost while maintaining the model's performance. In this work, we propose a method to improve model representation and processing efficiency by replacing the tokenizers of LLMs. We propose replacing and reinitializing the parameters of the model's input and output layers with the parameters of the original model, and training these parameters while keeping other parameters fixed. We conducted experiments on different LLMs, and the results show that our method can maintain the performance of the model after replacing the tokenizer, while significantly improving the decoding speed for long texts.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2410.04335 [cs.CL]
	(or arXiv:2410.04335v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.04335

Submission history

From: Shuhao Gu [view email]
[v1] Sun, 6 Oct 2024 03:01:07 UTC (427 KB)

Computer Science > Computation and Language

Title:ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators