Computer Science > Computation and Language

arXiv:2401.12522 (cs)

[Submitted on 23 Jan 2024 (v1), last revised 25 Jan 2024 (this version, v2)]

Title:BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Authors:Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao

Abstract:Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

Comments:	An appendix has been included. Source code at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2401.12522 [cs.CL]
	(or arXiv:2401.12522v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2401.12522

Submission history

From: Feng Lin [view email]
[v1] Tue, 23 Jan 2024 06:36:49 UTC (2,655 KB)
[v2] Thu, 25 Jan 2024 14:02:03 UTC (2,934 KB)

Computer Science > Computation and Language

Title:BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators