Computer Science > Machine Learning

arXiv:2405.14852 (cs)

[Submitted on 23 May 2024 (v1), last revised 30 May 2024 (this version, v2)]

Title:PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

Authors:Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik

View PDF HTML (experimental)

Abstract:There has been significant interest in "extreme" compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs. We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies, and provides convergence guarantees in restricted cases. On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly-performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama 2 family models at 2 bits per parameter.

Comments:	Preprint
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2405.14852 [cs.LG]
	(or arXiv:2405.14852v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2405.14852

Submission history

From: Denis Mazur [view email]
[v1] Thu, 23 May 2024 17:57:04 UTC (498 KB)
[v2] Thu, 30 May 2024 15:01:49 UTC (500 KB)

Computer Science > Machine Learning

Title:PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators