Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.06798 (cs)

[Submitted on 13 Aug 2024]

Title:Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Authors:Shibo Jie, Yehui Tang, Jianyuan Guo, Zhi-Hong Deng, Kai Han, Yunhe Wang

Abstract:Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approaches suffer from significant performance drop when the compression degrees are mismatched between training and inference stages, which limits the application of token compression on off-the-shelf trained models. In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. In advance, we additionally perform a fast parameter-efficient self-distillation stage on the pre-trained models to obtain a small plugin, called Token Compensator (ToCom), which describes the gap between models across different compression degrees. During inference, ToCom can be directly inserted into any downstream off-the-shelf models with any mismatched training and inference compression degrees to acquire universal performance improvements without further training. Experiments on over 20 downstream tasks demonstrate the effectiveness of our framework. On CIFAR100, fine-grained visual classification, and VTAB-1k, ToCom can yield up to a maximum improvement of 2.3%, 1.5%, and 2.0% in the average performance of DeiT-B, respectively. Code: this https URL

Comments:	Accepted to ECCV2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2408.06798 [cs.CV]
	(or arXiv:2408.06798v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.06798

Submission history

From: Shibo Jie [view email]
[v1] Tue, 13 Aug 2024 10:36:43 UTC (1,071 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators