Fine-tuning visual models has been widely shown promising performance on many downstream visual tasks. With the surprising development of pre-trained visual foundation models, visual tuning jumped out of the standard modus operandi that fine-tunes the whole pre-trained model or just the fully connected layer. Instead, recent advances can achieve superior performance than full-tuning the whole pre-trained parameters by updating far fewer parameters, enabling edge devices and downstream applications to reuse the increasingly large foundation models deployed on the cloud. With the aim of helping researchers get the full picture and future directions of visual tuning, this survey characterizes a large and thoughtful selection of recent works, providing a systematic and comprehensive overview of existing work and models. Specifically, it provides a detailed background of visual tuning and categorizes recent visual tuning techniques into five groups: fine-tuning, prompt tuning, adapter tuning, parameter tuning, and remapping tuning. Meanwhile, it offers some exciting research directions for prospective pre-training and various interactions in visual tuning.

1 Introduction

Since the wide usage of Transformer models [72, 105] and the emergence of large foundation models [10, 294], the paradigm of deep learning in vision intelligence has been experiencing the hype of adapting downstream tasks to the foundation models. The astonishing performance of the recent Visual ChatGPT [244] is enabled with myriad computation resources in the pre-training process [12], and human feedback during the tuning process. The pre-trained foundation model (i.e., GPT-3) shows strong capability but entails large storage space, around 800 GB, to store 175 B parameters [12], which makes it expensive to retrain independent model copies for different downstream tasks. Foundation models are expected to continue to scale up, and how to reuse the foundation model via parameter-efficient transfer learning (PETL) methods (prompt, prefix, adapter, etc.) quickly becomes a research hype. In the past two years, taking the inspiration of PETL methods in natural language processing (NLP) [136, 219, 294], numerous visual tuning techniques have been proposed for adapting downstream tasks to pre-trained vision or visual-language models.

In the era of increasingly large models, vision models have been scaled up from EfficientNet-based [177] (480M parameters) to Transformer-based [266] (\(2,100\)M parameters) and even larger scales such as 22B parameters [43] and 562B [50]. For such large models, PETL methods aim at making good reuse of the shared parameter weights (usually interpreted as the knowledge of large models) deployed on the cloud to save storage overhead and to empower edge devices such as autonomous vehicles, drones, and robots that are intensive in computing and battery resources [273]. This practice is different from the modus operandi of transfer learning that either fully fine-tunes the whole model or just fine-tunes the task head (e.g., the last fully connected layer) [301].

Given the emergence of increasingly large models (i.e., foundation models), we are in a new paradigm of visual tuning that is beyond tuning the entire model or the task head. How to effectively reuse the knowledge thereof with PETL methods, leading to less memory usage and higher inference speed is a hot topic in various vision tasks [130, 212, 213]. Starting with a detailed background in Section 2, this article provides an in-depth review of recent tuning advances in the vision domain, categorizing them into five common types and elaborating their current technical state with discussions in Section 3. Last but not the least, we provide insights into future research directions that hold significant promise in Section 4, followed by a conclusion. To the best of our knowledge, this is the first comprehensive survey on visual tuning, which bears great importance for researchers to understand the mechanisms and trends of this practice.

2 Background

In the early days, machine learning methods relied on feature engineering such as SIFT [104], BRIEF [20], and ORB [193] to handle specific tasks, which is later on dominated by the deep learning paradigm [116] since the introduction of ImageNet [45]. Deep learning models [80, 112, 214] pre-trained on ImageNet are able to benefit various downstream vision tasks such as image recognition, object detection, and image segmentation via fine-tuning. Fine-tuning is the second step of typical transfer learning, which makes use of the knowledge acquired from the source domain to facilitate the learning process of the target domain [216, 301]. Given the promise of large-scale pre-trained models, visual tuning techniques beyond fine-tuning have attracted increasing research interest, leading to the visual tuning paradigm as illustrated in Figure 1. This section will elaborate on the background of visual tuning from five perspectives: theory, definition, model architecture, model pre-training, and model tuning.

Fig. 1.

2.1 Theories

In the 1990s, the machine learning community largely ignored neural networks and backpropagation due to concerns about overfitting and the potential for poor local minima. However, in the recent era of deep learning, these concerns have been greatly alleviated via advancements in theories and empirical experiences [116]. In this section, we present the fundamental theories that underpin the current state of visual tuning, exploring these theories from three distinct perspectives as follows.

2.1.1 Biological Perspective.

Like the origin convolutional neural network (CNN) architecture was inspired by the receptive fields in the visual cortex [91], learning models can be inspired or motivated by biological and neuroscience discoveries. Particularly, researchers are working on enabling computer vision to have capabilities that are similar to human vision. First, human vision can efficiently process huge amounts of continuous visual streams. Regarding the intrinsic mechanism of this ability, classical biological findings suggest that humans perceive real-world scenes by contextualizing information from local parts (such as small edges) as a whole (i.e., subjective contours), which are respectively handled by cortical areas V1 and V2 [159]. It is also suggested that human vision is embodied and developed in interactive ecological environments [65]. This motivates researchers to work on effective solutions concerning the aspects such as accuracy and efficiency. Second, humans are good at generalizing visual understanding to unseen brand-new scenes by reasoning their physical and geometric properties [114]. This motivates emerged foundation models to be tested via increasingly challenging setups such as zero-shot learning, continual learning, multi-task learning, and so on.

2.1.2 Model Perspective.

Taking inspiration from human vision, the recently emerged paradigm of tuning large vision models aims to effectively reuse the knowledge in the large pre-trained model in an efficient way regarding computation and data. Generative pre-trained large language models such as GPT-3 show significant continual performance improvements when the model size is scaling up from 0.1 B to 175 B parameters [12]. This observation is known as the scaling up law: larger pre-trained model will benefit downstream tasks, which shows insights that adapting from a larger knowledge base can lead to better performance for downstream tasks. This scaling up law has also been proved in recent literature [43, 137]. Sung et al. [212] elaborated on the reason for the reduced training memory of PETL techniques from the perspective of backpropagation and further reduced their training memory by skipping the gradient traversal through the frozen backbone, which steps further on the analysis of existing PETL technique regarding training and inference memories.

2.1.3 Statistical Perspective.

Machine learning models are restricted by some statistical assumptions such as independent and identically distributed, the law of large numbers, central limit theorem, and so on [44], making practitioners conduct regularization techniques, collect large-scale datasets, and normalize the input data, respectively. In the era of large models, breaking these statistical boundaries becomes imaginable with encouraging recent progress (surveyed in Section 3), which intrinsically improves models’ generalization ability to out-of-distribution or long-tail data with less training data (from few-shot to zero-shot learning) and tunable parameters. To guide tuning with statistical rules, there are some works proposed based on measurable domain bound. For instance, Ye et al. [263] proposed the concept of expansion function, quantifying regularization or bound restrictions as the “variation” between the source and target domains, and the “informativeness” of a feature. Liu and Zhang [138] also attempted to measure the domain gap by using the test error. Zhang et al. [286] proposed to use the margin loss to replace the 0–1 loss for domain adaptation. The margin loss is expected to relax the restriction and provide a more informative generalization bound. Nilesh et al. [223] defined task diversity from a statistical perspective, providing generalization upper bounds of sample complexity for multi-task transfer learning.

2.2 Notation and Definition

In order to understand efficient fine-tuning, let’s start by defining domains, tasks, transfer learning, and other notations (Table 1 shows the notations used throughout this article). A joint distribution \(\mathcal {X} \times \mathcal {Y}\) can be expressed as \(P(X,Y)\) (i.e., \(P_{XY}\)), where \(\mathcal {X}\) and \(\mathcal {Y}\) represent its corresponding feature space and label space, respectively. (X and Y represent the observed instance set and its corresponding label set.) Given \(P_{XY}\), we refer \(P(X)\) (i.e., \(P_{X}\)) as the marginal distribution on X, \(P_{Y|X}\) the posterior distribution of Y, and \(P_{X|Y}\) the class-conditional distribution of X given Y.

Table 1.

Symbol	Definition
\(m\)	Number of domains
\(P\)	Distribution
\(\mathcal {D}\)	Domain
\(S\)	Source domain
\(T\)	Target domain
\(\mathcal {X}\)	Feature space
\(\mathcal {Y}\)	Label space
\(X\)	Instance set
\(Y\)	Label set
\(N\)	Number of samples in \(X\)
\(a\)	Learnable vectors
\(M\)	Number of prompts
\(\mathcal {T}\)	Tokens of visual inputs
\(Z\)	Feature propagate through network
\(l\)	Neural network layer
\(k\)	Input size of a convolutional layer
\(d\)	Output size of a convolutional layer
\(K\)	Kernel size
\(G\)	Group size
\(r\)	Dimension of low rank

Table 1. Notations Used in the Article

Definition 1 (Domain).

A domain \(\mathcal {D}=\lbrace \mathcal {X},P(X)\rbrace\) is defined by its feature space \(\mathcal {X}\) and a marginal distribution \(P(X)\), where X denotes an instance set defined as \(X = \lbrace x| x_i \in \mathcal {X} , i=1, \ldots ,n\rbrace\). A domain can be with or without labeling information.

Definition 2 (Task).

A task can be denoted as \(\mathcal {T}=\lbrace \mathcal {Y},f\rbrace\), where \(\mathcal {Y}\) and f represent a label space and a decision function, respectively. For the classification task of a source domain \(\mathcal {T^S}\), the goal is usually to predict the conditional distribution of instances, which can be denoted as \(f(x_j)=\lbrace P(y_k|x_j)|y_k \in \mathcal {Y}, k=1, \ldots , |\mathcal {Y}|\rbrace\). In this case, the task \(\mathcal {T^S}\) can be regarded as forming a typical source domain \(\mathcal {D^S}\) with labeling information, being denoted as \((\mathcal {D^S},\mathcal {T^S}) = \lbrace (x,y) | x_i \in \mathcal {X}^S, y_i \in \mathcal {Y}^S, i=1, \ldots , n^S \rbrace\).

Definition 3 (Transfer Learning).

Given \(m^S \in \mathbb {N}^+\) source domain(s) and \(m^T \in \mathbb {N}^+\) target domain(s), their corresponding task(s) can be denoted as \(\lbrace (\mathcal {D}^{S_i}, \mathcal {T}^{S_i}) | i=1, \ldots ,m^S \rbrace\) and \(\lbrace (\mathcal {D}^{T_j}, \mathcal {T}^{T_j}) | j=1, \ldots ,m^T \rbrace\), respectively. Transfer learning aims at improving the performance of decision functions \(f^{T_j}\) on the target domain(s) by making good use of the knowledge learned from the source domain(s).

Definition 4 (Parameter Efficient Fine-tuning).

Given \(m^S \in \mathbb {N}^+\) source domain(s) \(\lbrace (\mathcal {D}^{S_i}, \mathcal {T}^{S_i}) | i=1, \ldots ,m^S \rbrace\) and \(m^T \in \mathbb {N}^+\) target domain(s) \(\lbrace (\mathcal {D}^{T_j}, \mathcal {T}^{T_j}) | j=1, \ldots ,m^T \rbrace\) defined in transfer learning, the goal of efficient fine-tuning \(\lbrace (\mathcal {D}^{T_j}, \mathcal {Y}^{T_j}, f^{T_j}, f^{S_i})| i=1, \ldots ,m^S, j=1, \ldots ,m^T \rbrace\) is to improve the performances of \(f^{T_j}\) by reusing \({f^{S_i}|i=1, \ldots ,m^S}\) learned from their corresponding tasks denoted as \(P_{XY}\). In particular, the parameters of \(f^{S_i}\) need to be frozen or tuned with a small portion. While \(f^{T_j}\) denotes an extra small amount of model parameters that can be easily deployed on edge devices. In practice, for supervised or self-supervised pre-training, \(f^{S_i}\) can be learned from \(P_{Y|X}\) and \(P_X\), respectively. This definition is an extension of typical transfer learning [301], which covers multisource efficient fine-tuning.

2.3 Model Architecture

Pre-trained foundation models for vision have been surveyed in [294], which develops from CNN- and GAN-based models to recent Transformer-based models. We recommend readers refer to [294] for the detailed pre-training strategies. This section briefly introduces these representative models’ basic structures: CNN-based, Transformer-based, and CNN+Transformer.

CNN is one of the most popular deep learning models such as AlexNet [112], VGGNet [205], Inception [215], ResNet [80], EfficientNet [217], and so on, which has been surveyed time to time [2, 124]. EfficientNet is lightweight yet can achieve comparable performance to Transformer-based models via pre-trained initialization on various visual tasks such as image classification [269] and video understanding [14]. Except for 2D CNN, a couple of 3D CNN models such as C3D [222], I3D [21], S3D [250], and X3D [59] have been introduced for video understanding tasks. In addition, graph convolutional network [257] has also been proposed for tasks such as exercise evaluation [13] and pose estimation [265].

The typical architecture of a Transformer model is structured with several basic Transformer layers. Each layer can be made of a varied number of Transformer blocks composed of a multi-head self-attention (Attention) module and a fully connected feed-forward network (FFN) implemented with a 2-layer multilayer perceptron (MLP). Layer normalization (LN) and residual connection are, respectively, performed before and after both FFN and Attention modules. Building upon the basic Transformer model, Transformer has been dominating increasing tasks [72, 105]. Early Transformer models for vision are Vision Transformer (ViT) [49], Data-efficient image transformer (DeiT) [221], while their representative variations are TNT [73], T2T [271], PVT [237], Swin-Vit [146], Video Swin Transformer [147], CPVT [39].

Transformer models are well known for their ability to capture long-range dependencies of input data. Whereas CNNs might be better at representing local features. Models combining Transformer and CNN can achieve better performance. Twins-SVT [38] proposed to add a positional encoding module implemented via a 2D depth-wise convolution 2D in between the Transformer encoders, and designed global and local attention modules to improve the model’s representation ability, leading to improved performance on image-based tasks with slightly more model parameters. Representative methods combining CNN and Transformer are Shuffle [90], CMT [71], VOLO [272], and so on. Although they can achieve superior performance, how they can be used for visual tuning seems under-explored.

2.4 Model Pre-training

Pre-training methods can be roughly grouped into supervised and self-supervised ones. Early vision models were pre-trained via supervised learning on large-scale datasets such as ImageNet [45], JFT-300M [209], Kinetics [103], and so on. Since fine-tuning models pre-trained with supervised learning, larger-scale pre-training has been conducted recently. For example, Gato [188] uses multi-task learning with the supervision of varied tasks to enable the large model to acquire more knowledge for the adaptation of downstream tasks. Multi-label learning is used to pre-train a pure vision model that reaches 22 B parameters [43], showing fantastic performance on downstream visual tasks.

In the regime of supervised pre-training, the non-trivial annotation cost imposes a practical obstacle to scaling up the benefit of transfer learning. Alternatively, self-supervised learning on unannotated data can also make the models richer and potentially more useful [10]. The paradigm of fine-tuning models pre-trained via self-supervision brings the possibility of learning knowledge from unannotated data at a larger scale, which is enabled by advanced computing power, the Transformer model, and more data. Models pre-trained with self-supervised learning are termed “foundation models” by Bommasani et al. [10]. Recent notable examples include MAE [79, 220] in vision; CLIP [183], ALIGN [93], Florence [270], BEiT [236], Gato [188], CoCa [266], SWAG [207], and so on for visual-language models. More recently, generative models such as NeRF [161] and Diffusion [173] have also been fine-tuned for better image or video generation such as Latent-NeRF [160], DreamBooth [194], and Tune-a-video [248]. Taking the initial success in NLP, this paradigm has started showing success in vision and various other realms such as climate science [163], protein design [229], intelligent transportation [201], and so on. Bommasani et al. [10] identified the key significance of foundation models as emergence regarding capability and homogenization regarding model, modality, tasks, and domains.

2.5 Model Tuning

Given the knowledge learned via pre-trained models, downstream tasks can greatly benefit from them. Early modus operandis of fine-tuning includes updating the whole parameters of the pre-trained model and tuning the task head only (e.g., fully connected layer). With the popularity of large language models such as GPT-3 [12] pre-trained via meta-learning in an unsupervised manner, enabling them to handle a broad set of skills (the inner loop termed “in-context learning”). Given the ability of multiple skills, the current leading paradigm in NLP is to adapt downstream tasks to the large language models, entering the learning paradigm of “pre-train, prompt, predict” from “pre-train, fine-tune” [10].

On the one hand, a couple of recent works [167, 275] achieve promising performances on vision downstream tasks by fine-tuning visual-language models. However, according to the results in [167] and [81], fine-tuning visual-language models do not lead to results as good as fine-tuning supervised pre-trained vision models. In addition, pure vision models are also increasingly large (reach 22B parameters) [43] and gain great advances recently [29, 117, 188] with varied pre-training strategies [105, 294]. As such, it needs further investigation on proper pre-training techniques and fine-tuning techniques for vision downstream tasks.

3 Visual Tuning

To the best of our knowledge, there is no survey that systematically summarizes the recent state of visual tuning from the technical perspective. He et al. [78] analyzed different PETL tuning methods such as prompt-tuning, prefix-tuning, and adapters in the NLP domain, showing they are intrinsically similar (i.e., they bring a certain amount of tunable parameters for adaptation). Taking parameter efficient transfer learning methods in NLP into consideration, we group visual tuning methods into five categories: fine-tuning, prompt tuning, adapter tuning, parameter tuning, and remapping tuning (see Table 2) according to their structures and motivations. In the remainder of this section, we introduce the five groups of tuning techniques with discussions of their advantages and disadvantages.

Table 2.

Category	Description	Method
Fine-tuning	All parameters in the pre-trained model are updated in the tuning process. This method is by far regarded as an effective practice to achieve state-of-the-art performance on many vision benchmark datasets. However, when vision models continue to scale up, this fine-tuning method becomes less practicable due to the storage and training overhead.	CNN: VGGNet [205], Inception [215], ResNet [80], EfficientNet [217], C3D [222], I3D [21], S3D [250], X3D [59] Transformer: ViT [49], DeiT [221], TNT [73], T2T [271], PVT [237], Swin-Vit [146], Video Swin Transformer [147], CPVT [39] CNN and Transformer: Shuffle [90], CMT [71], VOLO [272]
Prompt Tuning	Prompt tuning unifies all downstream tasks into pre-trained tasks via designing a specific template to fully exploit the capabilities of foundation models. Prompt tuning usually learns few parameters and keeps pre-trained models frozen. In addition, the core mechanism of the vision prompts aims at exploiting the potential of the upstream pre-trained model, so that the upstream pre-trained model can perform the downstream task as well as possible with some or fewer labeled data.	Vision-driven Prompt: VPT [94], S-Prompting[239], DePT [64], ZegCLIP [298], ACT [48], PViT [83], TeCoA [156], EVP [247], ProSFDA [89], APT [11], PAT [267], LPT [47], PointCLIP [282], P2P [242], PromptGen [245], NOAH [288], PGN [148], FPTrans [278], FRPT [235], RePro [62], ViLD [68], LION [231] Language-driven Prompt: CoOp [296], SubPT [153], MPA [28], ZegOT [109], X-CLIP [164], ProGrad [299], Berg et. al [8], PTP [285], LANIT [170], SgVA-CLIP [175], LASP [17], DualCoOp [210], PLOT [25], CPL [82], DeFo [230], GALIP [218], CoCoOp [295], PointCLIP V2 [300] Vision-language Prompt: UPT [275], DPT [252], MaPLe [106], MVLPT [200], MetaPrompt [292], TPT [203]
Adapter Tuning	Adapter tuning is a class of techniques that inserts additional trainable parameters into a pre-trained model frozen to facilitate learning for downstream tasks. The advantage of this method is its lightweight nature and ease of plug-and-play insertion into the middle of a pre-trained network, making it widely applicable in many visual tasks.	Sequential Adapter: Res-adapt [186], EPM [187], DAN [192], LST [212], Conv-Adapter [26], Polyhistor [145], Pro-tuning [165], AMixer [185], Fit [204], TINA [158], RepAdapter [150], BDTL [123], ViTDet [122], Florence [270], SND [233], MK-Adapter [280], ADA [55], AIM [261], ST-Adapter [166], PEA [199], CAOA [224], HA [108], CLIP-Adapter [63], Tip-Adapter [281], BALLAD [154], MAGMA [53], VL-Adapter [213], Hierarchical3D [169], HyperPELT [290], SVL-Adapter [168], LAVISH [129], CrossModal-Adapter [95], MV-Adapter [277] Parallel Adapter: ViT-Adapter [36], PESF-KD [184], AdaptMLP [31], Convpass [98], AMA [268], UniAdapter [149] Mix Adapter: Consolidator [75], ETT [253], PATT [81], PALT [227], TVG [202], VQT [225]
Parameter Tuning	Parameter tuning aims to directly modify the model parameters (i.e., weight and bias). They can be grouped into three categories: bias part, weight part, and both. Common modification schemes can be addition, decomposition, or without extra parameters (i.e., directly tune part of parameters). Representative methods are bias tuning, LoRA, and Compacter.	Bias Part: Bitfit [274], Side Adapter [255], AdapterBias [60], DP-BiTFiT [15] Weight Part: LoRA [87], MoSA [111], DyLoRA [227] DnA [96], Compacter [102], KAdaptation [81], PHM [276], PHNNs [67], TARP [85], FacT [99], KronA [52], DLDR [121], Aurora [232] Weight and Bias: SSF [125]
Remapping Tuning	Remapping-based tuning is a novel approach that involves transferring the learned knowledge of a pre-existing model to a new downstream model. This technique has shown promising results in improving the performance of downstream models and can be categorized into three different types according to the use of the pre-trained model.	Knowledge Distillation: KD [84], Fitnet [191], Student [27], DFA [69], AdaIN [259], Normalized KD [254], Heterogeneous KD [172], DeiT [221], Manifold KD [76], Paraphrasing KD [107], RKD [171], AKDNet [141], SemCKD [23], HKD [297], Review [30], DKD [291] Weight Remapping: Net2Net [32], EAS [18], N2N Learning [5], NASH [54], Path-level EAS [19], FNA [57], FNA++ [58] Architecture Remapping: DARTS [131], DATA [22], DATA-GS [283], P-DARTS [34], DARTS+ [126], SGAS [118], SNAS [251], MiLeNAS [77], DARTS- [40]

3.1 Fine-tuning

We use fine-tuning to denote the standard practice of transfer learning, which either tunes the whole parameters of pre-trained models or just tunes the task head. Many state-of-the-art methods adopted this practice to achieve impressive performance on vision benchmarks such as ImageNet [45], Kinetics [103], COCO [128], NTU RGB+D 120 [132], Human3.6M [92], and so on. Tuning the whole pre-trained model intrinsically initiates the learning process of the downstream tasks via the learned model weights. While tuning the task head treats the pre-trained model as a feature extractor.

The full fine-tuning strategy comes with obstacles for adapting large models to downstream tasks. First, it requires one to update and store separate model parameters for different downstream tasks, which can be expensive and infeasible when the foundation models become increasingly large. Second, it relies on high-quality downstream data and can hardly adapt to unseen scenarios that have large distribution shift [113], which is unlike the learning process of humans who can learn from few samples and generalize well to new circumstances. This issue has been researched in directions such as zero-shot learning, few-shot learning, and continual learning [120]. Alternatively, fine-tuning the downstream task head can avoid updating the entire backbone model, but it usually leads to unsatisfactory experimental performance.

3.2 Prompt Tuning

Prompt-based learning is first introduced in NLP to efficiently adapt downstream language tasks to foundation models. Unlike the traditional “pre-training, fine-tuning” paradigm which initializes the weight parameters of pre-trained model and optimizes these parameters under the guidance of downstream task-specific loss functions, prompt-based learning leverages textual prompts to reformulate various downstream tasks as the original pre-trained task. Inspired by prompt techniques in NLP, prompt tuning is also introduced into the computer vision field. Specifically, vision prompt tuning could be divided into three groups, i.e., vision-driven prompt, language-driven prompt, and vision-language prompt.

3.2.1 Vision-driven Prompt.

Vision-driven prompt tuning [11, 47, 89, 148, 247, 267, 288, 298] has become a popular parameter-efficient way to transfer the remarkable generalization ability of pre-trained vision models to various downstream tasks. The research efforts of vision-driven prompt strategies can be roughly categorized into two groups, i.e., modifying inputs directly, and designing vision prompt sub-networks to produce vision prompts. Studies of the first family [48, 64, 83, 94, 156, 208, 239] usually tend to directly modify inputs, e.g., adding a set of learnable parameters into input images, which aims at modifying the input distribution and further makes downstream tasks close to the solved task during the original pre-training, as shown in Figure 2(a). Formally, the mathematical formulation can be described as

\begin{equation} \mathcal {P}_V = [\mathcal {T},{\bf {\it a}}_1, \cdots , {\bf {\it a}}_i, \cdots , {\bf {\it a}}_n], \end{equation}

(1)

where \(\mathcal {P}_V\) indicates the vision-driven prompts, \(\mathcal {T}\) denotes the embeddings of local images or tokens outputted by Transformer, and \({\bf {\it a}}_i\) is the ith learnable vector.

Fig. 2.

Existing extensive works utilize the above principle to design vision prompts to instruct frozen vision pre-trained models to various downstream tasks. Concretely, VPT [94] plugs solely a few learnable parameters and regards these parameters as a part of input tokens of Transformer, which steers pre-trained vision models to perform various downstream tasks. Similar with VPT, DePT [64] also introduces learnable visual prompts into the vision Transformer and only optimizes these source-initialized prompts while keeping the vision Transformer frozen during adaptation. In addition, PViT [83] designs task-specific prompts by introducing a small set of specialized parameters to adopt a shared video transformer backbone to perform synthetic scene tasks and a real video downstream task. LPT [47] optimizes shared prompts to explore the general features across the entire long-tailed dataset, and group-specific prompts to endow the fine-grained discrimination ability into frozen pre-trained vision models.

As proven by the above works, prompt learning enables pre-trained visual models to adapt to a variety of visual tasks in natural scenarios. However, prompt learning still has great potential in transferring visual knowledge of pre-trained vision models trained in natural scenarios to downstream tasks that have large domain gaps. A recent study has extended vision prompt from natural scene understanding to diverse vision tasks with huge domain discrepancies, such as point cloud analysis [242, 282], image generation [245] and even speech understanding [110]. Concretely, PointCLIP [282] converts the raw points into scatter depth maps by projecting them onto predefined image planes, termed as vision prompt, which effectively transfers the remarkable ability of CLIP model. In addition, PointCLIP also narrows the modality discrepancies between unordered point clouds and the visual images, thus producing a unique insight for processing vision tasks with significant domain gaps using prompt technology. P2P [242] proposes the geometry-preserved projection and geometry-aware coloring operations to translate point cloud data into colorful images, which are regarded as vision prompt and further adapt the pre-trained vision model for various point cloud analysis tasks. These works show that vision-driven prompts can transfer pre-trained vision models from natural scenarios to various downstream tasks even with domain discrepancies.

Excitingly, the above work takes only a simple manner (e.g., adding extra parameters into inputs) to construct visual prompts but makes great progress on transferring the remarkable discrimination and generalization of pre-trained vision models. To further investigate and dig into the effectiveness of vision prompt, the other family of approaches [61, 62, 68, 148, 235, 278, 288] also tend to design a sub-network to construct vision prompts, as shown in Figure 2(c). Specifically, the vision-driven prompts \(\mathcal {P}_V\) can be denoted as

\begin{equation} \mathcal {P}_V = \Phi (X, \boldsymbol {\theta }), \end{equation}

(2)

where \(\Phi (,)\) denotes the designed sub-network to produce vision prompts \(\mathcal {P}_V\), \(\boldsymbol {\theta }\) is the learnable parameters in \(\Phi (,)\), and X is the input image. For instance, NOAH [288] combines adapter, prompt, and LoRA. It utilizes the neural architecture search (NAS) algorithm to learn the down-sampled dimension of adapters, the down-projection dimension of LoRA, and the learnable token length of prompts, leading to better parameter efficiency and performance tradeoff. PGN [148] learns to produce input-dependent prompts via selectively sampling input images from a commonly learned library of tokens. FRPT [235] explicitly zooms the discriminative regions of input images via designing a lightweight sampling network to obtain the vision prompts. RePro [62] localizes objects from videos as vision prompts utilizing a tracklet detector and further learns the correlation between subjects and objects according to the learned vision prompts. ViLD [68] generates multiple regions of interest based on the region proposal network, regarded as vision prompts, to align their visual embeddings and textual embeddings for open-vocabulary object detection. These works can produce appropriate prompts according to downstream tasks, thus effectively exploring the remarkable generalization and discrimination ability of pre-trained vision models. More importantly, compared to introducing learnable parameters directly, they can improve the interpretability of these vision prompts such as directly modifying the pixels.

3.2.2 Language-driven Prompt.

Recently, large-scale vision-language models are pre-trained by extensive image-text pairs and focus on open-world visual concepts. Following this ideology of prompt learning in NLP, most existing works tend to transfer large-scale vision-language models into various downstream vision tasks via designing appropriate language-driven prompts [8, 127, 153, 170, 175, 218, 285, 295, 300]. As shown in Figure 2(b), most works, such as CoOp [296], firstly extract unified context or class-specific context of visual images as language-driven prompts to adapt frozen pre-trained vision-language models to diverse vision tasks. Formally, the language-driven prompts can be formulated as below:

\begin{equation} \mathcal {P}_T = [{\bf {\it a}}_1, \cdots , {\bf {\it a}}_i, \cdots , {\bf {\it a}}_n, \lt class\gt ], \end{equation}

(3)

where \(\mathcal {P}_T\) denotes the language-driven prompts, \(a_i\) represents the ith learnable vectors, the number of learnable vectors is n, and \(\lt class\gt\) is the class embeddings. Extensive existing works have focused widely on this line of language-driven prompt and utilize this prompt analogous to the designed in CoOp to adapt various downstream tasks, e.g., domain adaption [28], semantic segmentation [109], video understanding [100, 164], and few-shot learning [299]. In addition, recent methods [17, 25, 82, 196, 210, 230] extend the original language-driven prompt and thus design multiple complementary language-driven prompts to better mine the task-specific knowledge from pre-trained vision-language models. The multiple complementary language-driven prompts \(\mathcal {P}_{MT}\) can be represented as

\begin{equation} \mathcal {P}_{MT} = [ \mathcal {P}_T^1, \cdots , \mathcal {P}_T^i, \cdots , \mathcal {P}_T^M]. \end{equation}

(4)

For instance, PLOT [25] learns multiple comprehensive prompts to capture different attributes of classes and aligns visual embeddings and multiple textual embeddings via optimizing the optimal transport distances between multiple prompts.

3.2.3 Vision-language Prompt.

Vision-driven and language-driven prompts have been explored to simultaneously modify the vision and text inputs for pre-trained vision-language models, thus transferring the discrimination and generalization ability of pre-trained vision-language models thanks to effectively aligning visual and textual embeddings [106, 200, 203, 234, 252, 275, 292]. For an instance, UPT [275] designs a shared prompt network to produce the vision prompt and text prompt, thus narrowing the gap between visual representations and textual embeddings. DPT [252] simultaneously optimizes the visual and textual prompts from the vision and text input perspectives, which aims at modifying the textual classifier and visual representations of pre-trained vision-language models. TPT [203] introduces learnable text prompts with random vectors and category names and designs vision prompts generated by cropping input images randomly. These methods can transfer the pre-trained vision-language models to various downstream tasks from the perspective of text and vision inputs.

3.2.4 Discussion.

It is well known that the quantity of labeled data largely determines the upper limit of the vision algorithm. Vision prompt learning usually focuses on solving the problem of few-shot or zero-shot learning, which allows the model to perform relatively well even without labeled data. Moreover, visual prompt learning integrates all subsequent tasks into pre-training tasks by creating a distinct template. Through this approach, data from downstream tasks are converted into new inputs to leverage the inherent capacities of pre-trained models. In other words, the core mechanism of the vision prompts aims at harnessing the capabilities of the upstream pre-trained model, allowing it to excel in downstream tasks even with minimal reliance on annotated data.

However, prompt-based tuning also suffers from some limitations, vitally sacrificing the applicability in the real world. Firstly, a significant challenge facing prompt tuning is how to construct or highlight effective visual cues of inputs and seamlessly integrate them with downstream tasks. This necessitates a profound understanding and solid technical expertise in both the fields of original pre-training tasks and downstream tasks. Additionally, the prompt-based tuning approach still demands substantial computational resources for model training and optimization, inevitably leading to increased training time and costs. Lastly, despite prompt tuning showcasing notable performance improvements in many tasks, its generalizability requires further exploration and validation when facing huge domain differences between original pretraining tasks and downstream ones.

Despite these limitations, prompt tuning will assume an increasingly pivotal role in the realm of artificial intelligence. We posit that exploring three specific avenues could mitigate some of the current limitations. Firstly, prompt tuning tends to transparent and interpretable adjustments to the input prompts. This transparency enables researchers and practitioners to understand and validate the model’s decision-making process, performing better against different data distributions or interferences. Furthermore, researchers and users can tailor prompts to steer the model’s attention toward specific features or classes of interest, making the model more usable for various applications and tasks. Prompt tuning contributes to the usability of deep learning models by facilitating model interpretability and controllability. Lastly, prompt tuning tends to promote consistency in model performance by enabling standardized methodologies for adjusting prompts across different datasets and tasks. This consistency ensures that models behave predictably and reliably in various scenarios, enhancing their overall usability and applicability.

3.3 Adapter Tuning

Adapter-based methods are a class of techniques that involves additional trainable parameters into a pre-trained model that has been frozen to facilitate learning for downstream tasks. In the NLP domain, adapters were first introduced by Houlsby et al. [86] as a means of achieving PETL. However, efficient adaptation, particularly in the field of computer vision, has received comparatively little attention. Initial efforts to develop adaptive methods for computer vision have included incremental learning methods [192] and domain adaptation methods [186, 187]. Subsequently, adapters have garnered interest across domains and have been successfully applied in the computer vision field. Adapters provide a lightweight alternative to extensive model fine-tuning.

In this section, we have sorted out the existing vision-related adapter-based tuning methods, which can be roughly divided into three ideas, i.e., sequential adapter, parallel adapter, and mix adapter one by one as follows.

3.3.1 Sequential Adapter.

Sequential adapter refers to the technique of inserting parameters into a sequential forward network shown in Figure 3(a), which typically includes a linear down projection, a non-linear activation function, an up projection, and a residual connection. This approach is commonly applied after the multi-head attention layer and/or the feed-forward layer to enhance model performance. In particular, given a d-dimensional input feature map \(Z ^{(l)}\), the number of parameters of adapter can be adjusted by a hyperparameter \(d_{\text{bottle}}\) \((d_{\text{bottle}}\ll d)\). The sequential adapter module first uses a down-projection (i.e., downsampling) with \({\bf {\it W}}_{\text{down}} \in \mathbb {R}^{d{\times }d_{\text{bottle}}}\) to project the feature to the lower-dimensional representation, followed by a ReLU activation function and an up-projection (i.e., upsampling) with \({\bf {\it W}}_{\text{up}} \in \mathbb {R}^{d_{\text{bottle}}{\times }d}\). The above formulation can be written as

\begin{equation} \hat{Z}^{(l)}=\text{ReLU}(\text{LN}(Z^{(l)}){\bf {\it W}}_{\text{down}}){\bf {\it W}}_{\text{up}}, \end{equation}

(5)

where \(\hat{Z}^{(l)}\) denotes the optimized features outputted by the sequential adapter.

Fig. 3.

In sequential adapter strategies, research can be categorized into two groups: inserting residual blocks directly, and using parameter optimization techniques to minimize adapter size. Studies of the first group [26, 186, 192, 212] emerged early without large-scale models. Res-adapt [186] involves a customized deep network with adapter residual modules to adapt to different visual domains in real-time. DAN [192] converges to comparable or higher performance with a fraction (typically 13%) of the parameters of standard fine-tuning after Res-adapt. Recent work [212] introduces LST, which trains a separate ladder network using intermediate activations and shortcut connections to improve accuracy and reduce computational complexity. Additionally, Conv-Adapter [26] investigates feasible solutions to learn task-specific knowledge by adapting intermediate features of each residual block using four variants.

EPM [187] suggests using universal parametric neural network families with limited parameters, while Polyhistor [145] decomposes a hyper-network into separate hyper-networks and factorizes adapter weight matrices. Additionally, Pro-tuning enriches the feature space with multiple prompt blocks [165], while AMixer captures long and short-term dependencies without self-attention [185]. Shysheya et al. propose Fit [204], which scales and shifts activations and uses a Naive Bayes final layer classifier for image classification. Marouf et al. introduce TINA [158], which iteratively reduces adapter size using a scoring function compared to neuron importance, improving overall model efficiency. Finally, Luo et al. propose RepAdapter [150], which uses re-parameterization of sparse structure to approach nearby projection weights, reducing model parameters while maintaining effectiveness and lightweight nature.

Adapters have become a popular technique for foundation tasks where the pre-training task is often image classification. However, other tasks such as high-level vision tasks [55, 122, 123, 134, 152, 270, 280], low-level vision tasks [224, 233], video understanding [166, 261, 270], and robotic control [199] all require designs that are tailored to their specific architecture, in order to efficiently transfer learned parameters and achieve good performance through PETL. In addition to these task differences, recent research has proposed innovative ways to utilize adapters in different applications. Recent research proposes innovative ways to use adapters, such as BDTL and ViTDet [122, 123] adjusting a plain backbone with minimal adaptation for object detection, and Florence [270] incorporating universal visual-language representations for a wide range of tasks such as retrieval, classification, object detection, visual question answering, and action recognition. SND [233] uses a dynamic stacked network for image restoration, MK-Adapter [280] blends predictions for few-shot classification, and ADA [55] performs continual learning. AIM [261] and ST-Adapter [166] equip models with spatio-temporal reasoning for video understanding. PEA [199] addresses robotic manipulation limitations, and CAOA [224] optimizes image compression with adapters.

In the field of multi-modal learning, with the development of large-scale cross-modal pre-trained models, i.e., CLIP [183] and ALIGN [93], adapter technique has been widely adopted, using a design analogous to the one mentioned above, to adapt various downstream tasks for efficient fine-tuning [53, 63, 95, 129, 154, 168, 213, 277, 281, 290] with excellent results. HA [108] recommends general recipes for efficient multi-modal transfer learning. CLIP-Adapter [63] uses residual-style feature blending with an additional bottleneck adapter, while Tip-Adapter [281] enhances few-shot capability without backpropagation during training. MAGMA [53] combines visual and textual inputs for generative language models, and BALLAD [154] augments representations for long-tailed vision language learning. Hierarchical3D [169] integrates multi-modal content into a textual summarizer, while VL-Adapter [213] adjusts pre-trained models with sequential adapter layers for cross-modal domains. HyperPELT [290] fine-tunes small modules using a shared hyper-network, while CrossModal-Adapter and MV-Adapter [95, 277] allow early cross-modal interactions. SVL-Adapter [168] combines vision-language pre-training and self-supervised representation learning, and LAVISH [129] migrates adapters for pre-trained ViTs to audio-visual tasks. These approaches demonstrate the versatility of adapters and their potential for various applications beyond traditional classification tasks in multi-modal learning.

3.3.2 Parallel Adapter.

Parallel adapter [31, 36, 98, 135, 149, 184, 268] has been proposed as a variant of the classic sequential adapter architecture shown in Figure 3(b). Here, activations are passed via the module layer in parallel to the adapted sub-layer (i.e., feed-forward or attention layer), as opposed to the established, sequential, order of computations. The parallel adapter module also uses a down-projection (i.e., downsampling) with \({\bf {\it W}}_{\text{down}} \in \mathbb {R}^{d{\times }d_{\text{bottle}}}\) to project the feature to the lower-dimensional representation, followed by a ReLU activation function, and an up-projection (a.k.a. (i.e., upsampling)) with \({\bf {\it W}}_{\text{up}} \in \mathbb {R}^{d_{\text{bottle}}{\times }d}\) in parallel. Formally, the process of parallel adapter can be described:

\begin{equation} \hat{Z}^{(l)}=\text{ReLU}(\text{LN}(Z^{(l)}){\bf {\it W}}_{\text{down}}){\bf {\it W}}_{\text{up}}+\text{LN}(Z^{(l)}), \end{equation}

(6)

where \(\hat{Z}^{(l)}\) denotes the optimized features outputted by the parallel adapter.

The simplest application of adapters is to insert a module in parallel. ViT-Adapter [36] introduces image-related biases by a pre-training-free adapter, while PESF-KD [184] updates only the adapter for soft labels. AdaptMLP [31] adapts to large video action recognition using two parallel branches. Convpass [98] uses trainable convolutional blocks to improve inductive bias. AMA [268] restores 2D structure for each modality, and UniAdapter [149] unifies uni-modal and multi-modal adapters with partial weight sharing. These approaches demonstrate the versatility of adapter modules in various applications.

3.3.3 Mix Adapter.

Mix adapter [75, 202, 225, 246, 253, 264] introduces new parameters in different positions with mixed architecture demonstrated in Figure 3(c), i.e., the multi-head attention blocks in each Transformer layer.

PATT [264] explores efficient parameter techniques for video-based downstream tasks with a prefix-tuning module. ETT [253] uses attentive prefix tuning and domain residual adapters for few-shot learning. PALT [246] prunes adapters based on the lottery ticket hypothesis. VQT [225] aggregates intermediate features for parameter and memory-efficient transfer learning. Consolidator [75] structures tunable parts for efficient transfer learning with group-wise convolution. TVG [202] compares pre-trained models and adapters for video grounding tasks. These approaches demonstrate the versatility of efficient adapter techniques in various applications.

3.3.4 Discussion.

Adapter-based methods represent a popular PETL approach within vision and multi-modal learning, emphasizing the modification of a small set of parameters within a frozen backbone to address downstream tasks. This not only economizes on computational expense but also introduces a high degree of modularity. Such modularity facilitates the swift adaptation of pre-trained models to new tasks without necessitating significant architectural overhauls. Meanwhile, by focusing adaptation efforts on a concise set of parameters, adapter-based techniques maintain the integrity of the original model’s learned representations, thereby enhancing the model’s generalization capabilities across various tasks. Moreover, Adapters introduce variability through methods like projecting down and up with intermediate non-linear layers, offering a range of model adjustments not typically available through direct parameter tuning.

However, adapter tuning has its limitations when compared with other methods. On one hand, adapter tuning lacks of interpretable semantic meaning compared with prompt tuning. On the other hand, it can be slightly less parameter efficient than parameter tuning such as LoRA. Regarding the comparison with remapping methods, adapter tuning is faced with the challenge of where to insert parameters (such as Transformer models’ attention and feed-forward modules, between the Transformer layers or blocks, etc.). Existing adapter tuning methods seem to have no consistent rule but just insert parameters to specific layers.

We posit that exploring two specific avenues could mitigate some of the current limitations. Firstly, introducing more efficient operations could broaden the applicability of adapter-based methods across various communities. Not all layers of a foundation model may require adapters; a unified rule, akin to the scaling principles used in foundation models, could dictate their strategic implementation, enhancing efficiency. Secondly, more adapter architecture can be studied. For instance, in the realm of NLP, there exist adapter architectures that exhibit promising performance in adapting to new tasks [176], which can be leveraged and applied to visual tuning. Furthermore, emerging integration techniques will likely enable adapters to achieve improved performance in practical applications.

3.4 Parameter Tuning

An effective parameter-based tuning involves directly modifying the parameters (either weights or biases) of the pre-trained model in a more aggressive manner. Given a specific layer, it can have its weight-term multiplied to the feature map and a bias-term added to the feature map. As shown in Figure 4, this section introduces parameter-based methods based on which part of the parameters are tuned: weight part, bias part, and both. Techniques can be grouped into addition and decomposition. Existing works also termed this technique as reparameterization-based methods [46, 150].

Fig. 4.

Given a neural network layer with parameters (\(k,d,K,G\)), where \(k=C_{\text{in}}\) is the input channel, \(d=C_{\text{out}}\) is the output channel, K is the kernel size, and G is the group size. When \(G=1\), we will have \({\bf {\it W}} \in \mathbb {R}^{d \times k \times K}\) and \({\bf {\it b}} \in \mathbb {R}^{d}\). Then a typical neural convolutional operation can be denoted as

\begin{equation} Z^{(l)}=Z^{(l-1)}{\bf {\it W}} + {\bf {\it b}}, \end{equation}

(7)

where \(Z^{(l)}\) and \(Z^{(l-1)}\) denote the output and input features at the lth neural network layer. The group parameter G can be used to control the connections between inputs and outputs, leading to the weight becoming \({\bf {\it W}} \in \mathbb {R}^{d \times \frac{k}{G} \times K}\). For ease of explanation, we do not consider the kernel size and feature size but focus on the variable size. In the remainder of this section, parameter tuning methods are introduced based on three groups: bias part, weight part, and both.

3.4.1 Bias Part.

Bitfit [274] is also known as side-tuning, which only tunes the bias part of the pre-trained model (see Figure 4(a)) and can be represented as

\begin{equation} Z^{(l)}=Z^{(l-1)}{\bf {\it W}} + {\bf {\it b}}, \end{equation}

(8)

where the weight parameters \({\bf {\it W}}\) are frozen, and the bias \({\bf {\it b}}\) contains the parameters optimized in the tuning process. Avoiding change in the bias of the pre-trained model, AdapterBias [60] targets the bias term at the MLP layer by using a linear layer L with weight (\(\boldsymbol {\alpha } \in \mathbb {R}^{d}\)) and a tunable vector \({\bf {\it v}} \in \mathbb {R}^{r}\), which can be calculated as

\begin{equation} Z^{(l)}=Z^{(l-1)}{\bf {\it W}} + {\bf {\it b}}+{\bf {\it v}} \otimes \boldsymbol {\alpha }. \end{equation}

(9)

Xu et al. [255] introduced side-tuning as two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model for semantic segmentation. It adds a bias term to the results of the Softmax layer of the attention module. Differentially Private Bias-Term Fine-Tuning (DP-BiTFiT) [15] proposed a differentially private version of bias-tuning. DP-BiTFiT used the optimizer DP-SGD to make the bias term private: first aggregate bias gradient norms across all layers, then use it to compute clipping factor, add Gaussian noise to the sum of clipped gradients, and descend on bias term. DP-BiTFiT basically changed the way for optimizing the bias term, which achieves comparable performance with bias-tuning. DP-BiTFiT’s implementation is worth noting as it does not calculate the gradients for the pre-trained weights, which helps to save over \(60\%\) training time.

Namazifar et al. [162] studied the role bias-term of Transformer for NLP tasks. From the mathematical perspective with empirical verification, it concludes that the bias term of the key linear transformation is redundant and can be omitted without any impact on the attention module. Moreover, the bias term of the value linear transformation has a more prominent role than that of the bias term of the query linear transformation.

3.4.2 Weight Part.

Figure 4(b) shows models that tune the weight part of some layers. Given the parameter of a neural network layer with weight \({\bf {\it W}} \in \mathbb {R}^{d \times k}\), LoRA [87] learns parameters \({\bf {\it W}}_{\text{down}} \in \mathbb {R}^{d \times r}\) and \({\bf {\it W}}_{\text{up}} \in \mathbb {R}^{r \times k}\) on top of \({\bf {\it W}}\), which can be denoted as

\begin{equation} Z^{(l)}=Z^{(l-1)}+Z^{(l-1)}({\bf {\it W}}+{\bf {\it W}}_{\text{down}}{\bf {\it W}}_{\text{up}}). \end{equation}

(10)

The LoRA structure has been applied to an encoder-decoder model called motion style adapters (MoSA) [111]. MoSA uses a lightweight LoRA structure for adapting the motion style (e.g., pedestrians) from a source domain with sufficient labeled data to a target domain (e.g., cyclists). DyLoRA [227] proposes to truncate the parameters of rank to multiple parts (i.e., ranks) and optimize them separately sequentially without relying on a search mechanism.

Decomposition-and-Alignment (DnA) [96] uses GreBsmo (replaced with SVD in implementation) to decompose the weight matrix \({\bf {\it W}} \in \mathbb {R}^{d \times k}\) to a low-rank form: \({\bf {\it W}}={\bf {\it UV}}+{\bf {\it S}},~{\bf {\it U}}\in \mathbb {R}^{d \times r}\) is the “alignable” part, \({\bf {\it V}} \in \mathbb {R}^{r \times k}\) is the “fixed support” from the pre-trained model, \({\bf {\it S}} \in \mathbb {R}^{d \times k}\) is the residual term. Two additional variables \(\Delta {\bf {\it U}}\) and \(\Delta {\bf {\it S}}\) are added to the “alignable” of the decomposed \({\bf {\it W}}\), which can be denoted as

\begin{equation} Z^{(l)}=Z^{(l-1)}(({\bf {\it U}}+\Delta {\bf {\it U}}){\bf {\it V}}+{\bf {\it S}}+\Delta {\bf {\it S}}). \end{equation}

(11)

DnA remains needs to use SVD to implement the GreBsmo algorithm, bringing additional complexity to the iterative optimization process.

Compacter [102], KAdaptation [81], and Aurora [232] use Kronecker products to decompose weight parameter to a \({\bf {\it W}}=\sum _{i=1}^{n} {\bf {\it A}}_i \otimes {\bf {\it B}}_i, ~{\bf {\it A}}_i \in \mathbb {R}^{n \times n},~ {\bf {\it B}}_i \in \mathbb {R}^{\frac{k}{n} \times \frac{d}{n}}\) and tune one part of the decomposed term \({\bf {\it B}}_i\) with a low rank formed parameters \({\bf {\it B}}_i={\bf {\it u}}_i{\bf {\it v}}_i, ~{\bf {\it u}}_i \in \mathbb {R}^{\frac{k}{n} \times r},~ {\bf {\it v}}_i \in \mathbb {R}^{r \times \frac{d}{n}}\), which can be represented as

\begin{equation} Z^{(l)}=Z^{(l-1)}\left(\sum _{i=1}^{n} {\bf {\it A}}_i \otimes {\bf {\it u}}_i{\bf {\it v}}_i\right). \end{equation}

(12)

The decomposition method using the Kronecker product is also named as a parameterized hypercomplex multiplication/convolutional (PHM/PHC) layer [67, 276], being applied for varied tasks such as vision and audio tasks. PHM [276] inspires [3] to form a tunable weight with three terms \({\bf {\it z}}_i,{\bf {\it s}}_i\), and \({\bf {\it A}}_i\), being added to the pre-trained weight for PETL of NLP tasks. FacT [99] considers two decomposition methods Fact-TT and Fact-TK, using Kronecker product and a multilinear generalization of the SVD (i.e., the Trucker model) [42], respectively. Fact-TK generally performs better than Fact-TT with slightly more parameters than Fact-TT across 19 image-based tasks, which is far fewer parameters than the basic LoRA method. Dynamic Linear Dimensionality Reduction (DLDR) [121] claims that only optimizing the low-dimensional subspace of a large model can achieve comparable performance. DLDR used SVD to decompose the weight to find the tuned subspace, which achieves comparable performance by training a small number of epochs.

RepAdapter [150] is built based on LoRA structure and introduced a group-wise transformation [151] method to reparameterize the weight term. RepAdapter also interpreted its group-wise divided LoRA layers as a reparameterization process. RepAdapter [150] aims at reducing the inference time and seamlessly integrating the RepAdapter into most giant vision models via structural re-parameterization.

Similarly in NLP, task-adaptive reparameterization (TARP) [85] uses the Kronecker product as a dynamic low-rank decomposition for the MLP module for domain adaptation. Kronecker Adapter (KronA) [52] also introduces the Kronecker product to improve the limited representation power low-rank representation for NLP tasks.

3.4.3 Weight and Bias.

As illustrated in Figure 4(c), some methods modify parameters of both weight and bias parts. Scale and Shift the deep Features (SSF) [125] works toward the weights and bias terms by using two vectors \(\boldsymbol {\gamma } \in \mathbb {R}^d\) and \(\boldsymbol {\beta } \in \mathbb {R}^d\), which can be represented as

\begin{equation} Z^{(l)}=\boldsymbol {\gamma }({\bf {\it W}}Z^{(l-1)}+{\bf {\it b}}) +\boldsymbol {\beta }, \end{equation}

(13)

where are, respectively, interpreted as scale and shift factors. Note that both \(\boldsymbol {\gamma }\) and \(\boldsymbol {\beta }\) are learnable vectors, which can be much smaller than matrix variables of LoRA or decomposed forms of DnA, Compacter, and KAdaptation.

3.4.4 Discussion.

Differences from the prompt-based and adapter-based methods, parameter-based tuning can use fewer parameters to achieve a similar effect of adaptation. On the tested image-based tasks [125] can even outperform adapter and VPT. SSF [125] introduces the bias-tuning technique to the weight variable via dot product. According to the analysis of SSF, it intrinsically modifies both the weight and bias variables, which is interpreted as follows:

\begin{equation} Z^{(l)}=\boldsymbol {\gamma }({\bf {\it W}}Z^{(l-1)}+{\bf {\it b}}) +\boldsymbol {\gamma } = (\boldsymbol {\gamma } \odot {\bf {\it W}})Z^{(l-1)} + \boldsymbol {\gamma } \odot {\bf {\it b}} + \boldsymbol {\gamma }, \end{equation}

(14)

where \(\odot\) is dot product. Given the varied techniques available, Mao et al. [157] unified these methods with a gate mechanism. [97] uses pruning techniques to drop the activations during back-propagation, leading to sparse activations. This track of techniques will be further introduced in Section 3.5. In addition to Transformer-based structures, LoRA Winograd convolution [181] aims at using the LoRA mechanism to prune the 3D CNN backbone model (e.g., C3D and R3D-18) for accelerating the Winograd operation [115] with less trainable parameters.

Although parameter-based tuning can be less expensive from the perspective of tuned parameters, sometimes they will underperform the former two methods (i.e., prompt tuning and adapter tuning). This might be because fewer parameters can reduce the adaptation ability to a target domain with a large domain gap. Another limitation of existing parameter-based tuning methods is the lack of exploring based on the semantics of pre-trained models, leading to insufficient explainability. By far, most methods are tested on Transformer-based structures, but there remains exploration of their effect on CNN-based structures.

In the future, there will be continual exploration along this track of technique for more progressive parameter efficiency via further factorization on pre-trained models’ weight or bias terms. Meanwhile, visual semantics are expected to be considered based on different types of pre-trained models (i.e., foundation models pre-trained with varied levels of vision tasks: low-level, middle-level, and high-level). It can also be combined with other tuning techniques for better interpretability. In addition, existing methods can also be expanded to CNN-based methods that have superior performance over corresponding Transformer-based methods on some specific tasks.

3.5 Remapping Tuning

Instead of directly fine-tuning or processing the pre-existing model, remapping-based tuning is a category of techniques that transfer the knowledge learned by a pre-trained model to a new downstream model. Based on how to utilize the pre-trained model, i.e., output, weight, and network architecture (see Figure 5), we discuss three forms of knowledge transfer in the following categories: knowledge distillation-based remapping, weight-based remapping, and architecture-based remapping.

Fig. 5.

3.5.1 Knowledge Distillation.

Knowledge distillation aims at regularizing the downstream model by enforcing it to mimic the output of pre-trained models. Note that, the output typically refers to the final response or intermediate features. In the meantime, knowledge distillation is also an important model compression technique. In this section, we do not involve other model compression techniques such as network pruning [66, 155, 243], since they are not typically motivated to transfer knowledge from teacher network to student network.

The fundamental idea of knowledge distillation is to transfer the learned knowledge from a large pre-trained teacher model into a small student model by learning the network output or intermediate features of the teacher. Typically, knowledge is distilled from the teacher model to the student model using a soft target distribution for each case. The probability \(q_i\) of each case can be formulated as

\begin{align} q_i = \frac{exp(z_i/T)}{\sum _j exp(z_j/T)}, \end{align}

(15)

where \({\bf {\it z}}\) is the output logit of the teacher networks and T is the temperature of the distillation process.

To our best knowledge, the work of [16] first introduces knowledge distillation to extract knowledge from a pre-existing model. They trained a compressed model with the pseudo data produced by an ensemble of shallow networks while no significant loss occurs in performance. This idea has been expanded to compress the deep and wide networks into shallower ones in [6]. Hinton et al. [84] introduced the teacher-student knowledge distillation framework, where the student network is penalized based on the softened class distribution output of the teacher network.

One of the characteristics of deep neural networks is to obtain increasingly expressive power by learning hierarchical feature representations, as pointed out in [7]. Based on this theory, both the final response and the intermediate feature maps of the teacher network can be employed as the target for training the student model. To substantially exploit the information of intermediate layers, Fitnets [191] introduces intermediate-level hints of the teacher to facilitate training the student. It enforces the intermediate feature alignment between the teacher and student networks via the teacher’s intermediate feature maps as hints. Subsequently, a rich line of work is devoted to aligning the features indirectly [27, 69, 76, 107, 172, 221, 254, 259]. Concretely, Kim et al. [107] developed a factor transfer method that employs paraphrased intermediate features of the teacher as a factor, rendering the knowledge of the teacher network more understandable for the student network. Inspired by NAS [131], Guan et al. [69] developed a two-stage distillation approach that adopts the differentiable search strategy to simultaneously improve the efficiency and the effectiveness of knowledge distillation. Xu et al. [254] developed a feature-normalized distillation method by introducing a sample-specific correction factor for the replacement of the temperature, with the goal of suppressing the impact of noise resulting from the one-hot label. Passalis et al. [172] modeled the information flow of the teacher’s multiple intermediate layers and then trained a student model to match this information flow. To realize knowledge transfer for vision transformers, Touvron et al. [221] introduced a token-based distillation strategy termed DeiT, which enforces the student transformer to directly reproduce the label estimated by the pre-trained teacher network using a distillation token. Hao et al. [76] introduced a manifold distillation approach for vision transformers by substantially utilizing patch-level information.

There are also some extensions to further explore knowledge transfer patterns. Park et al. [171] proposed a relational knowledge distillation scheme for mutual relations transfer of outputs instead of individual outputs. As a generalization of vanilla knowledge distillation, they introduced distance-wise and angle-wise distillation losses to sufficiently extract the structural relations in data examples. Liu et al. [141] proposed an architecture-aware knowledge distillation approach termed AKD, with the goal of finding the optimal student networks for distilling a given teacher network. Chen et al. [23] studied the semantics of intermediate layers and employed an attention mechanism to automatically assign the soft layer association between teacher and student networks, which can reduce the impact of over-regularization during the training process. Zhou et al. [297] proposed a holistic knowledge distillation with graph neural networks, where the holistic knowledge contains individual knowledge and relational knowledge [139, 174]. To integrate two knowledge and refine their correlations, graph neural networks are adopted to learn holistic knowledge to provide supervision for the student network by aggregating feature representation from correlated data examples. Chen et al. [30] proposed a residual distillation framework termed Review to effectively learn informative features from multi-level information in the teacher network. Review utilizes multiple layers in the teacher to guide the training for one layer in the student with great performance gains. Zhao et al. [291] modeled the traditional knowledge distillation loss into target class knowledge distillation and non-target class knowledge distillation, then dived into their effects. Based on the observations, they found that the traditional knowledge distillation loss is a highly entangled formulation and introduced a decoupled method to facilitate the knowledge distillation.

For application, most of the above methods focus on image classification. Furthermore, knowledge distillation also demonstrates promising results in more vision tasks, such as object detection [70, 119, 279, 293], image segmentation [140, 143, 241, 258], person re-identification [179, 189], super-resolution [4, 284], depth estimation [88, 240], and crowd counting [133].

3.5.2 Weight Remapping.

Rather than relying on the teacher’s output as supervision to train the student, weight remapping directly transfers the model weights from the teacher network to the student ones. Specifically, assume that a teacher network is a function \(f(x; \boldsymbol {\theta })\) parameterized by \(\boldsymbol {\theta }\), where x is the network’s input. Weight remapping for student network g is to reassemble a new set of parameters \(\boldsymbol {\theta }^{\prime }\) from the existing parameters of \(\boldsymbol {\theta }\), such that

\begin{align} \forall x, f(x; \boldsymbol {\theta }) = g(x; \boldsymbol {\theta }^{\prime }). \end{align}

(16)

Net2Net [32] is a pioneering effort that rapidly transfers the knowledge stored in a pre-existing network into another network by remapping the weight of a pre-existing teacher network to the student. Subsequently, EAS [18] introduces the concept of weight remapping into NAS by exploring the search space according to a pre-existing network and reusing its weights. To tackle variable-length architecture and consider the entire input architecture, EAS employs a bidirectional recurrent network [198] as the encoder network. In this way, the previously trained network can be further exploited to efficiently explore the architecture space and greatly accelerate the training process of the new network. To efficiently compress the teacher network for knowledge transfer, Ashoket et al. [5] proposed a reinforcement learning-based approach termed N2N learning, which models the conversion from the teacher network into a student network as a Markov Decision Process (MDP). N2N learning formulates the process of knowledge transfer as a two-stage action selection. In the first stage, a recurrent policy network selects a sequence of actions including keeping or removing layers of the large teacher network. In the second stage, another policy network performs further reduction in each remaining layer to meet the attenuate configuration.

Furthermore, some interesting weight remapping methods take the network path topology into consideration instead of merely adding or removing network layers. Elsken et al. [54] introduced a hill climbing-based approach named NASH, which can automatically search the optimal student architecture. By using a series of alternative network morphisms, NASH can train the child networks with a short optimization process by cosine annealing. At each training step, NASH searches for the optimal architectures by a simple hill-climbing strategy [195]. Path-level EAS [19] enforces the meta-controller to change the topology of network connection paths while using function-preserving transformation operations to remap weights. To achieve this, Path-level EAS develops a bidirectional tree-structured meta-controller based on reinforcement learning, in order to enrich the architecture space to the generalization of multi-branch structures. And Yang et al. [262] proposed to customize networks efficiently via reassembling various pre-trained network blocks subject to downstream constraints.

Previous object detection and semantic segmentation approaches use the network weights pre-trained on image classification for performance gains. However, one of the major challenges is that ImageNet pre-training typically requires highly large computation costs. To address this issue, Fang et al. [57] introduced a fast neural network adaptation approach dubbed FNA, making a pre-trained network adapt to a new task by modifying the network such as depth and kernels. In this way, FNA can expand NAS techniques to object detection and semantic segmentation with negligible computation costs. Technically, FNA first designs a seed network by selecting a manually designed network pre-trained on ImageNet such as [197] and then enlarges it to a super network. By applying the weight remapping technique, the seed network is used to assign new model parameters. The follow-up work FNA++ [58] extends the weight remapping of FNA to one more task (i.e., human pose estimation) and more network architectures including ResNet [80] and NAS networks with diverse widths, depths, and kernel sizes.

3.5.3 Architecture Remapping.

Architecture remapping refers to the knowledge transfer about network architecture from a pre-existing model. To our best knowledge, this line of work is mainly used in weight-sharing neural network search (NAS). Formally, \(\mathcal {F}\) denotes the architecture and \(\boldsymbol {\omega }\) denotes the weight of \(\mathcal {F}\). The goal of NAS is to find the optimal architecture \(\mathcal {F}^*\) that produces the best performance on the test set:

\begin{align} \mathcal {F}^* = \arg \max _{\mathcal {F}} \text{Eval}(\lbrace \mathcal {F}, \boldsymbol {\omega }\rbrace ;\mathcal {D}_\text{test}). \end{align}

(17)

Specifically, this type of NAS formulates the search space into an over-parameterized super-network, e.g., modeling the search space as multiple repeatable cells [41, 131, 178]. When transferring the searched architecture to downstream tasks, direct architecture transfer, which stacks several searched cells to form a downstream model and then retrains it on the downstream data, is the current mainstream scheme. Canonical examples include DARTS [131] and its variants [34, 40, 77, 118, 251, 256]. Direct architecture transfer has shown impressive results on downstream tasks.

3.5.4 Discussion.

Different from traditional transfer learning approaches, remapping tuning focuses on training a new downstream model isolated from the pre-existing model. Thus remapping tuning methods own their exclusive advantages. In this line of work, knowledge distillation involves training a smaller student model to mimic the output or intermediate features of a teacher model. This method is advantageous for efficient model compression as well as flexible student architecture designs. Weight remapping directly transfers model weights from a teacher network to a student network. This approach is beneficial for speed and efficiency, as transferring weights can be faster than retraining a new model from scratch. Architecture remapping focuses on transferring network architecture knowledge, often used in weight-sharing neural network search. This method enables the transferability of architectures discovered in one task to other tasks, accelerating the development of new models for various applications.

While remapping tuning offers many advantages, there are also some limitations for different approaches. For instance, knowledge distillation may lead to a loss of information from the teacher model. It can also be sensitive to hyperparameters, such as temperature and weighting of different losses. Weight remapping is a simple and effective solution without manually adding constraints. However, this type of work typically struggles to obtain a lightweight student network compared with knowledge distillation. Architecture remapping faces challenges in designing an effective search space for NAS, which can be complex and time-consuming. Additionally, NAS methods often require significant computational resources to explore and evaluate a large number of candidate architectures, increasing the overall computational cost.

Their advantages and challenges also provide valuable insights for future advancements. For knowledge distillation, beyond learning from the pre-trained model’s output, its high flexibility suggests the potential to incorporate grounded information from the downstream tasks for distillation. For example, combining the pre-trained model’s output with a physics simulation could help guide the knowledge distillation process, resulting in an accurate and efficient student model for the downstream tasks [127]. Regarding weight remapping, a potential improvement is to combine with knowledge distillation to reduce the model size. As for architecture remapping, exploring stable and reusable modules from multiple pre-existing models could largely reduce the complexity of search space design, e.g., earlier layers of CNN are often reused for extracting lower-level visual features. This would help to flexibly and efficiently integrate various domain-specific models, based on the semantic associations among different domains.

4 Visual Tuning Future

To date, the state of visual intelligence forms a transfer learning paradigm of pre-training and tuning, showing great promising performance on numerous benchmarks. Vision contributes a large portion of knowledge acquisition of human intelligence. However, due to the high dimensionality of vision data, the intelligence of machine vision suffers from a relatively small data scale in comparison to that of NLP and remains far behind the general human vision. The future promise of intelligent vision will be expanded beyond the competed benchmark datasets, realizing transformative impacts on more domains via a multidisciplinary coevolutionary process. On one hand, we expect that future pre-training techniques play the role of knowledge acquisition and storage in a “collection-labeling-training-feedback” cycle system. While future tuning is around how to make use of the learned knowledge through more diversified interactions beyond the prompts around conversational systems. Along the way to further understand the mechanisms of deep neural network models and even the human brain, we discuss the future works of vision from perspectives of pre-training and tuning techniques.

4.1 Advanced Pre-training

Previous works use supervised or self-supervised methods to guide models to learn representations of our visual and visual-text world. The supervised pre-training method is a mainstream practice of the traditional transfer learning paradigm [216, 287, 301], While self-supervised pre-training scales pre-trained models up to foundation models (introduced in Section 2.4). Although encouraging progress has been made, as data continues to accumulate, we expect that future pre-training techniques will be able to constantly scale up the model size and improve the capabilities of foundation models. Here, we discuss the future directions of model pre-training from three perspectives: data, models, and optimization.

4.1.1 Data.

Quality data are the nourishment of foundation models. To realize the promise of future foundation models, it is expected to acquire more fundamental knowledge from the open-world multi-modal data with characteristics as follows:

—

Increasing scales: Concerning the data volume, large vision models that learn knowledge from large-scale datasets are empirically proven effective for adapting to downstream tasks via tuning techniques. However, compared with human vision, existing large-scale vision datasets remain far from the amount of data that humans learn from. On the contrary, the situation in NLP can be different as large language models can be regarded as having a wide knowledge of the Internet, making them more capable in some NLP tasks, i.e., chatGPT. To scale up the data volume, multi-modal data (e.g., image, video, audio, and text), multi-source data (e.g., Internet, generative models such as NeRF [161] and Diffusion [190, 260]), and multi-sensor data (e.g., different types of cameras, biomarkers, and ambient sensors) can be considered for training large models.

—

High quality: Before arbitrarily collecting large-scale data, determining what data and how much data are essential concerns. The newly collected data can be redundant or noisy, respectively leading to limited or even negative effects on the model. Chen et al. [33] introduced the diversity rule at the level of feature representation. However, there remains a lack of investigation into the quality at the data level for existing benchmark datasets, giving rise to research on topics such as out-of-distribution generalization and tolerance to noise (see Section 2.1.3). Further investigating measurable factors of data quality (e.g., 16 dimensions summarized in [56]) and their corresponding consequences on the large model can bring a large impact on the machine intelligence community. Findings will guide evidence-oriented data collection and effectively reduce the expensive labeling cost.

—

Security and privacy are always the priority throughout the life cycle of the data, especially for domains such as healthcare and finance when interacting with large models on the cloud [24]. Issues around cloud computing can be grouped into four aspects: (1) users’ control over the data, (2) authorized replication, (3) legal requirements, and (4) cloud subcontractors’ processing [211]. Protective actions can be taken at the data level to prevent attacks such as re-identification, dataset reconstruction, and tracing [101].

4.1.2 Models.

Given the multimodal, multi-source, multi-sensor data, vision large pre-trained models are expected to continuously accumulate knowledge from the new data in an interpretable and secure mechanism.

—

Theoretical support: Training models with theoretical support from statistical and biological perspectives can make them more interpretable, explainable, and improvable. In the regime of large models, there are a certain number of recent works motivated by some theoretical definitions from the statistical perspective [138, 223, 263, 286], by which a generalization bound are used to promise efficient knowledge transfer. Except for the statistical aspect, biological and neuroscience discoveries also benefit the development of deep neural networks, which can provide more insights and inspire new ideas for future large vision models. Recent works [65, 114] discussed in Section 2.1.1 are mainly delayed from one to another as they are intended to explain each other’s observed domain empirical realities instead of truly inspiring new ideas. Basic neural network connections are inspired by how brain neuron works, but it has not yet been known exactly how the human brain learns new knowledge. As such, it is also not clear if the knowledge acquired by existing large models via back-propagation can be effective. On one hand, we humans are sentient beings and acquire knowledge via multiple sensations: vision, sound, haptic, taste, and so on. On the other hand, the human brain can be very efficient to activate just a small portion of neurons to complete a task, while existing foundation models do not. By far, the built intelligence is different from the most intelligent and efficient machine in the world (i.e., the human brain). Understanding the brain can be the next turning point (i.e., artificial general intelligence), which brings serious ethical issues.

—

Continuously updating: As introduced in Section 2.2, a model can be featured with its domains and tasks with their feature space and label space. We expect that foundation models will scale up not only at the parameter aspect but also at aspects of domains and tasks. For a single domain, it can have multiple tasks. Models such as Gato [188] and Flamingo [1] are pre-trained with multiple tasks, where the former covers vision tasks while the latter even covers both NLP and vision tasks. For a single task, a classification task needs to handle novel unseen classes, which is defined by an active learning paradigm (i.e., lifelong learning or continual learning). In contrast to batch learning where all training data is available at once, continual learning represents a family of methods that accumulate knowledge and learn continuously with data available in sequential order [182]. The future stronger vision and visual-language models will bring a more profound impact on other domains via the multidisciplinary coevolutionary process.

—

Security: Aside from privacy issues at the data level, large foundation models (i.e., at the model level) can also be vulnerable to attacks. Foundation models allow users to easily plug and unplug via APIs, which raises security and privacy concerns. such as adversarial attacks and model-inversion. Kaissis et al. [101] introduced the advantages of federated learning and provided an outlook for future works. Although federated learning can mitigate data-level privacy issues, it can be vulnerable to adversarial attack [226]. Around privacy-preserving AI, the adversarial attack will attract more research in the near future.

4.1.3 Optimization.

Current foundation models are generally optimized with back-propagation and reinforcement learning with human feedback. Optimization itself can be relying on hardware devices, hyperparameter configuration, and algorithms as follows:

—

Hardware: Recent large models are trained with GPUs, which is unaffordable for general researchers or small companies. Fortunately, the newly released NVIDIA Hopper H100 GPU [37] supports FP8 format for accelerating compute-intensive Transformer models (around 9 times faster than previous A100 GPU for training), making trillion-parameter models within the reach of all researchers. While the inference speedup of H100 compared with A100 can be 30 times faster, making tuning a promising direction.

—

Hyperparameter configuration: In machine learning, hyperparameters such as initial learning rate, batch size, and task-specific parameters often considerably impact performance. To avoid the manual process of trial-and-error, hyperparameter optimization is a sub-field of automated machine learning, which aims at identifying a well-performing combination of hyperparameters. Simple techniques are grid or random search. Recent advances in hyperparameter optimization are evolution strategies, Bayesian optimization, Hyperband, and so on [9].

—

Algorithm: The combination of back-propagation and stochastic gradient descent remains the mainstream algorithm to make foundation models optimize toward some statistical goals (e.g., the probability that a picture is identified as a cat). Meanwhile, reinforcement learning with human feedback brings more raw human opinions, which can be some kind of human-machine interactions that align the pre-trained large models to more specific human desired tasks.

4.2 Tuning Techniques

As introduced in Section 3, recent developments in visual tuning techniques can be regarded as originating from the prompt tuning of the NLP domain and working toward the PETL direction. Then a couple of adapter methods are proposed, showing better performance than visual prompt methods but lacking interpretability. (It is expected that visual tuning techniques will be implemented on more existing benchmarks, their reformed versions, and the emerging brand-new benchmarks, which will not be listed in this survey. Readers are recommended to refer to the benchmarks of their targeting domains). The bias-tuning and LoRA methods further reduced the number of parameters, leading to direct parameter tuning methods via addition or decomposition. More recent works are grouped as remapping tuning, among which NAS-based methods [74, 249] show an even more aggressive PETL manner. These techniques provide exciting research foundations for developing future prompts, leading to better use of language and visual knowledge stored in large models via guidance and interaction, respectively. We discuss three core progressive interaction aspects: interpretable prompt, conversational guidance, and diversified interactions, that researchers will discern explosive development as follows.

4.2.1 Interpretable Prompt.

Prompt engineering will work from intuitive design to more understandable and interpretable directions. Existing text or visual prompts are more like implicit guidance at the high level, describing what is the downstream visual task. As introduced in Section 3.2, many works attempted to learn prompts to facilitate visual downstream tasks. Despite some progress, they suffered from poor interpretability, i.e., it remains difficult to understand what prompts the network has learned. For example, some works (e.g., VPT) learn unordered token-based prompts, which can not be visualized into an understandable prompt. Chen et al. [35] attempted to learn understandable prompts. Regarding other tuning techniques such as adapter, parameter-based, and remapping ones, they are also faced with the interpretability issue, as they intrinsically aim at reducing the number of tuned parameters for adapting the downstream tasks to the large model. Hence, future research should answer questions such as what are good text and vision prompts, and how to evaluate them throughout the learning pipeline (from the input side to the output side); the relationship between vision and text prompts, and in what situations visual and text prompts can be mutually replaced; How to design explicit, consistent, and logical prompts that enable a large model to adapt efficiently.

4.2.2 Conversational Guidance.

We observe the development of visual tuning will lead to new jobs such as prompt engineers who have expertise in providing guidance to large-scale visual-language models such as Sora [144]. Multi-round conversational systems can provide a natural platform that guides models to adapt toward the desired task goals [244]. It is generally expected that vision models will homogenize with language models [1, 93, 183, 206, 207, 244, 266]. However, due to the fact that “a picture is worth a thousand words”, the development of visual tuning is somehow behind the success of large language models (detailed in Section 2.1.2). Specifically, concerning data complexity and scenario diversity, the industrial applications of the vision domain (not common application scenarios such as autonomous vehicles, transportation recommendation [142], whether prediction [163], protein design [229], etc.) are highly demanding for customization based on the specific task requirements. Given a tumor detection task, a prompt engineer will select or design good segmentation samples in multiple rounds of conversation with the large models to improve some core steps of the task by referring to various agents [51] or tools [180] and eventually achieve acceptable results for production.

4.2.3 Diversified Interactions.

In addition to the interaction in a conversational system with text and visual prompts, interactions in vision can be more diversified. Humans can gradually build up the evaluation standard themselves and then practice (i.e., learn or train for a model) toward higher standards. Existing self-learning models have not set up mechanisms with progressively improving goals. We expect, in the long run, universal AI or strong AI in a specific domain will evolve in the form of prompts, guidance, and diversified interactions. In recent works [238, 289], a segmentation sample is also used as a prompt to tell the model what task will be performed. Currently, interactions in image synthesis use text prompts and sketch images [190], enabling everyone to become a visual content creator. These visual interactions can be regarded as a kind of visual prompt based on the image. Image can represent a limited part of visual interaction scenarios, which can be regarded as tasks with static viewpoints but provide basic conditions for more plentiful visual interaction. Current image-based interactions via prompts are also known as in-context learning, which aims to mimic the efficient visual understanding of the human brain and intrinsically narrows the searching space of foundation models [238]. In addition to the simple aspect of navigating large models to downstream tasks, there are more diversified interaction scenarios that provide plenty of egocentric visual interactions such as robots, drones, and bionic robot dogs [50, 228]. These video data provide interactive ecological environments that enable the development of human vision mentioned in Section 2.1.1. Although tuning foundation models pre-trained via self-supervised learning indicates a promising future direction, future visual interactions will rely on advanced pre-training techniques (knowledge accumulation techniques) that are beyond currently tested ones that are based on generating masked pixels or contrastive learning. Promising long-term directions that enable diversified interactions can involve emerging technologies such as brain-computer interface, quantum computing, event cameras, and so on. This will lead to new generalization capabilities on top of the future “collection-labeling-training-feedback” cycle system.

5 Conclusion

This survey summarized visual tuning techniques, particularly focusing on the recent state of visual tuning in the coming regime of large models. Starting from fine-tuning, existing states of prompt tuning, adapter tuning, parameter tuning, and remapping tuning are systematically investigated and compared based on a comprehensive understanding of their technical details. Based on the expected emerging large models, future visual tuning directions are discussed from perspectives of prompt, guidance, interaction, and optimization. We hope this first survey on the latest state of visual tuning will offer a new perspective to researchers in the era of large models, facilitating their research based on a better understanding of the current state and grasping the future core research challenges.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers who helped improve this manuscript.

References

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. 2022. Flamingo: A visual language model for few-shot learning. Proc. NeurIPS 35 (2022), 23716–23736.

Abstract

1 Introduction

2 Background

2.1 Theories

2.1.1 Biological Perspective.

2.1.2 Model Perspective.

2.1.3 Statistical Perspective.

2.2 Notation and Definition

2.3 Model Architecture

2.4 Model Pre-training

2.5 Model Tuning

3 Visual Tuning

3.1 Fine-tuning

3.2 Prompt Tuning

3.2.1 Vision-driven Prompt.

3.2.2 Language-driven Prompt.

3.2.3 Vision-language Prompt.

3.2.4 Discussion.

3.3 Adapter Tuning

3.3.1 Sequential Adapter.

3.3.2 Parallel Adapter.

3.3.3 Mix Adapter.

3.3.4 Discussion.

3.4 Parameter Tuning

3.4.1 Bias Part.

3.4.2 Weight Part.

3.4.3 Weight and Bias.

3.4.4 Discussion.

3.5 Remapping Tuning

3.5.1 Knowledge Distillation.

3.5.2 Weight Remapping.

3.5.3 Architecture Remapping.

3.5.4 Discussion.

4 Visual Tuning Future

4.1 Advanced Pre-training

4.1.1 Data.

4.1.2 Models.

4.1.3 Optimization.

4.2 Tuning Techniques

4.2.1 Interpretable Prompt.

4.2.2 Conversational Guidance.

4.2.3 Diversified Interactions.

5 Conclusion

Acknowledgments

References

Cited By

Index Terms

Recommendations

An Empirical Study of Parameter-Efficient Fine-Tuning Methods for Pre-Trained Code Models

Hadamard Adapter: An Extreme Parameter-Efficient Adapter Tuning Method for Pre-trained Language Models

Deep visual tracking

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations