Introduction

The domain of NLP has seen a remarkable transformation in recent years, primarily driven by the introduction of LLMs1. Transformer-based Language Models (TLMs) initially led about a revolution by showing outstanding capabilities in capturing extensive dependencies2. However, the challenges connected with adapting TLMs to various tasks, combined with their resource-intensive training, resulted to the development of more powerful models, such as GPT-33. With billions of parameters, these LLMs have not only boosted performance benchmarks across various tasks but have also extended their applications into novel domains, including creative writing and multimodal learning4.

Despite their notable achievements, a in-depth analysis of LLMs reveals several major limitations. The extensive computational resources required for their training raise questions about environmental sustainability and restrict accessibility to research facilities with sufficient equipment and resources5.

To address this issue, there has been an increasing focus on the latest innovations in parameter-efficient fine-tuning methods6.As compared to retraining the entire model from scratch, fine-tuning LLMs has proven to be a more rapid and efficient approach. Nevertheless, the fine-tuning of all parameters of LLMs remains a challenge due to their vast size, which typically consists of billions of parameters. Despite this, the fine-tuning process still requires extensive computational resources, much like the pretraining technique.

Adapting to this challenge, adapter training has gained importance as a more efficient approach7. This approach involves introducing domain-specific parameters, referred to as adapters, into pretrained models. These adapters, which are made of small neural networks, are strategically inserted within or between the layers of the pretrained model. During the training process, only the parameters of these added adapters are updated, while the parameters of the pretrained model remain unchanged7.

While adapters provide a more simple approach, they may not fully capture complex data patterns as effectively as fine-tuning the entire model. In addition, determining the optimal locations to insert adapters within the LLM can be challenging and may require experimentation. Nonetheless, prompt tuning complements adapter training by offering additional contextual information to guide the model’s understanding of the task at hand.

Prompt tuning8 does not modify the underlying structure of the model, potentially resulting in quicker inference times and decreased resource consumption in contrast to utilizing adapters. Furthermore, prefix tuning9 has been proposed as a method to improve performance. Unlike prompt tuning, which updates only a portion of a single layer, prefix tuning updates a section of every layer consistently, and it has shown improved performance in succeeding tasks.

The use of prompts and prefix tuning techniques8,9 can pose challenges in terms of the effectiveness and interpretability of the employed prompts or prefixes. These methods generally utilize trainable virtual tokens within an adapter, which may not have essential semantic significance and require extensive training to acquire domain-specific knowledge efficiently. Consequently, the performance of these techniques may not be optimal, particularly when dealing with complex tasks, and extensive training is necessary to achieve optimal performance.

To overcome these challenges, we propose SK-Tuning, a novel approach that focuses on improving the performance of fine-tuning LLMs for prompt and prefix tuning. Unlike standard techniques that depend on random virtual tokens, SK-Tuning utilizes genuine, semantically rich prompts or prefixes for adapter training. By employing the LLM’s innate capacity to understand linguistic semantics, SK-Tuning strives to improve performance by integrating semantic knowledge directly from prompts or prefixes.

LLMs display remarkable zero-shot capabilities, allowing them to perform tasks without explicit training, as shown in recent studies10. To maximize the potential of these capabilities, SK-Tuning utilizes LLM’s ability to understand prompts or instructions in a zero-shot manner. This approach speeds up the convergence process during fine-tuning because we concentrate only on refining the semantic representation of the prompt or prefix.

The SK-Tuning method is presented in Fig. 1, which displays the stages for prompt and prefix tuning. At first, the entire LLM is frozen to maintain its pretrained knowledge. Next, the frozen LLM is utilized to extract the semantic representation from the prompt or prefix text. This representation is then educated with a small adapter to improve its task-specific intelligence. Lastly, the revised representation is combined with the embedding of the input text, guaranteeing that the model effectively integrates both the semantic context provided by the prompt or prefix and the textual data of the input.

We perform wide-ranging experimental evaluations across a variety of downstream tasks, including sequence classification, token classification, and NLI, to practically show the efficiency and excellence of SK-Tuning compared to traditional fine-tuning methods. Furthermore, we compare SK-Tuning with other parameter-efficient approaches, such as prompt tuning, prefix tuning, p-tuning, and LoRA, highlighting its unique advantages and contributions to the field of NLP.

The major contributions of this paper are summarized as follows:

  • This paper introduces SK-Tuning, a novel approach for fine-tuning LLMs using real, semantically meaningful prompt or prefix text.

  • SK-Tuning improves training efficiency and convergence speed by utilizing the inherent semantic understanding of prompt or prefix text using LLM’s zero-shot capabilities, as a result allowing rapid adaptation to new tasks.

  • In numerous experiments covering a variety of tasks, including sequence classification, token classification, and NLI, SK-Tuning has continually exhibited significant improvements in performance metrics..

  • The study includes a comprehensive evaluation against other parameter-efficient methods like prompt tuning, prefix tuning, p-tuning, and LoRA, highlighting SK-Tuning’s superior effectiveness in terms of performance outcomes and computational efficiency.

  • SK-Tuning reduces computational requirements and the number of trainable parameters compared to traditional fine-tuning approaches, making it a more resource-efficient solution for adapting LLMs.

The structure of this paper is as follows: section “Related work” reviews related work, situating our approach within the broader domain of parameter-efficient fine-tuning methods. Section “Background study” provides a background study on existing tuning techniques, setting the stage for our proposed method. In section “SK-tuning procedure”, we detail the SK-Tuning procedure, explaining its methodology and implementation. Section “Experiments” presents our experiments, showcasing the performance improvements achieved through SK-Tuning across various tasks. Section “Ablation study” offers an ablation study to further analyze the contributions of each component within SK-Tuning, reinforcing the paper’s key contributions to NLP. Section “Discussion” provides a discussion on the implications and potential applications of SK-Tuning in practical settings. Section “Limitations” discusses the limitations and challenges experienced during the development and application of SK-Tuning. Finally, section “Conclusion” concludes the paper by summarizing the findings and highlighting future research directions in fine-tuning methods for LLMs.

Related work

The importance of parameter-efficient fine-tuning (PEFT) methods in the field of NLP is immense, considering the growing complexity of LLMs. These methods not only improve model performance but also significantly reduce computational and memory requirements, as demonstrated by recent academic research11,12,13,14,15. The effectiveness of PEFT techniques is being thoroughly evaluated on a range of NLP tasks, as shown in16. Moreover, an extensive body of research17,18,19,20,21,22,23 consistently indicates that PEFT strategies considerably enhance the performance of LLMs, even under limited-resource circumstances.

Prompt tuning is a novel approach that improves NLP and generation tasks by fine-tuning learnable parameters within the model8. This technique enhances the model’s performance on specific roles by fine-tuning prompts, thereby optimizing its output. Improvements in prompt tuning have been achieved through the implementation of the residual connections to strengthen performance and stability24. This technique has also been broadened to support continual learning environments, as illustrated in recent research25,26. Current research focuses on dynamic prompt tuning, which adapts prompts in real time based on evolving contexts, as well as hierarchical prompt tuning, which provides multilevel control over the model’s responses27,28.

Prefix tuning is another powerful technique that adds learnable parameters as prefixes to the input of pre-trained models, enabling modification to different applications with minimal changes to the model itself21. This method enables efficient domain-specific fine-tuning without requiring the retraining of the entire model, particularly in resource-limited settings. Recent innovations introduce hierarchical prefix tuning, which organizes prefixes in a hierarchical manner to provide more detailed control over the model’s responses29. Additionally, dynamic prefix tuning allows for real-time adaptation based on the input context, thereby improving the flexibility and adaptability of the model30. Techniques such as MixPrompt31 and E2VPT32 have also been introduced to combine and optimize the usage of input and key-value prompts, advancing the application of prefix tuning in natural language processing applications.

Low-rank adaptation (LoRA) first proposed by20, is a fine-tuning technique designed to optimize memory usage and has received considerable attention in the research community since its inception. The latest developments have expanded the range of applications for LoRA, particularly in the area of multitask learning, as illustrated by research conducted by33,34, and35. Practical applications of LoRA were further explored by36, while37 focused on optimizing its memory efficiency. A notable innovation, ReLoRA, introduced by38, incorporates a full-rank warm-up phase.19 proposed adaptive approaches that dynamically adjust the low-rank adaptation parameters. Additionally,39 presented the Low-Rank Kronecker Product (LoKr), and40 developed ResLoRA, which integrates residual pathways. Further contributions include the Low-Rank Hadamard Product (LoHa) by41, and the introduction of Orthogonal Finetuning (OFT) and OFT with butterfly factorization (BOFT) by42 and43, which utilize orthogonal matrices to transform pre-trained weight matrices, resulting in significant improvements in both fine-tuning efficiency and performance.

Subspace learning has become a crucial area of research, with a focus on optimizing model weights within a low-dimensional space, thereby providing computational efficiency and improved performance in various machine learning tasks44,45. This approach has been extensively utilized in meta-learning and continual learning frameworks, as shown by several studies44,45,46,47,48,49. Latest improvements in adaptive subspace learning methods have demonstrated significant improvements in generalization and robustness, especially in challenging environments50,51. Furthermore, incorporation of subspace learning into neural architecture search has proven invaluable in identifying efficient and innovative architectures, optimizing both performance and resource utilization51,52,53. The efficacy of subspace learning is further highlighted in scenarios requiring rapid adaptation to new tasks with limited data, such as few-shot learning and online learning, where it allows robust model performance despite data limitations54.

Projected gradient descent (PGD) has been significantly improved by the development of advanced methodologies such as GaLore55. Unlike traditional approaches, which treat the objective function as a black box, GaLore creates gradients within multilayer neural networks, providing a more extensive and effective optimization process56,57. This approach has displayed notable enhancements in the convergence rate of neural network training, particularly in high-dimensional datasets, while also contributing to advanced stability during the training process58. Furthermore, GaLore addresses the challenges of gradient sparsity and redundancy, resulting in significant gains in training efficiency55. These innovations have not only strengthened the robustness of neural networks against adversarial attacks but also ensured more stable and reliable training dynamics, marking a noteworthy improvement in the field59,60,61.

Memory-efficient optimization is a pivotal area of research within the development of adaptive optimization algorithms, particularly in the context of large-scale models where memory bounds are a significant challenge. Foundational studies by62 have proven the efficiency of quantization techniques and combined gradient computation in considerably reducing memory usage during training63. Building upon these contributions, the latest innovations have introduced hierarchical memory management systems that enable dynamic memory allocation and sparse gradient updates, thereby further optimizing memory utilization, as highlighted by64. Moreover,18 proposed a memory-efficient fine-tuning approach, employing block-wise optimizing strategies that dynamically adjust memory allocation, achieving superior performance across several benchmarks. In a similar vein,19 explored the use of low-rank factorization techniques to compress model parameters effectively while preserving model accuracy. Collectively, these innovations contribute to the deployment of large-scale models on resource-limited devices, ensuring computational efficiency and maintaining optimal performance.

In contrast to previous techniques, our proposed SK-Tuning method introduces a novel strategy that utilizes authentic, semantically meaningful prompts or prefix texts during adapter training. This method capitalizes on the zero-shot capabilities of large language models (LLMs) and their fundamental understanding of linguistic semantics. As a result, SK-Tuning is designed to achieve faster convergence and enhance task performance. Through extensive experimental evaluations and comprehensive comparative analysis, we establish the superiority of SK-Tuning over existing fine-tuning techniques. These findings highlight the significant potential of SK-Tuning to advance fine-tuning methodologies in the field of NLP.

Background study

Prefix and prompt tuning are methods of adapting large pretrained language models to specific tasks or datasets with minimal updates to the model parameters. These techniques have gained prominence due to their efficiency and effectiveness, particularly in scenarios where updating the entire model is computationally expensive or impractical.

Prefix tuning

Prefix tuning involves appending a sequence of tunable vectors, known as the prefix, to the input of each layer of the transformer model. Let us denote the transformer model as a function \(F\) that maps an input sequence \(x\) to an output \(y\), i.e., \(y = F(x)\). In prefix tuning, this mapping is modified to \(y = F(p \oplus x)\), where \(p\) represents the prefix and \(\oplus\) denotes concatenation.

Mathematically, if we consider a transformer model with \(K\) layers, and each layer \(k\) performs a transformation \(F_l\), the modified transformation with prefix becomes:

$$\begin{aligned} F'_k(p_k, x) = F_k(p_k \oplus x) \end{aligned}$$
(1)

where \(p_k\) is the prefix for layer \(k\). The prefixes \(\{p_1, p_2, ..., p_K\}\) are learnable parameters and are optimized during the training process.

Prompt tuning

Prompt tuning, on the other hand, leverages the concept of natural language prompts. Here, the model is fed a prompt that guides it to generate outputs tailored to a specific task. In mathematical terms, given a pretrained model \({\mathscr {M}}\), the objective is to find an optimal prompt \(p^*\) such that the model’s performance on a task \(T\) is maximized when the prompt is used as an input.

Formally, for a task \(T\) and a set of task-specific examples \(\{(x_i, y_i)\}\), prompt tuning aims to optimize the following:

$$\begin{aligned} p^* = \arg \max _p \sum _i \log {\mathscr {M}}(y_i | p \oplus x_i) \end{aligned}$$
(2)

This objective function maximizes the likelihood of the correct outputs \(y_i\) given the inputs \(x_i\) concatenated with the optimal prompt \(p^*\). Unlike prefix tuning, prompt tuning does not modify the internal workings of the model but rather influences its outputs through carefully crafted input sequences.

SK-tuning procedure

Fig. 1
figure 1

SK-Tuning approaches for Prefix (left) and Prompt (right). The dashed line represents the optimization path during the backward pass to the trainable adapter. Notably, in the context of prompt-tuning (on the right), the no sign signifies the discontinuation of the forward pass beyond a certain point. This is because we exclusively initialize layer-specific semantic information for the prompt, rendering the continuation of the forward pass unnecessary for the remaining layers.

Problem definition

Consider a downstream task that utilizes a pretrained LLM denoted as \({\mathscr {M}}\). Let the training dataset be represented as \({\mathscr {D}} = \{(x_i, y_i)\}_{i=1}^N\), where each training example \((x_i, y_i)\) consists of an input text \(x_i\) and its associated true label \(y_i\).

Our primary objective is to fine-tune the pretrained LLM \({\mathscr {M}}\) for these downstream tasks while keeping the model parameters \(\Theta\) frozen. Specifically, we aim to achieve this fine-tuning by introducing a small number of additional parameters, referred to as adapters or transformations, which enable task-specific adaptation without the need to retrain the entire LLM. Each instance in our system is a pair \((x_i, y_i)\) that defines a specific task configuration.

Mathematically, our goal can be expressed as follows:

Given a pretrained LLM with parameters \(\Theta\) and a dataset \({\mathscr {D}}\), we seek to find a set of trainable parameters \(\Phi\) for the adapters or transformations, such that:

$$\begin{aligned} \Phi ^* = \arg \min _\Phi {\mathscr {L}}({\mathscr {M}}_{\Theta , \Phi }, {\mathscr {D}}) \end{aligned}$$
(3)

where: - \({\mathscr {M}}_{\Theta , \Phi }\) represents the fine-tuned model with frozen parameters \(\Theta\) and trainable adapters/transformations \(\Phi\). - \({\mathscr {L}}\) is a task-specific loss function that quantifies the alignment between model predictions and true labels across the training dataset \({\mathscr {D}}\).

Our objective is to ascertain the veracity of the true labels \(y_i\) for the corresponding input texts \(x_i\) by effectively training a small number of parameters (\(\Phi\)) without altering the pretrained model’s core architecture or parameters (\(\Theta\)). This approach aims to achieve parameter efficiency while tailoring the LLM to specific downstream tasks.

SK-tuning for prefix

SK-Tuning for Prefix enhances the versatility and performance of the pretrained LLM for downstream tasks by judiciously incorporating semantic knowledge from prefixes into the fine-tuning process. In the context of prefix-tuning a pretrained LLM \({\mathscr {M}}\), traditionally, a mapping function from virtual trainable tokens to the LLM’s layer representation is employed to generate the layer’s trainable parameters. However, in our proposed SK-Tuning approach, we adopt a different strategy. We leverage the power of a pretrained LLM \({\mathscr {M}}\), with its parameters frozen, to directly acquire semantic knowledge embeddings from the prefix tokens.

Let p denote a prefix comprising a sequence of semantic knowledge tokens, with a length of l. The LLM model is assumed to have a dimension of d. Our objective is to extract the semantic hidden representation from each layer of the LLM for the given input prefix p. Let m represent the total number of layers, which includes the attention mechanisms. For each layer, we obtain its hidden representation \(h_j^p\). For m layers the representation can be defined as follows:

$$\begin{aligned} h^p = {\mathscr {M}}_{\Theta _{\text {frozen}}}(p) \in {\mathbb {R}}^{l \times d}. \end{aligned}$$
(4)

Next, we introduce a trainable adapter \({\mathscr {F}}\), parameterized by \(\Phi\), which takes \(h^p\) as input and yields a semantic projection z with the same dimension:

$$\begin{aligned} z = {\mathscr {F}}_{\Phi }(h^p) \in {\mathbb {R}}^{m \times l \times d}. \end{aligned}$$
(5)

Now, we possess z as the semantic representation of prefix tokens for every layer of \({\mathscr {M}}\). During the processing of input \(x_i\) in \({\mathscr {M}}_{\Theta }\), we concatenate \(z_{j \in m}\) to the processing layer for the j-th layer of \({\mathscr {M}}_{\Theta }\). This operation allows the j-th layer to access the corresponding semantic information from the prefix text.

Now, if \(r_i\) represents the final hidden output of \(x_i\), we can define:

$$\begin{aligned} r_i = {\mathscr {M}}_{\Theta _{\text {frozen}}}(x_i, z). \end{aligned}$$
(6)

Consider a task-specific module \({\mathscr {C}}\), parameterized by \(\zeta\), which embodies a downstream task:

$$\begin{aligned} o_i = {\mathscr {C}}_{\zeta }(r_i) \end{aligned}$$
(7)

Here, \(o_i\) represents the output of our task.

Our training objective aims to minimize the loss function \({\mathscr {L}}\), which quantifies the discrepancy between \(o_i\) and the target label \(y_i\), thereby indicating whether \(o_i\) correctly represents the label for \(x_i\):

$$\begin{aligned} \min _{\Phi , \zeta } {\mathscr {L}}\left( {\mathscr {C}}_{\zeta }({\mathscr {M}}_{\Theta _{\text {frozen}}}(x_i, {\mathscr {F}}_{\Phi }({\mathscr {M}}_{\Theta _{\text {frozen}}}(p)))), y_i\right) . \end{aligned}$$
(8)

This approach allows the fine-tuning process to concentrate explicitly on the representation and comprehension of labels, while simultaneously harnessing the intrinsic knowledge embedded within \(\Theta\). The adjustment of parameters \(\Phi\) and \(\zeta\) empowers the model to further refine its ability to map textual inputs to their corresponding labels.

SK-tuning for prompt

SK-Tuning for prompts involves a systematic process of semantic knowledge embedding, trainable adapter integration, concatenation, and a training objective. This approach allows for fine-tuning the pretrained LLM to effectively leverage semantic knowledge from prompts for improved performance in various downstream tasks. In the SK-Tuning framework for prompts, we focus on enhancing the capabilities of a pretrained LLM (\({\mathscr {M}}\)) by incorporating semantic knowledge from sequential prompt tokens, denoted as \(p\), of length \(l\). Let \({\mathscr {E}}\) represent the token embedding layer of \({\mathscr {M}}\), and consider \(e_p \in {\mathbb {R}}^{l \times d}\) and \(e_{x_i} \in {\mathbb {R}}^{n \times d}\) as the semantic embeddings for prompt \(p\) and input text \(x_i\), respectively. Here, \(n\) is the sequence length of input \(x_i\).

To obtain the semantic representation of the prompt \(p\) and input text \(x_i\), we utilize the pretrained token embedding layer \({\mathscr {E}}\) as follows:

$$\begin{aligned} e_p = {\mathscr {E}}(p) \thicksim {\mathscr {M}}_{\Theta _{\text {frozen}}} \end{aligned}$$
(9)

and

$$\begin{aligned} e_{x_i} = {\mathscr {E}}(x_i) \thicksim {\mathscr {M}}_{\Theta _{\text {frozen}}} \end{aligned}$$
(10)

This operation yields \(e_p\), which encapsulates the semantic information of the prompt, while \({\mathscr {M}}_{\Theta _{\text {frozen}}}\) ensures that the model parameters remain frozen during this process.

To further enhance the representation of the prompt, we introduce a trainable adapter, denoted as \({\mathscr {G}}\), which is parameterized by \(\gamma\). This adapter takes \(e_p\) as input and produces an updated embedding \(e_p^\prime\) as follows:

$$\begin{aligned} e_p^\prime = {\mathscr {G}}_{\gamma }(e_p) \in {\mathbb {R}}^{l \times d} \end{aligned}$$
(11)

The adapter \({\mathscr {G}}_{\gamma }\) serves as a mechanism to refine the semantic knowledge captured in \(e_p\) according to the specific downstream task requirements, allowing for fine-tuning without modifying the frozen model parameters.

The task head, denoted as \({\mathscr {C}}\), is designed to incorporate both the enhanced prompt representation \(e_p^\prime\) and the semantic embeddings of the input text \(e_{x_i}\). We achieve this through concatenation:

$$\begin{aligned} o_i = {\mathscr {C}}_{\zeta }({\mathscr {M}}_{\Theta _{\text {frozen}}}(e_p^\prime \oplus e_{x_i})) \end{aligned}$$
(12)

Here, \(\oplus\) represents the concatenation operation, and \(o_i\) serves as the output for the downstream task, allowing the model to leverage both prompt and input text information effectively.

The training objective for SK-Tuning of the prompt involves minimizing a loss function \({\mathscr {L}}\). This loss function quantifies the difference between the predicted output and the target label \(y_i\), reflecting the model’s performance on the specific task:

$$\begin{aligned} \min _{\gamma , \zeta } {\mathscr {L}}\left( {\mathscr {C}}_{\zeta }({\mathscr {M}}_{\Theta _{\text {frozen}}}({\mathscr {G}}_{\gamma }(e_p) \oplus e_{x_i}), y_i\right) \end{aligned}$$
(13)

Here, \(\gamma\) and \(\zeta\) denote the parameters of the adapter \({\mathscr {G}}\) and the task head \({\mathscr {C}}\) respectively.

Algorithms

In this section, we describe two key algorithms that constitute the core of our proposed SK-Tuning approach for enhancing the fine-tuning of LLMs in the context of specific downstream tasks.

SK-tuning for prefix

The first algorithm, outlined in Algorithm 1 (SK-Tuning for Prefix), details the process of incorporating semantic knowledge from prefixes into the fine-tuning of a pretrained LLM. The algorithm begins with inputs of a pretrained language model \({\mathscr {M}}\) with frozen parameters \(\Theta\), a prompt text p, and a dataset \({(x_i, y_i)}_{i=1}^N\). The trainable parameters \(\Phi\) and \(\zeta\) are initialized. For each input example \(({\textbf{x}}i, y_i)\), the prompt text is processed through the frozen LLM to obtain \(h^p\), which represents the hidden representation from each layer. This representation is then transformed using a trainable adapter \({\mathscr {F}}_{\Phi }\) to yield z. Subsequently, the input text \(x_i\) is processed, incorporating the generated z for improved task-specific adaptation. Finally, the classification head \({\mathscr {C}}_{\zeta }\) computes the output \(o_i\) for the downstream task, and the loss \({\mathscr {L}}(o_i, y_i)\) is computed. The trainable parameters \(\Phi\) and \(\zeta\) are updated iteratively to minimize the loss.

Algorithm 1
figure a

SK-tuning for prefix

SK-tuning for prompt

The second algorithm, described in Algorithm 2 (SK-Tuning for Prompt), focuses on leveraging semantic knowledge from sequential prompt tokens to enhance fine-tuning. It begins with inputs of a pretrained language model \({\mathscr {M}}\) with frozen parameters \(\Theta\), a prompt text p, and a dataset \({({\textbf{x}}i, y_i)}_{i=1}^N\). Trainable parameters \(\gamma\) and \(\zeta\) are initialized. For each input example \((x_i, y_i)\), the embeddings of the prompt text p and input text \({\textbf{x}}i\) are obtained through the pretrained token embedding layer \({\mathscr {E}}\) while ensuring the core LLM parameters remain frozen. The prompt embedding \(e_p\) and text embedding \(e{x_i}\) are then utilized to create an enhanced prompt representation \(e_p^\prime\) using a trainable adapter \({\mathscr {G}}_{\gamma }\). The classification head \({\mathscr {C}}_{\zeta }\) combines this enhanced prompt representation with the input text embedding and computes the output \(o_i\) for the downstream task. As in the previous algorithm, the loss \({\mathscr {L}}(o_i, y_i)\) is computed, and the trainable parameters \(\gamma\) and \(\zeta\) are updated iteratively to minimize the loss.

Algorithm 2
figure b

SK-tuning for prompt

Experiments

Experimental setup

Our experiments utilize a computational setup with two NVIDIA RTX H100 GPUs (80GB VRAM each), an Intel®Xeon®Gold 6448Y 2.1 GHz 32 Core Processor. This system includes 128GB of RAM and a Dell 7.68TB Enterprise NVMe Read Intensive Drive, providing the necessary computational power and storage for efficient model training and evaluation.

For the implementation, we employed the PyTorch65 deep learning framework for the implementation of our experiments. Additionally, we leveraged the Transformers library developed by Hugging Face66. This library offers a comprehensive set of tools and pretrained models for NLP tasks, facilitating the training and evaluation of LLMs on a variety of datasets.

The combination of these resources and software frameworks allowed us to conduct extensive experiments, enabling us to assess the performance and effectiveness of our proposed SK-Tuning approach across a range of downstream tasks.

LM results

Datasets: We evaluate our SK-Tuning on CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI and RTE of the GLUE Benchmarks67. We compute the accuracy using the Matthews correlation for CoLA, accuracy/F1 score for MRPC and QQP, Pearson/Spearman correlation for STS-B, average matched accuracy for MNLI, and accuracy for other NLU tasks in Table 1.

Table 1 Performance comparison of RoBERTa models on GLUE tasks: metrics include MCC for CoLA, accuracy for SST-2, accuracy/F1-score for MRPC and QQP, Pearson/Spearman correlation for STS-B, and Accuracy for MNLI, QNLI, and RTE.

Model selection & hyperparameter: For the GLUE benchmark, the models we select for fine-tuning are RoBERTa-base \(RoB_B\) with 125M parameters and RoBERTa-large \(RoB_L\) with 355M parameters from68. Dropout, attention dropout, and weight decay rates are uniformly maintained at 0.2 across all tasks. The initial learning rate was \(1 \times 10^{-4}\), subsequently fine-tuned to \(2 \times 10^{-5}\) and \(2 \times 10^{-6}\). All datasets have been trained over 10 epochs.

Results: Table 1 presents a detailed performance comparison of various parameter-efficient fine-tuning (PEFT) methods applied to two versions of the RoBERTa model on GLUE tasks, highlighting the SK-Tuning methods as particularly effective. These methods achieve competitive or superior performance across several metrics while utilizing significantly fewer parameters-demonstrated by as low as 0.60M parameters for \(RoB_B\) and 1.02M for \(RoB_L\). Notably, SK-Tuning (Prompt) and SK-Tuning (Prefix) consistently perform well across different task types, such as SST2 and QQP, demonstrating a compelling balance between model efficiency and task performance. This efficiency makes SK-Tuning an attractive option for scenarios requiring deployment in resource-constrained environments or where fast inference is crucial. The results underscore the potential of small, well-tuned models to match or even surpass the performance of larger, fully fine-tuned counterparts, suggesting a promising direction for future research in NLP model optimization.

LLM results

We conducted experiments on a diverse set of datasets to evaluate the performance of SK-Tuning across various NLP tasks, including sequence classification, token classification, and NLI. Our goal was to compare the performance of SK-Tuning with existing models on these tasks. Subsequently, we provide extensive details on the datasets utilized in our experiments.

Classification datasets

Sequence classification, a common task in NLP, involves labeling or categorizing text. In our study, we utilized five datasets: CoLA, SST2 from the GLUE benchmark, along with the Emotion dataset, and the Fake News Filipino dataset.

  • Cola (https://huggingface.co/datasets/glue/viewer/cola/): The Corpus of Linguistic Acceptability (CoLA)69 consists of 10,657 sentences curated from 23 linguistic publications. Each sentence has been meticulously annotated by its original author for grammaticality or acceptability. The publicly available version of the dataset includes 9,594 sentences for training and validation, while 1063 sentences are reserved for a separate held-out test set.

  • SST-2 (https://huggingface.co/datasets/sst2): The Stanford Sentiment Treebank is a dataset featuring fully labeled parse trees, enabling a comprehensive examination of how sentiment compositionally influences language. Derived from the dataset presented by Pang and Lee70, the corpus comprises 11,855 individual sentences extracted from movie reviews. Employing the Stanford parser, the dataset encompasses a total of 215,154 distinct phrases derived from these parse trees, with each phrase annotated by three human judges. The experiments involving binary classification on complete sentences (distinguishing between negative or somewhat negative versus somewhat positive or positive, with neutral sentences excluded) are denoted by the dataset acronym SST-2. The publicly available version includes 67,349 sentences designated for training along with 872 for validation set, while 1,821 sentences are for the test set.

  • Emotion (https://huggingface.co/datasets/dair-ai/emotion): Emotion is a dataset71 of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. The publicly available version includes 16,000 sentences designated for training along with 2000 for validation set, while 2000 sentences are for the test set.

  • Fake News Filipino (https://huggingface.co/datasets/fake_news_filipino): A unique initiative in low-resource fake news detection dataset72 for the Filipino language. Comprising 3,206 meticulously labeled news samples, evenly divided between authentic and fabricated content, this dataset represents a pioneering effort. We partitioned the dataset into 70%, 10% and 20% for training, validation, and testing purposes.

Token classification datasets

Token classification involves labeling individual tokens within a sentence. Named Entity Recognition (NER) is a prevalent task in token classification, aiming to assign labels to entities in a sentence, which may include individuals, locations, or organizations. We have used 3 token classification datasets: CoNLL 2003, NCBI Disease, and WikiAnn dataset.

  • CoNLL 2003 (https://huggingface.co/datasets/conll2003): CoNLL-2003 serves as a named entity recognition dataset73 introduced within the framework of the CoNLL-2003 shared task, focusing on language-independent named entity recognition. This dataset comprises eight files that encompass two languages: English and German. We have utilized the English dataset and “ner tags” as labels for our experiment. The publicly available version includes 14,041 examples designated for training along with 3250 for validation examples, while 3453 examples are for testing.

  • NCBI Disease (https://huggingface.co/datasets/ncbi_disease): The dataset74 includes annotations for disease names and concepts from the NCBI disease corpus, which is a compilation of 793 PubMed abstracts extensively annotated at both the mention and concept levels. There are 3 labels, 0 indicates no disease mentioned, 1 signals the first token of a disease, and 2 the subsequent disease tokens. The publicly available version includes 5433 examples designated for training along with 924 for validation examples, while 941 examples are for testing.

  • WikiAnn (https://huggingface.co/datasets/wikiann): WikiANN, also known as PAN-X, is a multilingual dataset75 for named entity recognition. It comprises Wikipedia articles annotated with location (LOC), person (PER), and organization (ORG) tags using the IOB2 format. This specific version aligns with the balanced train, validation, and test splits of 20,000, 10,000, and 10,000, respectively introduced by Rahimi et al. (2019), covering 176 out of the 282 languages featured in the original WikiANN corpus.

Entailment datasets

NLI involves the challenge of determining the truth (entailment), falsity (contradiction), or undetermined status (neutral) of a “hypothesis” based on a provided “premise.” We have used 3 NLI datasets for this task: RTE, SNLI, and MRPC.

  • RTE (https://huggingface.co/datasets/glue/viewer/rte): The Recognizing Textual Entailment (RTE) datasets originate from a series of annual challenges focused on textual entailment. The creators of the benchmark amalgamated data from RTE176, RTE277, RTE378, and RTE579. Constructed examples are derived from news and Wikipedia text. To maintain consistency, the benchmark creators transformed all datasets into a two-class split, collapsing neutral and contradiction into “not entailment” for three-class datasets. The publicly available version includes 2490 examples designated for training along with 277 for validation examples, while 3,000 examples are for testing.

  • MRPC (https://huggingface.co/datasets/glue/viewer/mrpc): The Microsoft Research Paraphrase Corpus (MRPC)80 comprises 5801 pairs of sentences extracted from newswire articles. Human annotators have labeled each pair to indicate whether it is a paraphrase or not. The entire dataset is split into a training subset, consisting of 4076 sentence pairs (with 2753 identified as paraphrases), and a test subset, containing 1725 pairs (with 1147 recognized as paraphrases).

  • SNLI (https://huggingface.co/datasets/snli): The Stanford NLI (SNLI) corpus81 is an assemblage of 570,000 pairs of English sentences crafted by humans. These sentence pairs have been meticulously labeled to achieve balanced classification, with the labels entailment, contradiction, and neutral. This corpus is designed to facilitate the task of NLI. The publicly available version includes 550,152 examples designated for training along with 10,000 for validation examples, while 10,000 examples are for testing.

These datasets collectively cover a wide spectrum of NLP tasks, enabling comprehensive evaluations of SK-Tuning’s performance across various domains and challenges.

Large language models

In our analysis, we utilized multiple Large Language Models (LLMs) to obtain extensive and detailed results. Specifically, we employed Bloom 7b, Llama2 7b, Mistral 7b, Falcon 7b, and Phi-2 2.7b, each offering unique strengths and capabilities that complemented one another.

  • Bloom: A 7B parameter LLM from BigScience, trained on an extensive corpus of text and code. Bloom displays robust performance on various NLP tasks and offers several variants, including Bloom Text-to-Text and Bloom Code82.

  • Llama2: Meta AI has introduced Llama 2, its most advanced LLM to date. Llama 2 showcases a diverse array of capabilities and potential applications, with model sizes ranging from 7 billion to 70 billion parameters. This release provides access to both model weights and initial code for pretrained and fine-tuned Llama models, including variants such as Llama Chat (specialized for dialogue) and Code Llama (optimized for programming tasks)83.

  • Mistral: Mistral 7B is a freely available, open-source language model comprising 7.3 billion parameters that demonstrates exceptional performance. Released in September 2023, it exhibits competitive results in comparison to Meta’s LLaMA models, outperforming the 13B version on all benchmarks evaluated and equaling the 34B version on numerous metrics. Developed using the transformers architecture and accessible via BitTorrent and Hugging Face, Mistral 7B presents a robust and accessible option for researchers and developers seeking a high-performing LLM84.

  • Falcon: The Falcon Large Language Model (LLM) is a generative LLM designed to advance applications and use cases for future-proofing our world. Currently, the Falcon 180B, 40B, 7B, and 1.3B parameter artificial intelligence models, along with the high-quality REFINEDWEB dataset, constitute a comprehensive suite of offerings85.

  • phi-2: Phi-2, the most recent small language model (SLM) developed by Microsoft Research, is a 2.7 billion parameter model that showcases superior reasoning and language understanding capabilities compared to its predecessors, Phi-1 and Phi-1.586. The model was trained on a diverse dataset, comprising “textbook quality” web data and synthetic textbooks/exercises generated using GPT-3.5. Phi-2 exhibits exceptional performance in various tasks, including Python code generation87. It is noteworthy that Phi-2 surpasses the performance of models up to 25 times larger in size. Furthermore, Phi-2 has been released under an MIT License, permitting its utilization in commercial applications.

Baseline methods

We established the following baseline methods to evaluate the performance of our proposed approach:

  • Full fine-tuning: This methodology88 involves the adjustment of all parameters within the pretrained language model to adapt it to the specific task at hand. It functions as a comprehensive adaptation approach; however, it can be computationally intensive.

  • Prefix tuning: This lightweight method89 introduces trainable continuous vectors termed “prefixes” to the input of each transformer layer, while the original model parameters remain fixed. Prefix-tuning is predicated on the concept of prompting in language models, enabling ensuing tokens to attend to this prefix as if it were composed of “virtual tokens”. It presents a more efficient alternative to complete fine-tuning, particularly in low-data scenarios.

  • Prompt tuning: This method90 employs natural language prompts called “soft prompts” to guide the model’s behavior without modifying its internal parameters. This provides a flexible method to adapt models to different tasks without additional training.

  • P tuning: This method91 introduces an optimized prompt tuning method, which exhibits efficacy across a diverse spectrum of model scales and natural language tasks. The method addresses the suboptimal performance associated with prompt tuning when applied to pretrained models of typical size. Moreover, it endeavors to rectify the limitations observed in the performance of prompt tuning, particularly its inefficacy in challenging sequence labeling tasks.

  • LoRA: LoRA20 (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that involves learning low-rank matrices to adapt the model while freezing most of its original parameters in a fixed state. This study investigated LoRA with rank 2 and rank 4 to evaluate its capability in optimizing the balance between performance and efficiency.

These baseline methods represent a diverse range of fine-tuning strategies, allowing us to measure the comparative performance of our proposed approach.

Evaluation metrics

Evaluation metrics measure the performance of a model on a specific dataset by comparing the model’s predictions with ground truth labels. Various tasks have specific metrics, and we used accuracy and F1 score in our experiments.

Accuracy

Accuracy is a metric that measures the overall correctness of a model’s predictions. It is calculated as the ratio of correct predictions to the total number of predictions made by the model:

$$\begin{aligned} \text {Accuracy} = \frac{\text {True Positives (TP) + True Negatives (TN)}}{\text {Total Predictions}} \end{aligned}$$

F1 score

The F1 score is the harmonic mean of precision and recall, providing a balanced measure considering both false positives and false negatives:

$$\begin{aligned} \text {F1 Score} = \frac{2 \times \text {Precision} \times \text {Recall}}{\text {Precision + Recall}} \end{aligned}$$

The F1 score ranges from 0 to 1, with higher values indicating better overall performance in terms of precision and recall.

In these formulas, TP, TN, FP, and FN represent the counts of true positives, true negatives, false positives, and false negatives, respectively.

Hyperparameters setting

In our experiments, we carefully selected hyperparameters to ensure consistent and effective training across various datasets and tasks.

For the maximum sequence length, we set it to 128 for all datasets except for RTE, where we did not impose a maximum sequence length.

Regarding learning rates, we employed the following values:

  • For the sequence classification datasets, the learning rate was set to \(1\times 10^{-3}\).

  • For the token classification datasets, a learning rate of \(1\times 10^{-5}\) was used.

  • For the NLI datasets, the learning rate was set to \(1\times 10^{-4}\).

In terms of the number of training epochs:

  • Sequence classification datasets were trained for 5 epochs.

  • Token classification datasets were trained for 10 epochs.

  • NLI datasets were trained for 10 epochs, with the exception of the SNLI dataset, which was trained for 2 epochs on each model.

For all our datasets, regardless of the task or tuning method (P-Tuning, Prefix Tuning, or Prompt Tuning), we consistently used 20 virtual tokens during training.

We employ the Adaptive Moment Estimation with Weight Decay (ADAMW)92 optimizer for all experiments. The ADAMW optimizer is an improved version of the traditional ADAM optimizer, which incorporates weight decay directly into the optimization process to better handle regularization. Additionally, we set the weight decay value to 0.01 across all experiments to control regularization and ensure stable model training.

Result analysis

See Tables 2, 3, 4, 5 and 6.

Table 2 Sequence classification results for the bloom model.
Table 3 Sequence classification results for the Llama2 model.
Table 4 Sequence classification results for the Falcon model.
Table 5 Sequence classification results for the Mistral model.
Table 6 Sequence classification results for the Phi-2 model.

Sequence classification

In the realm of classification, we conducted a comprehensive evaluation across various LLMs, including Bloom, Llama2, Falcon, Mistral, and Phi-2, employing different fine-tuning techniques. For each model, we examined the effectiveness of traditional approaches such as Finetuning, Prefix Tuning, Prompt Tuning, PTuning, Lora Rank 2, and Lora Rank 4, and compared them to our proposed SK-Tuning methods, both for Prefix and Prompt. Notably, SK-Tuning consistently outperforms traditional methods across different datasets, showcasing its superior efficiency and effectiveness. The performances of various models on different datasets are documented in Tables 2345, and 6.

Across the “Fake News Filipino” dataset, SK-Tuning, especially when applied as SK-Tuning (Prefix), demonstrates remarkable performance improvements compared to traditional approaches. It achieves the highest accuracy and F1-score, emphasizing its capability to efficiently adapt LLMs to specific tasks while minimizing trainable parameters. In the “Emotion” dataset, SK-Tuning consistently outperforms other methods, indicating its robustness across different classification tasks. The same trend is observed in the “SST2” dataset, where SK-Tuning invariably achieves superior results. Lastly, in the “Cola” dataset, SK-Tuning (Prefix) and SK-Tuning (Prompt) perpetually outperform other approaches, underscoring their potential for enhancing sequence classification tasks.

Comparatively, traditional methods like Prefix Tuning and Prompt Tuning, although efficient in terms of parameters compared to Fine-tuning, tend to lag behind SK-Tuning in terms of accuracy and F1-score. Furthermore, SK-Tuning requires fewer trainable parameters, making it an attractive choice for practitioners aiming to optimize performance while maintaining efficiency.

Token classification

In this comprehensive analysis of token classification across various datasets, we conducted an extensive evaluation of five distinct models: Bloom, Llama2, Falcon, Mistral, and Phi-2, while exploring a range of fine-tuning techniques to understand their impact on performance, documented in Tables 78910 and 11. The datasets used for evaluation included conll03, ncbi disease, and wiki ann, each representing different challenges in token classification.

Table 7 Token classification results for the Bloom model.
Table 8 Token classification results for the Llama2 model.
Table 9 Token classification results for the Falcon model.
Table 10 Token classification results for the Mistral model.
Table 11 Token classification results for the Phi-2 model.

First and foremost, we observed that Full Fine-tuning consistently achieved high accuracy across all models and datasets. However, it also required a substantial percentage of parameters, potentially making it less feasible for resource-constrained environments.

To address the trade-off between model efficiency and performance, we investigated several fine-tuning techniques. Prefix Tuning, Prompt Tuning, and P-Tuning, which involve introducing a small fraction of parameters, showcased mixed results. While these techniques achieved decent accuracy in some cases, they often lagged behind in terms of F1-score, indicating challenges in maintaining a balance between precision and recall.

Remarkably, Lora Rank 2 and Lora Rank 4, with a moderate percentage of parameters, consistently delivered a strong performance, especially in terms of the F1-score. These results underscore the importance of considering the architecture of the model when optimizing for token classification tasks, with Lora Rank models demonstrating their effectiveness.

Finally, SK-Tuning techniques, both Prefix and Prompt variants, stood out as noteworthy approaches. They required an extremely minimal percentage of additional parameters yet yielded competitive accuracy and remarkable F1 scores. This suggests that these techniques have the potential to strike a favorable balance between model efficiency and task effectiveness.

Entailment detection

The results of entailment detection using various models, including Bloom, Llama2, Falcon, Mistral, and Phi-2, are presented in Tables 12131415 and 16 . Across all three datasets (RTE, MRPC, SNLI), full fine-tuning consistently achieves the highest accuracy and F1-score, with Bloom and Mistral models demonstrating remarkable results. This underscores the value of fine-tuning the entire model’s parameters to adapt to specific entailment tasks, as it allows the model to capture intricate patterns and nuances in the data.

Table 12 Entailment classification results for the Bloom model.
Table 13 Entailment classification results for the Llama2 model.
Table 14 Entailment classification results for the Falcon model.
Table 15 Entailment classification results for the Mistral model.
Table 16 Entailment classification results for the Phi-2 model.

In contrast, prefix tuning and prompt tuning techniques, which involve fine-tuning only a small fraction of the model’s parameters, tend to yield significantly lower accuracy and F1-scores. This suggests that limiting parameter updates to specific prefixes or prompts may not be sufficient for optimal entailment classification performance, as these methods may struggle to capture the diverse and complex relationships present in the data.

The Lora Rank 2 and Lora Rank 4 models deliver competitive results, particularly evident in the RTE dataset, where they outperform other techniques. This indicates that techniques like Lora Rank, which involve a moderate amount of parameter modification, can strike a balance between model adaptation and computational efficiency.

However, SK-Tuning, whether applied to prefixes or prompts, consistently performs well across datasets, demonstrating its effectiveness as an alternative fine-tuning strategy. SK-Tuning achieves strong results with a minimal increase in the number of parameters, making it a promising approach for entailment classification tasks where computational resources are a concern.

Ablation study

Efficiency

Figure 2 illustrates that SK-Tuning methods for Prefix and Prompt, demonstrate superior memory efficiency with the lowest memory cost among the compared PEFT methods, making them ideal for resource-constrained environments. Despite their minimal memory footprint, these methods maintain competitive training efficiency, balancing low parameter percentages with moderate training times, which highlights their effectiveness in achieving lightweight and fast fine-tuning. Compared to other methods like LoKr, LoHa, and LoRA, which show higher memory costs and varying degrees of training efficiency, SK-Tuning stands out as a robust approach that optimizes both memory and computational resources, making it particularly advantageous for scenarios where efficiency is paramount.

Fig. 2
figure 2

Comparison of memory efficiency (left) and training efficiency (right) across various PEFT methods. S-Prefix and S-Prompt represent SK-Tuning applied to prefix tuning and prompt tuning, respectively. The left chart shows the memory cost in GB, highlighting the model weights and optimizations, while the right chart displays the percentage of parameters, total training time in hours, and iteration time per second.

Faster convergence with SK-tuning

In this section, we present an ablation study comparing the convergence speed and performance of SK-Tuning with traditional prompt and prefix tuning methods on three different downstream tasks: Token Classification, Sequence Classification, and NLI. We hypothesize that SK-Tuning, leveraging semantic knowledge, will lead to faster convergence due to the inherent zero-shot capabilities of LLMs93.

Accelerated convergence in token classification

In the context of token classification tasks, we conducted a comprehensive comparison between SK-Tuning and traditional tuning methods. We utilized two benchmark datasets, namely Wikiann and Conll03, both featuring token-level labels. Our primary objective was to analyze the convergence behavior, measured in terms of loss reduction, as training steps progressed.

Figure 3 visually presents the convergence trajectories for SK-Tuning and traditional methods. Notably, we observed a remarkable disparity in the convergence speed between these approaches. SK-Tuning, whether applied to prefixes or prompts, demonstrated a strikingly swift convergence compared to the conventional tuning method.

Fig. 3
figure 3

Convergence comparison for token classification.

This accelerated convergence showcased in Fig. 3 serves as compelling evidence of the significant advantages brought about by the incorporation of semantic knowledge. It underscores the ability of SK-Tuning to facilitate rapid adaptation to the intricacies of token classification tasks, emphasizing the practical utility of this approach.

Accelerated convergence in sequence classification

For the evaluation of SK-Tuning in sequence classification tasks, we conducted a comparative analysis against traditional tuning methods. Our experimentation leveraged two benchmark datasets: Fake News and SST2, both featuring sequences with corresponding labels. Our primary objective was to assess the convergence performance, measured in terms of loss reduction, as the model underwent training iterations.

Figure 4 offers a visual representation of the convergence patterns observed during sequence classification. Notably, the results depicted in the figure demonstrate the accelerated convergence achieved with SK-Tuning when compared to conventional tuning methods.

Fig. 4
figure 4

Convergence comparison for sequence classification.

The swift convergence illustrated in Fig. 4 underscores the significant advantages bestowed by the integration of semantic knowledge into the fine-tuning process. This enhancement enables the model to quickly adapt to the nuances of the specific sequence classification task, reaffirming the effectiveness of SK-Tuning in practical scenarios.

Accelerated convergence in NLI

In the realm of NLI tasks, we conducted a comparative analysis pitting SK-Tuning against traditional tuning methods. Our evaluation incorporated well-established datasets, including MRPC and SNLI, which consist of premise-hypothesis pairs and their corresponding entailment labels. The primary objective was to assess convergence speed, measured in terms of training steps.

Figure 5 visually illustrates the convergence dynamics observed during NLI tasks. Notably, the findings showcased in the figure reveal the expedited convergence achieved through SK-Tuning when compared to traditional tuning approaches.

Fig. 5
figure 5

Convergence comparison for sequence classification.

The swift convergence depicted in Fig. 5 underscores the substantial advantages conferred by the integration of semantic knowledge into the fine-tuning process. This augmentation enhances the model’s adaptability, enabling it to quickly grasp the nuances of NLI tasks and reaffirming the practical utility of SK-Tuning in advancing NLI model performance.

Our ablation study clearly demonstrates that SK-Tuning outperforms traditional prompt and prefix tuning methods in terms of convergence speed across a range of downstream tasks. The incorporation of semantic knowledge, along with the zero-shot capabilities of LLMs, contributes to faster task adaptation. Additionally, SK-Tuning consistently leads to better performance, as shown in subsequent sections.

Adapter layers

In this study, we investigate the impact of adapter layer complexity on the performance of fine-tuned models. Specifically, we analyze how increasing the complexity of adapter layers affects various factors, including the percentage of parameters, computational cost, and convergence speed. We conducted experiments using the Mistral 7B model on the SST2 dataset, and the results are presented in Table 17.

Table 17 Exploring the Trade-offs—adapter complexity vs. performance.

As shown in Table 17, increasing the number of adapter layers leads to a proportional increase in the number of parameters. This rise in complexity comes at the cost of increased computational resources and slower convergence. While the performance of the model does show marginal improvements with more complex adapter layers, it is essential to note that these gains are relatively modest.

For instance, with just one adapter layer, the model exhibits a relatively small number of parameters, efficient convergence, and high accuracy. However, as we progressively increase the complexity with additional layers, the number of parameters surges significantly, computational requirements escalate, and convergence becomes substantially slower. Notably, the performance gains achieved by complex adapter layers are relatively modest.

The observed trend suggested that as the complexity of the adapter layers increased, the computational demands and training time also increased substantially. This phenomenon can be attributed to the need for extensive training to capture and leverage semantic information effectively.

Effect of prompt and prefix text

In this ablation study, we investigate the influence of prompt and prefix text length on the performance of SK-Tuning for sentiment classification using the SST-2 dataset. Our goal is to demonstrate that well-crafted prompt or prefix texts can outperform longer, less informative alternatives, despite the latter offering a larger number of trainable parameters.

We conducted experiments with various prompt and prefix texts and evaluated their corresponding accuracy on sentiment classification tasks using the Mistral model, which boasts 7 billion parameters. The table below summarizes the results.

The results presented in Table 18 clearly illustrate that a concise and informative prompt text outperforms longer and less focused alternatives. Despite the fact that longer prompts or prefixes provide more trainable parameters, our findings underscore the significance of crafting prompts that offer clear task instructions and context, resulting in enhanced model performance.

Table 18 Effect of prompt and prefix length on sentiment classification accuracy.

Furthermore, to visualize the relationship between the prompt text and the input text, we analyzed the attention scores of the last layer. Specifically, we used the prompt text Classify the positive or negative sentiment of the text. in conjunction with input texts I love this movie. and I hate this movie. The figures in Fig. 6 depict the attention scores, highlighting the sentimental connection between the prompt and the text. In the left figure, the prompt text, particularly the word positive exhibits a strong attention score with love Conversely, in the right figure, the prompt word negative shows a pronounced attention score with hate This observation suggests that including words like positive and negative in the prompt text significantly aids the model in making informed decisions, thereby emphasizing the importance of crafting effective prompt texts.

Fig. 6
figure 6

Attentional Insights—exploring the Sentimental Connection between Prompt Text and Input Text. The left side of the figure reveals the attention scores between the prompt text, particularly the word ‘positive,’ and the input text ‘I love this movie.’ On the right side, the attention scores depict the relationship between the prompt word ‘negative’ and the input text ’I hate this movie.’ These attention patterns shed light on how well-crafted prompt texts enhance the model’s decision-making process.

Discussion

We present a comprehensive comparison of our proposed SK-Tuning method with established parameter-efficient fine-tuning techniques, including Prompt Tuning, Prefix Tuning, P-Tuning, LORA Rank 2, and LORA Rank 4. Our evaluation encompassed a diverse set of downstream tasks across various domains within NLP. Notably, SK-Tuning for both prompts and prefixes consistently outperformed these traditional methods across several key metrics, including accuracy, F1 score, and parameter efficiency.

One of the key takeaways from our comparison is the remarkable performance gains achieved by SK-Tuning. In terms of accuracy and F1 score, SK-Tuning consistently delivered superior results across the spectrum of tasks. This improvement underscores the effectiveness of leveraging semantically meaningful information in the fine-tuning process, as opposed to relying on arbitrary virtual tokens.

Equally noteworthy is the efficiency of SK-Tuning. By minimizing the number of trainable parameters required for adaptation, our approach demonstrates a substantial reduction in computational resources while maintaining or even enhancing task performance. This efficiency is particularly crucial in practical applications, where resource constraints often play a significant role.

Another noteworthy aspect of our study is the extensive evaluation across five different pretrained LLMs: Bloom (7B), Falcon (7B), LLAMA2 (7B), Mistral (7B), and Phi2 (2.7B). Our results consistently indicate that SK-Tuning is a robust and versatile technique that can be applied to various LLM architectures, demonstrating its broad applicability and effectiveness across different model sizes and complexities.

Limitations

While SK-Tuning offers significant advantages in terms of performance and parameter efficiency, there are several key limitations that should be considered:

Training and inference time overhead

One of the primary limitations of SK-Tuning is the potential increase in inference or training time. Since it utilizes the pre-trained LLM twice during the forward pass: once to obtain semantic information from the prompt or prefix and again for processing the input data to get output. This dual usage of the LLM can lead to longer training and inference time.

Dependency on pretrained models

SK-Tuning relies heavily on the quality and capabilities of the underlying pretrained LLM. The success of prompt or prefix text tuning is linked to the zero-shot capabilities of the LLM. If the pretrained model does not have a strong grasp of semantic knowledge or lacks certain linguistic skills, the effectiveness of SK-Tuning could be reduced. It needs significant training to accurately understand the semantic meaning of the prompt or prefix text.

Semantic knowledge acquisition

The effectiveness of SK-Tuning depends on using prompts or prefixes that are meaningful. The more relevant the prompt is to the task, the better the performance, described in section “Effect of prompt and prefix text”. However, creating or finding these meaningful prompts can be difficult and might require specific knowledge about the domain. This challenge could limit how useful SK-Tuning is for certain tasks or datasets.

Tuning hyperparameters

Like other fine-tuning approaches, SK-Tuning involves hyperparameter tuning, including the design of the adapter architecture, the choice of semantic knowledge text, and the adjustment of task-specific modules. Identifying the optimal hyperparameters can be a time-consuming and computationally intensive process.

Conclusion

In conclusion, our work introduces SK-Tuning as a pioneering approach to fine-tuning LLMs for specific downstream tasks, with a strong emphasis on parameter efficiency. We have shown that traditional methods, relying on learnable virtual tokens in adapters while keeping the LLM’s core parameters frozen, often fall short in terms of both efficiency and performance.

SK-Tuning, on the other hand, revolutionizes the fine-tuning process by replacing arbitrary virtual tokens with real, semantically meaningful prefixes. This innovation allows LLMs to tap into their intrinsic semantic knowledge, significantly reducing the need for extensive training iterations. Our experimental results across a range of downstream tasks, including sequence classification, token classification, and NLI, provide compelling evidence that SK-Tuning outperforms traditional approaches. Notably, this improvement is achieved with a reduced number of trainable parameters, emphasizing the efficiency of our method.

By prioritizing parameter efficiency and harnessing the latent semantic understanding of LLMs, SK-Tuning opens up new possibilities for efficient model adaptation across various real-world applications. We believe that our approach holds great promise for advancing the field of NLP, offering researchers and practitioners a valuable tool for achieving enhanced task performance while optimizing computational resources. As LLMs continue to play a pivotal role in NLP, SK-Tuning represents a significant step forward in harnessing their full potential.