Parameter-efficient fine-tuning of large language models using semantic knowledge tuning

Nusrat Jahan Prottasha¹,
Asif Mahmud²^na1,
Md. Shohanur Islam Sobuj³^na1,
Prakash Bhat⁴,
Md Kowsher¹,
Niloofar Yousefi¹ &
…
Ozlem Ozmen Garibay¹

3453 Accesses
Explore all metrics

Abstract

Large Language Models (LLMs) are gaining significant popularity in recent years for specialized tasks using prompts due to their low computational cost. Standard methods like prefix tuning utilize special, modifiable tokens that lack semantic meaning and require extensive training for best performance, often falling short. In this context, we propose a novel method called Semantic Knowledge Tuning (SK-Tuning) for prompt and prefix tuning that employs meaningful words instead of random tokens. This method involves using a fixed LLM to understand and process the semantic content of the prompt through zero-shot capabilities. Following this, it integrates the processed prompt with the input text to improve the model’s performance on particular tasks. Our experimental results show that SK-Tuning exhibits faster training times, fewer parameters, and superior performance on tasks such as text classification and understanding compared to other tuning methods. This approach offers a promising method for optimizing the efficiency and effectiveness of LLMs in processing language tasks.

PIQARD System for Experimenting and Testing Language Models with Prompting Strategies

Parameter-efficient fine-tuning of large-scale pre-trained language models

Article Open access 02 March 2023

AQLoRA: An Adaptive Quantization-Based Efficient Fine-Tuning Method for LLMs

Introduction

The domain of NLP has seen a remarkable transformation in recent years, primarily driven by the introduction of LLMs¹. Transformer-based Language Models (TLMs) initially led about a revolution by showing outstanding capabilities in capturing extensive dependencies². However, the challenges connected with adapting TLMs to various tasks, combined with their resource-intensive training, resulted to the development of more powerful models, such as GPT-3³. With billions of parameters, these LLMs have not only boosted performance benchmarks across various tasks but have also extended their applications into novel domains, including creative writing and multimodal learning⁴.

Despite their notable achievements, a in-depth analysis of LLMs reveals several major limitations. The extensive computational resources required for their training raise questions about environmental sustainability and restrict accessibility to research facilities with sufficient equipment and resources⁵.

To address this issue, there has been an increasing focus on the latest innovations in parameter-efficient fine-tuning methods⁶.As compared to retraining the entire model from scratch, fine-tuning LLMs has proven to be a more rapid and efficient approach. Nevertheless, the fine-tuning of all parameters of LLMs remains a challenge due to their vast size, which typically consists of billions of parameters. Despite this, the fine-tuning process still requires extensive computational resources, much like the pretraining technique.

Adapting to this challenge, adapter training has gained importance as a more efficient approach⁷. This approach involves introducing domain-specific parameters, referred to as adapters, into pretrained models. These adapters, which are made of small neural networks, are strategically inserted within or between the layers of the pretrained model. During the training process, only the parameters of these added adapters are updated, while the parameters of the pretrained model remain unchanged⁷.

While adapters provide a more simple approach, they may not fully capture complex data patterns as effectively as fine-tuning the entire model. In addition, determining the optimal locations to insert adapters within the LLM can be challenging and may require experimentation. Nonetheless, prompt tuning complements adapter training by offering additional contextual information to guide the model’s understanding of the task at hand.

Prompt tuning⁸ does not modify the underlying structure of the model, potentially resulting in quicker inference times and decreased resource consumption in contrast to utilizing adapters. Furthermore, prefix tuning⁹ has been proposed as a method to improve performance. Unlike prompt tuning, which updates only a portion of a single layer, prefix tuning updates a section of every layer consistently, and it has shown improved performance in succeeding tasks.

The use of prompts and prefix tuning techniques^8,9 can pose challenges in terms of the effectiveness and interpretability of the employed prompts or prefixes. These methods generally utilize trainable virtual tokens within an adapter, which may not have essential semantic significance and require extensive training to acquire domain-specific knowledge efficiently. Consequently, the performance of these techniques may not be optimal, particularly when dealing with complex tasks, and extensive training is necessary to achieve optimal performance.

To overcome these challenges, we propose SK-Tuning, a novel approach that focuses on improving the performance of fine-tuning LLMs for prompt and prefix tuning. Unlike standard techniques that depend on random virtual tokens, SK-Tuning utilizes genuine, semantically rich prompts or prefixes for adapter training. By employing the LLM’s innate capacity to understand linguistic semantics, SK-Tuning strives to improve performance by integrating semantic knowledge directly from prompts or prefixes.

LLMs display remarkable zero-shot capabilities, allowing them to perform tasks without explicit training, as shown in recent studies¹⁰. To maximize the potential of these capabilities, SK-Tuning utilizes LLM’s ability to understand prompts or instructions in a zero-shot manner. This approach speeds up the convergence process during fine-tuning because we concentrate only on refining the semantic representation of the prompt or prefix.

The SK-Tuning method is presented in Fig. 1, which displays the stages for prompt and prefix tuning. At first, the entire LLM is frozen to maintain its pretrained knowledge. Next, the frozen LLM is utilized to extract the semantic representation from the prompt or prefix text. This representation is then educated with a small adapter to improve its task-specific intelligence. Lastly, the revised representation is combined with the embedding of the input text, guaranteeing that the model effectively integrates both the semantic context provided by the prompt or prefix and the textual data of the input.

We perform wide-ranging experimental evaluations across a variety of downstream tasks, including sequence classification, token classification, and NLI, to practically show the efficiency and excellence of SK-Tuning compared to traditional fine-tuning methods. Furthermore, we compare SK-Tuning with other parameter-efficient approaches, such as prompt tuning, prefix tuning, p-tuning, and LoRA, highlighting its unique advantages and contributions to the field of NLP.

The major contributions of this paper are summarized as follows:

This paper introduces SK-Tuning, a novel approach for fine-tuning LLMs using real, semantically meaningful prompt or prefix text.
SK-Tuning improves training efficiency and convergence speed by utilizing the inherent semantic understanding of prompt or prefix text using LLM’s zero-shot capabilities, as a result allowing rapid adaptation to new tasks.
In numerous experiments covering a variety of tasks, including sequence classification, token classification, and NLI, SK-Tuning has continually exhibited significant improvements in performance metrics..
The study includes a comprehensive evaluation against other parameter-efficient methods like prompt tuning, prefix tuning, p-tuning, and LoRA, highlighting SK-Tuning’s superior effectiveness in terms of performance outcomes and computational efficiency.
SK-Tuning reduces computational requirements and the number of trainable parameters compared to traditional fine-tuning approaches, making it a more resource-efficient solution for adapting LLMs.

The structure of this paper is as follows: section “Related work” reviews related work, situating our approach within the broader domain of parameter-efficient fine-tuning methods. Section “Background study” provides a background study on existing tuning techniques, setting the stage for our proposed method. In section “SK-tuning procedure”, we detail the SK-Tuning procedure, explaining its methodology and implementation. Section “Experiments” presents our experiments, showcasing the performance improvements achieved through SK-Tuning across various tasks. Section “Ablation study” offers an ablation study to further analyze the contributions of each component within SK-Tuning, reinforcing the paper’s key contributions to NLP. Section “Discussion” provides a discussion on the implications and potential applications of SK-Tuning in practical settings. Section “Limitations” discusses the limitations and challenges experienced during the development and application of SK-Tuning. Finally, section “Conclusion” concludes the paper by summarizing the findings and highlighting future research directions in fine-tuning methods for LLMs.

Related work

The importance of parameter-efficient fine-tuning (PEFT) methods in the field of NLP is immense, considering the growing complexity of LLMs. These methods not only improve model performance but also significantly reduce computational and memory requirements, as demonstrated by recent academic research^{11,12,13,14,15}. The effectiveness of PEFT techniques is being thoroughly evaluated on a range of NLP tasks, as shown in¹⁶. Moreover, an extensive body of research^{17,18,19,20,21,22,23} consistently indicates that PEFT strategies considerably enhance the performance of LLMs, even under limited-resource circumstances.

Prompt tuning is a novel approach that improves NLP and generation tasks by fine-tuning learnable parameters within the model⁸. This technique enhances the model’s performance on specific roles by fine-tuning prompts, thereby optimizing its output. Improvements in prompt tuning have been achieved through the implementation of the residual connections to strengthen performance and stability²⁴. This technique has also been broadened to support continual learning environments, as illustrated in recent research^25,26. Current research focuses on dynamic prompt tuning, which adapts prompts in real time based on evolving contexts, as well as hierarchical prompt tuning, which provides multilevel control over the model’s responses^27,28.

Prefix tuning is another powerful technique that adds learnable parameters as prefixes to the input of pre-trained models, enabling modification to different applications with minimal changes to the model itself²¹. This method enables efficient domain-specific fine-tuning without requiring the retraining of the entire model, particularly in resource-limited settings. Recent innovations introduce hierarchical prefix tuning, which organizes prefixes in a hierarchical manner to provide more detailed control over the model’s responses²⁹. Additionally, dynamic prefix tuning allows for real-time adaptation based on the input context, thereby improving the flexibility and adaptability of the model³⁰. Techniques such as MixPrompt³¹ and E2VPT³² have also been introduced to combine and optimize the usage of input and key-value prompts, advancing the application of prefix tuning in natural language processing applications.

Low-rank adaptation (LoRA) first proposed by²⁰, is a fine-tuning technique designed to optimize memory usage and has received considerable attention in the research community since its inception. The latest developments have expanded the range of applications for LoRA, particularly in the area of multitask learning, as illustrated by research conducted by^33,34, and³⁵. Practical applications of LoRA were further explored by³⁶, while³⁷ focused on optimizing its memory efficiency. A notable innovation, ReLoRA, introduced by³⁸, incorporates a full-rank warm-up phase.¹⁹ proposed adaptive approaches that dynamically adjust the low-rank adaptation parameters. Additionally,³⁹ presented the Low-Rank Kronecker Product (LoKr), and⁴⁰ developed ResLoRA, which integrates residual pathways. Further contributions include the Low-Rank Hadamard Product (LoHa) by⁴¹, and the introduction of Orthogonal Finetuning (OFT) and OFT with butterfly factorization (BOFT) by⁴² and⁴³, which utilize orthogonal matrices to transform pre-trained weight matrices, resulting in significant improvements in both fine-tuning efficiency and performance.

Subspace learning has become a crucial area of research, with a focus on optimizing model weights within a low-dimensional space, thereby providing computational efficiency and improved performance in various machine learning tasks^44,45. This approach has been extensively utilized in meta-learning and continual learning frameworks, as shown by several studies^{44,45,46,47,48,49}. Latest improvements in adaptive subspace learning methods have demonstrated significant improvements in generalization and robustness, especially in challenging environments^50,51. Furthermore, incorporation of subspace learning into neural architecture search has proven invaluable in identifying efficient and innovative architectures, optimizing both performance and resource utilization^51,52,53. The efficacy of subspace learning is further highlighted in scenarios requiring rapid adaptation to new tasks with limited data, such as few-shot learning and online learning, where it allows robust model performance despite data limitations⁵⁴.

Projected gradient descent (PGD) has been significantly improved by the development of advanced methodologies such as GaLore⁵⁵. Unlike traditional approaches, which treat the objective function as a black box, GaLore creates gradients within multilayer neural networks, providing a more extensive and effective optimization process^56,57. This approach has displayed notable enhancements in the convergence rate of neural network training, particularly in high-dimensional datasets, while also contributing to advanced stability during the training process⁵⁸. Furthermore, GaLore addresses the challenges of gradient sparsity and redundancy, resulting in significant gains in training efficiency⁵⁵. These innovations have not only strengthened the robustness of neural networks against adversarial attacks but also ensured more stable and reliable training dynamics, marking a noteworthy improvement in the field^59,60,61.

Memory-efficient optimization is a pivotal area of research within the development of adaptive optimization algorithms, particularly in the context of large-scale models where memory bounds are a significant challenge. Foundational studies by⁶² have proven the efficiency of quantization techniques and combined gradient computation in considerably reducing memory usage during training⁶³. Building upon these contributions, the latest innovations have introduced hierarchical memory management systems that enable dynamic memory allocation and sparse gradient updates, thereby further optimizing memory utilization, as highlighted by⁶⁴. Moreover,¹⁸ proposed a memory-efficient fine-tuning approach, employing block-wise optimizing strategies that dynamically adjust memory allocation, achieving superior performance across several benchmarks. In a similar vein,¹⁹ explored the use of low-rank factorization techniques to compress model parameters effectively while preserving model accuracy. Collectively, these innovations contribute to the deployment of large-scale models on resource-limited devices, ensuring computational efficiency and maintaining optimal performance.

In contrast to previous techniques, our proposed SK-Tuning method introduces a novel strategy that utilizes authentic, semantically meaningful prompts or prefix texts during adapter training. This method capitalizes on the zero-shot capabilities of large language models (LLMs) and their fundamental understanding of linguistic semantics. As a result, SK-Tuning is designed to achieve faster convergence and enhance task performance. Through extensive experimental evaluations and comprehensive comparative analysis, we establish the superiority of SK-Tuning over existing fine-tuning techniques. These findings highlight the significant potential of SK-Tuning to advance fine-tuning methodologies in the field of NLP.

Background study

Prefix and prompt tuning are methods of adapting large pretrained language models to specific tasks or datasets with minimal updates to the model parameters. These techniques have gained prominence due to their efficiency and effectiveness, particularly in scenarios where updating the entire model is computationally expensive or impractical.

Prefix tuning

Prefix tuning involves appending a sequence of tunable vectors, known as the prefix, to the input of each layer of the transformer model. Let us denote the transformer model as a function $F$ that maps an input sequence $x$ to an output $y$, i.e., $y = F(x)$. In prefix tuning, this mapping is modified to $y = F(p \oplus x)$, where $p$ represents the prefix and $\oplus$ denotes concatenation.

Mathematically, if we consider a transformer model with $K$ layers, and each layer $k$ performs a transformation $F_l$, the modified transformation with prefix becomes:

$$\begin{aligned} F'_k(p_k, x) = F_k(p_k \oplus x) \end{aligned}$$

(1)

where $p_k$ is the prefix for layer $k$. The prefixes $\{p_1, p_2, ..., p_K\}$ are learnable parameters and are optimized during the training process.

Prompt tuning

Prompt tuning, on the other hand, leverages the concept of natural language prompts. Here, the model is fed a prompt that guides it to generate outputs tailored to a specific task. In mathematical terms, given a pretrained model ${\mathscr {M}}$, the objective is to find an optimal prompt $p^*$ such that the model’s performance on a task $T$ is maximized when the prompt is used as an input.

Formally, for a task $T$ and a set of task-specific examples $\{(x_i, y_i)\}$, prompt tuning aims to optimize the following:

$$\begin{aligned} p^* = \arg \max _p \sum _i \log {\mathscr {M}}(y_i | p \oplus x_i) \end{aligned}$$

(2)

This objective function maximizes the likelihood of the correct outputs $y_i$ given the inputs $x_i$ concatenated with the optimal prompt $p^*$. Unlike prefix tuning, prompt tuning does not modify the internal workings of the model but rather influences its outputs through carefully crafted input sequences.

SK-tuning procedure

Problem definition

Consider a downstream task that utilizes a pretrained LLM denoted as ${\mathscr {M}}$. Let the training dataset be represented as ${\mathscr {D}} = \{(x_i, y_i)\}_{i=1}^N$, where each training example $(x_i, y_i)$ consists of an input text $x_i$ and its associated true label $y_i$.

Our primary objective is to fine-tune the pretrained LLM ${\mathscr {M}}$ for these downstream tasks while keeping the model parameters $\Theta$ frozen. Specifically, we aim to achieve this fine-tuning by introducing a small number of additional parameters, referred to as adapters or transformations, which enable task-specific adaptation without the need to retrain the entire LLM. Each instance in our system is a pair $(x_i, y_i)$ that defines a specific task configuration.

Mathematically, our goal can be expressed as follows:

Given a pretrained LLM with parameters $\Theta$ and a dataset ${\mathscr {D}}$, we seek to find a set of trainable parameters $\Phi$ for the adapters or transformations, such that:

$$\begin{aligned} \Phi ^* = \arg \min _\Phi {\mathscr {L}}({\mathscr {M}}_{\Theta , \Phi }, {\mathscr {D}}) \end{aligned}$$

(3)

where: - ${\mathscr {M}}_{\Theta , \Phi }$ represents the fine-tuned model with frozen parameters $\Theta$ and trainable adapters/transformations $\Phi$. - ${\mathscr {L}}$ is a task-specific loss function that quantifies the alignment between model predictions and true labels across the training dataset ${\mathscr {D}}$.

Our objective is to ascertain the veracity of the true labels $y_i$ for the corresponding input texts $x_i$ by effectively training a small number of parameters ($\Phi$) without altering the pretrained model’s core architecture or parameters ($\Theta$). This approach aims to achieve parameter efficiency while tailoring the LLM to specific downstream tasks.

SK-tuning for prefix

SK-Tuning for Prefix enhances the versatility and performance of the pretrained LLM for downstream tasks by judiciously incorporating semantic knowledge from prefixes into the fine-tuning process. In the context of prefix-tuning a pretrained LLM ${\mathscr {M}}$, traditionally, a mapping function from virtual trainable tokens to the LLM’s layer representation is employed to generate the layer’s trainable parameters. However, in our proposed SK-Tuning approach, we adopt a different strategy. We leverage the power of a pretrained LLM ${\mathscr {M}}$, with its parameters frozen, to directly acquire semantic knowledge embeddings from the prefix tokens.

Let p denote a prefix comprising a sequence of semantic knowledge tokens, with a length of l. The LLM model is assumed to have a dimension of d. Our objective is to extract the semantic hidden representation from each layer of the LLM for the given input prefix p. Let m represent the total number of layers, which includes the attention mechanisms. For each layer, we obtain its hidden representation $h_j^p$. For m layers the representation can be defined as follows:

$$\begin{aligned} h^p = {\mathscr {M}}_{\Theta _{\text {frozen}}}(p) \in {\mathbb {R}}^{l \times d}. \end{aligned}$$

(4)

Next, we introduce a trainable adapter ${\mathscr {F}}$, parameterized by $\Phi$, which takes $h^p$ as input and yields a semantic projection z with the same dimension:

$$\begin{aligned} z = {\mathscr {F}}_{\Phi }(h^p) \in {\mathbb {R}}^{m \times l \times d}. \end{aligned}$$

(5)

Now, we possess z as the semantic representation of prefix tokens for every layer of ${\mathscr {M}}$. During the processing of input $x_i$ in ${\mathscr {M}}_{\Theta }$, we concatenate $z_{j \in m}$ to the processing layer for the j-th layer of ${\mathscr {M}}_{\Theta }$. This operation allows the j-th layer to access the corresponding semantic information from the prefix text.

Now, if $r_i$ represents the final hidden output of $x_i$, we can define:

$$\begin{aligned} r_i = {\mathscr {M}}_{\Theta _{\text {frozen}}}(x_i, z). \end{aligned}$$

(6)

Consider a task-specific module ${\mathscr {C}}$, parameterized by $\zeta$, which embodies a downstream task:

$$\begin{aligned} o_i = {\mathscr {C}}_{\zeta }(r_i) \end{aligned}$$

(7)

Here, $o_i$ represents the output of our task.

Our training objective aims to minimize the loss function ${\mathscr {L}}$, which quantifies the discrepancy between $o_i$ and the target label $y_i$, thereby indicating whether $o_i$ correctly represents the label for $x_i$:

$$\begin{aligned} \min _{\Phi , \zeta } {\mathscr {L}}\left( {\mathscr {C}}_{\zeta }({\mathscr {M}}_{\Theta _{\text {frozen}}}(x_i, {\mathscr {F}}_{\Phi }({\mathscr {M}}_{\Theta _{\text {frozen}}}(p)))), y_i\right) . \end{aligned}$$

(8)

This approach allows the fine-tuning process to concentrate explicitly on the representation and comprehension of labels, while simultaneously harnessing the intrinsic knowledge embedded within $\Theta$. The adjustment of parameters $\Phi$ and $\zeta$ empowers the model to further refine its ability to map textual inputs to their corresponding labels.

SK-tuning for prompt

SK-Tuning for prompts involves a systematic process of semantic knowledge embedding, trainable adapter integration, concatenation, and a training objective. This approach allows for fine-tuning the pretrained LLM to effectively leverage semantic knowledge from prompts for improved performance in various downstream tasks. In the SK-Tuning framework for prompts, we focus on enhancing the capabilities of a pretrained LLM (${\mathscr {M}}$) by incorporating semantic knowledge from sequential prompt tokens, denoted as $p$, of length $l$. Let ${\mathscr {E}}$ represent the token embedding layer of ${\mathscr {M}}$, and consider $e_p \in {\mathbb {R}}^{l \times d}$ and $e_{x_i} \in {\mathbb {R}}^{n \times d}$ as the semantic embeddings for prompt $p$ and input text $x_i$, respectively. Here, $n$ is the sequence length of input $x_i$.

To obtain the semantic representation of the prompt $p$ and input text $x_i$, we utilize the pretrained token embedding layer ${\mathscr {E}}$ as follows:

$$\begin{aligned} e_p = {\mathscr {E}}(p) \thicksim {\mathscr {M}}_{\Theta _{\text {frozen}}} \end{aligned}$$

(9)

and

$$\begin{aligned} e_{x_i} = {\mathscr {E}}(x_i) \thicksim {\mathscr {M}}_{\Theta _{\text {frozen}}} \end{aligned}$$

(10)

This operation yields $e_p$, which encapsulates the semantic information of the prompt, while ${\mathscr {M}}_{\Theta _{\text {frozen}}}$ ensures that the model parameters remain frozen during this process.

To further enhance the representation of the prompt, we introduce a trainable adapter, denoted as ${\mathscr {G}}$, which is parameterized by $\gamma$. This adapter takes $e_p$ as input and produces an updated embedding $e_p^\prime$ as follows:

$$\begin{aligned} e_p^\prime = {\mathscr {G}}_{\gamma }(e_p) \in {\mathbb {R}}^{l \times d} \end{aligned}$$

(11)

The adapter ${\mathscr {G}}_{\gamma }$ serves as a mechanism to refine the semantic knowledge captured in $e_p$ according to the specific downstream task requirements, allowing for fine-tuning without modifying the frozen model parameters.

The task head, denoted as ${\mathscr {C}}$, is designed to incorporate both the enhanced prompt representation $e_p^\prime$ and the semantic embeddings of the input text $e_{x_i}$. We achieve this through concatenation:

$$\begin{aligned} o_i = {\mathscr {C}}_{\zeta }({\mathscr {M}}_{\Theta _{\text {frozen}}}(e_p^\prime \oplus e_{x_i})) \end{aligned}$$

(12)

Here, $\oplus$ represents the concatenation operation, and $o_i$ serves as the output for the downstream task, allowing the model to leverage both prompt and input text information effectively.

The training objective for SK-Tuning of the prompt involves minimizing a loss function ${\mathscr {L}}$. This loss function quantifies the difference between the predicted output and the target label $y_i$, reflecting the model’s performance on the specific task:

$$\begin{aligned} \min _{\gamma , \zeta } {\mathscr {L}}\left( {\mathscr {C}}_{\zeta }({\mathscr {M}}_{\Theta _{\text {frozen}}}({\mathscr {G}}_{\gamma }(e_p) \oplus e_{x_i}), y_i\right) \end{aligned}$$

(13)

Here, $\gamma$ and $\zeta$ denote the parameters of the adapter ${\mathscr {G}}$ and the task head ${\mathscr {C}}$ respectively.

Algorithms

In this section, we describe two key algorithms that constitute the core of our proposed SK-Tuning approach for enhancing the fine-tuning of LLMs in the context of specific downstream tasks.

SK-tuning for prefix

The first algorithm, outlined in Algorithm 1 (SK-Tuning for Prefix), details the process of incorporating semantic knowledge from prefixes into the fine-tuning of a pretrained LLM. The algorithm begins with inputs of a pretrained language model ${\mathscr {M}}$ with frozen parameters $\Theta$, a prompt text p, and a dataset ${(x_i, y_i)}_{i=1}^N$. The trainable parameters $\Phi$ and $\zeta$ are initialized. For each input example $({\textbf{x}}i, y_i)$, the prompt text is processed through the frozen LLM to obtain $h^p$, which represents the hidden representation from each layer. This representation is then transformed using a trainable adapter ${\mathscr {F}}_{\Phi }$ to yield z. Subsequently, the input text $x_i$ is processed, incorporating the generated z for improved task-specific adaptation. Finally, the classification head ${\mathscr {C}}_{\zeta }$ computes the output $o_i$ for the downstream task, and the loss ${\mathscr {L}}(o_i, y_i)$ is computed. The trainable parameters $\Phi$ and $\zeta$ are updated iteratively to minimize the loss.

SK-tuning for prompt

The second algorithm, described in Algorithm 2 (SK-Tuning for Prompt), focuses on leveraging semantic knowledge from sequential prompt tokens to enhance fine-tuning. It begins with inputs of a pretrained language model ${\mathscr {M}}$ with frozen parameters $\Theta$, a prompt text p, and a dataset ${({\textbf{x}}i, y_i)}_{i=1}^N$. Trainable parameters $\gamma$ and $\zeta$ are initialized. For each input example $(x_i, y_i)$, the embeddings of the prompt text p and input text ${\textbf{x}}i$ are obtained through the pretrained token embedding layer ${\mathscr {E}}$ while ensuring the core LLM parameters remain frozen. The prompt embedding $e_p$ and text embedding $e{x_i}$ are then utilized to create an enhanced prompt representation $e_p^\prime$ using a trainable adapter ${\mathscr {G}}_{\gamma }$. The classification head ${\mathscr {C}}_{\zeta }$ combines this enhanced prompt representation with the input text embedding and computes the output $o_i$ for the downstream task. As in the previous algorithm, the loss ${\mathscr {L}}(o_i, y_i)$ is computed, and the trainable parameters $\gamma$ and $\zeta$ are updated iteratively to minimize the loss.

Experiments

Experimental setup

Our experiments utilize a computational setup with two NVIDIA RTX H100 GPUs (80GB VRAM each), an Intel®Xeon®Gold 6448Y 2.1 GHz 32 Core Processor. This system includes 128GB of RAM and a Dell 7.68TB Enterprise NVMe Read Intensive Drive, providing the necessary computational power and storage for efficient model training and evaluation.

For the implementation, we employed the PyTorch⁶⁵ deep learning framework for the implementation of our experiments. Additionally, we leveraged the Transformers library developed by Hugging Face⁶⁶. This library offers a comprehensive set of tools and pretrained models for NLP tasks, facilitating the training and evaluation of LLMs on a variety of datasets.

The combination of these resources and software frameworks allowed us to conduct extensive experiments, enabling us to assess the performance and effectiveness of our proposed SK-Tuning approach across a range of downstream tasks.

LM results

Datasets: We evaluate our SK-Tuning on CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI and RTE of the GLUE Benchmarks⁶⁷. We compute the accuracy using the Matthews correlation for CoLA, accuracy/F1 score for MRPC and QQP, Pearson/Spearman correlation for STS-B, average matched accuracy for MNLI, and accuracy for other NLU tasks in Table 1.

Table 1 Performance comparison of RoBERTa models on GLUE tasks: metrics include MCC for CoLA, accuracy for SST-2, accuracy/F1-score for MRPC and QQP, Pearson/Spearman correlation for STS-B, and Accuracy for MNLI, QNLI, and RTE.

Parameter-efficient fine-tuning of large language models using semantic knowledge tuning

Abstract

Similar content being viewed by others

PIQARD System for Experimenting and Testing Language Models with Prompting Strategies

Parameter-efficient fine-tuning of large-scale pre-trained language models

AQLoRA: An Adaptive Quantization-Based Efficient Fine-Tuning Method for LLMs

Introduction

Related work

Background study

Prefix tuning

Prompt tuning

SK-tuning procedure

Problem definition

SK-tuning for prefix

SK-tuning for prompt

Algorithms

SK-tuning for prefix

SK-tuning for prompt

Experiments

Experimental setup

LM results

LLM results

Classification datasets

Token classification datasets

Entailment datasets

Large language models

Baseline methods

Evaluation metrics

Accuracy

F1 score

Hyperparameters setting

Result analysis

Sequence classification

Token classification

Entailment detection

Ablation study

Efficiency

Faster convergence with SK-tuning

Accelerated convergence in token classification

Accelerated convergence in sequence classification

Accelerated convergence in NLI

Adapter layers

Effect of prompt and prefix text

Discussion

Limitations

Training and inference time overhead

Dependency on pretrained models

Semantic knowledge acquisition

Tuning hyperparameters

Conclusion

Data availibility

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article