CN117094361A

CN117094361A - Method for selecting parameter efficient fine adjustment module

Info

Publication number: CN117094361A
Application number: CN202311352064.3A
Authority: CN
Inventors: 游世学; 郭锐; 王丙栋; 乔亚飞; 徐峰
Original assignee: Beijing Zhongke Huilian Technology Co ltd
Current assignee: Beijing Zhongke Huilian Technology Co ltd
Priority date: 2023-10-19
Filing date: 2023-10-19
Publication date: 2023-11-21
Anticipated expiration: 2043-10-19
Also published as: CN117094361B

Abstract

The invention provides a method for selecting a parameter efficient fine adjustment module, which comprises the following steps: acquiring a parameter efficient fine adjustment module to be selected and a final hidden state of an input sample; constructing a parameter efficient fine-tuning super network and obtaining the final characterization of the input sample according to the final hidden state of the input sample; the final characterization of the input sample is multiplication of the learning coefficient with the final hidden state of the input sample; judging whether the learning coefficient is larger than a threshold value, if so, selecting the parameter efficient fine adjustment module to be selected, and if not, discarding the parameter efficient fine adjustment module to be selected. The invention solves the problems of small selectivity and large consumption in the training process of the large-scale language model fine tuning method in the prior art.

Description

Method for selecting parameter efficient fine adjustment module

Technical Field

The invention relates to the technical field of language models, in particular to a method for selecting a parameter efficient fine tuning module.

Background

The existing large-scale language model, though more and more powerful, shows a certain general learning capability, namely, by observing a few groups of examples, the existing large-scale language model can have a certain completion capability for tasks which are never seen. However, to support different needs of customers in different application scenarios, we may not be able to just run the exact same model. Therefore, we may need to customize the model based on the customer data. For example, a client has a requirement for privacy protection, the input data is dialogue data subjected to encryption processing, and possibly text is completely different, and the model needs to be customized in order to understand the dialogue content and formulate reply content. In order to use a large model base to meet the requirements of different customization tasks, a parameter efficient fine-tuning method is needed.

The existing method uses only one parameter efficient fine tuning method to fine tune a large model for one task, and the training process is consumed greatly.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a method for selecting a parameter efficient fine tuning module, and solves the problems of low selectivity and high consumption in the training process of a large-scale language model fine tuning method in the prior art.

In order to achieve the above object, the present invention provides the following solutions:

a method of selecting a parameter efficient trim module, comprising:

acquiring a parameter efficient fine adjustment module to be selected and a final hidden state of an input sample;

constructing a parameter efficient fine-tuning super network and obtaining the final characterization of the input sample according to the final hidden state of the input sample; the final characterization of the input sample is multiplication of the learning coefficient with the final hidden state of the input sample;

judging whether the learning coefficient is larger than a threshold value, if so, selecting the parameter efficient fine adjustment module to be selected, and if not, discarding the parameter efficient fine adjustment module to be selected.

Preferably, the obtaining the final hidden state of the parameter to be selected efficient fine tuning module and the input sample includes:

acquiring a first hidden state of an input sample;

obtaining a second hidden state of the input sample according to the first hidden state and part of the operation of the transducer layer;

obtaining a third hidden state of the input sample according to the second hidden state and the parameter to be selected high-efficiency fine adjustment module;

and obtaining the final hidden state of the input sample according to the remaining operation of the transformer layer and the third hidden state of the input sample.

Preferably, the expression of the final hidden state of the input sample is:

；

wherein,for the efficient fine-tuning module of the parameter to be selected +.>For the first hidden state, ++>In order to be in the final hidden state,is a second functional expression.

Preferably, the formula of the final characterization of the input sample is:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is a learning coefficient.

Preferably, the learning coefficient is determined by a bernoulli random number, and the bernoulli random number is 0 or 1, wherein the probability of the bernoulli random number being 1 is 0.5.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a method for selecting a parameter efficient fine-tuning module, which allows a plurality of parameter fine-tuning methods to be used by setting a parameter efficient fine-tuning super network, and eliminates redundant parameter efficient fine-tuning modules according to the final hidden state of an input sample and judgment of learning parameters, so that training consumption of a language model is reduced, and different transducer layers select corresponding parameter efficient fine-tuning modules to obtain better effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for selecting a parameter efficient fine tuning module according to an embodiment of the present invention;

fig. 2 is a schematic diagram of 3 known efficient fine tuning methods according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to provide a method for selecting a parameter efficient fine tuning module, which solves the problems of low selectivity and high consumption in the training process of a large-scale language model fine tuning method in the prior art.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 1, the present invention provides a method for selecting a parameter efficient fine tuning module, which includes:

step 100: acquiring a parameter efficient fine adjustment module to be selected and a final hidden state of an input sample;

step 200: constructing a parameter efficient fine-tuning super network and obtaining the final characterization of the input sample according to the final hidden state of the input sample; the final characterization of the input sample is multiplication of the learning coefficient with the final hidden state of the input sample;

step 300: judging whether the learning coefficient is larger than a threshold value, if so, selecting the parameter efficient fine adjustment module to be selected, and if not, discarding the parameter efficient fine adjustment module to be selected.

Specifically, assume we already have a large-scale language modelThis model may be our autonomous pre-training or may be an open source model, such as ChatGLM-6B. The model is generally over 50 billion in model parameter scale. When the model is trained, the pre-training trunk is not changed, and only parameters of the PEFT module on the right side are changed.

Further, the obtaining the final hidden state of the parameter to be selected high-efficiency fine adjustment module and the input sample includes:

acquiring a first hidden state of an input sample;

In particular, assume that for one parameter a module is efficiently trimmedThe hidden state of the input sample before entering the transducer layer is +.>It reaches->Part of the operation of passing through a transducer before accessing the location (recorded as a function) The previous hidden state is +.>Through->After that become +.>Then the remaining operations through the transducer layer (denoted as a function +.>) Obtain->。

The final hidden state of the input sample is expressed as:

；

This example discloses 3 known methods for efficient fine tuning of parameters, as shown in fig. 2: loRA, prefixing, adapter tuning. LoRA is the modification of the parameter matrix, and Prefix is the splicing of the parameter matrix with random initialization of the vector characterization of the sample; adapter is the modification of the hidden state of each transform layer output.

Specifically, we first set a parameter efficient fine-tuning super-network, i.e. we allow the six parameter efficient fine-tuning modules (2 adapter positions, two prefix positions, and two LoRA positions) in fig. 2 to be all used at the same time. Thus, the formula for the final characterization of the input sample (final table of samples through the transducer layer) is:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For learning coefficients, the value is between 0 and 1 (here, a real number is taken as sigmoid). We finally want to realize: giving a threshold value->If->Is greater than->Then keep +.>Otherwise remove->. Thus, by adjusting the threshold, we can eliminate redundant parameter efficient trimming modules.

The transformation layer of the pre-training model is equivalent to a trunk, the parameter efficient fine-tuning module is a Bowman on the transformation layer, and both the transformation layer and the parameter efficient fine-tuning module process and change the implicit state of the sample, so that the transformation layer and the parameter efficient fine-tuning module are more suitable for outputting useful outputs, such as the category of sentences, how the next sentence of sentences should be written, and the like.

Further, the learning coefficient is determined by a Bernoulli random number, the Bernoulli random number is 0 or 1, and the probability of the Bernoulli random number being 1 is 0.5.

In particular, the method comprises the steps of,learning parameters: />The parameters are considered as part of the model parameters and are learned along with the parameters of the parameter efficient fine tuning module. Finally, go up>By learning, a value near 0 or 1 indicates that the PEFT module on the customization task is less important. Although it can learn +.>Parameters, but there is a problem: finally we want to screen out less important PEFT modules, but during training and end use the composition of PEFT modules is different, training directly +.>Resulting in portions of the modules not being adequately trained. Therefore, it can cause +>The importance of parameter learning is inaccurate. So to ensureThe parameters may express the importance of the PEFT module, we propose the following regularization method. We are on each forward propagation for each +.>Parameters, all randomly sample a Bernoulli random number +.>The random number takes on a value of 0 or 1, with a probability of p=0.5 of 1, then +.>The value of (2) is->And->Multiplication is performed to obtain>Whether or not can be used by +.>And (5) determining. Thus, each forward propagation activates different parameter efficient fine tuning modules due to the randomness of the Bernoulli random numbers, and different results are obtained. The hidden state obtained by two different forward propagation is marked as +.>And->We believe that the super-network needs to ensure that when different PEFT modules are used, the provided sample semantic representation is stable enough that we need to select different PEFTs for the transducer modules at different positions of the model so that we can learn the number a by training enough for our different PEFT modules _i Adding the PEFT modules, and learning a through the regular term of I description plus training data loss _i The value of (a), finally a _i If smaller, the corresponding PEFT module is removed. Thus, we add an additional regularization term on the basis of training loss:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is a regular term

That is, the requirement isAnd->Is as small as possible. By the constraint of this regularization term, part of the modules in our PEFT super-network can be trained sufficiently so that the corresponding +.>The parameters may truly reflect their importance.

The beneficial effects of the invention are as follows:

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method of selecting a parameter efficient trim module, comprising:

2. The method of claim 1, wherein the obtaining the final hidden state of the parameter efficient trimming module to be selected and the input sample comprises:

acquiring a first hidden state of an input sample;

3. The method of claim 1, wherein the final hidden state of the input samples is expressed as:

；

wherein,for the efficient fine-tuning module of the parameter to be selected +.>For the first hidden state, ++>At the mostA final hidden state of the device,is a second functional expression.

4. A method of selecting a parameter efficient fine tuning module as claimed in claim 3 wherein the formula of the final characterization of the input sample is:

5. The method of claim 1, wherein the learning factor is determined by a bernoulli random number, the bernoulli random number having a value of 0 or 1, and wherein the probability of the bernoulli random number having a value of 1 is 0.5.