CN117217289A - Banking industry large language model training method - Google Patents
Banking industry large language model training method Download PDFInfo
- Publication number
- CN117217289A CN117217289A CN202311299622.4A CN202311299622A CN117217289A CN 117217289 A CN117217289 A CN 117217289A CN 202311299622 A CN202311299622 A CN 202311299622A CN 117217289 A CN117217289 A CN 117217289A
- Authority
- CN
- China
- Prior art keywords
- model
- training
- data
- llama
- banking
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 105
- 238000000034 method Methods 0.000 title claims abstract description 50
- 235000002198 Annona diversifolia Nutrition 0.000 claims abstract description 37
- 230000011218 segmentation Effects 0.000 claims abstract description 19
- 238000005728 strengthening Methods 0.000 claims abstract description 4
- 244000303258 Annona diversifolia Species 0.000 claims description 36
- 230000002787 reinforcement Effects 0.000 claims description 18
- QVFWZNCVPCJQOP-UHFFFAOYSA-N chloralodol Chemical compound CC(O)(C)CC(C)OC(O)C(Cl)(Cl)Cl QVFWZNCVPCJQOP-UHFFFAOYSA-N 0.000 claims description 13
- 241000282414 Homo sapiens Species 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 10
- 238000010276 construction Methods 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 6
- 241000282412 Homo Species 0.000 claims description 3
- 238000013459 approach Methods 0.000 claims description 2
- 238000009966 trimming Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 12
- 238000011161 development Methods 0.000 abstract description 3
- 230000018109 developmental process Effects 0.000 abstract description 3
- 241000282842 Lama glama Species 0.000 abstract 1
- 238000005457 optimization Methods 0.000 description 15
- 239000011159 matrix material Substances 0.000 description 11
- 230000008901 benefit Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 230000009471 action Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 230000006978 adaptation Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 1
- 241001416177 Vicugna pacos Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000008014 freezing Effects 0.000 description 1
- 238000007710 freezing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The invention provides a banking large language model training method, which comprises the following steps: step S1: constructing a model training data set; step S2: training a model word segmentation device based on a bank word list; step S3: constructing a large model base based on the gain training of the llama pre-training model; step S4: performing instruction fine adjustment by using a prompt project; step S5: and (5) strengthening and learning to finely tune the large model. The large language model is trained based on the business, knowledge and data of the specific bank. The method can understand the professional vocabulary such as the amount of money, financial product names, handling processes and the like which are frequently used by banks; instruction training is carried out by using customer service dialogue data and a bank internal knowledge base, so that the model has the capabilities of customer service questioning and answering, search type questioning and answering and the like; the service facing the continuous development has the capability of rapid iteration.
Description
Technical Field
The invention relates to the field of language model training of banks, in particular to a banking large language model training method.
Background
Since the release of ChatGPT by OpenAI corporation at 11 of 2022, this powerful language model has caused a tremendous hurdle around the world. The striking effect and wide application fields thoroughly change the knowledge of people on the potential of the large model, and detonate the expectations of various industries on artificial intelligence. Large models refer to deep learning models with large numbers of parameters and complex structures used in the fields of natural language processing and machine learning. These models are designed to handle large-scale text data and are capable of automatically learning and understanding the semantics, grammar, and context of the language.
The existing large language model is mainly trained based on general corpus no matter whether the commercial company trains or academic institutions are open-source, has very general knowledge understanding capability on banking businesses, cannot support scenes such as customer service of the banks, knowledge base questions and answers and the like, and cannot meet the requirements of the banking businesses.
Disclosure of Invention
In view of the above, the present invention has been made to provide a banking large language model training method that overcomes or at least partially solves the above-mentioned problems.
According to one aspect of the present invention, there is provided a banking large language model training method, the training method comprising:
step S1: constructing a model training data set;
step S2: training a model word segmentation device based on a bank word list;
step S3: constructing a large model base based on the gain training of the llama pre-training model;
step S4: performing instruction fine adjustment by using a prompt project;
step S5: and (5) strengthening and learning to finely tune the large model.
Optionally, the step S1: the building of the model training data set specifically comprises:
pre-training a data set, scoring the data based on heuristic rules and a quality model, and filtering the data set in chapter and sentence granularity; filtering the chapter and sentence granularity on the whole data by using a local sensitive hash method;
the instruction trims the data set.
Optionally, the step S2: training the model word segmentation device based on the bank vocabulary specifically comprises:
sorting a special dictionary in a row based on a knowledge base, wherein the dictionary comprises industry term words, financial product names and special digital words in banking industry, and 4300 words in the special dictionary;
loading a special word list when using a sentenceplice training word segmentation device to ensure that special words in a text are not separated, wherein training data is pre-training unsupervised text data with a 1.1-knot structure, and the training is set to be 25k;
the trained token is combined with the original llama token to obtain a combined token by combining the vocabulary.
Optionally, the step S3: based on the gain training of the llama pre-training model, the construction of the large model base specifically comprises the following steps:
retraining an Llama 13b model;
the model architecture of ilama uses a Transformer Decoder architecture.
Optionally, the step S4: instruction trimming using hint engineering specifically includes: model fine tuning was performed using the LoRA approach.
Optionally, the step S5: the reinforcement learning fine tuning large model specifically comprises:
multiple strategies generate samples and collect human feedback to form reinforcement learning data sets
Training reward model based on Bloom-7b model
The goal of a Reward Model (RM) is to characterize whether the Model's output appears to humans to be good;
inputting [ prompt (prompt), text generated by the model ], and outputting a scalar number for characterizing the text quality;
the formula is as follows:
wherein x, y represent POST and SUMMARY, r respectively θ A value representing a reward model with a parameter θ, σ representing a sig mod function;
the reward model receives a series of text and returns a scalar reward corresponding in value to the person's preference;
and fine-tuning the large model through the trained reward model prediction result and optimizing the model strategy through the PPO algorithm.
The invention provides a banking large language model training method, which comprises the following steps: step S1: constructing a model training data set; step S2: training a model word segmentation device based on a bank word list; step S3: constructing a large model base based on the gain training of the llama pre-training model; step S4: performing instruction fine adjustment by using a prompt project; step S5: and (5) strengthening and learning to finely tune the large model. The large language model is trained based on the business, knowledge and data of the specific bank. The method can understand the professional vocabulary such as the amount of money, financial product names, handling processes and the like which are frequently used by banks; instruction training is carried out by using customer service dialogue data and a bank internal knowledge base, so that the model has the capabilities of customer service questioning and answering, search type questioning and answering and the like; the service facing the continuous development has the capability of rapid iteration.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a reinforcement learning process according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a self-instruct architecture according to an embodiment of the present invention;
FIG. 3 is a graph showing a comparison of word segmentation effects of an original llama token and a token trained in Chinese, provided by an embodiment of the present invention;
FIG. 4 is a diagram of a model architecture according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a LoRA structure according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terms "comprising" and "having" and any variations thereof in the description embodiments of the invention and in the claims and drawings are intended to cover a non-exclusive inclusion, such as a series of steps or elements.
The technical scheme of the invention is further described in detail below with reference to the accompanying drawings and the examples.
As shown in fig. 1, a training method for a banking industry large model includes:
step A: building model training data sets
And (B) step (B): word segmentation device based on bank vocabulary training model
Step C: building large model base based on llama pre-training model incremental training
Step D: instruction trimming using hint engineering
Step E: reinforced learning fine tuning large model
The method is based on a task scene as a guide to perform model training, a training data set is constructed by using general corpus, financial corpus and internal text data of banks, a segmenter is trained by using a BPE algorithm of Sentence piece, an unsupervised training is performed for a large model base on the basis of a llama 13b model, a supervised training is performed by using instruction fine tuning, and an output alignment is performed by using a reinforcement learning method.
The application scene of the large model in banking industry comprises customer service and support, knowledge question and answer, search question and answer, content audit, auxiliary text writing, text classification/emotion analysis, entity identification, triple knowledge extraction and other scenes. The method uses the data set training model which is constructed to accord with the banking business scene and instructs fine adjustment and reinforcement learning training, so that the model can better understand banking business semantics and output accurate and compliant answers.
The specific implementation steps are as follows:
building training data sets
The pre-training data set comprises 50% of general corpus, 40% of financial corpus and 10% of internal corpus data of banks. The general corpus comprises task data such as multi-round dialogue data/entity recognition/text classification/text summarization and other general text data; the financial data comprises financial corpus of open source, financial/financial news data, financial newspaper of marketing company, annual newspaper, research newspaper, bulletin and other financial data; the internal data of the bank mainly comprises text data such as a bank knowledge base, customer service dialogue data, internal flow files, various notification specifications issued by the bank and the like. The data is scored based on heuristic rules and a quality model, and the data set is subjected to chapter and sentence granularity filtering. On the whole data, the chapter and sentence granularity is filtered by using a local sensitive hash method. After the data set is cleaned and filtered, the data set is constructed into a json file format, and each text is a json file. The training data set is 300G or so.
The instruction fine tuning data set is supervised training, and the total number of the instruction fine tuning data set is 30000 by using 50000 of an open-source Chinese instruction set, an instruction set modified by an in-line customer service FAQ and a knowledge base and in-line data marked by using a self-instruct method. The data set needs to be constructed in json format to include instruction, input, output three parts, and input can be omitted without filling. The instruction data set comprises Chinese part data of open source data Stanford Alpaca instruction set, shareGPT instruction set and BELLE instruction set, and most importantly, exclusive instruction data is constructed according to banking task scenes and in-line knowledge. The construction exclusive instruction data mainly comprises two parts, namely a customer service FAQ data construction instruction set, an intra-row knowledge-based construction instruction and construction instruction data according to specific task scenes.
self-instrcut method generates instruction set steps:
175 instructions representing different tasks were designed and (instruction, input, output)/(instruction, output) was written for each piece of data, taking the 175 pieces of data as seed pools.
Generating a new instruction by using the pre-training model of the method;
judging whether to classify tasks according to the instructions generated by the model;
generating an instance using the pre-training model;
filtering and post-processing the data generated by the model;
adding the filtered and post-processed data to a seed pool;
the above steps 2 to 6 are repeated until there is enough data in the seed pool.
When the instruction is generated, 6 manually written instructions are randomly extracted from the seed pool, and then 2 instructions generated in the previous steps in a model mode are randomly extracted, wherein the total number of the instructions is 8. Input to the model, let the model output a new instruction. Whether an instruction belongs to a classification task is determined, primarily because the classification task is different from the sample template used by the non-classification task when an instance is generated for the instruction. And randomly selecting 12 classified instructions and 19 non-classified instructions from the seed pool, adding the newly generated instructions, inputting the newly generated instructions into the model, and enabling the model to output whether the newly generated instructions classify tasks or not. For the diversity of data, newly generated instructions will only be added into the seed pool if the ROUGE-L with the instructions in the seed pool is less than 0.7; excluding some instructions that cannot be processed by language models, such as instructions related to images, pictures, graphics; when an instance is generated for an instruction, instances that are the same in input but different in output are filtered out. 50000 ten thousand instruction data were extracted through manual intervention using self-instruct method. The self-instruct architecture is shown in FIG. 2.
Training word segmentation device
Because the original LLaMA has very limited support on Chinese, only contains hundreds of Chinese words, so that the efficiency of encoding and decoding Chinese text is greatly affected, and the Chinese word list needs to be further expanded on the basis of the LLaMA. Firstly, a special dictionary in a row is arranged based on a knowledge base, wherein the dictionary comprises industry term words, financial product names, unique digital words and the like in banking industry, and 4300 words are used in the special dictionary. Then, when the sentenceperec training word segmentation device is used, a special word list is loaded to ensure that special words in the text are not separated, training data is pre-training unsupervised text data with a 1.1-knot structure, and the training set vocba_size is 25k. Combining the trained token with the original llama token, and finally obtaining a combined token with a vocabulary size of 57k by combining the vocabularies, wherein the word embedding and language model head is adjusted from v×h to V 'x H, where v=32,000 represents the original vocabulary size and V' =25000 is the trained token vocabulary size, in order to accommodate the new token. The new line is appended to the end of the original embedding matrix, ensuring that the embedding of the tokens in the original vocabulary is not affected. The number of tokens generated by using the Chinese LLaMA word segmentation device is reduced by about half relative to the original LLaMA word segmentation device. The original LLaMA word segmentation device and the Chinese LLaMA word segmentation device are compared, and the use of the Chinese LLaMA word segmentation device is obviously reduced relative to the original coding length. This shows that the method is effective in improving the Chinese understanding and generating capabilities of the LLaMA model. The Chinese LLaMA word segmentation device is used for pre-training the Chinese LLaMA model in a standard natural language model training task, and the next mark is predicted in an autoregressive mode, so that the Chinese understanding and generating capacity of the LLaMA model is further improved. FIG. 3 is a graph comparing the word segmentation effects of an original llama token and a Chinese trained token.
Llama 13b model retraining
The model architecture of Llama uses the Transformer Decoder architecture, but some optimization is done in detail, including the following
3.1RMS Pre-Norm
RMS Norm (Root Mean Square Layer Normalization), a variant of general LayerNorm, can make losses smoother as the gradient decreases. The main difference in RMS Norm compared to layerNorm is that the mean-subtracted part (re-centering) is removed, only the variance part (re-scaling) is retained, and LLaMA uses RMSNorm on the inputs of the Attention Layer and MLP, and training is more stable than on the outputs.
3.2 SwiGLU activation function
LLaMA replaces the original ReLU by SwiGLU, the action mechanism of SwiGLU is to automatically adjust the information flow path through the learned parameters according to the characteristics of input data, and specifically adopts Feedforward Neural Network of SwiGLU (FNN for short), which is a feedforward neural network using a learnable gating mechanism, and the formula is as follows:
the formula is to process the product of the input x and the weight matrix W through a Swish nonlinear activation function "
The result obtained in the above step and the "product of input x and weight matrix V" are multiplied element by element
This operation corresponds to the introduction of a "gate" similar to the GLU between the output of the Swish activation and the output of the second linear transformation, the value of this gate being calculated from the original input x by the linear transformation V, so that it can dynamically control the output of the Swish activation.
Finally multiplying by weight matrix W 2
3.3 RoPE rotational position encoding
RoPE (Rotary Position Embedding) is rotated by means of a position code, the idea being to use an absolute position code form to realize a relative position code. Position coding is particularly important for large models because since large models are to be trained, the characterization of long text and the modeling capabilities of the model are very important for long text. The benefit of the RoPE is that it retains both absolute position information in absolute position coding and relativity to position information under inner product operation. The absolute position coding has the advantages of high calculation speed and the like, and has the defects of more troublesome expansion length and no practical significance of the absolute position. Whereas relative position coding is significant for learning the relation between the token, for example, the probability of the relation between two token far away is very small, and better effect can be obtained by using the relative position coding. In addition, the expansion length is easier, because no matter how long the context size is, only the input within the longest distance needs to be focused. The disadvantage of relative position coding is that absolute position coding is not as fast as it is calculated.
It is first assumed that the original code vector is a two-dimensional row vector q before adding the position information m And k n Where m and n are absolute positions, it is now necessary to construct a transformation that introduces m and n into q m And k n In (3), find the transform:
q ~ m =f(q,m),k ~ n =f(k,n)
that is, we design operations f (/ cdot, m), f (/ cdot, n) for q, k, respectively, such that after this operation,/tilleq m 、/tildek n Absolute position information for positions m, n is carried.
3.4 model Structure construction
When Llama constructs a transducer, the transducer is constructed according to blocks, each transducer Block comprises SA and FFN, and then the whole transducer network is constructed in a stacked Block mode.
3.4.1 construction of self-layer
Inputting x, and obtaining x through three Linear respectively q ,x k ,x v
At x q And x k Incorporating rotary position coding
Cache x q And x k
Calculation of
One of the details is the caching mechanism, which is designed to reduce the repeated computation of the token at generation. When calculating the nth token feature, 1, n-11, n-1, i.e. every generation, all previous information needs to be known, if each calculation is performed from scratch, great waste is caused, so that no information of one position is calculated, and the information of one position is cached.
3.4.2 construction of FFN layer
The FFN in LLaMA adopts three full connection layers to realize FFNSwiGLU, namely
The assembly of the SA and FFN is transformer block, and transformer block is stacked by means of module list of the torch, and the assembly of the forefront part is a complete transformer (decoder) structure. The forward part inputs a token, first makes token queuing, and then adds position information. For the decoder model, a mask is required to prevent tag leakage, so a mask matrix with an upper triangle is made. The layer-by-layer computation of the transducer follows.
3.4.3 optimizer selection:
(1) Using an AdamW optimizer; super parameters: β1=0.9, β2=0.95. Using a weight decay of 0.1, the gradient cut was 1.0.
(2) Cosine learning rate, and final learning rate is 10% of maximum learning rate.
(3) 2000 preheat steps were used and the learning rate and batch size were changed as the model size was changed.
The model architecture diagram is shown in fig. 4.
3.4.4 model incremental training:
because of the high cost of training large language models, high performance is also important when building large language models. To achieve high performance training, the present method uses GPU resource training of 8XA00 (80G) and uses the following technique:
fused CUDAkernel: the fused CUDA kernel provided in xformars is used for fusing a plurality of operations, so that data transmission between the GPU and the CPU is reduced, and training efficiency is improved.
Introduced into xformators libraries open-sourced by META
The memory_effect_attribute operation performs the calculation of selfAttention, which improves the performance significantly by about 30%.
Parallelization training: parallelization training on multiple GPUs is supported by using an accelate library+deep speed stage 3+offload+actuation strategy to speed up training.
For the 13B model, training speed is 1378token/s/gpu by using an Llama model of a Pytorch native version in a transformer, and the training speed of the method reaches 3290token/s/gpu and basically reaches 3370token/s/gpu in Llama original. Pre-training was performed using 330Btoken, when 43000 GPUs were required to be trained. The total training 80K step,Global Batch Size and ilama were consistent at 4M.
Model fine tuning
The model fine tuning refers to giving out human instructions in a natural language form, is a process of directly tuning on a pre-trained large model based on a group of NLP task sets, and can improve the effect of the language model on unknown tasks, namely the zero-shot tuning capability. Model tuning can motivate the understanding ability of language models, which can generalize more on invisible tasks by understanding what tasks to do.
The method uses LoRA mode to make model fine tuning. LoRA, english full name Low-Rank Adaptation ofLarge Language Models, is interpreted as Low-level adaptation of a large language model, is PEFT (parameter efficient fine tuning method), and is a technology developed by Microsoft researchers to solve the problem of fine tuning of the large language model. The basic principle of LoRA is to freeze pre-trained model weight parameters, and under the condition of freezing original model parameters, by adding an additional network layer into the model, only training the newly added network layer parameters. Because the number of the newly added parameters is small, the cost of the finetune is obviously reduced, and the effect similar to the whole model fine tuning can be obtained.
A bypass is added beside the original PLM (Pre-trained Language Model), and a dimension-reducing and dimension-increasing operation is performed to simulate a so-called introside rank.
The parameters of PLM are fixed during training, and only the dimension-reducing matrix A and the dimension-increasing matrix B are trained. And the input and output dimensions of the model are unchanged, and parameters of BA and PLM are overlapped during output.
Initializing A with random Gaussian distribution, initializing B with 0 matrix, and ensuring that the bypass matrix is still 0 matrix at the beginning of training.
The pre-training model to be trained before is now fine-tuned at the downstream task, and then the pre-training model parameters need to be updated, as expressed below:
W 0 +ΔW
W 0 is the parameter for initializing the pre-training model, and DeltaW is the parameter to be updated. If the full parameter is fine-tuned, its parameter is equal to W 0 . It is seen that the cost is very high to fine tune a large language model with full parameters. Whereas for LORA only the fine tuning aw is needed.
In particular, let the pre-trained matrix be W 0 ∈R d X k, its update can be expressed as:
W 0 +ΔW=W 0 +BA,B∈R d×r ,A∈R r×k
wherein the rank r < min (d, k).
In the training process of LoRA, W 0 Is fixed, only a and B are training parameters.
In the forward direction, W 0 The same input x is multiplied by Δw and added: q
h=W 0 x+ΔWx=W 0 x+BA x
This idea of LORA is somewhat similar to a residual connection, while the update of this bypass is used to simulate the Full Fine-Tuning process. Also, full Fine-Tuning is considered a special case of LoRA (when r equals k).
During the reasoning process, the LoRA also introduces almost no extra Inference Latency, only w=w needs to be calculated 0 +Δw. The combination of LoRA and transducer is also simple, adding only one bypass in the computation of QKV Attention. Fig. 5 is a drawing of the LoRA structural framework.
The method uses a PEFT module of huggingface open source and combines deep speed to carry out fine tuning training.
The maximum length of the model context based on the llama-7b fine adjustment is 2048 token, however, in applications such as long dialogue, summarizing long documents or long-term planning, the preset context window limit is often exceeded, so that a large model capable of processing a longer context window is more in line with the business scene of a bank. The present method uses position interpolation (Position Interpolation, PI) to extend the context window of some existing pre-trained LLMs, including LLaMA. The PI method directly narrows the position index such that the maximum position index matches the contextual window limit of the pre-training phase. To accommodate more input tokens, interpolation of position codes over adjacent integer positions takes advantage of the property that position codes can be applied to non-integer positions, enabling the ability to expand contextual windows.
5. Reinforced learning training
The large language model has strong natural language understanding capability, but needs to ensure that the output of the large language model can meet the expectations of human beings, so that the behavior of the large language model is aligned with the human beings, the large language model can understand human instructions and answer helpful to the human beings.
Reinforcement learning is a paradigm that utilizes feedback to learn strategies. The reinforcement-learning model (Agent) interacts with the environment, takes action at for each given state st and obtains rewards rt from the environment, while entering the next state s [ t+1], and the process loops back and forth. After accumulating this series of interaction experiences, the model adjusts its strategy to maximize the rewards that the interaction process gets. Thus, the Agent learns the strategy of taking beneficial actions in a given state, and the aim of reinforcement learning is fulfilled.
Reinforcement learning training step:
5.1 multiple strategies sample and collect human feedback to form a reinforcement learning dataset
5.2 training of reward models based on Bloom-7b models
The goal of a Reward Model (RM) is to characterize whether the output of the Model appears to humans to be good. I.e. input [ prompt (text), model generated text ], output a scalar number characterizing the text quality. The formula is as follows:
wherein x, y represent POST and SUMMARY, r, respectively θ The value of the reward model representing the parameter θ, σ represents the sig mod function. The reward model receives a series of text and returns a scalar reward corresponding in value to the person's preference. We can model with LM in an end-to-end fashion, or with a modular system (such as ranking the outputs and then converting the ranking to rewards). This prize value will be critical for subsequent seamless access to the existing RL algorithm.
5.3 fine tuning the Large model by means of trained reward model predictions and model strategy optimization by means of PPO algorithm
Since the fine tuning task of the initial language model is modeled as a Reinforcement Learning (RL) problem, basic elements such as policy (policy), action space (action space), and reward function (reward function) need to be defined. The strategy is based on the language model, receiving a prompt as input, and then outputting a series of texts (or probability distributions of the texts); the action space is the arrangement and combination of all token in all output positions of the vocabulary (a single position usually has token candidates of about 50 k); the observation space is a possible input token sequence (namely, a template), which is obviously quite large, and is an arrangement combination of all tokens in all input positions of the vocabulary; the reward function (reward) is calculated based on the RM model trained in the previous section to obtain an initial reward, and a constraint term is superimposed.
Let the vocabulary be Σ and the language model be ρ, then the probability distribution for a sequence of length n can be expressed as
ρ(x 0 ...x n-1 )=Π 0≤k<n ρ(x k |x 0 ...x k-1 )
The input space x=Σ+.m, the output space x=Σn, and for an article of input X e X possibly length 1000, Y e Y may be a summary of length 100. The probability of generating a summary y from the article x can be expressed as ρ (y|x) =ρ (xy)/ρ (x).
Initializing a policy pi=ρ, then updating the policy pi using the PPO algorithm, defining a bonus function as r, and expressing the expected value of the bonus as:
Ε π [r]=Ε x~D,y~π(·|x) [r(x,y)]
the PPO algorithm optimizing reward function calculating steps are as follows:
inputting the prompt X into the initial LM and the current fine-tuned LM to obtain output texts y1 and y2 respectively, and transmitting the text from the current strategy to the RM to obtain a scalar prize r θ 。
Comparing the generated text of the two models to calculate a penalty term for the difference, typically designed as a scaling of Kullback-Leibler (KL) divergence between the output word distribution sequences, i.e. r=r θ -beta rKL, wherein
This term is used to penalize the RL strategy to generate a large deviation from the initial model in each training batch to ensure that the model outputs reasonably consistent text. If this penalty term is removed, it may cause the model to generate messy code text in the optimization to fool the bonus model into providing a high bonus value.
Finally, according to the PPO algorithm, the method optimizes the rewarding index of the current batch data (from the characteristics of the PPO algorithm on-policy). The PPO algorithm is a trusted region optimization (Trust Region Optimization, TRO) algorithm that uses gradient constraints to ensure that the update step does not disrupt the stability of the learning process, and A2C (synchronous advantage actor-critic) algorithm may also be used to optimize the gradient.
Training a large model using reinforcement learning (Reinforcement Learning, RL) significantly improves the performance of the model and achieves striking results on various tasks, advantages including:
1. adaptation to complex tasks: the large language model for reinforcement learning training has strong generalization capability, can be suitable for various complex natural language processing tasks, such as text generation, question-answering systems, emotion analysis and the like, and does not need to perform specific pre-training for each task.
2. And (3) self-adaptive optimization: the reinforcement learning can dynamically adjust model parameters, and continuously optimize the model according to feedback of tasks and environments, so that the model can achieve better performance on specific tasks. In the process of interaction with the environment, the model carries out strategy updating through learning and feedback, and self-adaptive optimization is realized.
3. Processing the non-convex optimization problem: the training objectives of large language models are typically non-convex optimization problems, and traditional optimization methods may fall into locally optimal solutions. Reinforcement learning uses gradient-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO), which are more suitable for dealing with complex non-convex optimization problems.
4. Beyond human level: the performance of human level has been exceeded on certain tasks by reinforcement learning trained large language models. The breakthrough brings unprecedented opportunities to the application field, and the power assisting realizes higher-level intelligent application.
The beneficial effects are that: large language models trained using reinforcement learning have significant advantages in terms of natural language processing tasks. The method is suitable for complex tasks, self-adaptive optimization, processing non-convex problems and performance exceeding human level, and lays a solid foundation for promoting the development and application of natural language processing technology.
The foregoing detailed description of the invention has been presented for purposes of illustration and description, and it should be understood that the invention is not limited to the particular embodiments disclosed, but is intended to cover all modifications, equivalents, alternatives, and improvements within the spirit and principles of the invention.
Claims (6)
1. A banking large language model training method, characterized in that the training method comprises:
step S1: constructing a model training data set;
step S2: training a model word segmentation device based on a bank word list;
step S3: constructing a large model base based on the gain training of the llama pre-training model;
step S4: performing instruction fine adjustment by using a prompt project;
step S5: and (5) strengthening and learning to finely tune the large model.
2. The method for training a banking large language model according to claim 1, wherein the step S1: the building of the model training data set specifically comprises:
pre-training a data set, scoring the data based on heuristic rules and a quality model, and filtering the data set in chapter and sentence granularity; filtering the chapter and sentence granularity on the whole data by using a local sensitive hash method;
the instruction trims the data set.
3. The method for training a banking large language model according to claim 1, wherein the step S2: training the model word segmentation device based on the bank vocabulary specifically comprises:
sorting a special dictionary in a row based on a knowledge base, wherein the dictionary comprises industry term words, financial product names and special digital words in banking industry, and 4300 words in the special dictionary;
loading a special word list when using a sentenceplice training word segmentation device to ensure that special words in a text are not separated, wherein training data is pre-training unsupervised text data with a 1.1-knot structure, and the training is set to be 25k;
the trained token is combined with the original llama token to obtain a combined token by combining the vocabulary.
4. The method for training a banking large language model according to claim 1, wherein the step S3: based on the gain training of the llama pre-training model, the construction of the large model base specifically comprises the following steps:
retraining an Llama 13b model;
the model architecture of ilama uses a Transformer Decoder architecture.
5. The method for training a banking large language model according to claim 1, wherein the step S4: instruction trimming using hint engineering specifically includes: model fine tuning was performed using the LoRA approach.
6. The method for training a banking large language model according to claim 1, wherein the step S5: the reinforcement learning fine tuning large model specifically comprises:
multiple strategies generate samples and collect human feedback to form reinforcement learning data sets
Training reward model based on Bloom-7b model
The goal of a Reward Model (RM) is to characterize whether the Model's output appears to humans to be good;
inputting [ prompt (prompt), text generated by the model ], and outputting a scalar number for characterizing the text quality;
the formula is as follows:
loss(r θ )=-E (x,y0,y1,i)~D [log(σ(r θ (x,y i )-r θ (x,y 1-i )))]
wherein x, y represent POST and SUMMARY, r respectively θ A value representing a reward model with a parameter θ, σ representing a sigmod function;
the reward model receives a series of text and returns a scalar reward corresponding in value to the person's preference;
and fine-tuning the large model through the trained reward model prediction result and optimizing the model strategy through the PPO algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311299622.4A CN117217289A (en) | 2023-10-09 | 2023-10-09 | Banking industry large language model training method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311299622.4A CN117217289A (en) | 2023-10-09 | 2023-10-09 | Banking industry large language model training method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117217289A true CN117217289A (en) | 2023-12-12 |
Family
ID=89037085
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311299622.4A Pending CN117217289A (en) | 2023-10-09 | 2023-10-09 | Banking industry large language model training method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117217289A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117634506A (en) * | 2024-01-25 | 2024-03-01 | 北京大学 | A training method, device and electronic device for a target language model |
CN117669737A (en) * | 2023-12-20 | 2024-03-08 | 中科星图数字地球合肥有限公司 | Method for constructing and using large language model in end-to-end geographic industry |
CN117708307A (en) * | 2024-02-06 | 2024-03-15 | 西北工业大学 | A large language model fine-tuning and Adapter fusion method and device |
CN117764054A (en) * | 2024-02-06 | 2024-03-26 | 佛山科学技术学院 | Natural language understanding method and system based on automatic construction prompt engineering |
CN117808124A (en) * | 2024-02-29 | 2024-04-02 | 云南师范大学 | Llama 2-based text simplification method |
CN118036757A (en) * | 2024-04-15 | 2024-05-14 | 清华大学 | Training method and device for large language model |
CN118297165A (en) * | 2024-06-05 | 2024-07-05 | 杭州思锐信息技术股份有限公司 | Knowledge graph question-answering method and system based on A2C algorithm and GCN model |
CN118468944A (en) * | 2024-05-30 | 2024-08-09 | 上海交通大学 | Method and system for rapidly improving activation sparsity of large language models |
CN118536987A (en) * | 2024-07-25 | 2024-08-23 | 北京化工大学 | Rotary machine intelligent operation and maintenance diagnosis method based on generation type large language model |
CN118780930A (en) * | 2024-09-11 | 2024-10-15 | 厦门大学 | An accounting method, device and readable medium based on a large language model |
CN119067222A (en) * | 2024-10-31 | 2024-12-03 | 中汽数据(天津)有限公司 | LLAMA 2-based method, device and medium for constructing automobile software defect repair evaluation model |
CN119090018A (en) * | 2024-09-03 | 2024-12-06 | 上海曲速超为技术有限公司 | A large model fine-tuning method, system and application for increasing the proportion of key information |
-
2023
- 2023-10-09 CN CN202311299622.4A patent/CN117217289A/en active Pending
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117669737A (en) * | 2023-12-20 | 2024-03-08 | 中科星图数字地球合肥有限公司 | Method for constructing and using large language model in end-to-end geographic industry |
CN117669737B (en) * | 2023-12-20 | 2024-04-26 | 中科星图数字地球合肥有限公司 | Method for constructing and using large language model in end-to-end geographic industry |
CN117634506A (en) * | 2024-01-25 | 2024-03-01 | 北京大学 | A training method, device and electronic device for a target language model |
CN117634506B (en) * | 2024-01-25 | 2024-04-30 | 北京大学 | A method, device and electronic device for training a target language model |
CN117708307B (en) * | 2024-02-06 | 2024-05-14 | 西北工业大学 | Method and device for fusing micro-tuning and Adapter of large language model |
CN117708307A (en) * | 2024-02-06 | 2024-03-15 | 西北工业大学 | A large language model fine-tuning and Adapter fusion method and device |
CN117764054A (en) * | 2024-02-06 | 2024-03-26 | 佛山科学技术学院 | Natural language understanding method and system based on automatic construction prompt engineering |
CN117808124A (en) * | 2024-02-29 | 2024-04-02 | 云南师范大学 | Llama 2-based text simplification method |
CN117808124B (en) * | 2024-02-29 | 2024-05-03 | 云南师范大学 | A text simplification method based on Llama2 |
CN118036757A (en) * | 2024-04-15 | 2024-05-14 | 清华大学 | Training method and device for large language model |
CN118468944A (en) * | 2024-05-30 | 2024-08-09 | 上海交通大学 | Method and system for rapidly improving activation sparsity of large language models |
CN118297165A (en) * | 2024-06-05 | 2024-07-05 | 杭州思锐信息技术股份有限公司 | Knowledge graph question-answering method and system based on A2C algorithm and GCN model |
CN118536987A (en) * | 2024-07-25 | 2024-08-23 | 北京化工大学 | Rotary machine intelligent operation and maintenance diagnosis method based on generation type large language model |
CN119090018A (en) * | 2024-09-03 | 2024-12-06 | 上海曲速超为技术有限公司 | A large model fine-tuning method, system and application for increasing the proportion of key information |
CN118780930A (en) * | 2024-09-11 | 2024-10-15 | 厦门大学 | An accounting method, device and readable medium based on a large language model |
CN119067222A (en) * | 2024-10-31 | 2024-12-03 | 中汽数据(天津)有限公司 | LLAMA 2-based method, device and medium for constructing automobile software defect repair evaluation model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117217289A (en) | Banking industry large language model training method | |
Zhou et al. | Progress in neural NLP: modeling, learning, and reasoning | |
CN111783462B (en) | Chinese Named Entity Recognition Model and Method Based on Double Neural Network Fusion | |
Radford et al. | Improving language understanding by generative pre-training | |
Yao et al. | Deep reinforcement learning for extractive document summarization | |
Yao et al. | Bi-directional LSTM recurrent neural network for Chinese word segmentation | |
Li et al. | Recognizing biomedical named entities based on the sentence vector/twin word embeddings conditioned bidirectional LSTM | |
Shreyashree et al. | A literature review on bidirectional encoder representations from transformers | |
CN114818682B (en) | Document-level entity relation extraction method based on adaptive entity path awareness | |
CN116168401A (en) | Training method of text image translation model based on multi-mode codebook | |
Wu et al. | Bi-directional gated memory networks for answer selection | |
De Boom et al. | Character-level recurrent neural networks in practice: comparing training and sampling schemes | |
Trandafili et al. | A named entity recognition approach for Albanian using deep learning | |
Harrando et al. | Named entity recognition as graph classification | |
KR20240129068A (en) | Attention neural network with gated attention units | |
Dudarin et al. | A Technique to Pre-trained Neural Network Language Model Customization to Software Development Domain | |
Yu et al. | Learning sparse hidden states in long short-term memory | |
Ge et al. | The application of deep learning in automated essay evaluation | |
Nagda et al. | Ascent of pre-trained state-of-the-art language models | |
Tesfagergish et al. | Deep learning-based part-of-speech tagging of the tigrinya language | |
Chen et al. | Adaptive joint attention with reinforcement training for convolutional image caption | |
Yu et al. | Adaptive cross-lingual question generation with minimal resources | |
Kulkarni et al. | Evolution of neural networks to large language models | |
Tardy et al. | Leverage unlabeled data for abstractive speech summarization with self-supervised learning and back-summarization | |
Kamath et al. | Language Models Pre-training |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |