CN117217289A

CN117217289A - Banking industry large language model training method

Info

Publication number: CN117217289A
Application number: CN202311299622.4A
Authority: CN
Inventors: 杨雷
Original assignee: Beiyin Financial Technology Co ltd
Current assignee: Beiyin Financial Technology Co ltd
Priority date: 2023-10-09
Filing date: 2023-10-09
Publication date: 2023-12-12

Abstract

The invention provides a banking large language model training method, which comprises the following steps: step S1: constructing a model training data set; step S2: training a model word segmentation device based on a bank word list; step S3: constructing a large model base based on the gain training of the llama pre-training model; step S4: performing instruction fine adjustment by using a prompt project; step S5: and (5) strengthening and learning to finely tune the large model. The large language model is trained based on the business, knowledge and data of the specific bank. The method can understand the professional vocabulary such as the amount of money, financial product names, handling processes and the like which are frequently used by banks; instruction training is carried out by using customer service dialogue data and a bank internal knowledge base, so that the model has the capabilities of customer service questioning and answering, search type questioning and answering and the like; the service facing the continuous development has the capability of rapid iteration.

Description

Banking industry large language model training method

Technical Field

The invention relates to the field of language model training of banks, in particular to a banking large language model training method.

Background

Since the release of ChatGPT by OpenAI corporation at 11 of 2022, this powerful language model has caused a tremendous hurdle around the world. The striking effect and wide application fields thoroughly change the knowledge of people on the potential of the large model, and detonate the expectations of various industries on artificial intelligence. Large models refer to deep learning models with large numbers of parameters and complex structures used in the fields of natural language processing and machine learning. These models are designed to handle large-scale text data and are capable of automatically learning and understanding the semantics, grammar, and context of the language.

The existing large language model is mainly trained based on general corpus no matter whether the commercial company trains or academic institutions are open-source, has very general knowledge understanding capability on banking businesses, cannot support scenes such as customer service of the banks, knowledge base questions and answers and the like, and cannot meet the requirements of the banking businesses.

Disclosure of Invention

In view of the above, the present invention has been made to provide a banking large language model training method that overcomes or at least partially solves the above-mentioned problems.

According to one aspect of the present invention, there is provided a banking large language model training method, the training method comprising:

step S1: constructing a model training data set;

step S2: training a model word segmentation device based on a bank word list;

step S3: constructing a large model base based on the gain training of the llama pre-training model;

step S4: performing instruction fine adjustment by using a prompt project;

step S5: and (5) strengthening and learning to finely tune the large model.

Optionally, the step S1: the building of the model training data set specifically comprises:

pre-training a data set, scoring the data based on heuristic rules and a quality model, and filtering the data set in chapter and sentence granularity; filtering the chapter and sentence granularity on the whole data by using a local sensitive hash method;

the instruction trims the data set.

Optionally, the step S2: training the model word segmentation device based on the bank vocabulary specifically comprises:

sorting a special dictionary in a row based on a knowledge base, wherein the dictionary comprises industry term words, financial product names and special digital words in banking industry, and 4300 words in the special dictionary;

loading a special word list when using a sentenceplice training word segmentation device to ensure that special words in a text are not separated, wherein training data is pre-training unsupervised text data with a 1.1-knot structure, and the training is set to be 25k;

the trained token is combined with the original llama token to obtain a combined token by combining the vocabulary.

Optionally, the step S3: based on the gain training of the llama pre-training model, the construction of the large model base specifically comprises the following steps:

retraining an Llama 13b model;

the model architecture of ilama uses a Transformer Decoder architecture.

Optionally, the step S4: instruction trimming using hint engineering specifically includes: model fine tuning was performed using the LoRA approach.

Optionally, the step S5: the reinforcement learning fine tuning large model specifically comprises:

multiple strategies generate samples and collect human feedback to form reinforcement learning data sets

Training reward model based on Bloom-7b model

The goal of a Reward Model (RM) is to characterize whether the Model's output appears to humans to be good;

inputting [ prompt (prompt), text generated by the model ], and outputting a scalar number for characterizing the text quality;

the formula is as follows:

wherein x, y represent POST and SUMMARY, r respectively _θ A value representing a reward model with a parameter θ, σ representing a sig mod function;

the reward model receives a series of text and returns a scalar reward corresponding in value to the person's preference;

and fine-tuning the large model through the trained reward model prediction result and optimizing the model strategy through the PPO algorithm.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a reinforcement learning process according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a self-instruct architecture according to an embodiment of the present invention;

FIG. 3 is a graph showing a comparison of word segmentation effects of an original llama token and a token trained in Chinese, provided by an embodiment of the present invention;

FIG. 4 is a diagram of a model architecture according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a LoRA structure according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terms "comprising" and "having" and any variations thereof in the description embodiments of the invention and in the claims and drawings are intended to cover a non-exclusive inclusion, such as a series of steps or elements.

The technical scheme of the invention is further described in detail below with reference to the accompanying drawings and the examples.

As shown in fig. 1, a training method for a banking industry large model includes:

step A: building model training data sets

And (B) step (B): word segmentation device based on bank vocabulary training model

Step C: building large model base based on llama pre-training model incremental training

Step D: instruction trimming using hint engineering

Step E: reinforced learning fine tuning large model

The method is based on a task scene as a guide to perform model training, a training data set is constructed by using general corpus, financial corpus and internal text data of banks, a segmenter is trained by using a BPE algorithm of Sentence piece, an unsupervised training is performed for a large model base on the basis of a llama 13b model, a supervised training is performed by using instruction fine tuning, and an output alignment is performed by using a reinforcement learning method.

The application scene of the large model in banking industry comprises customer service and support, knowledge question and answer, search question and answer, content audit, auxiliary text writing, text classification/emotion analysis, entity identification, triple knowledge extraction and other scenes. The method uses the data set training model which is constructed to accord with the banking business scene and instructs fine adjustment and reinforcement learning training, so that the model can better understand banking business semantics and output accurate and compliant answers.

The specific implementation steps are as follows:

building training data sets

The pre-training data set comprises 50% of general corpus, 40% of financial corpus and 10% of internal corpus data of banks. The general corpus comprises task data such as multi-round dialogue data/entity recognition/text classification/text summarization and other general text data; the financial data comprises financial corpus of open source, financial/financial news data, financial newspaper of marketing company, annual newspaper, research newspaper, bulletin and other financial data; the internal data of the bank mainly comprises text data such as a bank knowledge base, customer service dialogue data, internal flow files, various notification specifications issued by the bank and the like. The data is scored based on heuristic rules and a quality model, and the data set is subjected to chapter and sentence granularity filtering. On the whole data, the chapter and sentence granularity is filtered by using a local sensitive hash method. After the data set is cleaned and filtered, the data set is constructed into a json file format, and each text is a json file. The training data set is 300G or so.

The instruction fine tuning data set is supervised training, and the total number of the instruction fine tuning data set is 30000 by using 50000 of an open-source Chinese instruction set, an instruction set modified by an in-line customer service FAQ and a knowledge base and in-line data marked by using a self-instruct method. The data set needs to be constructed in json format to include instruction, input, output three parts, and input can be omitted without filling. The instruction data set comprises Chinese part data of open source data Stanford Alpaca instruction set, shareGPT instruction set and BELLE instruction set, and most importantly, exclusive instruction data is constructed according to banking task scenes and in-line knowledge. The construction exclusive instruction data mainly comprises two parts, namely a customer service FAQ data construction instruction set, an intra-row knowledge-based construction instruction and construction instruction data according to specific task scenes.

self-instrcut method generates instruction set steps:

175 instructions representing different tasks were designed and (instruction, input, output)/(instruction, output) was written for each piece of data, taking the 175 pieces of data as seed pools.

Generating a new instruction by using the pre-training model of the method;

judging whether to classify tasks according to the instructions generated by the model;

generating an instance using the pre-training model;

filtering and post-processing the data generated by the model;

adding the filtered and post-processed data to a seed pool;

the above steps 2 to 6 are repeated until there is enough data in the seed pool.

When the instruction is generated, 6 manually written instructions are randomly extracted from the seed pool, and then 2 instructions generated in the previous steps in a model mode are randomly extracted, wherein the total number of the instructions is 8. Input to the model, let the model output a new instruction. Whether an instruction belongs to a classification task is determined, primarily because the classification task is different from the sample template used by the non-classification task when an instance is generated for the instruction. And randomly selecting 12 classified instructions and 19 non-classified instructions from the seed pool, adding the newly generated instructions, inputting the newly generated instructions into the model, and enabling the model to output whether the newly generated instructions classify tasks or not. For the diversity of data, newly generated instructions will only be added into the seed pool if the ROUGE-L with the instructions in the seed pool is less than 0.7; excluding some instructions that cannot be processed by language models, such as instructions related to images, pictures, graphics; when an instance is generated for an instruction, instances that are the same in input but different in output are filtered out. 50000 ten thousand instruction data were extracted through manual intervention using self-instruct method. The self-instruct architecture is shown in FIG. 2.

Training word segmentation device

Because the original LLaMA has very limited support on Chinese, only contains hundreds of Chinese words, so that the efficiency of encoding and decoding Chinese text is greatly affected, and the Chinese word list needs to be further expanded on the basis of the LLaMA. Firstly, a special dictionary in a row is arranged based on a knowledge base, wherein the dictionary comprises industry term words, financial product names, unique digital words and the like in banking industry, and 4300 words are used in the special dictionary. Then, when the sentenceperec training word segmentation device is used, a special word list is loaded to ensure that special words in the text are not separated, training data is pre-training unsupervised text data with a 1.1-knot structure, and the training set vocba_size is 25k. Combining the trained token with the original llama token, and finally obtaining a combined token with a vocabulary size of 57k by combining the vocabularies, wherein the word embedding and language model head is adjusted from v×h to V 'x H, where v=32,000 represents the original vocabulary size and V' =25000 is the trained token vocabulary size, in order to accommodate the new token. The new line is appended to the end of the original embedding matrix, ensuring that the embedding of the tokens in the original vocabulary is not affected. The number of tokens generated by using the Chinese LLaMA word segmentation device is reduced by about half relative to the original LLaMA word segmentation device. The original LLaMA word segmentation device and the Chinese LLaMA word segmentation device are compared, and the use of the Chinese LLaMA word segmentation device is obviously reduced relative to the original coding length. This shows that the method is effective in improving the Chinese understanding and generating capabilities of the LLaMA model. The Chinese LLaMA word segmentation device is used for pre-training the Chinese LLaMA model in a standard natural language model training task, and the next mark is predicted in an autoregressive mode, so that the Chinese understanding and generating capacity of the LLaMA model is further improved. FIG. 3 is a graph comparing the word segmentation effects of an original llama token and a Chinese trained token.

Llama 13b model retraining

The model architecture of Llama uses the Transformer Decoder architecture, but some optimization is done in detail, including the following

3.1RMS Pre-Norm

RMS Norm (Root Mean Square Layer Normalization), a variant of general LayerNorm, can make losses smoother as the gradient decreases. The main difference in RMS Norm compared to layerNorm is that the mean-subtracted part (re-centering) is removed, only the variance part (re-scaling) is retained, and LLaMA uses RMSNorm on the inputs of the Attention Layer and MLP, and training is more stable than on the outputs.

3.2 SwiGLU activation function

LLaMA replaces the original ReLU by SwiGLU, the action mechanism of SwiGLU is to automatically adjust the information flow path through the learned parameters according to the characteristics of input data, and specifically adopts Feedforward Neural Network of SwiGLU (FNN for short), which is a feedforward neural network using a learnable gating mechanism, and the formula is as follows:

the formula is to process the product of the input x and the weight matrix W through a Swish nonlinear activation function "

The result obtained in the above step and the "product of input x and weight matrix V" are multiplied element by element

This operation corresponds to the introduction of a "gate" similar to the GLU between the output of the Swish activation and the output of the second linear transformation, the value of this gate being calculated from the original input x by the linear transformation V, so that it can dynamically control the output of the Swish activation.

Finally multiplying by weight matrix W ₂

3.3 RoPE rotational position encoding

RoPE (Rotary Position Embedding) is rotated by means of a position code, the idea being to use an absolute position code form to realize a relative position code. Position coding is particularly important for large models because since large models are to be trained, the characterization of long text and the modeling capabilities of the model are very important for long text. The benefit of the RoPE is that it retains both absolute position information in absolute position coding and relativity to position information under inner product operation. The absolute position coding has the advantages of high calculation speed and the like, and has the defects of more troublesome expansion length and no practical significance of the absolute position. Whereas relative position coding is significant for learning the relation between the token, for example, the probability of the relation between two token far away is very small, and better effect can be obtained by using the relative position coding. In addition, the expansion length is easier, because no matter how long the context size is, only the input within the longest distance needs to be focused. The disadvantage of relative position coding is that absolute position coding is not as fast as it is calculated.

It is first assumed that the original code vector is a two-dimensional row vector q before adding the position information _m And k _n Where m and n are absolute positions, it is now necessary to construct a transformation that introduces m and n into q _m And k _n In (3), find the transform:

q ^～ _m ＝f(q,m),k ^～ _n ＝f(k,n)

that is, we design operations f (/ cdot, m), f (/ cdot, n) for q, k, respectively, such that after this operation,/tilleq _m 、/tildek _n Absolute position information for positions m, n is carried.

3.4 model Structure construction

When Llama constructs a transducer, the transducer is constructed according to blocks, each transducer Block comprises SA and FFN, and then the whole transducer network is constructed in a stacked Block mode.

3.4.1 construction of self-layer

Inputting x, and obtaining x through three Linear respectively _q ,x _k ,x _v

At x _q And x _k Incorporating rotary position coding

Cache x _q And x _k

Calculation of

One of the details is the caching mechanism, which is designed to reduce the repeated computation of the token at generation. When calculating the nth token feature, 1, n-11, n-1, i.e. every generation, all previous information needs to be known, if each calculation is performed from scratch, great waste is caused, so that no information of one position is calculated, and the information of one position is cached.

3.4.2 construction of FFN layer

The FFN in LLaMA adopts three full connection layers to realize FFNSwiGLU, namely

The assembly of the SA and FFN is transformer block, and transformer block is stacked by means of module list of the torch, and the assembly of the forefront part is a complete transformer (decoder) structure. The forward part inputs a token, first makes token queuing, and then adds position information. For the decoder model, a mask is required to prevent tag leakage, so a mask matrix with an upper triangle is made. The layer-by-layer computation of the transducer follows.

3.4.3 optimizer selection:

(1) Using an AdamW optimizer; super parameters: β1=0.9, β2=0.95. Using a weight decay of 0.1, the gradient cut was 1.0.

(2) Cosine learning rate, and final learning rate is 10% of maximum learning rate.

(3) 2000 preheat steps were used and the learning rate and batch size were changed as the model size was changed.

The model architecture diagram is shown in fig. 4.

3.4.4 model incremental training:

because of the high cost of training large language models, high performance is also important when building large language models. To achieve high performance training, the present method uses GPU resource training of 8XA00 (80G) and uses the following technique:

fused CUDAkernel: the fused CUDA kernel provided in xformars is used for fusing a plurality of operations, so that data transmission between the GPU and the CPU is reduced, and training efficiency is improved.

Introduced into xformators libraries open-sourced by META

The memory_effect_attribute operation performs the calculation of selfAttention, which improves the performance significantly by about 30%.

Parallelization training: parallelization training on multiple GPUs is supported by using an accelate library+deep speed stage 3+offload+actuation strategy to speed up training.

For the 13B model, training speed is 1378token/s/gpu by using an Llama model of a Pytorch native version in a transformer, and the training speed of the method reaches 3290token/s/gpu and basically reaches 3370token/s/gpu in Llama original. Pre-training was performed using 330Btoken, when 43000 GPUs were required to be trained. The total training 80K step,Global Batch Size and ilama were consistent at 4M.

Model fine tuning

The model fine tuning refers to giving out human instructions in a natural language form, is a process of directly tuning on a pre-trained large model based on a group of NLP task sets, and can improve the effect of the language model on unknown tasks, namely the zero-shot tuning capability. Model tuning can motivate the understanding ability of language models, which can generalize more on invisible tasks by understanding what tasks to do.

The method uses LoRA mode to make model fine tuning. LoRA, english full name Low-Rank Adaptation ofLarge Language Models, is interpreted as Low-level adaptation of a large language model, is PEFT (parameter efficient fine tuning method), and is a technology developed by Microsoft researchers to solve the problem of fine tuning of the large language model. The basic principle of LoRA is to freeze pre-trained model weight parameters, and under the condition of freezing original model parameters, by adding an additional network layer into the model, only training the newly added network layer parameters. Because the number of the newly added parameters is small, the cost of the finetune is obviously reduced, and the effect similar to the whole model fine tuning can be obtained.

A bypass is added beside the original PLM (Pre-trained Language Model), and a dimension-reducing and dimension-increasing operation is performed to simulate a so-called introside rank.

The parameters of PLM are fixed during training, and only the dimension-reducing matrix A and the dimension-increasing matrix B are trained. And the input and output dimensions of the model are unchanged, and parameters of BA and PLM are overlapped during output.

Initializing A with random Gaussian distribution, initializing B with 0 matrix, and ensuring that the bypass matrix is still 0 matrix at the beginning of training.

The pre-training model to be trained before is now fine-tuned at the downstream task, and then the pre-training model parameters need to be updated, as expressed below:

W ₀ +ΔW

W ₀ is the parameter for initializing the pre-training model, and DeltaW is the parameter to be updated. If the full parameter is fine-tuned, its parameter is equal to W ₀ . It is seen that the cost is very high to fine tune a large language model with full parameters. Whereas for LORA only the fine tuning aw is needed.

In particular, let the pre-trained matrix be W ₀ ∈R _d X k, its update can be expressed as:

W ₀ +ΔW＝W ₀ +BA,B∈R ^d×r ,A∈R ^r×k

wherein the rank r < min (d, k).

In the training process of LoRA, W ₀ Is fixed, only a and B are training parameters.

In the forward direction, W ₀ The same input x is multiplied by Δw and added: q

h＝W ₀ x+ΔWx＝W ₀ x+BA _x

This idea of LORA is somewhat similar to a residual connection, while the update of this bypass is used to simulate the Full Fine-Tuning process. Also, full Fine-Tuning is considered a special case of LoRA (when r equals k).

During the reasoning process, the LoRA also introduces almost no extra Inference Latency, only w=w needs to be calculated ₀ +Δw. The combination of LoRA and transducer is also simple, adding only one bypass in the computation of QKV Attention. Fig. 5 is a drawing of the LoRA structural framework.

The method uses a PEFT module of huggingface open source and combines deep speed to carry out fine tuning training.

The maximum length of the model context based on the llama-7b fine adjustment is 2048 token, however, in applications such as long dialogue, summarizing long documents or long-term planning, the preset context window limit is often exceeded, so that a large model capable of processing a longer context window is more in line with the business scene of a bank. The present method uses position interpolation (Position Interpolation, PI) to extend the context window of some existing pre-trained LLMs, including LLaMA. The PI method directly narrows the position index such that the maximum position index matches the contextual window limit of the pre-training phase. To accommodate more input tokens, interpolation of position codes over adjacent integer positions takes advantage of the property that position codes can be applied to non-integer positions, enabling the ability to expand contextual windows.

5. Reinforced learning training

The large language model has strong natural language understanding capability, but needs to ensure that the output of the large language model can meet the expectations of human beings, so that the behavior of the large language model is aligned with the human beings, the large language model can understand human instructions and answer helpful to the human beings.

Reinforcement learning is a paradigm that utilizes feedback to learn strategies. The reinforcement-learning model (Agent) interacts with the environment, takes action at for each given state st and obtains rewards rt from the environment, while entering the next state s [ t+1], and the process loops back and forth. After accumulating this series of interaction experiences, the model adjusts its strategy to maximize the rewards that the interaction process gets. Thus, the Agent learns the strategy of taking beneficial actions in a given state, and the aim of reinforcement learning is fulfilled.

Reinforcement learning training step:

5.1 multiple strategies sample and collect human feedback to form a reinforcement learning dataset

5.2 training of reward models based on Bloom-7b models

The goal of a Reward Model (RM) is to characterize whether the output of the Model appears to humans to be good. I.e. input [ prompt (text), model generated text ], output a scalar number characterizing the text quality. The formula is as follows:

wherein x, y represent POST and SUMMARY, r, respectively _θ The value of the reward model representing the parameter θ, σ represents the sig mod function. The reward model receives a series of text and returns a scalar reward corresponding in value to the person's preference. We can model with LM in an end-to-end fashion, or with a modular system (such as ranking the outputs and then converting the ranking to rewards). This prize value will be critical for subsequent seamless access to the existing RL algorithm.

5.3 fine tuning the Large model by means of trained reward model predictions and model strategy optimization by means of PPO algorithm

Since the fine tuning task of the initial language model is modeled as a Reinforcement Learning (RL) problem, basic elements such as policy (policy), action space (action space), and reward function (reward function) need to be defined. The strategy is based on the language model, receiving a prompt as input, and then outputting a series of texts (or probability distributions of the texts); the action space is the arrangement and combination of all token in all output positions of the vocabulary (a single position usually has token candidates of about 50 k); the observation space is a possible input token sequence (namely, a template), which is obviously quite large, and is an arrangement combination of all tokens in all input positions of the vocabulary; the reward function (reward) is calculated based on the RM model trained in the previous section to obtain an initial reward, and a constraint term is superimposed.

Let the vocabulary be Σ and the language model be ρ, then the probability distribution for a sequence of length n can be expressed as

ρ(x ₀ ...x _n-1 )＝Π _0≤k＜n ρ(x _k |x ₀ ...x _k-1 )

The input space x=Σ+.m, the output space x=Σn, and for an article of input X e X possibly length 1000, Y e Y may be a summary of length 100. The probability of generating a summary y from the article x can be expressed as ρ (y|x) =ρ (xy)/ρ (x).

Initializing a policy pi=ρ, then updating the policy pi using the PPO algorithm, defining a bonus function as r, and expressing the expected value of the bonus as:

Ε _π [r]＝Ε _{x～D,y～π(·|x)} [r(x,y)]

the PPO algorithm optimizing reward function calculating steps are as follows:

inputting the prompt X into the initial LM and the current fine-tuned LM to obtain output texts y1 and y2 respectively, and transmitting the text from the current strategy to the RM to obtain a scalar prize r ^θ 。

Comparing the generated text of the two models to calculate a penalty term for the difference, typically designed as a scaling of Kullback-Leibler (KL) divergence between the output word distribution sequences, i.e. r=r ^θ -beta rKL, wherein

This term is used to penalize the RL strategy to generate a large deviation from the initial model in each training batch to ensure that the model outputs reasonably consistent text. If this penalty term is removed, it may cause the model to generate messy code text in the optimization to fool the bonus model into providing a high bonus value.

Finally, according to the PPO algorithm, the method optimizes the rewarding index of the current batch data (from the characteristics of the PPO algorithm on-policy). The PPO algorithm is a trusted region optimization (Trust Region Optimization, TRO) algorithm that uses gradient constraints to ensure that the update step does not disrupt the stability of the learning process, and A2C (synchronous advantage actor-critic) algorithm may also be used to optimize the gradient.

Training a large model using reinforcement learning (Reinforcement Learning, RL) significantly improves the performance of the model and achieves striking results on various tasks, advantages including:

1. adaptation to complex tasks: the large language model for reinforcement learning training has strong generalization capability, can be suitable for various complex natural language processing tasks, such as text generation, question-answering systems, emotion analysis and the like, and does not need to perform specific pre-training for each task.

2. And (3) self-adaptive optimization: the reinforcement learning can dynamically adjust model parameters, and continuously optimize the model according to feedback of tasks and environments, so that the model can achieve better performance on specific tasks. In the process of interaction with the environment, the model carries out strategy updating through learning and feedback, and self-adaptive optimization is realized.

3. Processing the non-convex optimization problem: the training objectives of large language models are typically non-convex optimization problems, and traditional optimization methods may fall into locally optimal solutions. Reinforcement learning uses gradient-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO), which are more suitable for dealing with complex non-convex optimization problems.

4. Beyond human level: the performance of human level has been exceeded on certain tasks by reinforcement learning trained large language models. The breakthrough brings unprecedented opportunities to the application field, and the power assisting realizes higher-level intelligent application.

The beneficial effects are that: large language models trained using reinforcement learning have significant advantages in terms of natural language processing tasks. The method is suitable for complex tasks, self-adaptive optimization, processing non-convex problems and performance exceeding human level, and lays a solid foundation for promoting the development and application of natural language processing technology.

The foregoing detailed description of the invention has been presented for purposes of illustration and description, and it should be understood that the invention is not limited to the particular embodiments disclosed, but is intended to cover all modifications, equivalents, alternatives, and improvements within the spirit and principles of the invention.

Claims

1. A banking large language model training method, characterized in that the training method comprises:

step S1: constructing a model training data set;

step S2: training a model word segmentation device based on a bank word list;

step S4: performing instruction fine adjustment by using a prompt project;

step S5: and (5) strengthening and learning to finely tune the large model.

2. The method for training a banking large language model according to claim 1, wherein the step S1: the building of the model training data set specifically comprises:

the instruction trims the data set.

3. The method for training a banking large language model according to claim 1, wherein the step S2: training the model word segmentation device based on the bank vocabulary specifically comprises:

4. The method for training a banking large language model according to claim 1, wherein the step S3: based on the gain training of the llama pre-training model, the construction of the large model base specifically comprises the following steps:

retraining an Llama 13b model;

the model architecture of ilama uses a Transformer Decoder architecture.

5. The method for training a banking large language model according to claim 1, wherein the step S4: instruction trimming using hint engineering specifically includes: model fine tuning was performed using the LoRA approach.

6. The method for training a banking large language model according to claim 1, wherein the step S5: the reinforcement learning fine tuning large model specifically comprises:

Training reward model based on Bloom-7b model

the formula is as follows:

loss(r _θ )＝-E _{(x,y0,y1,i)～D} [log(σ(r _θ (x,y _i )-r _θ (x,y _1-i )))]

wherein x, y represent POST and SUMMARY, r respectively _θ A value representing a reward model with a parameter θ, σ representing a sigmod function;