CN113569033A

CN113569033A - Government affair problem generation method and device

Info

Publication number: CN113569033A
Application number: CN202110890114.8A
Authority: CN
Inventors: 李羊
Original assignee: Industrial and Commercial Bank of China Ltd ICBC; ICBC Technology Co Ltd
Current assignee: Industrial and Commercial Bank of China Ltd ICBC; ICBC Technology Co Ltd
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-10-29

Abstract

The invention discloses a government affair problem generation method and a device, which can be used in the technical field of artificial intelligence, and the method comprises the following steps: acquiring government affair data, open source question and answer data and self-built government affair question and answer text data; inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, wherein the UNILM network model is pre-established according to Bert Chinese model parameters; inputting the open source question-answer data and self-built government affair question-answer text data into a pre-training model for fine tuning to obtain a question generation model; and generating government affair questions by using the question generation model. The invention can improve the effect of generating Chinese data problems.

Description

Government affair problem generation method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a government affair problem generation method and device.

Background

Question Generation (QG) refers to the Generation of related questions from articles and answers, the answers of which are available from the articles. The main application scenarios are as follows: in the conversation system, the chat robot actively throws out problems to increase the continuity of interaction; the automatic generation of questions in the construction of question-answering and machine reading understanding data sets can reduce the work of manual disassembly; the questions can be automatically generated in the process of constructing the query-answer corpus, and the method is favorable for quickly constructing the original question and answer data of the question and answer system. At present, the government affair field is lack of a unified question and answer corpus, and in order to reduce the manual disassembly cost, the problem generation technology is applied to automatically generate problems for the government affair field data, so that the data set is helped to be constructed.

The existing government affair problem generation method has a poor effect on generating Chinese data problems, so that a government affair problem generation scheme capable of overcoming the problems is urgently needed.

Disclosure of Invention

The embodiment of the invention provides a government affair problem generation method, which is used for improving the generation effect of Chinese data problems and comprises the following steps:

acquiring government affair data, open source question and answer data and self-built government affair question and answer text data;

inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, wherein the UNILM network model is pre-established according to Bert Chinese model parameters;

inputting the open source question-answer data and self-built government affair question-answer text data into a pre-training model for fine tuning to obtain a question generation model;

and generating government affair questions by using the question generation model.

The embodiment of the invention provides a government affair question generating device, which is used for improving the generating effect of Chinese data questions and comprises:

the data acquisition module is used for acquiring government affair data, open source question-answer data and self-built government affair question-answer text data;

the model training module is used for inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, wherein the UNILM network model is pre-established according to Bert Chinese model parameters;

the model fine-tuning module is used for inputting the open source question-answer data and the self-built government question-answer text data into a pre-training model for fine tuning to obtain a question generation model;

and the question generation module is used for generating government affair questions by using the question generation model.

An embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the government affair question generating method.

An embodiment of the present invention also provides a computer-readable storage medium storing a computer program for executing the above-described government affair question generating method.

According to the embodiment of the invention, government affair data, open source question-answer data and self-built government affair question-answer text data are obtained; inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, wherein the UNILM network model is pre-established according to Bert Chinese model parameters; inputting the open source question-answer data and self-built government affair question-answer text data into a pre-training model for fine tuning to obtain a question generation model; and generating government affair questions by using the question generation model. According to the embodiment of the invention, for the UNILM network model without the Chinese open source, Bert Chinese model parameters are adopted for initialization at the time of establishment, then government affair data are input into the established UNILM network model for training to obtain a pre-training model, and pre-training model fine tuning of downstream tasks is carried out by fusing open source question-answering data and self-established government affair question-answering text data, so that the model effect and the Chinese data problem generation effect are effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a schematic diagram of a government affairs question generation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of generating government affairs questions in an embodiment of the present invention;

FIG. 3 is a diagram showing a structure of a government affairs question generating apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

First, terms in the embodiments of the present invention are described:

question Generation (QG): the method refers to generating related questions according to articles, wherein answers of the related questions can be obtained from the articles, preset or calculated.

NQG (neural Question Generation): a neural network problem is generated.

Word-Embedding: a language model and a representation technology in natural language processing are collectively called, and high-dimensional sparse spaces with dimensions of all numbers are embedded into a low-dimensional dense vector space assembly.

Pre-training the model: the pre-training model in NLP is that text representation information prediction word tokens based on context is learned on a massive corpus data set. Can be used to address specific tasks downstream. The deep learning has high requirements on data, particularly the quantity of labeled data, and the pre-training model can apply strong characterization capability to various tasks, so that the problem that a large quantity of labeled data is lacked in some tasks is solved.

As described above, the current government affairs question generation method is poor in the effect of generating chinese data questions. Problem generation is largely divided into rule-based (rule-based) and neural network-based (neural approach) methods. Based on the rules, the relevant entities of the target sentence are mainly extracted and filled into the manually written template (according to the rules and grammar), and then one or more most suitable templates are selected according to the sorting method to generate the problem. The method has the advantages of smoothness and low quality and has the defect of dependence on an artificial template. A seq2seq model is adopted based on a neural network (neural improvement), paragraphs and answers are coded, and Chinese lacks related data sets, so that related research and application results are few, and the effect is poor. And obtaining absolute advantage effect in each task of natural language understanding based on a pre-training model.

In order to improve the effect of generating the chinese data problem, an embodiment of the present invention provides a government affairs problem generating method, as shown in fig. 1, the method may include:

step 101, acquiring government affair data, open source question and answer data and self-built government affair question and answer text data;

102, inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, wherein the UNILM network model is pre-established according to Bert Chinese model parameters;

103, inputting the open source question-answer data and self-built government affair question-answer text data into a pre-training model for fine tuning to obtain a question generation model;

and step 104, generating government affair questions by using the question generation model.

As shown in fig. 1, the embodiment of the present invention obtains government affair data, open source question-answer data and self-established government affair question-answer text data; inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, wherein the UNILM network model is pre-established according to Bert Chinese model parameters; inputting the open source question-answer data and self-built government affair question-answer text data into a pre-training model for fine tuning to obtain a question generation model; and generating government affair questions by using the question generation model. According to the embodiment of the invention, for the UNILM network model without the Chinese open source, Bert Chinese model parameters are adopted for initialization at the time of establishment, then government affair data are input into the established UNILM network model for training to obtain a pre-training model, and pre-training model fine tuning of downstream tasks is carried out by fusing open source question-answering data and self-established government affair question-answering text data, so that the model effect and the Chinese data problem generation effect are effectively improved.

In the embodiment, government affair data, open source question and answer data and self-built government affair question and answer text data are obtained; and inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, wherein the UNILM network model is pre-established according to Bert Chinese model parameters.

In this embodiment, inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, where the method includes:

performing word segmentation processing on the government affair data to obtain an input sequence, wherein the input sequence comprises a plurality of words;

determining a representation joint word vector, a position vector and text segment information of an input sequence;

inputting the representation joint word vector, the position vector and the text segment information of the input sequence into a pre-established UNILM network model, and outputting a text vector representation, wherein the UNILM network model comprises a plurality of layers of Transformer networks, and each layer of Transformer network in the plurality of layers of Transformer networks comprises: a one-way language network model, a two-way language network model, and an end-to-end language network model.

In this embodiment, inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, further includes:

after an input sequence is obtained, masking the input sequence by using a self-attention mask matrix;

determining a characterized joint word vector, a position vector and text segment information of an input sequence, comprising: and determining the characteristic joint word vector, the position vector and the text segment information of the input sequence after covering.

The following describes the structure of the UNILM network model. First, a problem definition is made: q ═ argM_QAX (Prob { Q | P, A }), Q stands for the generated question, P stands for a paragraph, and A stands for a known answer. The UNILM is a deep Transformer network, a UNILM network model comprises a plurality of layers of Transformer networks, and 3 unsupervised networks are adopted in the pre-training processThe language model object of (1): a two-way language network model, a one-way language network model, and an end-to-end language network model (sequence-to-sequence LM). The model uses a parameter-sharing Transformer network and specific self-attention mask matrices (self-attention masks) to control the context information used in prediction. When fine tuning the downstream tasks, the UNILM network model can be viewed as a one-way coding, a two-way coding, and an end-to-end model to accommodate different downstream tasks (natural language understanding and generation tasks). And inputting a text and a section of the question in the text by using an end-to-end structure in the question generation task, and outputting an answer. The UNILM network model was compared comprehensively with the BERT model on the GLUE, SQuAD 2.0 and CoQA datasets. It refreshes records on 3 natural language generation tasks, including CNN/DailyMail summary generation (ROUGE-L is 40.63, promoted by 2.16), problem generation by CoQA (F1 value is 82.5, promoted by 37.1), and problem generation by SQuAD (BLEU-4 is 22.88, promoted by 6.5).

In the specific implementation, in the training stage of the UNILM network model, the input sequence is covered by using the self-attention mask matrix, some words are covered randomly, and the target task is to restore the words. For example, the input X is a string of sequences, a text segment, a sequence table, like Bert, including word vectors, position codes, and sequence codes, and the sequence codes can be used as a method for using a one-way language model, a two-way language model, and an end-to-end training mode. The backbone consists of 24 layers of transformers, with the output of each layer being the input to the next layer. Each layer controls the attention range of each word through mask moments, so that the joint training of a plurality of training targets is guaranteed. Given an input sequence x ═ x1 … xn, (x1 is a word) a vector representation with context information is obtained for each word by a multi-layered transform network. The characterization joint word vector, the position vector and the text segment information of the input word sequence. And inputting the input vector into a multi-layer Transformer network, and calculating by combining the self-attention mechanism in the multi-layer Transformer network with the whole input sequence to obtain the representation of the text. When an input is masked, the masking operation is to replace it with a predefined mask matrix. And then, inputting the output vector into a Transformer network to calculate to obtain an output vector, and inputting the output vector into a softmax classifier to predict the occluded words. The parameters of the UNILM model are learned by minimizing the cross-entropy loss between the predicted words and the original real words. In the end-to-end schema, the structure is "[ SOS ] S1[ EOS ] S2[ EOS ]", S1 (i.e., source segment) can notice any word inside the whole segment, the word in S2(target segment) can only notice the above word in the target portion and itself, and also S1. The self-attention mask matrix of the end-to-end language model is set to all 0 s for the source sequence mask matrix, i.e. all words are noticeable to each other. Setting a mask matrix of the target sequence to be infinite so that the source sequence can not see the target sequence; the lower upper right value is set to infinity, and the other values are 0, so that the target segment can be realized only by noticing the above and ignoring the following.

In the embodiment, the open source question-answer data and self-built government affair question-answer text data are input into a pre-training model for fine tuning to obtain a question generation model; and generating government affair questions by using the question generation model.

In this embodiment, inputting the open source question-answer data and the self-established government affair question-answer text data into a pre-training model for fine tuning includes:

determining a source sequence and a target sequence according to the open source question-answer data and self-built government affair question-answer text data;

masking the target sequence with a self-attention mask matrix;

and inputting the source sequence and the covered target sequence into a pre-training model for fine adjustment.

In specific implementation, in the pre-training model fine tuning, the UNILM model can be used in downstream tasks of natural language understanding and natural language generation, and the model can be better applied to the downstream specific tasks through fine tuning. Natural language generates fine-tuning of tasks, similar to pre-training using self-attention masks. Let S1 and S2 represent the source sequence and the target sequence, respectively. The fine-tuning of the model is to let the model learn to recover the masked words by randomly masking a proportion of the words in the target sequence S2, with the training objective being to maximize the likelihood of the masked words based on context. The problem generation task is one of NLG, which can be expressed as an end-to-end problem. The first part is the input articles and answers and the second part is the generated questions. By randomly masking words in the problem, the model is asked to train model parameters by learning how to recover the masked words. We have performed 10 epochs of fine-tuning on UNILMs on the training set. We set the batch size to 32, the masking probability to 0.7, and the learning rate to 2 e-5. The label smoothness was 0.1, and the other hyperparameters were the same as before training.

A specific example is given below to explain the scheme of generating government affairs questions in the example of the present invention. In this embodiment, as shown in fig. 2, the model training phase: 1. performing model initialization parameters, and selecting Bert Chinese pre-training model parameters of the open source of the official Google as initialization parameters; the characterization capability is strong, and the model structures are consistent; the cost of training from scratch is enormous. 2. Updating parameters: pre-training the model by using the crawled government affair data (text data relevant to government affairs and not influenced by formats) of 3G to update parameters; so that the representation capability of the model to the words in the government affair field is enhanced. 3. The pre-training and the model self-training have the same task, and the targets are respectively one-way, two-way and end-to-end tasks.

And (3) fine tuning the model: fine adjustment is carried out by using open source question and answer data (question and answer source corpora based on Baidu Dureader data sets and wiki encyclopedia) and self-built government question and answer data (5000 pairs of text data manually marked and constructed, and data contents are questions (Q) and answers (A)); the mixing reason is that the open source data is more but lacks the domain knowledge, the self-built data volume is less, and the experiment proves that the mixing can improve the model effect.

A problem generation stage: the problem generation process is a process of decoding by using a model, inputting a section and a problem interval, carrying out encoding by using an encoder, and decoding to generate a problem. The target data is domain data which we need to disassemble, here, government affair data. Data preprocessing: and preprocessing the article and the manually marked answers to obtain the interval positions of the text segment and the answers in the text segment. Model calculation: and carrying out UNILM model coding on the text segment and the answer, and generating a question through model calculation and decoding. Aiming at the training targets of different language models, four complete blank filling tasks are designed. In a certain shape filling task, a mask matrix is randomly used for covering some words, then a corresponding output vector is obtained through the calculation of a Transformer network, and the output vector is fed into a softmax classifier to predict the covered words. The goal of UNILM parameter optimization is to minimize the cross entropy between the predicted and true values of the masked words.

The government affair question generation is carried out according to the following steps:

(1) loading a word list and establishing a word segmentation device;

(2) loading a bert Chinese model, configuring network parameters, and setting the network parameters as a UNILM network structure;

(3) updating parameters: and loading government affair data, and performing model pre-training model training. Training on new data by using a pre-training method for training UNILM, and updating model parameters;

(4) the following tasks are fine-tuned: the task is an end-to-end model structure, and the objective function is maximized. Fine adjustment is carried out on open-source question-answer data (based on a Baidu reader data set and question-answer open-source corpora of wiki encyclopedia) and self-built government question-answer data (5000 pairs of text data manually labeled and constructed, and data contents are questions (Q) and answers (A)); the mixing reason is that the open source data is more but lacks the domain knowledge, the self-built data volume is less, and the experiment proves that the mixing can improve the model effect. And (3) fine adjustment process: q-a data (open source question-answer data (based on Baidu reader data set and question-answer source corpus of wiki encyclopedia) and self-built government question-answer data (5000 pairs of text data and data content of question (Q) and answer (A)) are manually marked and constructed for fine tuning) are loaded), the question generation is an NLG task, an end-to-end structure is selected, and the fine tuning process is similar to pre-training by using a self-attention mask. Let S1 and S2 denote the source sequence and the target sequence, respectively, and construct the input [ SOS ] S1[ EOS ] S2[ EOS ]. The fine tuning of the model is to randomly mask a certain proportion of words in the target sequence, so that the model learns to recover the masked words, and the training goal is to maximize the likelihood of the masked words based on the context. This is slightly different from pre-training, in which words of the source sequence and the target sequence are randomly masked during pre-training, that is, both ends participate in training, and only the target sequence participates during fine-tuning, because fine-tuning focuses more on the target end. It should be noted that, during fine tuning, the end identifier [ EOS ] of the target end may also be masked, so that the model learns and predicts, and thus the model can learn to automatically end the NLG task.

The government affair problem generation scheme provided by the embodiment of the invention can effectively and automatically generate problems from documents, help solve the problem of language material construction in the government affair field of the question-answering system, reduce the cost of manually disassembling the language material and save time and cost. For the UNILM model without the Chinese open source, the open source Chinese Bert model is selected for initialization, and the characterization of the model to the domain knowledge is enhanced by training and fine tuning in the self-established government affair data. And the fine adjustment of downstream tasks is carried out by fusing the general problem generation data and the self-built government affair problem generation data to improve the model effect. Problems can be generated automatically in batches by applying an end-to-end structural mode.

Based on the same inventive concept, embodiments of the present invention further provide a government affairs question generation apparatus, as described in the following embodiments. Since the principles of solving the problems are similar to the government affairs problem generation method, the implementation of the device can be referred to the implementation of the method, and repeated details are not repeated.

Fig. 3 is a block diagram of a government affairs question generating apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes:

the data acquisition module 301 is used for acquiring government affair data, open source question-answer data and self-built government affair question-answer text data;

the model training module 302 is used for inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, wherein the UNILM network model is pre-established according to Bert Chinese model parameters;

the model fine-tuning module 303 is configured to input the open-source question-answer data and the self-established government question-answer text data into a pre-training model for fine tuning to obtain a question generation model;

and the question generation module 304 is used for generating government affair questions by using the question generation model.

In one embodiment, the model training module 302 is further configured to:

In one embodiment, the model fine tuning module 303 is further configured to:

masking the target sequence with a self-attention mask matrix;

In summary, the embodiment of the present invention obtains the government affair data, the open source question-answer data and the self-established government affair question-answer text data; inputting the government affair data into a pre-established UNILM network model for training to obtain a pre-training model, wherein the UNILM network model is pre-established according to Bert Chinese model parameters; inputting the open source question-answer data and self-built government affair question-answer text data into a pre-training model for fine tuning to obtain a question generation model; and generating government affair questions by using the question generation model. According to the embodiment of the invention, for the UNILM network model without the Chinese open source, Bert Chinese model parameters are adopted for initialization at the time of establishment, then government affair data are input into the established UNILM network model for training to obtain a pre-training model, and pre-training model fine tuning of downstream tasks is carried out by fusing open source question-answering data and self-established government affair question-answering text data, so that the model effect and the Chinese data problem generation effect are effectively improved.

Based on the aforementioned inventive concept, as shown in fig. 4, the present invention further provides a computer device 400, which includes a memory 410, a processor 420, and a computer program 430 stored on the memory 410 and operable on the processor 420, wherein the processor 420 executes the computer program 430 to implement the aforementioned government affairs problem generation method.

Based on the foregoing inventive concept, the present invention proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the foregoing government affairs problem generation method.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A government affair question generating method, comprising:

2. A method for generating government affair questions according to claim 1, wherein inputting the government affair data into a UNILM network model established in advance for training, and obtaining a pre-training model comprises:

3. A method for generating government affair questions according to claim 2, wherein the government affair data is input into a UNILM network model established in advance for training, so as to obtain a pre-training model, further comprising:

4. A method for generating government affairs questions according to claim 1, wherein inputting the open source question-answer data and self-established government affair question-answer text data into a pre-training model for fine tuning comprises:

masking the target sequence with a self-attention mask matrix;

5. A government affair question generating apparatus, comprising:

6. The government issue generating device of claim 5, wherein the model training module is further for:

7. The government issue generating device of claim 6, wherein the model training module is further for:

8. The government issue generating device of claim 5, wherein the model fine tuning module is further for:

masking the target sequence with a self-attention mask matrix;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 4.