CN111666374A

CN111666374A - Method for integrating additional knowledge information into deep language model

Info

Publication number: CN111666374A
Application number: CN202010410674.4A
Authority: CN
Inventors: 杨燕; 郑淇; 陈成才; 贺樑
Original assignee: East China Normal University; Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: East China Normal University; Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-09-15

Abstract

The invention discloses a method for integrating additional knowledge information into a deep language model, which is characterized in that a method for integrating the knowledge information into the deep language model to pre-train is adopted, entity marking is carried out on large-scale natural language corpora, and a disturbance construction negative sample of a natural language text at an entity level is carried out to strengthen and improve the performance of a relation matching module in a knowledge base question-answering system, and the training and knowledge integration of the model specifically comprise the following steps: and constructing a word list, identifying entity information, creating a negative sample required by training and pre-training a deep language model. Compared with the prior art, the method has the advantages that the structured knowledge information is introduced into the parameters of the deep language model, so that the model can carry out semantic understanding containing factual structured information aiming at the natural language input text, the performance is improved on the corresponding task, the method is simple and convenient, and the efficiency is high.

Description

Method for integrating additional knowledge information into deep language model

Technical Field

The invention relates to the technical field of computer question-answering systems, in particular to a method for integrating additional knowledge information into a deep language model based on a knowledge graph.

Background

Natural language processing aims at studying techniques for automatically processing, understanding, and generating natural language using a computer. The question-answering system is an important sub-field in natural language processing and aims to automatically give answers to questions input by a user through a computer. The question-answering systems are divided into various types, including a question-answering system based on reading understanding, a question-answering system based on community retrieval and a question-answering system based on a knowledge base; a reading comprehension based question-answering system answers relevant questions by searching for potential answers in a given article snippet; the question-answering system based on community retrieval queries possibly related answers or text segments in the community through a retrieval system, and reorders all candidate answers according to context and given information to answer questions input by a user; and a knowledge-base based question-answering system returns these candidate entities to the user as factual answers to the questions by determining the search sub-graph query range in a given knowledge base and searching the sub-graph for the most relevant entities.

The knowledge base is a database which stores a series of information such as facts through a structured storage method. Typically, this factual information is maintained in the form of an "entity-predicate-entity" triple. Different entities are connected with each other through various relation predicates to form a network graph structure which is a knowledge graph. Knowledge bases are widely used in multiple fields, and in the field of natural language processing, extra factual information is often introduced at the encoder stage by introducing the knowledge base, so that the performance of a neural network on a natural language processing task is improved.

In recent years, deep language models such as BERT and ELMo have greatly affected model design in the field of natural language processing. The deep language models have the characteristics of extremely large parameter quantity and huge number of pre-training linguistic data, and the hidden grammatical and semantic information in the natural language is automatically coded through unsupervised tasks such as the language models. Taking BERT as an example, BERT uses a 12-layer 768-dimensional hidden layer Transformer model as its encoder, and constructs a training set from tens of millions of pieces of corpus information crawled from multiple sources on the internet. The word vector obtained by deep learning represents that the semantic information contains rich context information and has strong representation capability, and the traditional word vector does not have context information and huge corpus information, so that the ideal performance can be obtained only by inputting an additional encoder into a corresponding data set for fine adjustment. The application of the deep language model greatly improves the accuracy of a plurality of subtasks in natural language processing, and obviously improves the understanding capability of the natural language model to the language.

Most of the deep language models in the prior art use natural language corpora as training sets thereof, and have the problem of lacking structured knowledge base information.

Disclosure of Invention

The invention aims to provide a method for integrating additional knowledge information into a deep language model aiming at the defects of the prior art, which adopts a pre-training method for integrating structured knowledge information into the deep language model to strengthen the performance of a relation matching module in a knowledge base question-answering system, carries out entity marking on a large-scale natural language corpus, carries out entity-level disturbance construction negative samples on a natural language text, and introduces the structured knowledge information into the parameters of the deep language model, so that the model has semantic understanding containing factual structured information aiming at the natural language input text, and obtains performance improvement on corresponding tasks, and the method is simple and convenient and has high efficiency.

The specific technical scheme for realizing the aim of the invention is as follows: a method for integrating additional knowledge information into a deep language model is characterized in that a method for integrating the knowledge information into the deep language model for pre-training is adopted to strengthen the performance of a relation matching module in a knowledge base question-answering system, and the specific training and knowledge integration of the model comprise the following steps:

1) building an entity to a named word list;

2) identifying entity information in the text through the constructed word list;

3) disturbing the natural language text, and creating a negative sample required by training;

4) and pre-training the parameters of the deep language model by using the created new training set to strengthen the performance of the deep language model.

The entities referred to by the vocabulary in step 1) are generally derived from a knowledge base or text containing structured knowledge information, such as wikipedia.

The text in the step 2) is from large-scale natural language linguistic data which is crawled from the Internet, such as Wikipedia, New York Times and the like. And judging whether the text contains entities through the word list constructed in the first step, and linking the text to the corresponding entities in the knowledge base to be training positive samples.

In the step 3), the original material is disturbed according to the structural information obtained after the entity linking.

The new training set in the step 4) is composed of the crawled original raw material and the negative sample created after the perturbation.

Compared with the prior art, the invention has the following advantages:

1) and practicability: by introducing the structured knowledge information, the model can carry out semantic understanding containing factual structured information aiming at the natural language input text, and performance is improved in the field needing to carry out factual question and answer.

2) And high efficiency: in the training corpus construction stage, other remote supervision means are not needed, a large number of complicated steps are applied to align the natural language text and knowledge base structured triple information, complex engineering details are avoided, and meanwhile the error rate of labeling is reduced.

3) Ease of use: the internal structure and the external interface of the depth language model do not need to be modified, and the trained model parameters can be directly imported into other scenes for use. And any modification on the model code is not needed in the training process, so that the method is convenient and quick to use.

Drawings

FIG. 1 is a model schematic of the present invention.

Detailed Description

The present invention is further illustrated by the following specific examples.

Example 1

Referring to the attached figure 1, the invention uses a constructed word list to perform entity tagging on corpora, perturbs a document to construct positive and negative sample training data, then judges the accuracy of an input text by superposing a classifier on a deep language model, and finely adjusts model parameters, so that the obtained deep language model parameters are applied to a question-answering system based on a knowledge base to improve the performance of corresponding tasks. The method for fusing knowledge information into the deep language model for pre-training specifically comprises the following steps:

the method comprises the following steps: building entity to name vocabulary

And constructing a word list of entities named to the entities through hyperlink texts rich in artificial labels such as Wikipedia and the like. For any entity e, all anchor text names m whose occurrences as hyperlinks are counted_iAnd taking the word as the entity name of the entity, and constructing a word list for entity identification and annotation. In addition to counting the information between the entities and the reference, a series of related prior probabilities should be counted. Firstly, the probability p of a certain name as an entity is counted_e(m) the specific value of the probability is the ratio between the number of times the reference appears as an entity and the total number of times the reference appears in the text.If the probability is high, it indicates that the reference has a very high probability of being an entity in the current text. If the probability is low, it indicates that the current term is a common word, and the hyperlink label may have noise. In addition, the current designation for an entity e is counted_jConditional probability p of_m(e_j|m_i). Because the entity and the reference satisfy many-to-many relationship, not only the same entity may have multiple references, but also the same reference may correspond to multiple entities, and the statistical conditional probability may select the most appropriate entity according to the context under the condition that the entity disambiguation is required.

Step two: identifying entity information in text through constructed word list

And labeling the natural language text by using the constructed word list, and finding out all possible entities as training positive samples. The marked texts are texts from texts automatically crawled on the Internet and comprise newspaper websites such as Wikipedia and New York Times. Considering that hyperlink information in a webpage is mostly lost and noise exists at the same time, the quality of a constructed training set is possibly poor. Thus, rather than using hyperlinks in web pages as entity-specific labels, the choice is to re-label using the constructed vocabulary.

A dictionary tree is established through the constructed word list in the labeling process, the dictionary tree is also called a prefix tree, and information appearing in the word list in all texts can be matched quickly. After matching to obtain possible entity designations, calculating the conditional probability of the current candidate entity designation according to the following formula a:

in the formula: e_mRepresenting all potential candidate entity sets to which the current reference m corresponds.

If the conditional probability is lower than a certain preset threshold c, the currently matched candidate entity is considered to be noise and filtered. Thus, an entity sequence [ e ] containing labeled and rich structured information is obtained₁,e₂,e₃,...,e_k]The natural text of (1).

Step three: creating negative examples for training

Perturbing the natural language text, creating a negative example required for training is to put the entity sequence [ e ]₁,e₂,e₃,...,e_k]The natural texts are randomly scrambled, and a new entity sequence [ e 'is obtained according to the appearance sequence of each entity'₁,e′₂,e′₃,...,e′_k]And filling the negative sample back into the original natural language text to obtain a negative sample used in training. For example, for the natural language text "Jack to leave Microsoft to join Google as a seniorcientist.", after perturbation, it becomes "science to leave Google to join Jack as a seniorMicrosoft. Obviously, the perturbed natural language corpus is not a piece of grammatically correct text. The deep language model is pre-trained by utilizing the characteristic.

Besides random entity disturbance, the invention also introduces disturbance considering type constraint, and replaces half of entities in the sentence with entities with different types, and replaces the other half of entities with the same type to construct a negative sample. For example, for the same example sentence "Jack to leave Microsoft to join Google as a sensor scientist", the example sentence will become "Beijing to leave aliba to join China as an aseneier enginer" after considering the disturbance of the type constraint. The information disturbance with the type constraint not only enables the model to memorize the corresponding relation between the entity-predicate-entity triples and the natural language, but also memorizes the corresponding relation between the entity type and the natural language text, thereby further improving the performance of the deep language model on related tasks.

Step four: pre-training deep language model parameters

And pre-training the parameters of the deep language model by using the created new training set to strengthen the performance of the deep language model. And carrying out unsupervised pre-training on the deep language model by using the obtained training set so as to blend knowledge information into model parameters. The depth language model is used as BERT, which uses a multi-layer Transformer as an encoder to encode the input text layer by layer with a multi-headed self-attention mechanism. Taking BERT as an encoder, the input text is converted into a sequence represented by the following b-type vector:

[v_CLS,v₁,v₂,...,v_k,v_SEP,]＝Enc([w_CLs,w₁,w₂,...,w_k,w_SEP,]) (b)；

in the formula: w is a_kA word in the input sequence; v. of_kRepresenting the vector corresponding to the word; w is a_CLS、w_SEP、v_CLSAnd v_SEPThe characters of two special symbols of CLS and SEP and the corresponding vector representation are rich in context information and can be directly used for fixed-length vector representation of input text.

The first vector v in the vector sequence_CLSInputting the data into a multi-layer classifier, and calculating a score s according to the following formula c_SET：

s_SET＝MLP(v_CLS) (c)；

In the formula: s_SETThe one-dimensional scalar value represents the scoring of the language correctness of the current input text by the classifier, the high score indicates that the current input text is close to the natural language text, and the low score indicates that the language order problem exists in the current input text. Positive and negative sample training is carried out by using change loss, and the parameters of the deep language model are updated by the following d formula:

L(w_pos,w_neg)＝max(0,margin-S_SET+S′_SET) (d)；

in the formula: w is a_posAnd w_negAn input character sequence representing positive and negative examples; s_SETAnd s'_SETRepresenting the classifier scores of positive and negative samples, respectively.

The parameters of the depth language model are fixed in the first round of training, and the upper-layer classifier is prevented from using wrong prediction information to adjust the parameters of the depth language model under the condition that the parameters are not stable, so that the accuracy rate of the depth language model is prevented from declining. The positive samples are disturbed at the entity level to generate negative samples, and the depth language model parameters obtained by a common pre-training method cannot effectively distinguish the positive samples from the negative samples, because the grammars of the two texts are close from the perspective of a neural network. After disturbance training, the entity sequence is exchanged, but predicates among the entities still remain unchanged, so that the deep language model is memorized into the structured information in the entity-predicate-entity triple and the corresponding natural language input text, and the naturally generated corpus and the disturbed wrong corpus are distinguished. And after the pre-training is finished, applying the adjusted deep language model to a question-answering system based on a knowledge base. In the question-answering system based on the knowledge base, semantic similarity of a matched text and a relation path in the knowledge base is an important link, and if an encoder can identify knowledge information in an input text, the performance of a network is effectively improved. The present invention uses a deep language model as an encoder and fine-tunes on a specific training set using adjusted network parameters instead of initial network parameters. The adjusted network parameters have better performance, and the performance in the WebQuestions test set is improved from 64.9% to 65.3%.

The above examples are only for further illustration of the present invention and are not intended to limit the present invention, and all equivalent implementations of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A method for integrating additional knowledge information into a deep language model is characterized in that a method for integrating the knowledge information into the deep language model for pre-training is adopted to strengthen and improve the performance of a relation matching module in a knowledge base question-answering system, and the training and knowledge integration of the model comprise the following specific steps:

the method comprises the following steps: building entity to name vocabulary

Constructing a word list from an entity to an entity by adopting a knowledge base or a hyperlink text rich in artificial labels;

step two: identifying entity information in text

Marking the natural language text by using the constructed word list, identifying entity information in the text, and linking the entity information to a corresponding entity in a knowledge base to be a training positive sample;

step three: creating negative examples for training

Disturbing the natural language text according to the structured information obtained after entity linking, and creating a negative sample required by training;

step four: pre-training deep language model parameters

And pre-training the deep language model parameters by using the created negative sample as a new training set, and performing unsupervised pre-training on the deep language model by using the obtained training set so as to integrate knowledge information into the model parameters to strengthen the performance of the model parameters.

2. The method of claim 1, wherein the entity involved in the vocabulary is derived from a knowledge base or from text containing structured knowledge information.

3. The method of claim 1, wherein the text targeted by the natural language text is derived from text automatically crawled over the internet.

4. The method of incorporating additional knowledge information into a deep language model as claimed in claim 1 wherein the new training set is comprised of crawled primitive material and negative examples created after perturbation.