CN111666374A - Method for integrating additional knowledge information into deep language model - Google Patents
Method for integrating additional knowledge information into deep language model Download PDFInfo
- Publication number
- CN111666374A CN111666374A CN202010410674.4A CN202010410674A CN111666374A CN 111666374 A CN111666374 A CN 111666374A CN 202010410674 A CN202010410674 A CN 202010410674A CN 111666374 A CN111666374 A CN 111666374A
- Authority
- CN
- China
- Prior art keywords
- training
- entity
- information
- text
- language model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for integrating additional knowledge information into a deep language model, which is characterized in that a method for integrating the knowledge information into the deep language model to pre-train is adopted, entity marking is carried out on large-scale natural language corpora, and a disturbance construction negative sample of a natural language text at an entity level is carried out to strengthen and improve the performance of a relation matching module in a knowledge base question-answering system, and the training and knowledge integration of the model specifically comprise the following steps: and constructing a word list, identifying entity information, creating a negative sample required by training and pre-training a deep language model. Compared with the prior art, the method has the advantages that the structured knowledge information is introduced into the parameters of the deep language model, so that the model can carry out semantic understanding containing factual structured information aiming at the natural language input text, the performance is improved on the corresponding task, the method is simple and convenient, and the efficiency is high.
Description
Technical Field
The invention relates to the technical field of computer question-answering systems, in particular to a method for integrating additional knowledge information into a deep language model based on a knowledge graph.
Background
Natural language processing aims at studying techniques for automatically processing, understanding, and generating natural language using a computer. The question-answering system is an important sub-field in natural language processing and aims to automatically give answers to questions input by a user through a computer. The question-answering systems are divided into various types, including a question-answering system based on reading understanding, a question-answering system based on community retrieval and a question-answering system based on a knowledge base; a reading comprehension based question-answering system answers relevant questions by searching for potential answers in a given article snippet; the question-answering system based on community retrieval queries possibly related answers or text segments in the community through a retrieval system, and reorders all candidate answers according to context and given information to answer questions input by a user; and a knowledge-base based question-answering system returns these candidate entities to the user as factual answers to the questions by determining the search sub-graph query range in a given knowledge base and searching the sub-graph for the most relevant entities.
The knowledge base is a database which stores a series of information such as facts through a structured storage method. Typically, this factual information is maintained in the form of an "entity-predicate-entity" triple. Different entities are connected with each other through various relation predicates to form a network graph structure which is a knowledge graph. Knowledge bases are widely used in multiple fields, and in the field of natural language processing, extra factual information is often introduced at the encoder stage by introducing the knowledge base, so that the performance of a neural network on a natural language processing task is improved.
In recent years, deep language models such as BERT and ELMo have greatly affected model design in the field of natural language processing. The deep language models have the characteristics of extremely large parameter quantity and huge number of pre-training linguistic data, and the hidden grammatical and semantic information in the natural language is automatically coded through unsupervised tasks such as the language models. Taking BERT as an example, BERT uses a 12-layer 768-dimensional hidden layer Transformer model as its encoder, and constructs a training set from tens of millions of pieces of corpus information crawled from multiple sources on the internet. The word vector obtained by deep learning represents that the semantic information contains rich context information and has strong representation capability, and the traditional word vector does not have context information and huge corpus information, so that the ideal performance can be obtained only by inputting an additional encoder into a corresponding data set for fine adjustment. The application of the deep language model greatly improves the accuracy of a plurality of subtasks in natural language processing, and obviously improves the understanding capability of the natural language model to the language.
Most of the deep language models in the prior art use natural language corpora as training sets thereof, and have the problem of lacking structured knowledge base information.
Disclosure of Invention
The invention aims to provide a method for integrating additional knowledge information into a deep language model aiming at the defects of the prior art, which adopts a pre-training method for integrating structured knowledge information into the deep language model to strengthen the performance of a relation matching module in a knowledge base question-answering system, carries out entity marking on a large-scale natural language corpus, carries out entity-level disturbance construction negative samples on a natural language text, and introduces the structured knowledge information into the parameters of the deep language model, so that the model has semantic understanding containing factual structured information aiming at the natural language input text, and obtains performance improvement on corresponding tasks, and the method is simple and convenient and has high efficiency.
The specific technical scheme for realizing the aim of the invention is as follows: a method for integrating additional knowledge information into a deep language model is characterized in that a method for integrating the knowledge information into the deep language model for pre-training is adopted to strengthen the performance of a relation matching module in a knowledge base question-answering system, and the specific training and knowledge integration of the model comprise the following steps:
1) building an entity to a named word list;
2) identifying entity information in the text through the constructed word list;
3) disturbing the natural language text, and creating a negative sample required by training;
4) and pre-training the parameters of the deep language model by using the created new training set to strengthen the performance of the deep language model.
The entities referred to by the vocabulary in step 1) are generally derived from a knowledge base or text containing structured knowledge information, such as wikipedia.
The text in the step 2) is from large-scale natural language linguistic data which is crawled from the Internet, such as Wikipedia, New York Times and the like. And judging whether the text contains entities through the word list constructed in the first step, and linking the text to the corresponding entities in the knowledge base to be training positive samples.
In the step 3), the original material is disturbed according to the structural information obtained after the entity linking.
The new training set in the step 4) is composed of the crawled original raw material and the negative sample created after the perturbation.
Compared with the prior art, the invention has the following advantages:
1) and practicability: by introducing the structured knowledge information, the model can carry out semantic understanding containing factual structured information aiming at the natural language input text, and performance is improved in the field needing to carry out factual question and answer.
2) And high efficiency: in the training corpus construction stage, other remote supervision means are not needed, a large number of complicated steps are applied to align the natural language text and knowledge base structured triple information, complex engineering details are avoided, and meanwhile the error rate of labeling is reduced.
3) Ease of use: the internal structure and the external interface of the depth language model do not need to be modified, and the trained model parameters can be directly imported into other scenes for use. And any modification on the model code is not needed in the training process, so that the method is convenient and quick to use.
Drawings
FIG. 1 is a model schematic of the present invention.
Detailed Description
The present invention is further illustrated by the following specific examples.
Example 1
Referring to the attached figure 1, the invention uses a constructed word list to perform entity tagging on corpora, perturbs a document to construct positive and negative sample training data, then judges the accuracy of an input text by superposing a classifier on a deep language model, and finely adjusts model parameters, so that the obtained deep language model parameters are applied to a question-answering system based on a knowledge base to improve the performance of corresponding tasks. The method for fusing knowledge information into the deep language model for pre-training specifically comprises the following steps:
the method comprises the following steps: building entity to name vocabulary
And constructing a word list of entities named to the entities through hyperlink texts rich in artificial labels such as Wikipedia and the like. For any entity e, all anchor text names m whose occurrences as hyperlinks are countediAnd taking the word as the entity name of the entity, and constructing a word list for entity identification and annotation. In addition to counting the information between the entities and the reference, a series of related prior probabilities should be counted. Firstly, the probability p of a certain name as an entity is countede(m) the specific value of the probability is the ratio between the number of times the reference appears as an entity and the total number of times the reference appears in the text.If the probability is high, it indicates that the reference has a very high probability of being an entity in the current text. If the probability is low, it indicates that the current term is a common word, and the hyperlink label may have noise. In addition, the current designation for an entity e is countedjConditional probability p ofm(ej|mi). Because the entity and the reference satisfy many-to-many relationship, not only the same entity may have multiple references, but also the same reference may correspond to multiple entities, and the statistical conditional probability may select the most appropriate entity according to the context under the condition that the entity disambiguation is required.
Step two: identifying entity information in text through constructed word list
And labeling the natural language text by using the constructed word list, and finding out all possible entities as training positive samples. The marked texts are texts from texts automatically crawled on the Internet and comprise newspaper websites such as Wikipedia and New York Times. Considering that hyperlink information in a webpage is mostly lost and noise exists at the same time, the quality of a constructed training set is possibly poor. Thus, rather than using hyperlinks in web pages as entity-specific labels, the choice is to re-label using the constructed vocabulary.
A dictionary tree is established through the constructed word list in the labeling process, the dictionary tree is also called a prefix tree, and information appearing in the word list in all texts can be matched quickly. After matching to obtain possible entity designations, calculating the conditional probability of the current candidate entity designation according to the following formula a:
in the formula: emRepresenting all potential candidate entity sets to which the current reference m corresponds.
If the conditional probability is lower than a certain preset threshold c, the currently matched candidate entity is considered to be noise and filtered. Thus, an entity sequence [ e ] containing labeled and rich structured information is obtained1,e2,e3,...,ek]The natural text of (1).
Step three: creating negative examples for training
Perturbing the natural language text, creating a negative example required for training is to put the entity sequence [ e ]1,e2,e3,...,ek]The natural texts are randomly scrambled, and a new entity sequence [ e 'is obtained according to the appearance sequence of each entity'1,e′2,e′3,...,e′k]And filling the negative sample back into the original natural language text to obtain a negative sample used in training. For example, for the natural language text "Jack to leave Microsoft to join Google as a seniorcientist.", after perturbation, it becomes "science to leave Google to join Jack as a seniorMicrosoft. Obviously, the perturbed natural language corpus is not a piece of grammatically correct text. The deep language model is pre-trained by utilizing the characteristic.
Besides random entity disturbance, the invention also introduces disturbance considering type constraint, and replaces half of entities in the sentence with entities with different types, and replaces the other half of entities with the same type to construct a negative sample. For example, for the same example sentence "Jack to leave Microsoft to join Google as a sensor scientist", the example sentence will become "Beijing to leave aliba to join China as an aseneier enginer" after considering the disturbance of the type constraint. The information disturbance with the type constraint not only enables the model to memorize the corresponding relation between the entity-predicate-entity triples and the natural language, but also memorizes the corresponding relation between the entity type and the natural language text, thereby further improving the performance of the deep language model on related tasks.
Step four: pre-training deep language model parameters
And pre-training the parameters of the deep language model by using the created new training set to strengthen the performance of the deep language model. And carrying out unsupervised pre-training on the deep language model by using the obtained training set so as to blend knowledge information into model parameters. The depth language model is used as BERT, which uses a multi-layer Transformer as an encoder to encode the input text layer by layer with a multi-headed self-attention mechanism. Taking BERT as an encoder, the input text is converted into a sequence represented by the following b-type vector:
[vCLS,v1,v2,...,vk,vSEP,]=Enc([wCLs,w1,w2,...,wk,wSEP,]) (b);
in the formula: w is akA word in the input sequence; v. ofkRepresenting the vector corresponding to the word; w is aCLS、wSEP、vCLSAnd vSEPThe characters of two special symbols of CLS and SEP and the corresponding vector representation are rich in context information and can be directly used for fixed-length vector representation of input text.
The first vector v in the vector sequenceCLSInputting the data into a multi-layer classifier, and calculating a score s according to the following formula cSET:
sSET=MLP(vCLS) (c);
In the formula: sSETThe one-dimensional scalar value represents the scoring of the language correctness of the current input text by the classifier, the high score indicates that the current input text is close to the natural language text, and the low score indicates that the language order problem exists in the current input text. Positive and negative sample training is carried out by using change loss, and the parameters of the deep language model are updated by the following d formula:
L(wpos,wneg)=max(0,margin-SSET+S′SET) (d);
in the formula: w is aposAnd wnegAn input character sequence representing positive and negative examples; sSETAnd s'SETRepresenting the classifier scores of positive and negative samples, respectively.
The parameters of the depth language model are fixed in the first round of training, and the upper-layer classifier is prevented from using wrong prediction information to adjust the parameters of the depth language model under the condition that the parameters are not stable, so that the accuracy rate of the depth language model is prevented from declining. The positive samples are disturbed at the entity level to generate negative samples, and the depth language model parameters obtained by a common pre-training method cannot effectively distinguish the positive samples from the negative samples, because the grammars of the two texts are close from the perspective of a neural network. After disturbance training, the entity sequence is exchanged, but predicates among the entities still remain unchanged, so that the deep language model is memorized into the structured information in the entity-predicate-entity triple and the corresponding natural language input text, and the naturally generated corpus and the disturbed wrong corpus are distinguished. And after the pre-training is finished, applying the adjusted deep language model to a question-answering system based on a knowledge base. In the question-answering system based on the knowledge base, semantic similarity of a matched text and a relation path in the knowledge base is an important link, and if an encoder can identify knowledge information in an input text, the performance of a network is effectively improved. The present invention uses a deep language model as an encoder and fine-tunes on a specific training set using adjusted network parameters instead of initial network parameters. The adjusted network parameters have better performance, and the performance in the WebQuestions test set is improved from 64.9% to 65.3%.
The above examples are only for further illustration of the present invention and are not intended to limit the present invention, and all equivalent implementations of the present invention should be included in the scope of the claims of the present invention.
Claims (4)
1. A method for integrating additional knowledge information into a deep language model is characterized in that a method for integrating the knowledge information into the deep language model for pre-training is adopted to strengthen and improve the performance of a relation matching module in a knowledge base question-answering system, and the training and knowledge integration of the model comprise the following specific steps:
the method comprises the following steps: building entity to name vocabulary
Constructing a word list from an entity to an entity by adopting a knowledge base or a hyperlink text rich in artificial labels;
step two: identifying entity information in text
Marking the natural language text by using the constructed word list, identifying entity information in the text, and linking the entity information to a corresponding entity in a knowledge base to be a training positive sample;
step three: creating negative examples for training
Disturbing the natural language text according to the structured information obtained after entity linking, and creating a negative sample required by training;
step four: pre-training deep language model parameters
And pre-training the deep language model parameters by using the created negative sample as a new training set, and performing unsupervised pre-training on the deep language model by using the obtained training set so as to integrate knowledge information into the model parameters to strengthen the performance of the model parameters.
2. The method of claim 1, wherein the entity involved in the vocabulary is derived from a knowledge base or from text containing structured knowledge information.
3. The method of claim 1, wherein the text targeted by the natural language text is derived from text automatically crawled over the internet.
4. The method of incorporating additional knowledge information into a deep language model as claimed in claim 1 wherein the new training set is comprised of crawled primitive material and negative examples created after perturbation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010410674.4A CN111666374A (en) | 2020-05-15 | 2020-05-15 | Method for integrating additional knowledge information into deep language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010410674.4A CN111666374A (en) | 2020-05-15 | 2020-05-15 | Method for integrating additional knowledge information into deep language model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111666374A true CN111666374A (en) | 2020-09-15 |
Family
ID=72383648
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010410674.4A Pending CN111666374A (en) | 2020-05-15 | 2020-05-15 | Method for integrating additional knowledge information into deep language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111666374A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112328891A (en) * | 2020-11-24 | 2021-02-05 | 北京百度网讯科技有限公司 | Method for training search model, method for searching target object and device thereof |
US20210406467A1 (en) * | 2020-06-24 | 2021-12-30 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for generating triple sample, electronic device and computer storage medium |
CN115495593A (en) * | 2022-10-13 | 2022-12-20 | 中原工学院 | Mathematical knowledge graph construction method based on big data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
CN110083690A (en) * | 2019-04-10 | 2019-08-02 | 华侨大学 | A kind of external Chinese characters spoken language training method and system based on intelligent answer |
CN110990590A (en) * | 2019-12-20 | 2020-04-10 | 北京大学 | Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning |
-
2020
- 2020-05-15 CN CN202010410674.4A patent/CN111666374A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
CN110083690A (en) * | 2019-04-10 | 2019-08-02 | 华侨大学 | A kind of external Chinese characters spoken language training method and system based on intelligent answer |
CN110990590A (en) * | 2019-12-20 | 2020-04-10 | 北京大学 | Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210406467A1 (en) * | 2020-06-24 | 2021-12-30 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for generating triple sample, electronic device and computer storage medium |
CN112328891A (en) * | 2020-11-24 | 2021-02-05 | 北京百度网讯科技有限公司 | Method for training search model, method for searching target object and device thereof |
CN112328891B (en) * | 2020-11-24 | 2023-08-01 | 北京百度网讯科技有限公司 | Method for training search model, method for searching target object and device thereof |
CN115495593A (en) * | 2022-10-13 | 2022-12-20 | 中原工学院 | Mathematical knowledge graph construction method based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115238101B (en) | Multi-engine intelligent question-answering system oriented to multi-type knowledge base | |
Neculoiu et al. | Learning text similarity with siamese recurrent networks | |
CN111931506B (en) | Entity relationship extraction method based on graph information enhancement | |
CN117171333B (en) | Electric power file question-answering type intelligent retrieval method and system | |
CN110175585B (en) | Automatic correcting system and method for simple answer questions | |
CN102663129A (en) | Medical field deep question and answer method and medical retrieval system | |
CN110765277B (en) | Knowledge-graph-based mobile terminal online equipment fault diagnosis method | |
CN114238653B (en) | Method for constructing programming education knowledge graph, completing and intelligently asking and answering | |
CN114416942A (en) | Automatic question-answering method based on deep learning | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
CN114036281B (en) | Knowledge graph-based citrus control question-answering module construction method and question-answering system | |
CN115392259B (en) | Microblog text sentiment analysis method and system based on confrontation training fusion BERT | |
CN112883175B (en) | Meteorological service interaction method and system combining pre-training model and template generation | |
CN115599902B (en) | Oil-gas encyclopedia question-answering method and system based on knowledge graph | |
CN112818106A (en) | Evaluation method of generating type question and answer | |
CN118093834B (en) | AIGC large model-based language processing question-answering system and method | |
CN111666374A (en) | Method for integrating additional knowledge information into deep language model | |
CN112632250A (en) | Question and answer method and system under multi-document scene | |
CN115599899A (en) | Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph | |
CN116561251A (en) | Natural language processing method | |
CN113971394A (en) | Text repeat rewriting system | |
CN117251455A (en) | Intelligent report generation method and system based on large model | |
CN113535897A (en) | Fine-grained emotion analysis method based on syntactic relation and opinion word distribution | |
Nugraha et al. | Typographic-based data augmentation to improve a question retrieval in short dialogue system | |
Lee | Natural Language Processing: A Textbook with Python Implementation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200915 |
|
WD01 | Invention patent application deemed withdrawn after publication |