Nothing Special   »   [go: up one dir, main page]

CN111666374A - Method for integrating additional knowledge information into deep language model - Google Patents

Method for integrating additional knowledge information into deep language model Download PDF

Info

Publication number
CN111666374A
CN111666374A CN202010410674.4A CN202010410674A CN111666374A CN 111666374 A CN111666374 A CN 111666374A CN 202010410674 A CN202010410674 A CN 202010410674A CN 111666374 A CN111666374 A CN 111666374A
Authority
CN
China
Prior art keywords
training
entity
information
text
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010410674.4A
Other languages
Chinese (zh)
Inventor
杨燕
郑淇
陈成才
贺樑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Shanghai Xiaoi Robot Technology Co Ltd
Original Assignee
East China Normal University
Shanghai Xiaoi Robot Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University, Shanghai Xiaoi Robot Technology Co Ltd filed Critical East China Normal University
Priority to CN202010410674.4A priority Critical patent/CN111666374A/en
Publication of CN111666374A publication Critical patent/CN111666374A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for integrating additional knowledge information into a deep language model, which is characterized in that a method for integrating the knowledge information into the deep language model to pre-train is adopted, entity marking is carried out on large-scale natural language corpora, and a disturbance construction negative sample of a natural language text at an entity level is carried out to strengthen and improve the performance of a relation matching module in a knowledge base question-answering system, and the training and knowledge integration of the model specifically comprise the following steps: and constructing a word list, identifying entity information, creating a negative sample required by training and pre-training a deep language model. Compared with the prior art, the method has the advantages that the structured knowledge information is introduced into the parameters of the deep language model, so that the model can carry out semantic understanding containing factual structured information aiming at the natural language input text, the performance is improved on the corresponding task, the method is simple and convenient, and the efficiency is high.

Description

Method for integrating additional knowledge information into deep language model
Technical Field
The invention relates to the technical field of computer question-answering systems, in particular to a method for integrating additional knowledge information into a deep language model based on a knowledge graph.
Background
Natural language processing aims at studying techniques for automatically processing, understanding, and generating natural language using a computer. The question-answering system is an important sub-field in natural language processing and aims to automatically give answers to questions input by a user through a computer. The question-answering systems are divided into various types, including a question-answering system based on reading understanding, a question-answering system based on community retrieval and a question-answering system based on a knowledge base; a reading comprehension based question-answering system answers relevant questions by searching for potential answers in a given article snippet; the question-answering system based on community retrieval queries possibly related answers or text segments in the community through a retrieval system, and reorders all candidate answers according to context and given information to answer questions input by a user; and a knowledge-base based question-answering system returns these candidate entities to the user as factual answers to the questions by determining the search sub-graph query range in a given knowledge base and searching the sub-graph for the most relevant entities.
The knowledge base is a database which stores a series of information such as facts through a structured storage method. Typically, this factual information is maintained in the form of an "entity-predicate-entity" triple. Different entities are connected with each other through various relation predicates to form a network graph structure which is a knowledge graph. Knowledge bases are widely used in multiple fields, and in the field of natural language processing, extra factual information is often introduced at the encoder stage by introducing the knowledge base, so that the performance of a neural network on a natural language processing task is improved.
In recent years, deep language models such as BERT and ELMo have greatly affected model design in the field of natural language processing. The deep language models have the characteristics of extremely large parameter quantity and huge number of pre-training linguistic data, and the hidden grammatical and semantic information in the natural language is automatically coded through unsupervised tasks such as the language models. Taking BERT as an example, BERT uses a 12-layer 768-dimensional hidden layer Transformer model as its encoder, and constructs a training set from tens of millions of pieces of corpus information crawled from multiple sources on the internet. The word vector obtained by deep learning represents that the semantic information contains rich context information and has strong representation capability, and the traditional word vector does not have context information and huge corpus information, so that the ideal performance can be obtained only by inputting an additional encoder into a corresponding data set for fine adjustment. The application of the deep language model greatly improves the accuracy of a plurality of subtasks in natural language processing, and obviously improves the understanding capability of the natural language model to the language.
Most of the deep language models in the prior art use natural language corpora as training sets thereof, and have the problem of lacking structured knowledge base information.
Disclosure of Invention
The invention aims to provide a method for integrating additional knowledge information into a deep language model aiming at the defects of the prior art, which adopts a pre-training method for integrating structured knowledge information into the deep language model to strengthen the performance of a relation matching module in a knowledge base question-answering system, carries out entity marking on a large-scale natural language corpus, carries out entity-level disturbance construction negative samples on a natural language text, and introduces the structured knowledge information into the parameters of the deep language model, so that the model has semantic understanding containing factual structured information aiming at the natural language input text, and obtains performance improvement on corresponding tasks, and the method is simple and convenient and has high efficiency.
The specific technical scheme for realizing the aim of the invention is as follows: a method for integrating additional knowledge information into a deep language model is characterized in that a method for integrating the knowledge information into the deep language model for pre-training is adopted to strengthen the performance of a relation matching module in a knowledge base question-answering system, and the specific training and knowledge integration of the model comprise the following steps:
1) building an entity to a named word list;
2) identifying entity information in the text through the constructed word list;
3) disturbing the natural language text, and creating a negative sample required by training;
4) and pre-training the parameters of the deep language model by using the created new training set to strengthen the performance of the deep language model.
The entities referred to by the vocabulary in step 1) are generally derived from a knowledge base or text containing structured knowledge information, such as wikipedia.
The text in the step 2) is from large-scale natural language linguistic data which is crawled from the Internet, such as Wikipedia, New York Times and the like. And judging whether the text contains entities through the word list constructed in the first step, and linking the text to the corresponding entities in the knowledge base to be training positive samples.
In the step 3), the original material is disturbed according to the structural information obtained after the entity linking.
The new training set in the step 4) is composed of the crawled original raw material and the negative sample created after the perturbation.
Compared with the prior art, the invention has the following advantages:
1) and practicability: by introducing the structured knowledge information, the model can carry out semantic understanding containing factual structured information aiming at the natural language input text, and performance is improved in the field needing to carry out factual question and answer.
2) And high efficiency: in the training corpus construction stage, other remote supervision means are not needed, a large number of complicated steps are applied to align the natural language text and knowledge base structured triple information, complex engineering details are avoided, and meanwhile the error rate of labeling is reduced.
3) Ease of use: the internal structure and the external interface of the depth language model do not need to be modified, and the trained model parameters can be directly imported into other scenes for use. And any modification on the model code is not needed in the training process, so that the method is convenient and quick to use.
Drawings
FIG. 1 is a model schematic of the present invention.
Detailed Description
The present invention is further illustrated by the following specific examples.
Example 1
Referring to the attached figure 1, the invention uses a constructed word list to perform entity tagging on corpora, perturbs a document to construct positive and negative sample training data, then judges the accuracy of an input text by superposing a classifier on a deep language model, and finely adjusts model parameters, so that the obtained deep language model parameters are applied to a question-answering system based on a knowledge base to improve the performance of corresponding tasks. The method for fusing knowledge information into the deep language model for pre-training specifically comprises the following steps:
the method comprises the following steps: building entity to name vocabulary
And constructing a word list of entities named to the entities through hyperlink texts rich in artificial labels such as Wikipedia and the like. For any entity e, all anchor text names m whose occurrences as hyperlinks are countediAnd taking the word as the entity name of the entity, and constructing a word list for entity identification and annotation. In addition to counting the information between the entities and the reference, a series of related prior probabilities should be counted. Firstly, the probability p of a certain name as an entity is countede(m) the specific value of the probability is the ratio between the number of times the reference appears as an entity and the total number of times the reference appears in the text.If the probability is high, it indicates that the reference has a very high probability of being an entity in the current text. If the probability is low, it indicates that the current term is a common word, and the hyperlink label may have noise. In addition, the current designation for an entity e is countedjConditional probability p ofm(ej|mi). Because the entity and the reference satisfy many-to-many relationship, not only the same entity may have multiple references, but also the same reference may correspond to multiple entities, and the statistical conditional probability may select the most appropriate entity according to the context under the condition that the entity disambiguation is required.
Step two: identifying entity information in text through constructed word list
And labeling the natural language text by using the constructed word list, and finding out all possible entities as training positive samples. The marked texts are texts from texts automatically crawled on the Internet and comprise newspaper websites such as Wikipedia and New York Times. Considering that hyperlink information in a webpage is mostly lost and noise exists at the same time, the quality of a constructed training set is possibly poor. Thus, rather than using hyperlinks in web pages as entity-specific labels, the choice is to re-label using the constructed vocabulary.
A dictionary tree is established through the constructed word list in the labeling process, the dictionary tree is also called a prefix tree, and information appearing in the word list in all texts can be matched quickly. After matching to obtain possible entity designations, calculating the conditional probability of the current candidate entity designation according to the following formula a:
Figure BDA0002493082100000041
in the formula: emRepresenting all potential candidate entity sets to which the current reference m corresponds.
If the conditional probability is lower than a certain preset threshold c, the currently matched candidate entity is considered to be noise and filtered. Thus, an entity sequence [ e ] containing labeled and rich structured information is obtained1,e2,e3,...,ek]The natural text of (1).
Step three: creating negative examples for training
Perturbing the natural language text, creating a negative example required for training is to put the entity sequence [ e ]1,e2,e3,...,ek]The natural texts are randomly scrambled, and a new entity sequence [ e 'is obtained according to the appearance sequence of each entity'1,e′2,e′3,...,e′k]And filling the negative sample back into the original natural language text to obtain a negative sample used in training. For example, for the natural language text "Jack to leave Microsoft to join Google as a seniorcientist.", after perturbation, it becomes "science to leave Google to join Jack as a seniorMicrosoft. Obviously, the perturbed natural language corpus is not a piece of grammatically correct text. The deep language model is pre-trained by utilizing the characteristic.
Besides random entity disturbance, the invention also introduces disturbance considering type constraint, and replaces half of entities in the sentence with entities with different types, and replaces the other half of entities with the same type to construct a negative sample. For example, for the same example sentence "Jack to leave Microsoft to join Google as a sensor scientist", the example sentence will become "Beijing to leave aliba to join China as an aseneier enginer" after considering the disturbance of the type constraint. The information disturbance with the type constraint not only enables the model to memorize the corresponding relation between the entity-predicate-entity triples and the natural language, but also memorizes the corresponding relation between the entity type and the natural language text, thereby further improving the performance of the deep language model on related tasks.
Step four: pre-training deep language model parameters
And pre-training the parameters of the deep language model by using the created new training set to strengthen the performance of the deep language model. And carrying out unsupervised pre-training on the deep language model by using the obtained training set so as to blend knowledge information into model parameters. The depth language model is used as BERT, which uses a multi-layer Transformer as an encoder to encode the input text layer by layer with a multi-headed self-attention mechanism. Taking BERT as an encoder, the input text is converted into a sequence represented by the following b-type vector:
[vCLS,v1,v2,...,vk,vSEP,]=Enc([wCLs,w1,w2,...,wk,wSEP,]) (b);
in the formula: w is akA word in the input sequence; v. ofkRepresenting the vector corresponding to the word; w is aCLS、wSEP、vCLSAnd vSEPThe characters of two special symbols of CLS and SEP and the corresponding vector representation are rich in context information and can be directly used for fixed-length vector representation of input text.
The first vector v in the vector sequenceCLSInputting the data into a multi-layer classifier, and calculating a score s according to the following formula cSET
sSET=MLP(vCLS) (c);
In the formula: sSETThe one-dimensional scalar value represents the scoring of the language correctness of the current input text by the classifier, the high score indicates that the current input text is close to the natural language text, and the low score indicates that the language order problem exists in the current input text. Positive and negative sample training is carried out by using change loss, and the parameters of the deep language model are updated by the following d formula:
L(wpos,wneg)=max(0,margin-SSET+S′SET) (d);
in the formula: w is aposAnd wnegAn input character sequence representing positive and negative examples; sSETAnd s'SETRepresenting the classifier scores of positive and negative samples, respectively.
The parameters of the depth language model are fixed in the first round of training, and the upper-layer classifier is prevented from using wrong prediction information to adjust the parameters of the depth language model under the condition that the parameters are not stable, so that the accuracy rate of the depth language model is prevented from declining. The positive samples are disturbed at the entity level to generate negative samples, and the depth language model parameters obtained by a common pre-training method cannot effectively distinguish the positive samples from the negative samples, because the grammars of the two texts are close from the perspective of a neural network. After disturbance training, the entity sequence is exchanged, but predicates among the entities still remain unchanged, so that the deep language model is memorized into the structured information in the entity-predicate-entity triple and the corresponding natural language input text, and the naturally generated corpus and the disturbed wrong corpus are distinguished. And after the pre-training is finished, applying the adjusted deep language model to a question-answering system based on a knowledge base. In the question-answering system based on the knowledge base, semantic similarity of a matched text and a relation path in the knowledge base is an important link, and if an encoder can identify knowledge information in an input text, the performance of a network is effectively improved. The present invention uses a deep language model as an encoder and fine-tunes on a specific training set using adjusted network parameters instead of initial network parameters. The adjusted network parameters have better performance, and the performance in the WebQuestions test set is improved from 64.9% to 65.3%.
The above examples are only for further illustration of the present invention and are not intended to limit the present invention, and all equivalent implementations of the present invention should be included in the scope of the claims of the present invention.

Claims (4)

1. A method for integrating additional knowledge information into a deep language model is characterized in that a method for integrating the knowledge information into the deep language model for pre-training is adopted to strengthen and improve the performance of a relation matching module in a knowledge base question-answering system, and the training and knowledge integration of the model comprise the following specific steps:
the method comprises the following steps: building entity to name vocabulary
Constructing a word list from an entity to an entity by adopting a knowledge base or a hyperlink text rich in artificial labels;
step two: identifying entity information in text
Marking the natural language text by using the constructed word list, identifying entity information in the text, and linking the entity information to a corresponding entity in a knowledge base to be a training positive sample;
step three: creating negative examples for training
Disturbing the natural language text according to the structured information obtained after entity linking, and creating a negative sample required by training;
step four: pre-training deep language model parameters
And pre-training the deep language model parameters by using the created negative sample as a new training set, and performing unsupervised pre-training on the deep language model by using the obtained training set so as to integrate knowledge information into the model parameters to strengthen the performance of the model parameters.
2. The method of claim 1, wherein the entity involved in the vocabulary is derived from a knowledge base or from text containing structured knowledge information.
3. The method of claim 1, wherein the text targeted by the natural language text is derived from text automatically crawled over the internet.
4. The method of incorporating additional knowledge information into a deep language model as claimed in claim 1 wherein the new training set is comprised of crawled primitive material and negative examples created after perturbation.
CN202010410674.4A 2020-05-15 2020-05-15 Method for integrating additional knowledge information into deep language model Pending CN111666374A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010410674.4A CN111666374A (en) 2020-05-15 2020-05-15 Method for integrating additional knowledge information into deep language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010410674.4A CN111666374A (en) 2020-05-15 2020-05-15 Method for integrating additional knowledge information into deep language model

Publications (1)

Publication Number Publication Date
CN111666374A true CN111666374A (en) 2020-09-15

Family

ID=72383648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010410674.4A Pending CN111666374A (en) 2020-05-15 2020-05-15 Method for integrating additional knowledge information into deep language model

Country Status (1)

Country Link
CN (1) CN111666374A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328891A (en) * 2020-11-24 2021-02-05 北京百度网讯科技有限公司 Method for training search model, method for searching target object and device thereof
US20210406467A1 (en) * 2020-06-24 2021-12-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for generating triple sample, electronic device and computer storage medium
CN115495593A (en) * 2022-10-13 2022-12-20 中原工学院 Mathematical knowledge graph construction method based on big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN110083690A (en) * 2019-04-10 2019-08-02 华侨大学 A kind of external Chinese characters spoken language training method and system based on intelligent answer
CN110990590A (en) * 2019-12-20 2020-04-10 北京大学 Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN110083690A (en) * 2019-04-10 2019-08-02 华侨大学 A kind of external Chinese characters spoken language training method and system based on intelligent answer
CN110990590A (en) * 2019-12-20 2020-04-10 北京大学 Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210406467A1 (en) * 2020-06-24 2021-12-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for generating triple sample, electronic device and computer storage medium
CN112328891A (en) * 2020-11-24 2021-02-05 北京百度网讯科技有限公司 Method for training search model, method for searching target object and device thereof
CN112328891B (en) * 2020-11-24 2023-08-01 北京百度网讯科技有限公司 Method for training search model, method for searching target object and device thereof
CN115495593A (en) * 2022-10-13 2022-12-20 中原工学院 Mathematical knowledge graph construction method based on big data

Similar Documents

Publication Publication Date Title
CN115238101B (en) Multi-engine intelligent question-answering system oriented to multi-type knowledge base
Neculoiu et al. Learning text similarity with siamese recurrent networks
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN117171333B (en) Electric power file question-answering type intelligent retrieval method and system
CN110175585B (en) Automatic correcting system and method for simple answer questions
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN114238653B (en) Method for constructing programming education knowledge graph, completing and intelligently asking and answering
CN114416942A (en) Automatic question-answering method based on deep learning
CN112328800A (en) System and method for automatically generating programming specification question answers
CN114036281B (en) Knowledge graph-based citrus control question-answering module construction method and question-answering system
CN115392259B (en) Microblog text sentiment analysis method and system based on confrontation training fusion BERT
CN112883175B (en) Meteorological service interaction method and system combining pre-training model and template generation
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN112818106A (en) Evaluation method of generating type question and answer
CN118093834B (en) AIGC large model-based language processing question-answering system and method
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN112632250A (en) Question and answer method and system under multi-document scene
CN115599899A (en) Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph
CN116561251A (en) Natural language processing method
CN113971394A (en) Text repeat rewriting system
CN117251455A (en) Intelligent report generation method and system based on large model
CN113535897A (en) Fine-grained emotion analysis method based on syntactic relation and opinion word distribution
Nugraha et al. Typographic-based data augmentation to improve a question retrieval in short dialogue system
Lee Natural Language Processing: A Textbook with Python Implementation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200915

WD01 Invention patent application deemed withdrawn after publication