CN116541535A

CN116541535A - Automatic knowledge graph construction method, system, equipment and medium

Info

Publication number: CN116541535A
Application number: CN202310571923.1A
Authority: CN
Inventors: 王儒; 孙延劭; 华益威; 阎艳; 蔡韫宸; 黎瑞
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-08-04

Abstract

The invention discloses an automatic knowledge graph construction method, system, equipment and medium, and relates to the technical field of knowledge graph information extraction. The method comprises the following steps: carrying out knowledge modeling on the vertical field of the target to obtain a knowledge model; dividing the initial data set into large sample data and small sample data; carrying out data labeling on the small sample data, and determining an entity dictionary according to the labeled small sample data; text enhancement is carried out on the large sample data according to the entity dictionary; constructing a text data set according to the marked small sample data and the text-enhanced large sample data; training a named entity recognition model according to the text data set; performing entity extraction on the target text data by using the trained named entity recognition model, and performing triplet mapping on the extraction result according to the knowledge model to obtain a triplet instance; and constructing a knowledge graph according to the triplet examples. The invention can realize the automatic construction of the knowledge graph in the vertical field.

Description

Automatic knowledge graph construction method, system, equipment and medium

Technical Field

The invention relates to the technical field of knowledge graph information extraction, in particular to an automatic knowledge graph construction method, system, equipment and medium.

Background

The existing knowledge graph construction method in the vertical field mainly comprises two kinds of construction, namely manual construction and automatic construction, and in an automatic construction scheme, the method mainly comprises two steps of knowledge modeling and information extraction, wherein the information extraction is a key of automatic construction of the knowledge graph. The information extraction includes named entity recognition (i.e., entity extraction) and entity relationship extraction (i.e., relationship extraction).

Named entity recognition techniques mainly include rule and dictionary based methods, statistical based methods, deep learning based methods, and the like. The method based on the rules and the dictionary is an early named entity recognition method, relies on manual customization of rules and a complete entity library, is easy to cause rule conflict, and takes time and labor for making the rules and the dictionary. The statistical-based method comprises a hidden Markov model, a maximum entropy model, a support vector machine model, a conditional random field and the like, and is greatly dependent on a corpus, so that corpus construction for constructing and evaluating a named entity recognition model in the vertical field is a main factor for restricting the method. With the continuous improvement of computer performance, the machine learning method is widely applied to the field of natural language processing, including a cyclic neural network (RNN), a long-short-term memory network (LSTM), a gated cyclic neural network (GRU), and the like. In addition, a pretraining model represented by a BERT model becomes the latest model of deep learning in recent years, the pretraining model forms a pretraining model by pretraining a large amount of text data, and then high-quality data is used for fine adjustment of the model to improve the training efficiency and accuracy of the model.

Different from the general domain knowledge graph, the construction of the vertical domain knowledge graph has higher requirements on data sources and data quality, however, the construction of the vertical domain knowledge graph has a certain gap from the general domain in terms of data quantity and data set, and the construction is directed at different domain problems, different disciplines and different task demands, and a new entity extraction model is often required to be trained, so that a sufficient data set is difficult to ensure to train the model, the data used for training the model is often dependent on manual labeling, and time and labor are consumed.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a medium for automatically constructing a knowledge graph, so as to realize the automatic construction of the knowledge graph in the vertical field.

In order to achieve the above object, the present invention provides the following solutions:

an automatic knowledge graph construction method comprises the following steps:

carrying out knowledge modeling on the target vertical field to obtain a knowledge model of the target vertical field; the knowledge model comprises entity types and entity relations; the entity relationship characterizes the relationship between entities;

acquiring an initial data set, and dividing the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value;

carrying out data annotation on the small sample data to obtain small sample data with the completed annotation, and determining an entity dictionary according to the small sample data with the completed annotation;

performing text enhancement on the large sample data according to the entity dictionary to obtain text enhanced large sample data;

constructing a text data set according to the marked small sample data and the text enhanced large sample data;

training a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model;

acquiring target text data, extracting the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance;

and constructing a knowledge graph of the target vertical field according to the triplet examples.

Optionally, performing data labeling on the small sample data to obtain labeled small sample data, and determining an entity dictionary according to the labeled small sample data, wherein the method specifically comprises the following steps:

performing data annotation on the small sample data by using a BIO annotation method to obtain the small sample data after the annotation is completed; wherein B represents the initial character of the entity, I represents the middle and end characters of the entity, and O represents the non-entity character;

extracting entities from the small sample data with the marked, and obtaining an entity dictionary; the entity dictionary includes entity names and corresponding entity types in the small sample data.

Optionally, text enhancement is performed on the large sample data according to the entity dictionary to obtain text enhanced large sample data, which specifically includes:

performing word segmentation and stop word removal processing on the large sample data to obtain preprocessed large sample data;

carrying out semantic similarity calculation on the preprocessed large sample data and each entity name in the entity dictionary to obtain a similarity value corresponding to the preprocessed large sample data;

and carrying out data annotation on the preprocessed large sample data with the similarity value larger than the set similarity threshold value to obtain text enhanced large sample data.

Optionally, training a named entity recognition model according to the text data set, specifically including:

dividing the text data set into a plurality of sample subsets with equal sizes;

and training and verifying the named entity recognition model according to the sample subset by adopting a K-fold cross verification method to obtain a trained named entity recognition model.

Optionally, performing entity extraction on the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance, which specifically includes:

performing entity extraction on the target text data sentence by using a trained named entity recognition model to obtain an extraction result; the extraction result comprises an entity name and a corresponding entity type in the target text data;

performing triplet mapping on the entity names in the extraction result according to the corresponding entity types and the entity relations in the knowledge model to obtain a triplet instance; the triplet instance includes a start entity name, an end entity name, and a relationship between the start entity and the end entity.

Optionally, the method further comprises:

storing the trained named entity recognition model as a pre-training model;

taking the target text data and the corresponding extraction result as a first expansion data set;

and training the pre-training model according to the first extended data set when the data volume of the first extended data set is larger than the set number so as to update the trained named entity recognition model.

Optionally, the method further comprises:

when a new manual annotation data set exists, the new manual annotation data set is used as a second expansion data set;

and training the pre-training model according to the second extended data set when the data volume of the second extended data set is larger than the set number so as to update the trained named entity recognition model.

An automatic knowledge graph construction system, comprising:

the knowledge modeling module is used for carrying out knowledge modeling on the target vertical field to obtain a knowledge model of the target vertical field; the knowledge model comprises entity types and entity relations; the entity relationship characterizes the relationship between entities;

the data set dividing module is used for acquiring an initial data set and dividing the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value;

the small sample data processing module is used for carrying out data annotation on the small sample data to obtain small sample data with the marked small sample data, and determining an entity dictionary according to the small sample data with the marked small sample data;

the large sample data processing module is used for carrying out text enhancement on the large sample data according to the entity dictionary to obtain text enhanced large sample data;

the text data set construction module is used for constructing a text data set according to the small sample data with the marked and the large sample data with the enhanced text;

the model training module is used for training a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model;

the triplet instance generating module is used for acquiring target text data, extracting the entity of the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance;

and the knowledge graph construction module is used for constructing a knowledge graph of the target vertical field according to the triplet examples.

An electronic device comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic device to execute the automatic knowledge graph construction method.

A computer-readable storage medium storing a computer program which, when executed by a processor, implements the knowledge graph automatic construction method described above.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the knowledge graph automatic construction method provided by the invention, the initial data set is divided into the small sample data and the large sample data, text enhancement is carried out on the large sample data according to the small sample data subjected to manual labeling, the text enhanced large sample data is obtained, and a sufficient amount of data can be obtained to train a named entity recognition model, so that the difficulties of complex knowledge in the vertical field, difficult acquisition of data resources, insufficient labeling of the data set and the like are overcome, the preparation efficiency of a model training set and the accuracy of entity extraction are improved, the named entity recognition model is trained by utilizing the data, entity extraction is carried out on target text data by utilizing the trained named entity recognition model, and the extraction result is mapped in a triplet mode according to the pre-established knowledge model, so that the automatic construction of the knowledge graph oriented to the vertical field can be realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an automatic knowledge graph construction method provided by the invention;

FIG. 2 is a specific flowchart of the knowledge graph automatic construction method provided by the invention;

FIG. 3 is a diagram of an example knowledge modeling provided by the present invention;

FIG. 4 is a diagram of small sample data annotation examples provided by the present invention;

FIG. 5 is a diagram of a named entity recognition model provided by the present invention;

FIG. 6 is a diagram illustrating an exemplary triplet mapping provided by the present invention;

fig. 7 is a block diagram of the knowledge graph automatic construction system provided by the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Specifically, in order to solve the problem of insufficient training data sets of named entity recognition models for knowledge graphs in the vertical field, the invention provides an automatic knowledge graph construction method, system, equipment and medium based on text enhancement and pre-training models, which mainly relate to the technology of named entity recognition, and the relationship among entities is obtained by mapping a model layer established in knowledge modeling. On one hand, the method and the device realize the expansion of the data set through text enhancement, and improve the extraction accuracy of the pre-training model; on the other hand, the invention also takes the trained model as a pre-training model, builds a new training data set after marking new data, trains the new model on the basis, and realizes the dynamic update of the named entity recognition model so as to ensure that the model can be updated along with the update of the data.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

The embodiment of the invention provides an automatic knowledge graph construction method. As shown in fig. 1, the method includes:

step 101: carrying out knowledge modeling on the target vertical field to obtain a knowledge model of the target vertical field; the knowledge model comprises entity types and entity relations; the entity relationship characterizes a relationship between entities.

Step 102: acquiring an initial data set, and dividing the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value.

Step 103: and carrying out data labeling on the small sample data to obtain labeled small sample data, and determining an entity dictionary according to the labeled small sample data.

Preferably, a BIO labeling method is adopted to carry out data labeling on the small sample data, so as to obtain small sample data with finished labeling; wherein B represents the initial character of the entity, I represents the middle and end characters of the entity, and O represents the non-entity character; extracting entities from the small sample data with the marked, and obtaining an entity dictionary; the entity dictionary includes entity names and corresponding entity types in the small sample data.

Step 104: and carrying out text enhancement on the large sample data according to the entity dictionary to obtain text enhanced large sample data.

Preferably, word segmentation and stop word removal processing are carried out on the large sample data to obtain preprocessed large sample data; carrying out semantic similarity calculation on the preprocessed large sample data and each entity name in the entity dictionary to obtain a similarity value corresponding to the preprocessed large sample data; and carrying out data annotation on the preprocessed large sample data with the similarity value larger than the set similarity threshold value to obtain text enhanced large sample data.

Step 105: and constructing a text data set according to the small sample data with the marked and the text enhanced large sample data.

Step 106: training a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model.

Preferably, step 106 specifically includes: dividing the text data set into a plurality of sample subsets with equal sizes; and training and verifying the named entity recognition model according to the sample subset by adopting a K-fold cross verification method to obtain a trained named entity recognition model.

Step 107: and obtaining target text data, extracting the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance.

Preferably, step 107 specifically includes: acquiring target text data; performing entity extraction on the target text data sentence by using a trained named entity recognition model to obtain an extraction result; the extraction result comprises an entity name and a corresponding entity type in the target text data; performing triplet mapping on the entity names in the extraction result according to the corresponding entity types and the entity relations in the knowledge model to obtain a triplet instance; the triplet instance includes a start entity name, an end entity name, and a relationship between the start entity and the end entity.

Step 108: and constructing a knowledge graph of the target vertical field according to the triplet examples.

Further, the method further comprises the steps of: storing the trained named entity recognition model as a pre-training model; taking the target text data and the corresponding extraction result as a first expansion data set; and training the pre-training model according to the first extended data set when the data volume of the first extended data set is larger than the set number so as to update the trained named entity recognition model.

Further, the method further comprises the steps of: when a new manual annotation data set exists, the new manual annotation data set is used as a second expansion data set; and training the pre-training model according to the second extended data set when the data volume of the second extended data set is larger than the set number so as to update the trained named entity recognition model.

The above steps will be described in detail below with reference to fig. 2, taking the construction of a process knowledge graph as an example.

Step S1: knowledge modeling. And (3) combing and classifying the knowledge in the vertical field according to the knowledge graph construction belonging to the field and the knowledge demand, and defining the types and attributes of the entities and the relations.

As shown in fig. 3, a specific method of knowledge modeling is as follows:

(1) The knowledge in the vertical domain is carded and classified according to the knowledge graph construction belonging to the domain and the knowledge demand, and the classified entity types are stored in a Label form, namely L= { Label ₁ ,Label ₂ ,…,Label _n In which Label is ₁ ～Label _n The label name is represented, and n labels are included.

(2) Defining the type and attribute of entity and relationship, determining the relationship between different labels and the attribute contained in each label, storing in the form of triples, i.e<S,R,E>Wherein S represents a starting entity, S.epsilon.L, and contains M attributes, denoted as P _s ＝{p ₁ ,p ₂ ,…,p _M -a }; r represents the relationship, R= { R ₁ ,r ₂ ,…,r _T -where r ₁ ～r _T Representing a relationship name, comprising T relationships in total; e represents an ending entity, E E L, and contains N attributes P _E ＝{p ₁ ,p ₂ ,…,p _N }. In the present embodiment, onlyConsidering the entity Name as an attribute of the entity, P _s ＝{Name ₁ ,Name ₂ ,…,Name _M }，P _E ＝{Name ₁ ,Name ₂ ,…,Name _N }. The process of instantiating the entity type is a process of assigning the entity attribute, namely, assigning the actual entity name to the label, thus forming the node of the knowledge graph, instantiating the attribute, thus forming the attribute of the node, and defining the relationship among the nodes, thus forming the complete knowledge graph.

Step S2: data acquisition and partitioning. The acquired vertical field text dataset (i.e., the initial dataset) is partitioned into small sample data and large sample data.

The specific method for data acquisition and division is as follows:

(1) Data with the number of characters exceeding 100000 in the text data is defined as large sample data, and data with the number of characters greater than 8000 and less than or equal to 10000 is defined as small sample data.

(2) And for the obtained text data, the obtained text data are disordered in a sentence unit, randomly arranged, and then the sentences are randomly extracted from the obtained text data, so that the number of characters contained in the extracted sentences accords with the definition of the small sample data, and the rest part is the large sample data.

Step S3: and labeling the small sample data and obtaining the physical dictionary. And labeling entity types of the divided small sample data sets, and extracting entities from the labeled small sample data to obtain an entity dictionary.

As a specific implementation mode, the specific method for labeling and obtaining the physical dictionary by the small sample data is as follows:

(1) The text data is labeled with the BIO labeling method, namely, B represents the initial character of the entity, I represents the middle and end characters of the entity, O represents the non-entity character, and the entity type is represented by B, I labels, as shown in fig. 4, for example, the defined entity type comprises products, components and parts, and is represented by letters P, W, L respectively, and B-P represents the initial character of the product entity, and I-P represents the middle and end characters of the product entity.

(2) Extracting an entity from the marked small sample data, obtaining an entity dictionary, storing the Name and the type of the entity, and setting the entity dictionary as EntityData= { [ Name ] ₁ :Label ₁ ],[Name ₂ :Label ₁ ],…,[Name _m :Label _n ]Name therein _m Name of the mth entity, label _n Represents the nth tag, [ Name ] _m :Label _n ]The entity type corresponding to the mth entity name is represented as an nth tag. Specific content is, for example, entitydata= { [ diesel: p (P)][ speed reducer: w (W)][ crankshaft: l (L)]Cylinder head: l (L)]Body, [ body: l (L)]Piston, [ piston: l (L)]…}。

Step S4: text enhancement of large sample data. The method comprises the steps of performing text enhancement on a divided large sample data set, performing data preprocessing on the large sample data, including word segmentation and stop word removal, performing semantic similarity calculation on the processed data and each entity in an entity dictionary, setting a similarity threshold, and marking the data exceeding the similarity threshold.

As a specific embodiment, a specific method of text enhancement of large sample data is as follows:

(1) The data preprocessing of the large sample data comprises word segmentation and stop word removal, and the large sample data is set as RawData= { S by taking sentences as units ₁ ,S ₂ ,…,S _i ,…,S _m S, where S _i S after word segmentation and stop word removal for sentences in large sample data _i ＝{w ₁ ,w ₂ ,…,w _j ,…,w _n -w is _j Representing keywords in sentences.

(2) Carrying out semantic similarity calculation on the preprocessed large sample data and each entity in the entity dictionary, namely comparing w _j (w _j ∈S _i ，S _i E RawData) and Name _i (Name _i E EntityData). Semantic similarity is calculated by Jaccard correlation coefficients. Jaccard correlation coefficients are used to compare similarities between finite sample sets, given two sets A, B, jaccard correlation coefficients are defined as the ratio of the intersection size to the union size of the two sets: j (a, B) = |aΣb|/|a+|b|= |a)And U.B/(|A|+|B| -A.U.B|). Here, a=rawdata, b=entitydata. The larger the ratio obtained by the formula, the higher the similarity is, and conversely, the lower the similarity is.

(3) And setting a similarity threshold value, and marking the data exceeding the similarity threshold value.

Step S5: text dataset construction and named entity recognition model training set partitioning. Combining the marked small sample data and the text enhanced large sample data to obtain a text data set, and dividing a training set and a verification set by a K-fold cross verification method.

As a specific implementation mode, the specific method for constructing the text data set and dividing the named entity recognition model training set is as follows:

combining the marked small sample data with the text enhanced large sample data, dividing a training set and a verification set by a K-fold cross verification method, and firstly dividing all data samples into K sample subsets with equal size; traversing the k subsets in sequence, taking the current subset as a verification set each time, taking all other samples as training sets, and performing model training and evaluation; and finally taking the average value of k evaluation indexes as a final evaluation index.

Step S6: and (5) constructing a named entity recognition model. The BERT model provides a deep learning new model based on pre-training, so that model training efficiency is greatly improved, a two-way long-short-term memory network (BiLSTM) is an improvement of a long-short-term memory network, the problem of long-distance dependence in the vertical field can be solved, a Conditional Random Field (CRF) can further restrict model labeling capability, a BERT-BiLSTM-CRF model is constructed as a named entity recognition model oriented to the vertical field, a model structure is shown in fig. 5, model training and verification are carried out by using a divided training set and a verification set, and a general pre-training model is adopted for training in the primary training.

Step S7: information extraction and knowledge graph construction. And extracting the vertical field data to be extracted by using the trained and verified named entity recognition model, and performing triplet mapping according to the knowledge modeling result to form a knowledge graph.

As shown in fig. 6, the specific method for information extraction and knowledge graph construction is as follows:

(1) And performing entity extraction (namely named entity recognition) on the vertical field data to be extracted by using the trained and verified named entity recognition model. Let the text data to be extracted be finaldata= { S ₁ ,S ₂ ,…,S _k ,…,S _M }，S ₁ ～S _M Representing sentences in text data, and carrying out named entity recognition sentence by sentence to obtain entity extractEntity= { [ Name ] contained in the sentences ₁ :Label ₁ ],[Name ₂ :Label ₁ ],[Name ₃ :Label ₂ ],…,[Name _i :Label _j ],…}，Name _i Representing the i-th entity name, label _j Represents the j-th entity tag, i.e. entity type, and the entity tag l= { Label defined in step S1 ₁ ,Label ₂ ,…,Label _n And (3) consistent.

(2) And performing triplet mapping according to the knowledge modeling result to form a knowledge graph. In the last step, the text data to be extracted is subjected to named entity recognition sentence by sentence, and an entity set obtained by extracting an h sentence is set as S _h _ExtractEntity＝{[Name ₁ :Label ₁ ],[Name ₂ :Label ₁ ],[Name ₃ :Label ₂ ],…,[Name _i :Label _j ]…, traverse the whole S _h Comparing whether the relation R epsilon R exists between the current Label and other Label, and when the relation R epsilon R exists, modeling the extracted entity according to the entity Label and knowledge to define the entity relation<S,R,E>Mapping and setting Label _x With Label _Y The existence of relation r _z Label then _x Corresponding entity Name _A With Label _Y Corresponding entity Name _B The existence of relation r _z Obtaining an instantiated triplet<Name _A ,r _z ,Name _B >I.e., a triplet instance. And then, a knowledge graph of the vertical field of the target can be constructed according to a plurality of triplet examples.

Step S8: the extracted data is supplemented to the text data set. After the entity extraction of the vertical field data is completed, the extracted data is used as a new data set, and when the number of the data sets reaches the minimum requirement of model training, the data sets are divided into a training set and a verification set again.

Step S9: and (5) storing the pre-training model. And (3) saving the named entity recognition model trained in the step (S6) as a pre-training model, training and verifying by using the data set generated in the step (S8) to obtain a new named entity recognition model, and returning to the step (S7) after the completion.

Step S10: and newly obtaining manual annotation data supplement. When a new manual labeling data set exists, the data set is saved similarly to the step S8, and after the number of the data sets reaches the minimum requirement of model training, the data set is divided into a training set and a verification set, and the step S9 is returned.

Example two

In order to execute the method corresponding to the above embodiment to achieve the corresponding functions and technical effects, an automatic knowledge graph construction system is provided below. As shown in fig. 7, the system includes:

the knowledge modeling module 701 is configured to perform knowledge modeling on a target vertical domain to obtain a knowledge model of the target vertical domain; the knowledge model comprises entity types and entity relations; the entity relationship characterizes a relationship between entities.

A data set dividing module 702, configured to obtain an initial data set, and divide the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value.

The small sample data processing module 703 is configured to perform data labeling on the small sample data, obtain small sample data with completed labeling, and determine an entity dictionary according to the small sample data with completed labeling.

And the large sample data processing module 704 is configured to perform text enhancement on the large sample data according to the physical dictionary, so as to obtain text enhanced large sample data.

A text data set construction module 705 for constructing a text data set from the noted small sample data and the text enhanced large sample data.

A model training module 706, configured to train a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model.

And the triplet instance generating module 707 is configured to obtain target text data, perform entity extraction on the target text data by using a trained named entity recognition model, and perform triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance.

A knowledge graph construction module 708, configured to construct a knowledge graph of the target vertical domain according to the triplet instance.

Example III

The embodiment of the invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for running the computer program to enable the electronic equipment to execute the knowledge graph automatic construction method in the first embodiment. The electronic device may be a server.

In addition, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the knowledge graph automatic construction method in the first embodiment.

Aiming at the problems of insufficient labeling data quantity in the vertical field, difficult training of the automatic extraction model of unstructured information of the knowledge graph, difficult updating of the model along with the change of data and the like, the invention provides the automatic knowledge graph construction method based on text enhancement and migration learning, which has the following advantages compared with the prior art:

1. according to the method, the data set is expanded based on the text enhancement method, the large sample data after text enhancement is obtained according to the small sample data marked manually, a sufficient amount of data can be obtained to train the named entity recognition model, the preparation efficiency of the model training set can be improved, and the model precision and the robustness are improved.

2. The invention builds the named entity recognition model BERT-BiLSTM-CRF based on the pre-training model, and can train a new model by taking the trained model as the pre-training model, thereby realizing the dynamic update of the named entity recognition model.

In summary, the invention provides an automatic knowledge graph construction method based on text enhancement and pre-training models, which is suitable for knowledge graph construction in the vertical field aiming at the difficulties of complex knowledge in the vertical field, difficult acquisition of data resources, insufficient labeling of data sets and the like, and can dynamically update a named entity recognition model along with the change of the data sets in the vertical field so as to adapt to the data resources and knowledge requirements in the vertical field which change at any time.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. The automatic knowledge graph construction method is characterized by comprising the following steps of:

2. The automatic knowledge graph construction method according to claim 1, wherein the small sample data is subjected to data labeling, small sample data with completed labeling is obtained, and an entity dictionary is determined according to the small sample data with completed labeling, specifically comprising:

3. The automatic knowledge graph construction method according to claim 1, wherein text enhancement is performed on the large sample data according to the physical dictionary to obtain text enhanced large sample data, specifically comprising:

4. The automatic knowledge graph construction method according to claim 1, wherein training a named entity recognition model according to the text data set specifically comprises:

dividing the text data set into a plurality of sample subsets with equal sizes;

5. The automatic knowledge graph construction method according to claim 1, wherein the entity extraction is performed on the target text data by using a trained named entity recognition model, and the extraction result is subjected to triplet mapping according to the knowledge model, so as to obtain a triplet instance, and the method specifically comprises:

6. The automatic knowledge-graph construction method according to claim 1, further comprising:

storing the trained named entity recognition model as a pre-training model;

7. The automatic knowledge-graph construction method according to claim 6, further comprising:

8. The automatic knowledge graph construction system is characterized by comprising:

9. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to execute the knowledge-graph automatic construction method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the knowledge-graph automatic construction method according to any one of claims 1 to 7.