CN116541535A - Automatic knowledge graph construction method, system, equipment and medium - Google Patents
Automatic knowledge graph construction method, system, equipment and medium Download PDFInfo
- Publication number
- CN116541535A CN116541535A CN202310571923.1A CN202310571923A CN116541535A CN 116541535 A CN116541535 A CN 116541535A CN 202310571923 A CN202310571923 A CN 202310571923A CN 116541535 A CN116541535 A CN 116541535A
- Authority
- CN
- China
- Prior art keywords
- data
- sample data
- entity
- text
- knowledge
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 56
- 238000012549 training Methods 0.000 claims abstract description 62
- 238000000605 extraction Methods 0.000 claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 43
- 238000002372 labelling Methods 0.000 claims abstract description 23
- 238000013507 mapping Methods 0.000 claims abstract description 17
- 238000012795 verification Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 239000000047 product Substances 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 239000003638 chemical reducing agent Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Animal Behavior & Ethology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an automatic knowledge graph construction method, system, equipment and medium, and relates to the technical field of knowledge graph information extraction. The method comprises the following steps: carrying out knowledge modeling on the vertical field of the target to obtain a knowledge model; dividing the initial data set into large sample data and small sample data; carrying out data labeling on the small sample data, and determining an entity dictionary according to the labeled small sample data; text enhancement is carried out on the large sample data according to the entity dictionary; constructing a text data set according to the marked small sample data and the text-enhanced large sample data; training a named entity recognition model according to the text data set; performing entity extraction on the target text data by using the trained named entity recognition model, and performing triplet mapping on the extraction result according to the knowledge model to obtain a triplet instance; and constructing a knowledge graph according to the triplet examples. The invention can realize the automatic construction of the knowledge graph in the vertical field.
Description
Technical Field
The invention relates to the technical field of knowledge graph information extraction, in particular to an automatic knowledge graph construction method, system, equipment and medium.
Background
The existing knowledge graph construction method in the vertical field mainly comprises two kinds of construction, namely manual construction and automatic construction, and in an automatic construction scheme, the method mainly comprises two steps of knowledge modeling and information extraction, wherein the information extraction is a key of automatic construction of the knowledge graph. The information extraction includes named entity recognition (i.e., entity extraction) and entity relationship extraction (i.e., relationship extraction).
Named entity recognition techniques mainly include rule and dictionary based methods, statistical based methods, deep learning based methods, and the like. The method based on the rules and the dictionary is an early named entity recognition method, relies on manual customization of rules and a complete entity library, is easy to cause rule conflict, and takes time and labor for making the rules and the dictionary. The statistical-based method comprises a hidden Markov model, a maximum entropy model, a support vector machine model, a conditional random field and the like, and is greatly dependent on a corpus, so that corpus construction for constructing and evaluating a named entity recognition model in the vertical field is a main factor for restricting the method. With the continuous improvement of computer performance, the machine learning method is widely applied to the field of natural language processing, including a cyclic neural network (RNN), a long-short-term memory network (LSTM), a gated cyclic neural network (GRU), and the like. In addition, a pretraining model represented by a BERT model becomes the latest model of deep learning in recent years, the pretraining model forms a pretraining model by pretraining a large amount of text data, and then high-quality data is used for fine adjustment of the model to improve the training efficiency and accuracy of the model.
Different from the general domain knowledge graph, the construction of the vertical domain knowledge graph has higher requirements on data sources and data quality, however, the construction of the vertical domain knowledge graph has a certain gap from the general domain in terms of data quantity and data set, and the construction is directed at different domain problems, different disciplines and different task demands, and a new entity extraction model is often required to be trained, so that a sufficient data set is difficult to ensure to train the model, the data used for training the model is often dependent on manual labeling, and time and labor are consumed.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a medium for automatically constructing a knowledge graph, so as to realize the automatic construction of the knowledge graph in the vertical field.
In order to achieve the above object, the present invention provides the following solutions:
an automatic knowledge graph construction method comprises the following steps:
carrying out knowledge modeling on the target vertical field to obtain a knowledge model of the target vertical field; the knowledge model comprises entity types and entity relations; the entity relationship characterizes the relationship between entities;
acquiring an initial data set, and dividing the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value;
carrying out data annotation on the small sample data to obtain small sample data with the completed annotation, and determining an entity dictionary according to the small sample data with the completed annotation;
performing text enhancement on the large sample data according to the entity dictionary to obtain text enhanced large sample data;
constructing a text data set according to the marked small sample data and the text enhanced large sample data;
training a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model;
acquiring target text data, extracting the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance;
and constructing a knowledge graph of the target vertical field according to the triplet examples.
Optionally, performing data labeling on the small sample data to obtain labeled small sample data, and determining an entity dictionary according to the labeled small sample data, wherein the method specifically comprises the following steps:
performing data annotation on the small sample data by using a BIO annotation method to obtain the small sample data after the annotation is completed; wherein B represents the initial character of the entity, I represents the middle and end characters of the entity, and O represents the non-entity character;
extracting entities from the small sample data with the marked, and obtaining an entity dictionary; the entity dictionary includes entity names and corresponding entity types in the small sample data.
Optionally, text enhancement is performed on the large sample data according to the entity dictionary to obtain text enhanced large sample data, which specifically includes:
performing word segmentation and stop word removal processing on the large sample data to obtain preprocessed large sample data;
carrying out semantic similarity calculation on the preprocessed large sample data and each entity name in the entity dictionary to obtain a similarity value corresponding to the preprocessed large sample data;
and carrying out data annotation on the preprocessed large sample data with the similarity value larger than the set similarity threshold value to obtain text enhanced large sample data.
Optionally, training a named entity recognition model according to the text data set, specifically including:
dividing the text data set into a plurality of sample subsets with equal sizes;
and training and verifying the named entity recognition model according to the sample subset by adopting a K-fold cross verification method to obtain a trained named entity recognition model.
Optionally, performing entity extraction on the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance, which specifically includes:
performing entity extraction on the target text data sentence by using a trained named entity recognition model to obtain an extraction result; the extraction result comprises an entity name and a corresponding entity type in the target text data;
performing triplet mapping on the entity names in the extraction result according to the corresponding entity types and the entity relations in the knowledge model to obtain a triplet instance; the triplet instance includes a start entity name, an end entity name, and a relationship between the start entity and the end entity.
Optionally, the method further comprises:
storing the trained named entity recognition model as a pre-training model;
taking the target text data and the corresponding extraction result as a first expansion data set;
and training the pre-training model according to the first extended data set when the data volume of the first extended data set is larger than the set number so as to update the trained named entity recognition model.
Optionally, the method further comprises:
when a new manual annotation data set exists, the new manual annotation data set is used as a second expansion data set;
and training the pre-training model according to the second extended data set when the data volume of the second extended data set is larger than the set number so as to update the trained named entity recognition model.
An automatic knowledge graph construction system, comprising:
the knowledge modeling module is used for carrying out knowledge modeling on the target vertical field to obtain a knowledge model of the target vertical field; the knowledge model comprises entity types and entity relations; the entity relationship characterizes the relationship between entities;
the data set dividing module is used for acquiring an initial data set and dividing the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value;
the small sample data processing module is used for carrying out data annotation on the small sample data to obtain small sample data with the marked small sample data, and determining an entity dictionary according to the small sample data with the marked small sample data;
the large sample data processing module is used for carrying out text enhancement on the large sample data according to the entity dictionary to obtain text enhanced large sample data;
the text data set construction module is used for constructing a text data set according to the small sample data with the marked and the large sample data with the enhanced text;
the model training module is used for training a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model;
the triplet instance generating module is used for acquiring target text data, extracting the entity of the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance;
and the knowledge graph construction module is used for constructing a knowledge graph of the target vertical field according to the triplet examples.
An electronic device comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic device to execute the automatic knowledge graph construction method.
A computer-readable storage medium storing a computer program which, when executed by a processor, implements the knowledge graph automatic construction method described above.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the knowledge graph automatic construction method provided by the invention, the initial data set is divided into the small sample data and the large sample data, text enhancement is carried out on the large sample data according to the small sample data subjected to manual labeling, the text enhanced large sample data is obtained, and a sufficient amount of data can be obtained to train a named entity recognition model, so that the difficulties of complex knowledge in the vertical field, difficult acquisition of data resources, insufficient labeling of the data set and the like are overcome, the preparation efficiency of a model training set and the accuracy of entity extraction are improved, the named entity recognition model is trained by utilizing the data, entity extraction is carried out on target text data by utilizing the trained named entity recognition model, and the extraction result is mapped in a triplet mode according to the pre-established knowledge model, so that the automatic construction of the knowledge graph oriented to the vertical field can be realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an automatic knowledge graph construction method provided by the invention;
FIG. 2 is a specific flowchart of the knowledge graph automatic construction method provided by the invention;
FIG. 3 is a diagram of an example knowledge modeling provided by the present invention;
FIG. 4 is a diagram of small sample data annotation examples provided by the present invention;
FIG. 5 is a diagram of a named entity recognition model provided by the present invention;
FIG. 6 is a diagram illustrating an exemplary triplet mapping provided by the present invention;
fig. 7 is a block diagram of the knowledge graph automatic construction system provided by the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a method, a system, equipment and a medium for automatically constructing a knowledge graph, so as to realize the automatic construction of the knowledge graph in the vertical field.
Specifically, in order to solve the problem of insufficient training data sets of named entity recognition models for knowledge graphs in the vertical field, the invention provides an automatic knowledge graph construction method, system, equipment and medium based on text enhancement and pre-training models, which mainly relate to the technology of named entity recognition, and the relationship among entities is obtained by mapping a model layer established in knowledge modeling. On one hand, the method and the device realize the expansion of the data set through text enhancement, and improve the extraction accuracy of the pre-training model; on the other hand, the invention also takes the trained model as a pre-training model, builds a new training data set after marking new data, trains the new model on the basis, and realizes the dynamic update of the named entity recognition model so as to ensure that the model can be updated along with the update of the data.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Example 1
The embodiment of the invention provides an automatic knowledge graph construction method. As shown in fig. 1, the method includes:
step 101: carrying out knowledge modeling on the target vertical field to obtain a knowledge model of the target vertical field; the knowledge model comprises entity types and entity relations; the entity relationship characterizes a relationship between entities.
Step 102: acquiring an initial data set, and dividing the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value.
Step 103: and carrying out data labeling on the small sample data to obtain labeled small sample data, and determining an entity dictionary according to the labeled small sample data.
Preferably, a BIO labeling method is adopted to carry out data labeling on the small sample data, so as to obtain small sample data with finished labeling; wherein B represents the initial character of the entity, I represents the middle and end characters of the entity, and O represents the non-entity character; extracting entities from the small sample data with the marked, and obtaining an entity dictionary; the entity dictionary includes entity names and corresponding entity types in the small sample data.
Step 104: and carrying out text enhancement on the large sample data according to the entity dictionary to obtain text enhanced large sample data.
Preferably, word segmentation and stop word removal processing are carried out on the large sample data to obtain preprocessed large sample data; carrying out semantic similarity calculation on the preprocessed large sample data and each entity name in the entity dictionary to obtain a similarity value corresponding to the preprocessed large sample data; and carrying out data annotation on the preprocessed large sample data with the similarity value larger than the set similarity threshold value to obtain text enhanced large sample data.
Step 105: and constructing a text data set according to the small sample data with the marked and the text enhanced large sample data.
Step 106: training a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model.
Preferably, step 106 specifically includes: dividing the text data set into a plurality of sample subsets with equal sizes; and training and verifying the named entity recognition model according to the sample subset by adopting a K-fold cross verification method to obtain a trained named entity recognition model.
Step 107: and obtaining target text data, extracting the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance.
Preferably, step 107 specifically includes: acquiring target text data; performing entity extraction on the target text data sentence by using a trained named entity recognition model to obtain an extraction result; the extraction result comprises an entity name and a corresponding entity type in the target text data; performing triplet mapping on the entity names in the extraction result according to the corresponding entity types and the entity relations in the knowledge model to obtain a triplet instance; the triplet instance includes a start entity name, an end entity name, and a relationship between the start entity and the end entity.
Step 108: and constructing a knowledge graph of the target vertical field according to the triplet examples.
Further, the method further comprises the steps of: storing the trained named entity recognition model as a pre-training model; taking the target text data and the corresponding extraction result as a first expansion data set; and training the pre-training model according to the first extended data set when the data volume of the first extended data set is larger than the set number so as to update the trained named entity recognition model.
Further, the method further comprises the steps of: when a new manual annotation data set exists, the new manual annotation data set is used as a second expansion data set; and training the pre-training model according to the second extended data set when the data volume of the second extended data set is larger than the set number so as to update the trained named entity recognition model.
The above steps will be described in detail below with reference to fig. 2, taking the construction of a process knowledge graph as an example.
Step S1: knowledge modeling. And (3) combing and classifying the knowledge in the vertical field according to the knowledge graph construction belonging to the field and the knowledge demand, and defining the types and attributes of the entities and the relations.
As shown in fig. 3, a specific method of knowledge modeling is as follows:
(1) The knowledge in the vertical domain is carded and classified according to the knowledge graph construction belonging to the domain and the knowledge demand, and the classified entity types are stored in a Label form, namely L= { Label 1 ,Label 2 ,…,Label n In which Label is 1 ~Label n The label name is represented, and n labels are included.
(2) Defining the type and attribute of entity and relationship, determining the relationship between different labels and the attribute contained in each label, storing in the form of triples, i.e<S,R,E>Wherein S represents a starting entity, S.epsilon.L, and contains M attributes, denoted as P s ={p 1 ,p 2 ,…,p M -a }; r represents the relationship, R= { R 1 ,r 2 ,…,r T -where r 1 ~r T Representing a relationship name, comprising T relationships in total; e represents an ending entity, E E L, and contains N attributes P E ={p 1 ,p 2 ,…,p N }. In the present embodiment, onlyConsidering the entity Name as an attribute of the entity, P s ={Name 1 ,Name 2 ,…,Name M },P E ={Name 1 ,Name 2 ,…,Name N }. The process of instantiating the entity type is a process of assigning the entity attribute, namely, assigning the actual entity name to the label, thus forming the node of the knowledge graph, instantiating the attribute, thus forming the attribute of the node, and defining the relationship among the nodes, thus forming the complete knowledge graph.
Step S2: data acquisition and partitioning. The acquired vertical field text dataset (i.e., the initial dataset) is partitioned into small sample data and large sample data.
The specific method for data acquisition and division is as follows:
(1) Data with the number of characters exceeding 100000 in the text data is defined as large sample data, and data with the number of characters greater than 8000 and less than or equal to 10000 is defined as small sample data.
(2) And for the obtained text data, the obtained text data are disordered in a sentence unit, randomly arranged, and then the sentences are randomly extracted from the obtained text data, so that the number of characters contained in the extracted sentences accords with the definition of the small sample data, and the rest part is the large sample data.
Step S3: and labeling the small sample data and obtaining the physical dictionary. And labeling entity types of the divided small sample data sets, and extracting entities from the labeled small sample data to obtain an entity dictionary.
As a specific implementation mode, the specific method for labeling and obtaining the physical dictionary by the small sample data is as follows:
(1) The text data is labeled with the BIO labeling method, namely, B represents the initial character of the entity, I represents the middle and end characters of the entity, O represents the non-entity character, and the entity type is represented by B, I labels, as shown in fig. 4, for example, the defined entity type comprises products, components and parts, and is represented by letters P, W, L respectively, and B-P represents the initial character of the product entity, and I-P represents the middle and end characters of the product entity.
(2) Extracting an entity from the marked small sample data, obtaining an entity dictionary, storing the Name and the type of the entity, and setting the entity dictionary as EntityData= { [ Name ] 1 :Label 1 ],[Name 2 :Label 1 ],…,[Name m :Label n ]Name therein m Name of the mth entity, label n Represents the nth tag, [ Name ] m :Label n ]The entity type corresponding to the mth entity name is represented as an nth tag. Specific content is, for example, entitydata= { [ diesel: p (P)][ speed reducer: w (W)][ crankshaft: l (L)]Cylinder head: l (L)]Body, [ body: l (L)]Piston, [ piston: l (L)]…}。
Step S4: text enhancement of large sample data. The method comprises the steps of performing text enhancement on a divided large sample data set, performing data preprocessing on the large sample data, including word segmentation and stop word removal, performing semantic similarity calculation on the processed data and each entity in an entity dictionary, setting a similarity threshold, and marking the data exceeding the similarity threshold.
As a specific embodiment, a specific method of text enhancement of large sample data is as follows:
(1) The data preprocessing of the large sample data comprises word segmentation and stop word removal, and the large sample data is set as RawData= { S by taking sentences as units 1 ,S 2 ,…,S i ,…,S m S, where S i S after word segmentation and stop word removal for sentences in large sample data i ={w 1 ,w 2 ,…,w j ,…,w n -w is j Representing keywords in sentences.
(2) Carrying out semantic similarity calculation on the preprocessed large sample data and each entity in the entity dictionary, namely comparing w j (w j ∈S i ,S i E RawData) and Name i (Name i E EntityData). Semantic similarity is calculated by Jaccard correlation coefficients. Jaccard correlation coefficients are used to compare similarities between finite sample sets, given two sets A, B, jaccard correlation coefficients are defined as the ratio of the intersection size to the union size of the two sets: j (a, B) = |aΣb|/|a+|b|= |a)And U.B/(|A|+|B| -A.U.B|). Here, a=rawdata, b=entitydata. The larger the ratio obtained by the formula, the higher the similarity is, and conversely, the lower the similarity is.
(3) And setting a similarity threshold value, and marking the data exceeding the similarity threshold value.
Step S5: text dataset construction and named entity recognition model training set partitioning. Combining the marked small sample data and the text enhanced large sample data to obtain a text data set, and dividing a training set and a verification set by a K-fold cross verification method.
As a specific implementation mode, the specific method for constructing the text data set and dividing the named entity recognition model training set is as follows:
combining the marked small sample data with the text enhanced large sample data, dividing a training set and a verification set by a K-fold cross verification method, and firstly dividing all data samples into K sample subsets with equal size; traversing the k subsets in sequence, taking the current subset as a verification set each time, taking all other samples as training sets, and performing model training and evaluation; and finally taking the average value of k evaluation indexes as a final evaluation index.
Step S6: and (5) constructing a named entity recognition model. The BERT model provides a deep learning new model based on pre-training, so that model training efficiency is greatly improved, a two-way long-short-term memory network (BiLSTM) is an improvement of a long-short-term memory network, the problem of long-distance dependence in the vertical field can be solved, a Conditional Random Field (CRF) can further restrict model labeling capability, a BERT-BiLSTM-CRF model is constructed as a named entity recognition model oriented to the vertical field, a model structure is shown in fig. 5, model training and verification are carried out by using a divided training set and a verification set, and a general pre-training model is adopted for training in the primary training.
Step S7: information extraction and knowledge graph construction. And extracting the vertical field data to be extracted by using the trained and verified named entity recognition model, and performing triplet mapping according to the knowledge modeling result to form a knowledge graph.
As shown in fig. 6, the specific method for information extraction and knowledge graph construction is as follows:
(1) And performing entity extraction (namely named entity recognition) on the vertical field data to be extracted by using the trained and verified named entity recognition model. Let the text data to be extracted be finaldata= { S 1 ,S 2 ,…,S k ,…,S M },S 1 ~S M Representing sentences in text data, and carrying out named entity recognition sentence by sentence to obtain entity extractEntity= { [ Name ] contained in the sentences 1 :Label 1 ],[Name 2 :Label 1 ],[Name 3 :Label 2 ],…,[Name i :Label j ],…},Name i Representing the i-th entity name, label j Represents the j-th entity tag, i.e. entity type, and the entity tag l= { Label defined in step S1 1 ,Label 2 ,…,Label n And (3) consistent.
(2) And performing triplet mapping according to the knowledge modeling result to form a knowledge graph. In the last step, the text data to be extracted is subjected to named entity recognition sentence by sentence, and an entity set obtained by extracting an h sentence is set as S h _ExtractEntity={[Name 1 :Label 1 ],[Name 2 :Label 1 ],[Name 3 :Label 2 ],…,[Name i :Label j ]…, traverse the whole S h Comparing whether the relation R epsilon R exists between the current Label and other Label, and when the relation R epsilon R exists, modeling the extracted entity according to the entity Label and knowledge to define the entity relation<S,R,E>Mapping and setting Label x With Label Y The existence of relation r z Label then x Corresponding entity Name A With Label Y Corresponding entity Name B The existence of relation r z Obtaining an instantiated triplet<Name A ,r z ,Name B >I.e., a triplet instance. And then, a knowledge graph of the vertical field of the target can be constructed according to a plurality of triplet examples.
Step S8: the extracted data is supplemented to the text data set. After the entity extraction of the vertical field data is completed, the extracted data is used as a new data set, and when the number of the data sets reaches the minimum requirement of model training, the data sets are divided into a training set and a verification set again.
Step S9: and (5) storing the pre-training model. And (3) saving the named entity recognition model trained in the step (S6) as a pre-training model, training and verifying by using the data set generated in the step (S8) to obtain a new named entity recognition model, and returning to the step (S7) after the completion.
Step S10: and newly obtaining manual annotation data supplement. When a new manual labeling data set exists, the data set is saved similarly to the step S8, and after the number of the data sets reaches the minimum requirement of model training, the data set is divided into a training set and a verification set, and the step S9 is returned.
Example two
In order to execute the method corresponding to the above embodiment to achieve the corresponding functions and technical effects, an automatic knowledge graph construction system is provided below. As shown in fig. 7, the system includes:
the knowledge modeling module 701 is configured to perform knowledge modeling on a target vertical domain to obtain a knowledge model of the target vertical domain; the knowledge model comprises entity types and entity relations; the entity relationship characterizes a relationship between entities.
A data set dividing module 702, configured to obtain an initial data set, and divide the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value.
The small sample data processing module 703 is configured to perform data labeling on the small sample data, obtain small sample data with completed labeling, and determine an entity dictionary according to the small sample data with completed labeling.
And the large sample data processing module 704 is configured to perform text enhancement on the large sample data according to the physical dictionary, so as to obtain text enhanced large sample data.
A text data set construction module 705 for constructing a text data set from the noted small sample data and the text enhanced large sample data.
A model training module 706, configured to train a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model.
And the triplet instance generating module 707 is configured to obtain target text data, perform entity extraction on the target text data by using a trained named entity recognition model, and perform triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance.
A knowledge graph construction module 708, configured to construct a knowledge graph of the target vertical domain according to the triplet instance.
Example III
The embodiment of the invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for running the computer program to enable the electronic equipment to execute the knowledge graph automatic construction method in the first embodiment. The electronic device may be a server.
In addition, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the knowledge graph automatic construction method in the first embodiment.
Aiming at the problems of insufficient labeling data quantity in the vertical field, difficult training of the automatic extraction model of unstructured information of the knowledge graph, difficult updating of the model along with the change of data and the like, the invention provides the automatic knowledge graph construction method based on text enhancement and migration learning, which has the following advantages compared with the prior art:
1. according to the method, the data set is expanded based on the text enhancement method, the large sample data after text enhancement is obtained according to the small sample data marked manually, a sufficient amount of data can be obtained to train the named entity recognition model, the preparation efficiency of the model training set can be improved, and the model precision and the robustness are improved.
2. The invention builds the named entity recognition model BERT-BiLSTM-CRF based on the pre-training model, and can train a new model by taking the trained model as the pre-training model, thereby realizing the dynamic update of the named entity recognition model.
In summary, the invention provides an automatic knowledge graph construction method based on text enhancement and pre-training models, which is suitable for knowledge graph construction in the vertical field aiming at the difficulties of complex knowledge in the vertical field, difficult acquisition of data resources, insufficient labeling of data sets and the like, and can dynamically update a named entity recognition model along with the change of the data sets in the vertical field so as to adapt to the data resources and knowledge requirements in the vertical field which change at any time.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.
Claims (10)
1. The automatic knowledge graph construction method is characterized by comprising the following steps of:
carrying out knowledge modeling on the target vertical field to obtain a knowledge model of the target vertical field; the knowledge model comprises entity types and entity relations; the entity relationship characterizes the relationship between entities;
acquiring an initial data set, and dividing the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value;
carrying out data annotation on the small sample data to obtain small sample data with the completed annotation, and determining an entity dictionary according to the small sample data with the completed annotation;
performing text enhancement on the large sample data according to the entity dictionary to obtain text enhanced large sample data;
constructing a text data set according to the marked small sample data and the text enhanced large sample data;
training a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model;
acquiring target text data, extracting the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance;
and constructing a knowledge graph of the target vertical field according to the triplet examples.
2. The automatic knowledge graph construction method according to claim 1, wherein the small sample data is subjected to data labeling, small sample data with completed labeling is obtained, and an entity dictionary is determined according to the small sample data with completed labeling, specifically comprising:
performing data annotation on the small sample data by using a BIO annotation method to obtain the small sample data after the annotation is completed; wherein B represents the initial character of the entity, I represents the middle and end characters of the entity, and O represents the non-entity character;
extracting entities from the small sample data with the marked, and obtaining an entity dictionary; the entity dictionary includes entity names and corresponding entity types in the small sample data.
3. The automatic knowledge graph construction method according to claim 1, wherein text enhancement is performed on the large sample data according to the physical dictionary to obtain text enhanced large sample data, specifically comprising:
performing word segmentation and stop word removal processing on the large sample data to obtain preprocessed large sample data;
carrying out semantic similarity calculation on the preprocessed large sample data and each entity name in the entity dictionary to obtain a similarity value corresponding to the preprocessed large sample data;
and carrying out data annotation on the preprocessed large sample data with the similarity value larger than the set similarity threshold value to obtain text enhanced large sample data.
4. The automatic knowledge graph construction method according to claim 1, wherein training a named entity recognition model according to the text data set specifically comprises:
dividing the text data set into a plurality of sample subsets with equal sizes;
and training and verifying the named entity recognition model according to the sample subset by adopting a K-fold cross verification method to obtain a trained named entity recognition model.
5. The automatic knowledge graph construction method according to claim 1, wherein the entity extraction is performed on the target text data by using a trained named entity recognition model, and the extraction result is subjected to triplet mapping according to the knowledge model, so as to obtain a triplet instance, and the method specifically comprises:
performing entity extraction on the target text data sentence by using a trained named entity recognition model to obtain an extraction result; the extraction result comprises an entity name and a corresponding entity type in the target text data;
performing triplet mapping on the entity names in the extraction result according to the corresponding entity types and the entity relations in the knowledge model to obtain a triplet instance; the triplet instance includes a start entity name, an end entity name, and a relationship between the start entity and the end entity.
6. The automatic knowledge-graph construction method according to claim 1, further comprising:
storing the trained named entity recognition model as a pre-training model;
taking the target text data and the corresponding extraction result as a first expansion data set;
and training the pre-training model according to the first extended data set when the data volume of the first extended data set is larger than the set number so as to update the trained named entity recognition model.
7. The automatic knowledge-graph construction method according to claim 6, further comprising:
when a new manual annotation data set exists, the new manual annotation data set is used as a second expansion data set;
and training the pre-training model according to the second extended data set when the data volume of the second extended data set is larger than the set number so as to update the trained named entity recognition model.
8. The automatic knowledge graph construction system is characterized by comprising:
the knowledge modeling module is used for carrying out knowledge modeling on the target vertical field to obtain a knowledge model of the target vertical field; the knowledge model comprises entity types and entity relations; the entity relationship characterizes the relationship between entities;
the data set dividing module is used for acquiring an initial data set and dividing the initial data set into large sample data and small sample data; the large sample data are text data with the character number in the initial data set being larger than a first set value; the small sample data are text data with the character number in the initial data set being larger than a third set value and smaller than or equal to the second set value; the first set value is larger than the second set value, and the second set value is larger than the third set value;
the small sample data processing module is used for carrying out data annotation on the small sample data to obtain small sample data with the marked small sample data, and determining an entity dictionary according to the small sample data with the marked small sample data;
the large sample data processing module is used for carrying out text enhancement on the large sample data according to the entity dictionary to obtain text enhanced large sample data;
the text data set construction module is used for constructing a text data set according to the small sample data with the marked and the large sample data with the enhanced text;
the model training module is used for training a named entity recognition model according to the text data set; the named entity recognition model is a BERT-BiLSTM-CRF model;
the triplet instance generating module is used for acquiring target text data, extracting the entity of the target text data by using a trained named entity recognition model, and performing triplet mapping on an extraction result according to the knowledge model to obtain a triplet instance;
and the knowledge graph construction module is used for constructing a knowledge graph of the target vertical field according to the triplet examples.
9. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to execute the knowledge-graph automatic construction method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the knowledge-graph automatic construction method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310571923.1A CN116541535A (en) | 2023-05-19 | 2023-05-19 | Automatic knowledge graph construction method, system, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310571923.1A CN116541535A (en) | 2023-05-19 | 2023-05-19 | Automatic knowledge graph construction method, system, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116541535A true CN116541535A (en) | 2023-08-04 |
Family
ID=87457571
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310571923.1A Pending CN116541535A (en) | 2023-05-19 | 2023-05-19 | Automatic knowledge graph construction method, system, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116541535A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117196027A (en) * | 2023-11-07 | 2023-12-08 | 北京航天晨信科技有限责任公司 | Training sample generation method and device based on knowledge graph |
CN118170891A (en) * | 2024-05-13 | 2024-06-11 | 浙江大学 | Text information extraction method, device, equipment and readable storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200302234A1 (en) * | 2019-03-22 | 2020-09-24 | Capital One Services, Llc | System and method for efficient generation of machine-learning models |
CN111950264A (en) * | 2020-08-05 | 2020-11-17 | 广东工业大学 | Text data enhancement method and knowledge element extraction method |
CN113177124A (en) * | 2021-05-11 | 2021-07-27 | 北京邮电大学 | Vertical domain knowledge graph construction method and system |
CN114266254A (en) * | 2021-12-24 | 2022-04-01 | 上海德拓信息技术股份有限公司 | Text named entity recognition method and system |
CN115238093A (en) * | 2022-07-26 | 2022-10-25 | 上海航空工业(集团)有限公司 | Model training method and device, electronic equipment and storage medium |
WO2023064563A1 (en) * | 2021-10-15 | 2023-04-20 | Adrenalineip | Methods, systems, and apparatuses for processing sports-related data |
CN116070632A (en) * | 2022-11-25 | 2023-05-05 | 金茂云科技服务(北京)有限公司 | Informal text entity tag identification method and device |
-
2023
- 2023-05-19 CN CN202310571923.1A patent/CN116541535A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200302234A1 (en) * | 2019-03-22 | 2020-09-24 | Capital One Services, Llc | System and method for efficient generation of machine-learning models |
CN111950264A (en) * | 2020-08-05 | 2020-11-17 | 广东工业大学 | Text data enhancement method and knowledge element extraction method |
CN113177124A (en) * | 2021-05-11 | 2021-07-27 | 北京邮电大学 | Vertical domain knowledge graph construction method and system |
WO2023064563A1 (en) * | 2021-10-15 | 2023-04-20 | Adrenalineip | Methods, systems, and apparatuses for processing sports-related data |
CN114266254A (en) * | 2021-12-24 | 2022-04-01 | 上海德拓信息技术股份有限公司 | Text named entity recognition method and system |
CN115238093A (en) * | 2022-07-26 | 2022-10-25 | 上海航空工业(集团)有限公司 | Model training method and device, electronic equipment and storage medium |
CN116070632A (en) * | 2022-11-25 | 2023-05-05 | 金茂云科技服务(北京)有限公司 | Informal text entity tag identification method and device |
Non-Patent Citations (4)
Title |
---|
KAZUAKI KASHIHARA等: "Human-Machine Interaction for Improved Cybersecurity Named Entity Recognition Considering Semantic Similarity", SPRINGER LINK, vol. 1251, 25 August 2020 (2020-08-25), pages 347 - 361 * |
刘嘉锡: "基于小规模标注的案件要素提取模型", 中国优秀硕士学位论文全文数据库 社会科学Ⅰ辑;信息科技, no. 8, 15 August 2022 (2022-08-15), pages 120 - 187 * |
钱玲飞 崔晓蕾: "基于数据增强的领域知识图谱构建方法研究", 现代情报, vol. 42, no. 3, 28 February 2022 (2022-02-28), pages 31 - 39 * |
黄琛: "基于少量标注数据的医疗命名实体识别方法", 中国优秀硕士学位论文全文数据库 医药卫生科技;信息科技, no. 1, 15 January 2023 (2023-01-15), pages 054 - 109 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117196027A (en) * | 2023-11-07 | 2023-12-08 | 北京航天晨信科技有限责任公司 | Training sample generation method and device based on knowledge graph |
CN117196027B (en) * | 2023-11-07 | 2024-02-02 | 北京航天晨信科技有限责任公司 | Training sample generation method and device based on knowledge graph |
CN118170891A (en) * | 2024-05-13 | 2024-06-11 | 浙江大学 | Text information extraction method, device, equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106776711B (en) | Chinese medical knowledge map construction method based on deep learning | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
CN111914096A (en) | Public transport passenger satisfaction evaluation method and system based on public opinion knowledge graph | |
CN116541535A (en) | Automatic knowledge graph construction method, system, equipment and medium | |
CN112800170A (en) | Question matching method and device and question reply method and device | |
CN109960786A (en) | Chinese Measurement of word similarity based on convergence strategy | |
CN113515632B (en) | Text classification method based on graph path knowledge extraction | |
CN114238653B (en) | Method for constructing programming education knowledge graph, completing and intelligently asking and answering | |
CN113360675A (en) | Knowledge graph specific relation completion method based on Internet open world | |
CN111368066B (en) | Method, apparatus and computer readable storage medium for obtaining dialogue abstract | |
CN103559193A (en) | Topic modeling method based on selected cell | |
CN116541472B (en) | Knowledge graph construction method in medical field | |
CN116756347B (en) | Semantic information retrieval method based on big data | |
CN114722833B (en) | Semantic classification method and device | |
CN113934835B (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation | |
CN115905554A (en) | Chinese academic knowledge graph construction method based on multidisciplinary classification | |
CN107622047B (en) | Design decision knowledge extraction and expression method | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN112711944A (en) | Word segmentation method and system and word segmentation device generation method and system | |
CN117113094A (en) | Semantic progressive fusion-based long text similarity calculation method and device | |
CN115033706A (en) | Method for automatically complementing and updating knowledge graph | |
CN113128210B (en) | Webpage form information analysis method based on synonym discovery | |
CN112613318B (en) | Entity name normalization system, method thereof and computer readable medium | |
CN114969001A (en) | Database metadata field matching method, device, equipment and medium | |
CN112632284A (en) | Information extraction method and system for unlabeled text data set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |