CN113130025A

CN113130025A - Entity relationship extraction method, terminal equipment and computer readable storage medium

Info

Publication number: CN113130025A
Application number: CN202010047654.5A
Authority: CN
Inventors: 唐琎; 覃若彬; 高琰; 王艳东
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2021-07-16
Anticipated expiration: 2040-01-16
Also published as: CN113130025B

Abstract

The invention discloses an entity relationship extraction method, a terminal device and a computer readable storage medium, wherein the method comprises the following steps: manually extracting a plurality of binary entity pairs which accord with a preset entity relationship from an electronic medical record text database to serve as seed examples; for each seed instance, searching sentences comprising the seed instances in an electronic medical record text database, and extracting the feature vectors of the sentences; clustering the seed examples based on the feature vectors; generating an extraction template corresponding to the cluster according to the seed example and the characteristic vector of the sentence corresponding to the seed example; extracting candidate examples in an electronic medical record text database by using an extraction template; and calculating the confidence of each candidate instance according to the entity relationship between the candidate instance and the extraction template, and determining whether to use the candidate instance as a new seed instance for the next iteration according to the confidence. The method and the device can greatly improve the accuracy of extracting the entity relationship of the electronic medical record.

Description

Entity relationship extraction method, terminal equipment and computer readable storage medium

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a medical electronic medical record entity relation extraction method based on semi-supervision, a terminal device and a computer readable storage medium.

Background

In the more and more information and intelligent age, the medical health services are continuously developing towards the aspects of information and intelligence, and the medical electronic medical records are beginning to play more and more important roles in the medical health field. Medical Records (Medical Records) are Records of Medical activities of Medical personnel in examining, diagnosing, treating, etc. for occurrence, development and outcome of diseases of patients. The medical health record of the patient is also written according to the specified format and requirements by carrying out induction, arrangement and comprehensive analysis on the collected data. The traditional paper medical records have the defects of scattered storage, difficult retrieval, easy loss, difficult handwriting identification and the like, so that the medical records are difficult to manage and utilize by a modern means, and the electronic medical records are superior to the paper medical records in the aspects of content, availability and the like. In recent years, the use of electronic medical records is becoming more and more widespread, people have gradually improved knowledge of electronic medical records, and how to effectively mine a large amount of clinical information of patients, such as numbers, characters, tables, figures, images and other medical knowledge, and the utilization of the professional knowledge plays an important role in the development of medical health care industry.

The natural language processing method is mainly used for mining knowledge in medical texts, and the information extraction task mainly comprises NER (named entity recognition) and RE (relationship extraction). This task is used in medical informatics for Clinical Decision Support (CDS) research services for medical professionals. The method is mainly a method provided for the task of extracting the relation.

Relationship extraction is a task of extracting named relationships between entities in a natural language processing process, and extracting semantic relationships between entities in sentences that are labeled in an entity recognition process. The relation extraction technology is divided into three categories based on machine learning, supervised relation extraction, semi-supervised relation extraction, unsupervised relation extraction and open entity relation extraction according to the dependence of the training data set on manual labeling in the extraction process.

1. And (3) extracting supervision relations: the essence of supervised relationship extraction is classification, and the method needs a large number of labeled training data sets, and then identifies and classifies entity relationship types of a text corpus through machine learning. The feature vector-based method is to extract morphological information, syntactic information, and relational mode information from sentences of a text corpus, and quantize and encode useful information extracted from the sentences. Feature vectors and feature combinations may then be constructed. An entity relationship extraction model (e.g., classifier SVM, WINDOWs) may be established by machine learning. The quantity requirement of manually annotating the corpus is the greatest weakness of supervised relationship extraction, and the method is not suitable for processing a massive data corpus.

2. Weak supervision relation extraction: weakly supervised relationship extraction requires only a small annotated corpus and uses a representative sample of relationship seeds. The seeds of the labeled training dataset can be applied in a large-scale corpus and new extraction patterns are continuously extracted by an iterative method. The most widely used methods are bootstrapping, tag propagation and active learning. The bootstrap program summarizes the extended seed set by performing multiple experiments on a limited seed sample, and obtains the training examples through multiple iterations. In the bootstrap study, two representative systems were DIPRE and Snowball. The method has high requirements on initial relation seeds, each field needs a high-quality relation, and researches show that the method has low recall rate and poor portability.

3. Unsupervised relationship extraction: unsupervised relationship extraction does not require any manually annotated corpus and does not require predefined entity relationships, and the automatic extraction process of semantic relationships depends mainly on clustering the corpus. The method has strong portability in various fields and can be used for large-scale information extraction. However, the current experimental research has not obtained ideal extraction results, and the accuracy and the recall ratio are not obviously improved.

Relationship extraction based on semi-supervision can utilize a large amount of unlabelled data, only a small amount of entity relationships need to be annotated manually, the method can be used for extracting the entity relationships lacking in an annotated corpus, and has shown advantages in the electronic medical record relationship extraction.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method, a terminal device and a computer-readable storage medium for extracting an entity relationship of a medical electronic medical record based on semi-supervision, which can greatly improve the accuracy of extracting the entity relationship of the electronic medical record.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

an entity relationship extraction method comprises the following steps:

step 1, manually extracting a plurality of binary entity pairs which accord with a preset entity relationship from an electronic medical record text database to serve as seed examples;

step 2, for each seed case, searching sentences including the seed case in an electronic medical record text database, and extracting characteristic vectors of the sentences;

step 3, clustering the seed examples based on the characteristic vectors; generating an extraction template corresponding to the cluster according to the seed example and the characteristic vector of the sentence corresponding to the seed example;

step 4, extracting candidate examples from the electronic medical record text database by using the extraction template obtained in the step 3;

each extraction template can extract a group of a plurality of candidate examples, and a plurality of extraction templates can extract the same candidate example;

step 5, adding a new seed example according to the confidence of the candidate example;

step 5.1, for each extracted template obtained in the step 3, calculating the confidence of the extracted template by using the entity relationship between the candidate instance extracted by the template and the template;

step 5.2, for each candidate example obtained in the step 4, calculating the confidence coefficient of the candidate example by using the confidence coefficients of all the extracted templates which can extract the candidate example;

and 5.3, taking the candidate example with the confidence coefficient larger than the confidence coefficient threshold value as a new seed example, returning to the step 2, and executing the next iteration until the preset iteration times are reached.

In a more preferred technical solution, the method for calculating the confidence of each extracted template in step 5.1 is as follows:

counting the candidate examples extracted by the self, wherein if the candidate examples are the same as the 2 entities in the extraction template, the candidate examples are extracted; if the candidate instance is the same as 1 entity in the extraction template, the candidate instance is negative extraction; if the candidate instance is different from 2 entities in the extraction template, the candidate instance is unknown extraction; then, according to the number of positive extractions, negative extractions and unknown extractions, the confidence of the extraction template is calculated according to the following formula:

in the formula, Conf_ρ(P) represents the confidence coefficient of the template P, wherein P, N and U respectively represent the number of positive extraction, negative extraction and unknown extraction corresponding to the template P, and W_ngt、W_unkWeights for negative and unknown extractions, respectively;

the method for calculating the confidence of the candidate instance in step 5.2 is as follows:

in the formula, Conf_ι(i) For waitingThe confidence coefficient of the selected example i, xi is the set formed by all the extraction templates of the candidate example i, xi_jFor an extraction template referenced j in the set xi, C_iThe sentence is the sentence where the candidate instance i is; sim (C)_i,ξ_j) Representing sentence C_iAnd extracting template xi_jThe similarity between them.

In a more preferred technical solution, the candidate examples refer to all pairs of binary entities that satisfy a preset entity relationship and have similarity greater than a similarity threshold with the extracted template.

In a more preferred technical scheme, the specific process of extracting the feature vector of each sentence is as follows: analyzing the sentence according to the dependency syntax, extracting all dependency characteristics of the binary entity pairs in the sentence, extracting a word vector of each dependency characteristic by using a skip-gram method, and taking the average value of all the word vectors as the feature vector of the sentence.

In a more preferred technical scheme, a single-pass algorithm is used to cluster sentences.

In a more preferred embodiment, the pair of binary entities that satisfy the predetermined entity relationship is < body part, medical description >.

In a more preferred technical scheme, the electronic medical record text database is a txt document which comprises a plurality of medical electronic medical record text data, is processed in a sentence division manner and is obtained by carrying out entity labeling processing on each sentence.

In a more preferred technical solution, the number of iterations is preset to 5.

The present invention also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements any one of the methods described above when executing the computer program.

The invention also provides a computer-readable storage medium, in which a computer program is stored, wherein the computer program, when executed by a processor, implements the method of any one of the above.

Advantageous effects

Compared with the prior art, the invention has the beneficial effects that:

according to the method, firstly, a small number of seed instances are used for generating an extraction template, then, candidate instances are extracted from an electronic medical record text database according to the extraction template, finally, the confidence coefficient of each candidate instance is calculated according to the entity relationship between the candidate instances and the extraction template, whether the candidate instances are used as new seed instances for next iteration is determined according to the confidence coefficient, so that semantic drift can be controlled, namely, some candidate instances with low correlation degree with the extraction template are prevented from being used as seed instances to enter the next iteration, and more relationship instances irrelevant to the seed instances are frequently generated, so that the accuracy rate of extracting the entity relationship of the electronic medical record can be greatly improved; in addition, only a small amount of seed examples need to be provided, so that a large amount of data without labels can be processed, the effect is good, and the development of medical health careers can be better assisted.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in detail, which are developed based on the technical solutions of the present invention, and give detailed implementation manners and specific operation procedures to further explain the technical solutions of the present invention.

The embodiment provides a medical electronic medical record entity relation extraction method based on semi-supervision, which comprises the following steps as shown in fig. 1:

step 1, preprocessing data;

acquiring a plurality of medical electronic medical record text data for training from a hospital, and combining all the data into a txt document; then dividing the document into sentences; secondly, carrying out entity labeling on sentences in the document by using a BILSTM + CRF technology, and paying attention to two types of entities, namely BODYPART (body part) and DESCRIPTION (medical DESCRIPTION), to obtain a sentence document; finally, a small number of binary entity pairs with the entity relationship of < body part, medical description > are selected as seed examples in a sentence document in an artificial mode, such as < waist, pain >.

Step 2, searching seed matching: for each seed instance, a sentence comprising the seed instance is searched in the text database of the electronic medical record, and a feature vector of the sentence is extracted.

In particular, the sentence file is scanned and if two entities in the seed instance appear in a sentence at the same time, the sentence S is scanned_i＝{a_i1,a_i2,a_i3,...,a_inPerforming dependency syntax analysis, and extracting the common dependency characteristics a of two entities in the sentence_iqExtracting all dependency characteristics of the binary entity pairs in the sentences; then, word embedding is carried out by using a skip-gram method to obtain each dependency characteristic a_iqCorresponding word vector

Finally, all the word vectors are taken

Is taken as the sentence S_iFeature vector of

Step 3, generating an extraction template: clustering the seed examples based on the feature vectors and by adopting a single-pass algorithm; and for each cluster, generating an extraction template corresponding to the cluster according to the seed example and the feature vector of the sentence corresponding to the seed example.

Specifically, all the seed instances are obtained, and the 1 st seed instance is allocated to a new empty cluster; traversing each remaining seed instance, calculating a similarity between the seed instance and each cluster based on the feature vectors, and assigning the seed instance to a similarity greater than or equal to a similarity threshold τ_simIf the similarity of the seed instance to each cluster is below the similarity threshold τ_simIf so, creating a new cluster and assigning the seed instance to the newly created cluster; finally, each cluster comprises a group of a plurality of seed instances, the wrong cluster is removed through a manual supervision method, and the rest clusters are used for generating a template through averaging the feature vectors of the seed instances

I.e. each cluster Cl_jGenerating an extraction template, wherein

As a template P_jThe feature vector of (2). In this embodiment, if the entity relationship of the seed instances in the cluster does not conform to the preset entity relationship, that is, does not conform to the preset entity relationship<Body part, medical description>This relationship is considered to be the wrong cluster.

Wherein, seed example i_nAnd cluster Cl_jThe similarity function between them is sim (i)_n,Cl_j) By computing seed instances i_nAnd cluster Cl_jIf the similarity score of more than half of the seed instances is more than the similarity threshold value, taking the maximum similarity score as the seed instance i_nAnd cluster Cl_jSimilarity value between otherwise seed instance i_nAnd cluster Cl_jThe similarity value therebetween is assigned to 0. And the similarity between the two seed instances is calculated by the following formula:

sim(i_n,i_j)＝sim(S_n,S_j)＝cos(V_n,V_j)；

wherein i_n,i_jRepresents two different seed instances, S_n,S_jRespectively represent seed instances i_n,i_jThe sentence in which V is located_n,V_jRespectively represent sentences S_n,S_jCharacteristic vector of (c), cos (V)_n,V_j) Representation of feature vector V_n,V_jCosine similarity between them.

Step 4, searching candidate examples: extracting candidate examples, namely all binary entity pairs which are in accordance with a preset entity relationship and have similarity with the extracted template larger than a similarity threshold value, from the electronic medical record text database by using the extracted template obtained in the step (3);

each extraction template can extract a group of a plurality of candidate examples, and a plurality of extraction templates can extract the same candidate example.

Specifically, the method comprises the following steps:

step 4.1, scanning sentence documents, and collecting all sentences containing binary entity pairs which accord with the preset entity relationship;

step 4.2, traversing each sentence obtained in step 4.1: performing dependency syntactic analysis and other steps on the sentence according to the same method in the step 2 to extract a feature vector of the sentence; then, the similarity of the sentence and each extraction template is calculated based on the feature vectors: if the similarity between the sentence and any one of the extraction templates is greater than the similarity threshold, taking the binary entity pair in the sentence as a candidate example, and taking all the extraction templates with the similarity greater than the similarity threshold as the extraction templates of the candidate example;

step 4.3, after step 4.2 is completed, each candidate instance may correspond to a group of several extraction templates, and a group of several candidate instances may correspond to the same extraction template, that is: each extraction template can extract a group of a plurality of candidate examples, and a plurality of extraction templates can extract the same candidate example.

Step 5, controlling semantic drift to add a new seed instance according to the confidence of the candidate instance;

step 5.1, for each extracted template obtained in step 3, calculating the confidence of the extracted template by using the entity relationship between the candidate instance extracted by the extracted template and the extracted template, specifically:

step 5.2, for each candidate example obtained in step 4, the confidence degrees of all the extracted templates which can extract the candidate example are used, and the confidence degree of the candidate example is calculated according to the following formula:

in the formula, Conf_ι(i) Is the confidence of the candidate instance i, ξ is the set of all the extracted templates of the candidate instance i, ξ_jFor an extraction template referenced j in the set xi, C_iThe sentence is the sentence where the candidate instance i is; sim (C)_i,ξ_j) Representing sentence C_iAnd extracting template xi_jThe similarity between them;

step 5.3, the confidence coefficient is larger than the confidence coefficient threshold value tau_tThe candidate example is used as a new seed example, the step 2 is returned to execute the next iteration until the preset iteration times are reached, and the process is finished; in the present embodiment, the preset number of iterations is set to 5.

The present invention also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method described in the above method embodiments when executing the computer program.

The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method described in the above-mentioned method embodiments.

According to the entity relationship extraction method, the terminal device and the computer readable storage medium in the embodiments of the invention, firstly, a small number of seed instances are used to generate an extraction template, then, candidate instances are extracted from an electronic medical record text database according to the extraction template, and then, the confidence coefficient of each candidate instance is calculated according to the entity relationship between the candidate instances and the extraction template, so as to determine whether to perform the next iteration by taking the candidate instances as new seed instances according to the confidence coefficient, thereby controlling semantic drift, namely, avoiding that some candidate instances with low correlation with the extraction template enter the next iteration as seed instances to frequently generate more instances irrelevant to the seed instances, and greatly improving the accuracy of the entity relationship extraction of the electronic medical record; in addition, only a small amount of seed examples need to be provided, so that a large amount of data without labels can be processed, the effect is good, and the development of medical health career can be better assisted.

The above embodiments are preferred embodiments of the present application, and those skilled in the art can make various changes or modifications without departing from the general concept of the present application, and such changes or modifications should fall within the scope of the claims of the present application.

Claims

1. An entity relationship extraction method is characterized by comprising the following steps:

2. The method of claim 1, wherein the confidence level of each extracted template in step 5.1 is calculated by:

in the formula, Conf_ρ(P) represents the confidence coefficient of the template P, wherein P, N and U respectively represent the number of positive extraction, negative extraction and unknown extraction corresponding to the template P, and W_ngt、W_unkNegative and unknown respectivelyThe weight taken;

in the formula, Conf_ι(i) Is the confidence of the candidate instance i, ξ is the set of all the extracted templates of the candidate instance i, ξ_jFor an extraction template referenced j in the set xi, C_iThe sentence is the sentence where the candidate instance i is; sim (C)_i,ξ_j) Representing sentence C_iAnd extracting template xi_jThe similarity between them.

3. The method of claim 1, wherein the candidate instances refer to all pairs of binary entities matching a predetermined entity relationship, and the similarity between the pairs of binary entities and the extracted template is greater than a similarity threshold.

4. The method according to claim 1, wherein the specific process of extracting the feature vector of each sentence is as follows: analyzing the sentence according to the dependency syntax, extracting all dependency characteristics of the binary entity pairs in the sentence, extracting a word vector of each dependency characteristic by using a skip-gram method, and taking the average value of all the word vectors as the feature vector of the sentence.

5. The method of claim 1, wherein sentences are clustered using a single-pass algorithm.

6. The method of claim 1, wherein the pair of binary entities that conform to the predetermined entity relationship is < body part, medical description >.

7. The method as claimed in claim 6, wherein the electronic medical record text database is txt document which comprises a plurality of medical electronic medical record text data, is processed by sentence division, and is obtained by entity labeling processing on each sentence.

8. The method of claim 1, wherein the predetermined number of iterations is 5.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.