CN113130025A - Entity relationship extraction method, terminal equipment and computer readable storage medium - Google Patents
Entity relationship extraction method, terminal equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN113130025A CN113130025A CN202010047654.5A CN202010047654A CN113130025A CN 113130025 A CN113130025 A CN 113130025A CN 202010047654 A CN202010047654 A CN 202010047654A CN 113130025 A CN113130025 A CN 113130025A
- Authority
- CN
- China
- Prior art keywords
- candidate
- extraction
- template
- seed
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 105
- 238000000034 method Methods 0.000 claims abstract description 47
- 239000013598 vector Substances 0.000 claims abstract description 32
- 238000004590 computer program Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 3
- 230000036541 health Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 235000019013 Viburnum opulus Nutrition 0.000 description 1
- 244000071378 Viburnum opulus Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000007721 medicinal effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Public Health (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Pathology (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an entity relationship extraction method, a terminal device and a computer readable storage medium, wherein the method comprises the following steps: manually extracting a plurality of binary entity pairs which accord with a preset entity relationship from an electronic medical record text database to serve as seed examples; for each seed instance, searching sentences comprising the seed instances in an electronic medical record text database, and extracting the feature vectors of the sentences; clustering the seed examples based on the feature vectors; generating an extraction template corresponding to the cluster according to the seed example and the characteristic vector of the sentence corresponding to the seed example; extracting candidate examples in an electronic medical record text database by using an extraction template; and calculating the confidence of each candidate instance according to the entity relationship between the candidate instance and the extraction template, and determining whether to use the candidate instance as a new seed instance for the next iteration according to the confidence. The method and the device can greatly improve the accuracy of extracting the entity relationship of the electronic medical record.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a medical electronic medical record entity relation extraction method based on semi-supervision, a terminal device and a computer readable storage medium.
Background
In the more and more information and intelligent age, the medical health services are continuously developing towards the aspects of information and intelligence, and the medical electronic medical records are beginning to play more and more important roles in the medical health field. Medical Records (Medical Records) are Records of Medical activities of Medical personnel in examining, diagnosing, treating, etc. for occurrence, development and outcome of diseases of patients. The medical health record of the patient is also written according to the specified format and requirements by carrying out induction, arrangement and comprehensive analysis on the collected data. The traditional paper medical records have the defects of scattered storage, difficult retrieval, easy loss, difficult handwriting identification and the like, so that the medical records are difficult to manage and utilize by a modern means, and the electronic medical records are superior to the paper medical records in the aspects of content, availability and the like. In recent years, the use of electronic medical records is becoming more and more widespread, people have gradually improved knowledge of electronic medical records, and how to effectively mine a large amount of clinical information of patients, such as numbers, characters, tables, figures, images and other medical knowledge, and the utilization of the professional knowledge plays an important role in the development of medical health care industry.
The natural language processing method is mainly used for mining knowledge in medical texts, and the information extraction task mainly comprises NER (named entity recognition) and RE (relationship extraction). This task is used in medical informatics for Clinical Decision Support (CDS) research services for medical professionals. The method is mainly a method provided for the task of extracting the relation.
Relationship extraction is a task of extracting named relationships between entities in a natural language processing process, and extracting semantic relationships between entities in sentences that are labeled in an entity recognition process. The relation extraction technology is divided into three categories based on machine learning, supervised relation extraction, semi-supervised relation extraction, unsupervised relation extraction and open entity relation extraction according to the dependence of the training data set on manual labeling in the extraction process.
1. And (3) extracting supervision relations: the essence of supervised relationship extraction is classification, and the method needs a large number of labeled training data sets, and then identifies and classifies entity relationship types of a text corpus through machine learning. The feature vector-based method is to extract morphological information, syntactic information, and relational mode information from sentences of a text corpus, and quantize and encode useful information extracted from the sentences. Feature vectors and feature combinations may then be constructed. An entity relationship extraction model (e.g., classifier SVM, WINDOWs) may be established by machine learning. The quantity requirement of manually annotating the corpus is the greatest weakness of supervised relationship extraction, and the method is not suitable for processing a massive data corpus.
2. Weak supervision relation extraction: weakly supervised relationship extraction requires only a small annotated corpus and uses a representative sample of relationship seeds. The seeds of the labeled training dataset can be applied in a large-scale corpus and new extraction patterns are continuously extracted by an iterative method. The most widely used methods are bootstrapping, tag propagation and active learning. The bootstrap program summarizes the extended seed set by performing multiple experiments on a limited seed sample, and obtains the training examples through multiple iterations. In the bootstrap study, two representative systems were DIPRE and Snowball. The method has high requirements on initial relation seeds, each field needs a high-quality relation, and researches show that the method has low recall rate and poor portability.
3. Unsupervised relationship extraction: unsupervised relationship extraction does not require any manually annotated corpus and does not require predefined entity relationships, and the automatic extraction process of semantic relationships depends mainly on clustering the corpus. The method has strong portability in various fields and can be used for large-scale information extraction. However, the current experimental research has not obtained ideal extraction results, and the accuracy and the recall ratio are not obviously improved.
Relationship extraction based on semi-supervision can utilize a large amount of unlabelled data, only a small amount of entity relationships need to be annotated manually, the method can be used for extracting the entity relationships lacking in an annotated corpus, and has shown advantages in the electronic medical record relationship extraction.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method, a terminal device and a computer-readable storage medium for extracting an entity relationship of a medical electronic medical record based on semi-supervision, which can greatly improve the accuracy of extracting the entity relationship of the electronic medical record.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
an entity relationship extraction method comprises the following steps:
step 1, manually extracting a plurality of binary entity pairs which accord with a preset entity relationship from an electronic medical record text database to serve as seed examples;
step 2, for each seed case, searching sentences including the seed case in an electronic medical record text database, and extracting characteristic vectors of the sentences;
step 3, clustering the seed examples based on the characteristic vectors; generating an extraction template corresponding to the cluster according to the seed example and the characteristic vector of the sentence corresponding to the seed example;
step 4, extracting candidate examples from the electronic medical record text database by using the extraction template obtained in the step 3;
each extraction template can extract a group of a plurality of candidate examples, and a plurality of extraction templates can extract the same candidate example;
step 5, adding a new seed example according to the confidence of the candidate example;
step 5.1, for each extracted template obtained in the step 3, calculating the confidence of the extracted template by using the entity relationship between the candidate instance extracted by the template and the template;
step 5.2, for each candidate example obtained in the step 4, calculating the confidence coefficient of the candidate example by using the confidence coefficients of all the extracted templates which can extract the candidate example;
and 5.3, taking the candidate example with the confidence coefficient larger than the confidence coefficient threshold value as a new seed example, returning to the step 2, and executing the next iteration until the preset iteration times are reached.
In a more preferred technical solution, the method for calculating the confidence of each extracted template in step 5.1 is as follows:
counting the candidate examples extracted by the self, wherein if the candidate examples are the same as the 2 entities in the extraction template, the candidate examples are extracted; if the candidate instance is the same as 1 entity in the extraction template, the candidate instance is negative extraction; if the candidate instance is different from 2 entities in the extraction template, the candidate instance is unknown extraction; then, according to the number of positive extractions, negative extractions and unknown extractions, the confidence of the extraction template is calculated according to the following formula:
in the formula, Confρ(P) represents the confidence coefficient of the template P, wherein P, N and U respectively represent the number of positive extraction, negative extraction and unknown extraction corresponding to the template P, and Wngt、WunkWeights for negative and unknown extractions, respectively;
the method for calculating the confidence of the candidate instance in step 5.2 is as follows:
in the formula, Confι(i) For waitingThe confidence coefficient of the selected example i, xi is the set formed by all the extraction templates of the candidate example i, xijFor an extraction template referenced j in the set xi, CiThe sentence is the sentence where the candidate instance i is; sim (C)i,ξj) Representing sentence CiAnd extracting template xijThe similarity between them.
In a more preferred technical solution, the candidate examples refer to all pairs of binary entities that satisfy a preset entity relationship and have similarity greater than a similarity threshold with the extracted template.
In a more preferred technical scheme, the specific process of extracting the feature vector of each sentence is as follows: analyzing the sentence according to the dependency syntax, extracting all dependency characteristics of the binary entity pairs in the sentence, extracting a word vector of each dependency characteristic by using a skip-gram method, and taking the average value of all the word vectors as the feature vector of the sentence.
In a more preferred technical scheme, a single-pass algorithm is used to cluster sentences.
In a more preferred embodiment, the pair of binary entities that satisfy the predetermined entity relationship is < body part, medical description >.
In a more preferred technical scheme, the electronic medical record text database is a txt document which comprises a plurality of medical electronic medical record text data, is processed in a sentence division manner and is obtained by carrying out entity labeling processing on each sentence.
In a more preferred technical solution, the number of iterations is preset to 5.
The present invention also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements any one of the methods described above when executing the computer program.
The invention also provides a computer-readable storage medium, in which a computer program is stored, wherein the computer program, when executed by a processor, implements the method of any one of the above.
Advantageous effects
Compared with the prior art, the invention has the beneficial effects that:
according to the method, firstly, a small number of seed instances are used for generating an extraction template, then, candidate instances are extracted from an electronic medical record text database according to the extraction template, finally, the confidence coefficient of each candidate instance is calculated according to the entity relationship between the candidate instances and the extraction template, whether the candidate instances are used as new seed instances for next iteration is determined according to the confidence coefficient, so that semantic drift can be controlled, namely, some candidate instances with low correlation degree with the extraction template are prevented from being used as seed instances to enter the next iteration, and more relationship instances irrelevant to the seed instances are frequently generated, so that the accuracy rate of extracting the entity relationship of the electronic medical record can be greatly improved; in addition, only a small amount of seed examples need to be provided, so that a large amount of data without labels can be processed, the effect is good, and the development of medical health careers can be better assisted.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the present invention.
Detailed Description
The following describes embodiments of the present invention in detail, which are developed based on the technical solutions of the present invention, and give detailed implementation manners and specific operation procedures to further explain the technical solutions of the present invention.
The embodiment provides a medical electronic medical record entity relation extraction method based on semi-supervision, which comprises the following steps as shown in fig. 1:
step 1, preprocessing data;
acquiring a plurality of medical electronic medical record text data for training from a hospital, and combining all the data into a txt document; then dividing the document into sentences; secondly, carrying out entity labeling on sentences in the document by using a BILSTM + CRF technology, and paying attention to two types of entities, namely BODYPART (body part) and DESCRIPTION (medical DESCRIPTION), to obtain a sentence document; finally, a small number of binary entity pairs with the entity relationship of < body part, medical description > are selected as seed examples in a sentence document in an artificial mode, such as < waist, pain >.
Step 2, searching seed matching: for each seed instance, a sentence comprising the seed instance is searched in the text database of the electronic medical record, and a feature vector of the sentence is extracted.
In particular, the sentence file is scanned and if two entities in the seed instance appear in a sentence at the same time, the sentence S is scannedi={ai1,ai2,ai3,...,ainPerforming dependency syntax analysis, and extracting the common dependency characteristics a of two entities in the sentenceiqExtracting all dependency characteristics of the binary entity pairs in the sentences; then, word embedding is carried out by using a skip-gram method to obtain each dependency characteristic aiqCorresponding word vectorFinally, all the word vectors are takenIs taken as the sentence SiFeature vector of
Step 3, generating an extraction template: clustering the seed examples based on the feature vectors and by adopting a single-pass algorithm; and for each cluster, generating an extraction template corresponding to the cluster according to the seed example and the feature vector of the sentence corresponding to the seed example.
Specifically, all the seed instances are obtained, and the 1 st seed instance is allocated to a new empty cluster; traversing each remaining seed instance, calculating a similarity between the seed instance and each cluster based on the feature vectors, and assigning the seed instance to a similarity greater than or equal to a similarity threshold τsimIf the similarity of the seed instance to each cluster is below the similarity threshold τsimIf so, creating a new cluster and assigning the seed instance to the newly created cluster; finally, each cluster comprises a group of a plurality of seed instances, the wrong cluster is removed through a manual supervision method, and the rest clusters are used for generating a template through averaging the feature vectors of the seed instancesI.e. each cluster CljGenerating an extraction template, whereinAs a template PjThe feature vector of (2). In this embodiment, if the entity relationship of the seed instances in the cluster does not conform to the preset entity relationship, that is, does not conform to the preset entity relationship<Body part, medical description>This relationship is considered to be the wrong cluster.
Wherein, seed example inAnd cluster CljThe similarity function between them is sim (i)n,Clj) By computing seed instances inAnd cluster CljIf the similarity score of more than half of the seed instances is more than the similarity threshold value, taking the maximum similarity score as the seed instance inAnd cluster CljSimilarity value between otherwise seed instance inAnd cluster CljThe similarity value therebetween is assigned to 0. And the similarity between the two seed instances is calculated by the following formula:
sim(in,ij)=sim(Sn,Sj)=cos(Vn,Vj);
wherein in,ijRepresents two different seed instances, Sn,SjRespectively represent seed instances in,ijThe sentence in which V is locatedn,VjRespectively represent sentences Sn,SjCharacteristic vector of (c), cos (V)n,Vj) Representation of feature vector Vn,VjCosine similarity between them.
Step 4, searching candidate examples: extracting candidate examples, namely all binary entity pairs which are in accordance with a preset entity relationship and have similarity with the extracted template larger than a similarity threshold value, from the electronic medical record text database by using the extracted template obtained in the step (3);
each extraction template can extract a group of a plurality of candidate examples, and a plurality of extraction templates can extract the same candidate example.
Specifically, the method comprises the following steps:
step 4.1, scanning sentence documents, and collecting all sentences containing binary entity pairs which accord with the preset entity relationship;
step 4.2, traversing each sentence obtained in step 4.1: performing dependency syntactic analysis and other steps on the sentence according to the same method in the step 2 to extract a feature vector of the sentence; then, the similarity of the sentence and each extraction template is calculated based on the feature vectors: if the similarity between the sentence and any one of the extraction templates is greater than the similarity threshold, taking the binary entity pair in the sentence as a candidate example, and taking all the extraction templates with the similarity greater than the similarity threshold as the extraction templates of the candidate example;
step 4.3, after step 4.2 is completed, each candidate instance may correspond to a group of several extraction templates, and a group of several candidate instances may correspond to the same extraction template, that is: each extraction template can extract a group of a plurality of candidate examples, and a plurality of extraction templates can extract the same candidate example.
Step 5, controlling semantic drift to add a new seed instance according to the confidence of the candidate instance;
step 5.1, for each extracted template obtained in step 3, calculating the confidence of the extracted template by using the entity relationship between the candidate instance extracted by the extracted template and the extracted template, specifically:
counting the candidate examples extracted by the self, wherein if the candidate examples are the same as the 2 entities in the extraction template, the candidate examples are extracted; if the candidate instance is the same as 1 entity in the extraction template, the candidate instance is negative extraction; if the candidate instance is different from 2 entities in the extraction template, the candidate instance is unknown extraction; then, according to the number of positive extractions, negative extractions and unknown extractions, the confidence of the extraction template is calculated according to the following formula:
in the formula, Confρ(P) represents the confidence coefficient of the template P, wherein P, N and U respectively represent the number of positive extraction, negative extraction and unknown extraction corresponding to the template P, and Wngt、WunkWeights for negative and unknown extractions, respectively;
step 5.2, for each candidate example obtained in step 4, the confidence degrees of all the extracted templates which can extract the candidate example are used, and the confidence degree of the candidate example is calculated according to the following formula:
in the formula, Confι(i) Is the confidence of the candidate instance i, ξ is the set of all the extracted templates of the candidate instance i, ξjFor an extraction template referenced j in the set xi, CiThe sentence is the sentence where the candidate instance i is; sim (C)i,ξj) Representing sentence CiAnd extracting template xijThe similarity between them;
step 5.3, the confidence coefficient is larger than the confidence coefficient threshold value tautThe candidate example is used as a new seed example, the step 2 is returned to execute the next iteration until the preset iteration times are reached, and the process is finished; in the present embodiment, the preset number of iterations is set to 5.
The present invention also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method described in the above method embodiments when executing the computer program.
The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method described in the above-mentioned method embodiments.
According to the entity relationship extraction method, the terminal device and the computer readable storage medium in the embodiments of the invention, firstly, a small number of seed instances are used to generate an extraction template, then, candidate instances are extracted from an electronic medical record text database according to the extraction template, and then, the confidence coefficient of each candidate instance is calculated according to the entity relationship between the candidate instances and the extraction template, so as to determine whether to perform the next iteration by taking the candidate instances as new seed instances according to the confidence coefficient, thereby controlling semantic drift, namely, avoiding that some candidate instances with low correlation with the extraction template enter the next iteration as seed instances to frequently generate more instances irrelevant to the seed instances, and greatly improving the accuracy of the entity relationship extraction of the electronic medical record; in addition, only a small amount of seed examples need to be provided, so that a large amount of data without labels can be processed, the effect is good, and the development of medical health career can be better assisted.
The above embodiments are preferred embodiments of the present application, and those skilled in the art can make various changes or modifications without departing from the general concept of the present application, and such changes or modifications should fall within the scope of the claims of the present application.
Claims (10)
1. An entity relationship extraction method is characterized by comprising the following steps:
step 1, manually extracting a plurality of binary entity pairs which accord with a preset entity relationship from an electronic medical record text database to serve as seed examples;
step 2, for each seed case, searching sentences including the seed case in an electronic medical record text database, and extracting characteristic vectors of the sentences;
step 3, clustering the seed examples based on the characteristic vectors; generating an extraction template corresponding to the cluster according to the seed example and the characteristic vector of the sentence corresponding to the seed example;
step 4, extracting candidate examples from the electronic medical record text database by using the extraction template obtained in the step 3;
each extraction template can extract a group of a plurality of candidate examples, and a plurality of extraction templates can extract the same candidate example;
step 5, adding a new seed example according to the confidence of the candidate example;
step 5.1, for each extracted template obtained in the step 3, calculating the confidence of the extracted template by using the entity relationship between the candidate instance extracted by the template and the template;
step 5.2, for each candidate example obtained in the step 4, calculating the confidence coefficient of the candidate example by using the confidence coefficients of all the extracted templates which can extract the candidate example;
and 5.3, taking the candidate example with the confidence coefficient larger than the confidence coefficient threshold value as a new seed example, returning to the step 2, and executing the next iteration until the preset iteration times are reached.
2. The method of claim 1, wherein the confidence level of each extracted template in step 5.1 is calculated by:
counting the candidate examples extracted by the self, wherein if the candidate examples are the same as the 2 entities in the extraction template, the candidate examples are extracted; if the candidate instance is the same as 1 entity in the extraction template, the candidate instance is negative extraction; if the candidate instance is different from 2 entities in the extraction template, the candidate instance is unknown extraction; then, according to the number of positive extractions, negative extractions and unknown extractions, the confidence of the extraction template is calculated according to the following formula:
in the formula, Confρ(P) represents the confidence coefficient of the template P, wherein P, N and U respectively represent the number of positive extraction, negative extraction and unknown extraction corresponding to the template P, and Wngt、WunkNegative and unknown respectivelyThe weight taken;
the method for calculating the confidence of the candidate instance in step 5.2 is as follows:
in the formula, Confι(i) Is the confidence of the candidate instance i, ξ is the set of all the extracted templates of the candidate instance i, ξjFor an extraction template referenced j in the set xi, CiThe sentence is the sentence where the candidate instance i is; sim (C)i,ξj) Representing sentence CiAnd extracting template xijThe similarity between them.
3. The method of claim 1, wherein the candidate instances refer to all pairs of binary entities matching a predetermined entity relationship, and the similarity between the pairs of binary entities and the extracted template is greater than a similarity threshold.
4. The method according to claim 1, wherein the specific process of extracting the feature vector of each sentence is as follows: analyzing the sentence according to the dependency syntax, extracting all dependency characteristics of the binary entity pairs in the sentence, extracting a word vector of each dependency characteristic by using a skip-gram method, and taking the average value of all the word vectors as the feature vector of the sentence.
5. The method of claim 1, wherein sentences are clustered using a single-pass algorithm.
6. The method of claim 1, wherein the pair of binary entities that conform to the predetermined entity relationship is < body part, medical description >.
7. The method as claimed in claim 6, wherein the electronic medical record text database is txt document which comprises a plurality of medical electronic medical record text data, is processed by sentence division, and is obtained by entity labeling processing on each sentence.
8. The method of claim 1, wherein the predetermined number of iterations is 5.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010047654.5A CN113130025B (en) | 2020-01-16 | 2020-01-16 | Entity relation extraction method, terminal equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010047654.5A CN113130025B (en) | 2020-01-16 | 2020-01-16 | Entity relation extraction method, terminal equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113130025A true CN113130025A (en) | 2021-07-16 |
CN113130025B CN113130025B (en) | 2023-11-24 |
Family
ID=76771765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010047654.5A Active CN113130025B (en) | 2020-01-16 | 2020-01-16 | Entity relation extraction method, terminal equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113130025B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113658652A (en) * | 2021-08-18 | 2021-11-16 | 四川大学华西医院 | Binary relation extraction method based on electronic medical record data text |
CN113836924A (en) * | 2021-09-16 | 2021-12-24 | 东软集团股份有限公司 | Entity relationship extraction method and device, storage medium and electronic equipment |
CN114625880A (en) * | 2022-05-13 | 2022-06-14 | 上海帜讯信息技术股份有限公司 | Character relation extraction method, device, terminal and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050027664A1 (en) * | 2003-07-31 | 2005-02-03 | Johnson David E. | Interactive machine learning system for automated annotation of information in text |
US20190065576A1 (en) * | 2017-08-23 | 2019-02-28 | Rsvp Technologies Inc. | Single-entity-single-relation question answering systems, and methods |
CN109710932A (en) * | 2018-12-22 | 2019-05-03 | 北京工业大学 | A kind of medical bodies Relation extraction method based on Fusion Features |
CN110188193A (en) * | 2019-04-19 | 2019-08-30 | 四川大学 | A kind of electronic health record entity relation extraction method based on most short interdependent subtree |
-
2020
- 2020-01-16 CN CN202010047654.5A patent/CN113130025B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050027664A1 (en) * | 2003-07-31 | 2005-02-03 | Johnson David E. | Interactive machine learning system for automated annotation of information in text |
US20190065576A1 (en) * | 2017-08-23 | 2019-02-28 | Rsvp Technologies Inc. | Single-entity-single-relation question answering systems, and methods |
CN109710932A (en) * | 2018-12-22 | 2019-05-03 | 北京工业大学 | A kind of medical bodies Relation extraction method based on Fusion Features |
CN110188193A (en) * | 2019-04-19 | 2019-08-30 | 四川大学 | A kind of electronic health record entity relation extraction method based on most short interdependent subtree |
Non-Patent Citations (2)
Title |
---|
SHIGUANG WANG等: "Pedestrian Detection via Body Part Semantic and Contextual Information With DNN", 《IEEE TRANSACTIONS ON MULTIMEDIA》, vol. 20, no. 11, pages 3148 - 3159, XP011691817, DOI: 10.1109/TMM.2018.2829602 * |
SHIGUANG WAN等: "PCN: Part and Context Information for Pedestrian Detection with CNNs", 《网页在线公开:HTTPS://ARXIV.ORG/ABS/1804.04483V1》, pages 1 - 13 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113658652A (en) * | 2021-08-18 | 2021-11-16 | 四川大学华西医院 | Binary relation extraction method based on electronic medical record data text |
CN113658652B (en) * | 2021-08-18 | 2023-07-28 | 四川大学华西医院 | Binary relation extraction method based on electronic medical record data text |
CN113836924A (en) * | 2021-09-16 | 2021-12-24 | 东软集团股份有限公司 | Entity relationship extraction method and device, storage medium and electronic equipment |
CN114625880A (en) * | 2022-05-13 | 2022-06-14 | 上海帜讯信息技术股份有限公司 | Character relation extraction method, device, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113130025B (en) | 2023-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111414393B (en) | Semantic similar case retrieval method and equipment based on medical knowledge graph | |
WO2021151353A1 (en) | Medical entity relationship extraction method and apparatus, and computer device and readable storage medium | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
CN111949759A (en) | Method and system for retrieving medical record text similarity and computer equipment | |
CN109508459B (en) | Method for extracting theme and key information from news | |
CN113076411B (en) | Medical query expansion method based on knowledge graph | |
CN113130025B (en) | Entity relation extraction method, terminal equipment and computer readable storage medium | |
CN111143571B (en) | Entity labeling model training method, entity labeling method and device | |
CN108959305A (en) | A kind of event extraction method and system based on internet big data | |
CN112052318A (en) | Semantic recognition method and device, computer equipment and storage medium | |
CN110675962A (en) | Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules | |
CN116341546A (en) | Medical natural language processing method based on pre-training model | |
CN112214335A (en) | Web service discovery method based on knowledge graph and similarity network | |
CN116737924B (en) | Medical text data processing method and device | |
CN117235275A (en) | Medical disease coding mapping method and device based on large language model reasoning | |
CN115017884B (en) | Text parallel sentence pair extraction method based on graphic multi-mode gating enhancement | |
Saranya et al. | Intelligent medical data storage system using machine learning approach | |
CN115345165A (en) | Specific entity identification method oriented to label scarcity or distribution unbalance scene | |
CN112100382B (en) | Clustering method and device, computer readable storage medium and processor | |
CN113254609A (en) | Question-answering model integration method based on negative sample diversity | |
CN112614562A (en) | Model training method, device, equipment and storage medium based on electronic medical record | |
CN113722431B (en) | Named entity relationship identification method and device, electronic equipment and storage medium | |
CN115862844A (en) | M-N + model-based chronic pain feature recognition system | |
CN111199154B (en) | Fault-tolerant rough set-based polysemous word expression method, system and medium | |
CN112836014A (en) | Multi-field interdisciplinary-oriented expert selection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |