Nothing Special   »   [go: up one dir, main page]

WO2021151353A1 - 医学实体关系抽取方法、装置、计算机设备及可读存储介质 - Google Patents

医学实体关系抽取方法、装置、计算机设备及可读存储介质 Download PDF

Info

Publication number
WO2021151353A1
WO2021151353A1 PCT/CN2020/135082 CN2020135082W WO2021151353A1 WO 2021151353 A1 WO2021151353 A1 WO 2021151353A1 CN 2020135082 W CN2020135082 W CN 2020135082W WO 2021151353 A1 WO2021151353 A1 WO 2021151353A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
data
processed
relationship
medical
Prior art date
Application number
PCT/CN2020/135082
Other languages
English (en)
French (fr)
Inventor
张圣
顾大中
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021151353A1 publication Critical patent/WO2021151353A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • This application relates to the field of natural language processing technology, and in particular to a method, device, computer equipment, and readable storage medium for extracting medical entity relationships.
  • biomedical literature contains rich and cutting-edge biomedical knowledge, which is an important knowledge treasure for researchers in the biomedical field.
  • the inventor’s research found that the existing biomedical entity relationship knowledge base is basically constructed based on expert manpower. The coverage of medical relationship knowledge is small and the knowledge scale is limited. As the number of medical literature increases exponentially However, the method of constructing a medical knowledge base only by relying on experts to manually edit and organize knowledge is unable to construct a full medical relation knowledge base. The manual operation has a large workload, low efficiency and high cost.
  • the purpose of this application is to provide a medical entity relationship extraction method, device, computer equipment, and readable storage medium, which are used to solve the technical problems of manual extraction of medical entity relationships that are time-consuming, laborious and low in efficiency in the prior art.
  • this application provides a method for extracting medical entity relationships, including:
  • this application also provides a medical entity relationship extraction device, including:
  • the acquisition module is used to acquire medical texts, and acquire multiple pieces of to-be-processed data based on the medical texts;
  • the entity recognition module is configured to use the first model to perform medical named entity recognition on each data to be processed, and obtain the entity recognition result corresponding to each data to be processed;
  • the relationship recognition module is configured to extract entity relationships based on the entity recognition results, and obtain entity pairs with entity relationships;
  • the generating module is configured to calculate the confidence of the entity pair based on the entity relationship, and generate target data based on each of the entity pairs, the entity relationship, and the corresponding confidence.
  • the present application also provides a computer device, the computer device including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program when the computer program is executed.
  • the present application also provides a computer-readable storage medium, which includes multiple storage media, each of which stores a computer program, and when the computer program stored in the multiple storage media is executed by a processor Implement the following methods together:
  • the present application can realize the automatic extraction of medical entity relationships, and solves the technical problems in the prior art that manual extraction of medical entity relationships is time-consuming, laborious, and low in efficiency.
  • FIG. 1 is a flowchart of Embodiment 1 of the method for extracting medical entity relationship according to this application;
  • FIG. 2 is a flow chart of using the first model to perform medical named entity recognition on each data to be processed in the first embodiment of the medical entity relationship extraction method described in this application, and obtain the entity recognition result corresponding to each data to be processed;
  • Fig. 3 is the first embodiment of the medical entity relationship extraction method according to the application, before the first model is used to perform medical named entity recognition on each data to be processed, and before the entity recognition result corresponding to each data to be processed is obtained, the first model Flow chart of training;
  • FIG. 4 is a flowchart of performing entity relationship extraction based on the entity recognition result in the first embodiment of the medical entity relationship extraction method according to this application to obtain an entity pair with entity relationship;
  • FIG. 5 is a flowchart of entity relationship extraction based on the entity recognition result in Embodiment 1 of the medical entity relationship extraction method according to this application to obtain entity pairs with entity relationships;
  • Embodiment 6 is the method of calculating the confidence of the entity pair based on the entity relationship in Embodiment 1 of the method for extracting the medical entity relationship according to this application, and generating target data based on each of the entity pairs, entity relationships, and corresponding confidences flow chart;
  • FIG. 7 is a schematic diagram of program modules of Embodiment 2 of the medical entity relationship extraction apparatus according to this application.
  • FIG. 8 is a schematic diagram of the hardware structure of the computer device in the third embodiment of the computer device of this application.
  • Adjustment unit 5 Computer equipment 51. Memory
  • the technical solution of this application can be applied to the fields of artificial intelligence, smart city, digital healthcare, blockchain and/or big data technology to realize automatic extraction of medical entity relationships.
  • the data involved in this application such as medical text, confidence, and/or target data, can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application .
  • the medical entity relationship extraction method, device, computer equipment, and readable storage medium provided in this application are applicable to the field, and provide a medical entity relationship extraction method based on an acquisition module, an entity recognition module, a relationship recognition module, and a generation module.
  • This application obtains the data to be processed based on the medical text through the acquisition module, and uses the first model in the entity recognition module to process the data to be processed to obtain the entity recognition result.
  • the entity recognition result includes 10 entity types, and then uses the relationship according to the entity recognition result.
  • the recognition module performs entity relationship extraction to obtain entity pairs with entity relationships.
  • the entity relationships include 150 types, which are any two of the entity category association relationships generated based on the dependency relationship type.
  • the generation module is used to calculate the confidence of each entity pair , Used to evaluate the relevance of each entity pair, and generate target data, realize the automatic extraction of the entity relationship through the aforementioned method, and solve the technical problem of manual extraction of the medical entity relationship in the prior art that is time-consuming, laborious and low in efficiency.
  • a method for extracting medical entity relationships in this embodiment includes the following steps:
  • S100 Obtain a medical text, and obtain multiple pieces of to-be-processed data based on the medical text;
  • a large number of medical texts are used for entity relationship extraction. After the medical texts are obtained, they can be pre-screened and analyzed through preset rules, and each single sentence obtained is used as the data to be processed.
  • Each medical text contains multiple
  • the medical text is split according to preset labels (such as period, semicolon, etc.), and each sentence obtained by the split is filtered, and sentences that do not meet the preset conditions are proposed to be processed
  • the preset condition can be sentence length, etc. Examples of the data to be processed obtained are: "The profile of the ACE makes it a therapeutic target for heart failure.”
  • S200 Use the first model to perform medical named entity recognition on each data to be processed, and obtain an entity recognition result corresponding to each data to be processed;
  • the first model includes the Bert-Embeding layer, Bi-LSTM network, Bi-GRU network, and CRF network that are sequentially set;
  • the entity recognition result includes entities and entity types, and the entity types include genes (gene) , Disease (disease), substance (chemical), protein (protein), symptom (symptom), laboratory test (test), treatment plan (therapy, including surgery, chemotherapy, radiotherapy, immunotherapy, etc.), microorganism (microorganism), Immune factor, biological pathway.
  • the input is the data to be processed
  • the output is the information of the recognized multi-category medical entity.
  • the entity recognition is carried out by means of BIO labeling, including three types of BIO tags.
  • B represents the starting position of an entity in the text
  • I Indicates the middle or end position of an entity in the text. O indicates that it is not an entity part.
  • BI has 10 categories, namely B-gene, ..., B-pathway; I-gene ,..., I-pathway.
  • the first model is used to perform medical named entity recognition on each data to be processed, and to obtain entity recognition results corresponding to each data to be processed. Referring to FIG. 3, the following steps are included:
  • S211 Obtain any data to be processed, and use the Bert-Embeding layer to vectorize the data to be processed to obtain a first vector;
  • the pre-training model bert is used to obtain the embedding representation of each word.
  • the pre-training models such as Bert are pre-trained through the Masked LM task and the Next Sentence Prediction task, and then the pre-trained model is applied to the specific task.
  • fine-tuning a commonly used parameter tuning method in machine learning or deep learning
  • the pre-training model is used to learn the word embedding and the expression effect is better than the word embedding obtained by the network structure training of word2vec commonly used in the prior art, through
  • the Bert-Embeding layer obtains the vector corresponding to each data to be processed for subsequent semantic recognition.
  • This solution pre-trains the bert model using a large amount of medical literature corpus, so that it can be adapted to natural language processing tasks in the medical field.
  • LSTM is a commonly used recurrent neural network, and Bi-LSTM is used here. It is a two-way LSTM, Bi-LSTM can learn the forward and backward semantics of each word in the sentence (ie context semantics).
  • the second vector obtained after the Bi-LSTM network processing is input to the Bi-GRU layer to obtain the hidden vector (that is, the third vector).
  • the first model uses a two-layer recurrent neural network.
  • the layer uses Bi-LSTM, and the second layer uses Bi-GRU.
  • the multi-layer recurrent neural network of this model can learn deeper semantic representations
  • S214 Input the third vector to the CRF layer, output a predicted tag sequence for entity recognition, and obtain an entity recognition result corresponding to the to-be-processed data according to the predicted tag sequence;
  • the output in the above S212 and S213 in this solution is the predicted score of each entity tag. These scores will be used as the CRF layer.
  • the CRF layer can add some constraints to the final predicted tag sequence to ensure that the predicted tag sequence is legal. Yes, in the training process of training data, these constraints can be automatically learned through the CRF layer. As an example, the first word in a sentence always starts with the label "B-" or "O” instead of "I-", etc. , The probability of illegal sequences in the predicted tag sequence obtained after the CRF layer processing will be greatly reduced.
  • ACE gene
  • heart failure disease
  • the first model described in this solution differs from the existing word embedding information of word2vec as input.
  • the structure of a single-layer recurrent neural network (RNN) uses a double-layer recurrent network to further increase the learning of the model and improve the obtained entity recognition results. accuracy.
  • the first model Before using the first model to perform medical named entity recognition on each data to be processed, and before obtaining the entity recognition result corresponding to each data to be processed, the first model is trained, referring to FIG. 3, including the following;
  • the entity tag includes an entity and an entity type; the entity tag includes the corresponding 10 medical entity types described above, and the entity tag is annotated using a BIO model.
  • S222 Input the training data into the Bert-Embeding layer for vectorization processing, and obtain a first processing vector corresponding to the training data;
  • S225 Input the third processing vector to the CRF layer, output a predicted tag sequence for entity recognition, and obtain a sample target result according to the predicted tag sequence;
  • the steps S222-S225 in the above training process are the same as the steps S211-S214 in the above processing process.
  • the pre-training model bert is used to obtain the embedding representation of each word in the data to be processed, and then the data to be processed is corresponding
  • the word vectors are input to the Bi-LSTM and Bi-GRU layers in turn, and the third processing vector can be used for semantic recognition twice.
  • the setting of the two-layer recurrent network (Bi-LSTM+Bi-GRU) can learn a deeper semantic representation, and finally input Go to the CRF layer to obtain the entity recognition result corresponding to the data to be processed.
  • S226 Compare the sample target result with the entity tag corresponding to the training data, and adjust the parameters of the first model until the training is completed, and a trained first model is obtained.
  • step S200 performs the medical named entity recognition operation, but
  • step S300 entity relationship extraction is performed based on the entity recognition result, and entity names need to be abbreviated and disambiguated before obtaining entity pairs with entity relationships.
  • the entity name list can be obtained by collecting the entities in the entity recognition result.
  • the abbreviation HF can correspond to various entity names such as Heart failure, Hydrofluoric acid, Helical Factor, finger protein, and complement factor H.
  • S233 Search in the medical text based on each candidate entity name, and obtain a candidate entity name matching the medical text as the entity name corresponding to the abbreviated name;
  • abbreviated entity name abbreviation disambiguation based on self-consistent literature is adopted. Specifically, it means that abbreviated names generally appear in medical texts with corresponding non-abbreviated names, so you can search for candidate entity names that have appeared in medical texts. As an example, if an entity reference item identified is HF, and the non-abbreviated entity name Heart failure appears in the corresponding full text of the to-be-processed data, the entity corresponding to HF in the sentence is Heart failure. Instead of corresponding to other entities corresponding to HF such as Hydrofluoric acid, it reduces the ambiguity caused by abbreviations.
  • the complete entity name is used to replace the abbreviated name, which further improves the accuracy of the recognition result and is also conducive to the accuracy of the entity relationship extraction result in the subsequent S300.
  • the entity relationship includes any two of the entity category association relationships generated based on the dependency relationship type, and the dependency relationship type includes positive semantics, negative semantics (neg), and unclear semantics (unclear), any two
  • the relationship between the entity types includes a total of 50
  • the relationship between each two entity types includes 3 relationships, including a total of 150 entity relationships, including but not limited to gene-gene, ..., gene-pathway; disease -protein,..., disease-pathway;...; immune factor-pathway, etc.
  • each relationship type has a neg, unclear or a certain type, such as gene-gene-neg, gene-gene-unclear.
  • step S300 the entity relationship extraction is performed based on the entity recognition result to obtain the entity pair with the entity relationship, referring to FIG. 5, including the following:
  • S310 Obtain an entity recognition result corresponding to any data to be processed, and obtain an entity pair and an entity type based on the entity recognition result;
  • S320 Identify the dependency relationship type of the data to be processed, where the dependency relationship type includes positive semantics, negative semantics, and undetermined semantics;
  • identifying the type of dependency relationship of the data to be processed adopts the dependency relationship processing of natural language processing, and the dependency relationship explains its syntactic structure by analyzing the dependency relationship among the components in the language unit, claiming that the core verbs in the sentence dominate other components
  • the central component affirms the semantics: "The profile of the ACE makes it a therapeutic target for heart failure.”
  • Two medical entities are identified, namely ACE (gene) and heart failure (disease).
  • a piece of knowledge ⁇ ACE,heart failure,gene-disease> can be obtained from it, and the storage record format is as follows: ⁇ ACE,heart failure,gene-disease>; negative semantics: "BRCA1 is not associated with heart failure.”
  • the medical entities are BRCA1 (gene), heart failure (disease), and associated dependencies It can be seen that there is a negative semantics (neg), then a piece of knowledge ⁇ BRCA1, heart failure, gene-disease-neg> is obtained; the semantics cannot be determined: "However, whether GHRP have a beneficial effect on CHF is unclear.”, where The recognized medical entities are GHRP (gene) and CHF (disease). The root node of the dependency relationship of this sentence is unclear, and the semantics of the word cannot be determined, and a piece of knowledge ⁇ GHRP, CHF, gene-disease-unclear> is obtained.
  • S330 Generate an entity relationship according to the dependency relationship type, the entity pair, and the entity type, and obtain an entity pair with the entity relationship;
  • the entity pair and the entity relationship are respectively spliced to obtain the entity pair with the entity relationship, such as: ⁇ GHRP, CHF, gene-disease-unclear>.
  • the entity relationship of the entity pair in each data to be processed is determined by identifying the dependency relationship type in each data to be processed, which overcomes the entity relationship that can only determine the affirmative semantics in the prior art, and improves the entity relationship. The accuracy of the determination.
  • S400 Calculate the confidence of the entity pair based on the entity relationship, and generate target data based on each of the entity pairs, the entity relationship, and the corresponding confidence.
  • the confidence of each entity pair association obtained by extraction is evaluated. The higher the confidence degree, the higher the association degree of the corresponding entity pair.
  • step S400 the confidence level of the entity pair is calculated based on the entity relationship, and target data is generated based on each entity pair, the entity relationship, and the corresponding confidence level.
  • target data is generated based on each entity pair, the entity relationship, and the corresponding confidence level.
  • the following data format can be obtained by simple calculation of all the extracted entity pairs and entity pairs: ⁇ head_entity, tail_entity, rel, nums, nums_neg, nums_unclear>.
  • head_entity, tail_entity represent the head and tail entities of the piece of knowledge
  • rel represents the relationship type.
  • nums_neg represents the number of negative semantics extracted from the entity pair (that is, the frequency of occurrence in medical texts)
  • nums_unclear represents the number of unclear semantics extracted from the entity pair
  • nums represents the number of positive semantics extracted from the entity pair.
  • the format of the extracted entity pair ⁇ ACE,heart failure> through simple conversion is as follows: ⁇ ACE,heart failure,gene-disease,964,2,6>, represents the entity pair ⁇ ACE,heart failure>
  • the relationship type is gene-disease.
  • the number of entities extracted to contain neg semantics is 2
  • the number of entities extracted to contain unclear semantics is 6
  • the number of entities extracted to contain positive semantics is 964.
  • S420 Weight the occurrence frequency corresponding to each entity relationship of the entity pair by using a preset weight, and obtain a weighted ratio before weighting as the confidence level of the entity pair;
  • the confidence calculation can be expressed as the following formula:
  • This score calculates the confidence score of each medical entity pair. The higher the score, the more likely the entity pair is related. Taking the above entity pair ⁇ ACE,heart failure> as an example, the corresponding confidence level is:
  • S430 Generate target data based on each of the entity pairs, entity relationships, and corresponding confidence.
  • the generated target data is ⁇ head_entity, tail_entity, rel, confidence>.
  • the confidence level in the above step S420 as an example, such as ⁇ ACE, heart failure, gene-disease, 0.9928>, through the above confidence level
  • the calculation further improves the reference of the extraction results, and subsequently, entity pairs suitable for multiple different scenarios can be selected based on the confidence in the target data.
  • the above-mentioned entity pairs, entity relationships, and corresponding target data can be uploaded to the blockchain for subsequent use as reference samples or training samples. Uploading to the blockchain can ensure their security and fairness and transparency to users. User devices can download from The summary information is downloaded from the blockchain to verify whether the priority list has been tampered with. The voice file corresponding to the amount of data can also be downloaded from the blockchain for voice broadcast without a generation process, which effectively improves the efficiency of voice processing.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • 10 medical entities and 55 medical relationship types between different entities are preset.
  • the extracted entity pairs have a high coverage and a variety of types.
  • the pre-training model bert is used to obtain word embedding, and multi-layer two-way circulation is used.
  • the neural network (Bi-LSTM, Bi-GRU) enhances the learning ability of the model and improves the accuracy of the entity recognition result.
  • This solution also retrieves the entity data that has appeared in the medical text as the entity data corresponding to the abbreviation of the entity name. The problem of ambiguity in the abbreviation of entity names is solved.
  • the determination of the entity-to-relationship through the analysis of the dependency relationship, and the processing of the entity-to-relationship in the to-be-processed data containing negative semantics and indeterminate semantics, is finally calculated based on the confidence score algorithm. Obtain the associated confidence of each medical entity pair extracted, and further improve the reference and accuracy of the target data.
  • a medical entity relationship extraction device 5 of this embodiment includes: an acquisition module 51, an entity recognition module 52, a relationship recognition module 53, and a generation module 54.
  • the obtaining module 51 is configured to obtain medical text, and obtain multiple pieces of to-be-processed data based on the medical text;
  • the entity recognition module 52 is configured to use the first model to perform medical named entity recognition on each data to be processed, and obtain the entity recognition result corresponding to each data to be processed;
  • the entity type includes gene, disease, chemical, protein, symptom, test, treatment, including surgery, chemotherapy, and radiotherapy. , Immunotherapy, etc.), microorganisms, immune factors, and biological pathways.
  • the above-mentioned first model includes a Bert-Embeding layer, a Bi-LSTM network, a Bi-GRU network, and a CRF network arranged in sequence.
  • the relationship recognition module 53 is configured to extract entity relationships based on the entity recognition results, and obtain entity pairs with entity relationships;
  • the relationship recognition module 53 also includes the following:
  • the collection module 531 is configured to obtain an entity recognition result corresponding to any data to be processed, and obtain an entity pair and an entity type based on the entity recognition result;
  • the recognition module 532 is configured to recognize the dependency relationship type of the data to be processed, and the dependency relationship type includes positive semantics, negative semantics, and undeterminable semantics;
  • the relationship determination module 533 is configured to generate an entity relationship according to the dependency relationship type, the entity pair, and the entity type, and obtain an entity pair with the entity relationship.
  • the entity relationship includes any two entity category association relationships generated based on a dependency relationship type.
  • the generating module 54 is configured to calculate the confidence of the entity pair based on the entity relationship, and generate target data based on each of the entity pairs, the entity relationship, and the corresponding confidence.
  • the device also includes a disambiguation module 55, configured to obtain a list of entity names based on the entity recognition result; obtain the abbreviated names in the entity name list, and obtain the entity name corresponding to the abbreviated name from the entity database As a candidate entity name; searching in the medical text based on each candidate entity name, and obtaining the candidate entity name matching the medical text as the entity name corresponding to the abbreviated name; based on the name corresponding to the abbreviated name The entity name updates the entity recognition result.
  • a disambiguation module 55 configured to obtain a list of entity names based on the entity recognition result; obtain the abbreviated names in the entity name list, and obtain the entity name corresponding to the abbreviated name from the entity database As a candidate entity name; searching in the medical text based on each candidate entity name, and obtaining the candidate entity name matching the medical text as the entity name corresponding to the abbreviated name; based on the name corresponding to the abbreviated name
  • the entity name updates the entity recognition result
  • the data to be processed is acquired based on medical text through the acquisition module, and the first model in the entity recognition module is used to process the data to be processed to obtain the entity recognition result, wherein the first model includes The Bert-Embeding layer, Bi-LSTM network, Bi-GRU network, and CRF network are set in sequence.
  • the entity recognition result includes 10 entity types, and then the relationship recognition module is used to extract the entity relationship according to the entity recognition result to obtain the entity relationship.
  • the entity pair of the relationship, the entity relationship includes any two of the entity category association relationships generated based on the dependency relationship type, and finally the generation module is used to calculate the confidence of each entity pair, which is used to evaluate the association of each entity pair and generate target data .
  • the use of multi-layer bidirectional recurrent neural networks (LSTM, GRU) in the aforementioned way enhances the learning ability of the model, realizes the automatic extraction of entity relationships, and solves the time-consuming, labor-intensive and low-efficiency technology of manually extracting medical entity relationships in the prior art In the end, it is also based on the confidence score algorithm to further improve the reference and accuracy of the target data.
  • the abbreviation of the entity name is disambiguated through the disambiguation module, and the abbreviated entity name abbreviation based on self-consistent literature is used for disambiguation, and the entity that has appeared in the medical text is retrieved as the entity name.
  • the entity corresponding to the abbreviation has dealt with the ambiguity of the abbreviation of the entity name and further improved the accuracy of the entity recognition result.
  • the present application also provides a computer device 6, which may include multiple computer devices.
  • the components of the medical entity relationship extraction apparatus 1 of the second embodiment can be dispersed in different computer devices 6.
  • the computer device 6 It can be a smart phone, a tablet, a laptop, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including independent servers, or server clusters composed of multiple servers) that executes the program Wait.
  • the computer device in this embodiment includes at least but not limited to: the computer device includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor. When the processor executes the computer program, the above method is implemented. Part or all of the steps.
  • the computer equipment may also include a network interface and/or a medical entity relationship extraction device.
  • the memory 61, the processor 62, the network interface 63, and the medical entity relationship extraction device 5 that can be communicatively connected to each other through a system bus, as shown in FIG. 8.
  • FIG. 9 only shows a computer device with components, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the memory 61 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory ( RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 61 may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device.
  • the memory 61 may also be an external storage device of the computer device, for example, a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) Card, Flash Card, etc.
  • the memory 61 may also include both the internal storage unit of the computer device and its external storage device.
  • the memory 51 is generally used to store an operating system and various application software installed in a computer device, such as the program code of the medical entity relationship extraction device 5 in the first embodiment, and so on.
  • the memory 61 may also be used to temporarily store various types of data that have been output or will be output.
  • the processor 62 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 62 is generally used to control the overall operation of the computer equipment.
  • the processor 62 is used to run the program code or process data stored in the memory 51, for example, to run a medical entity relationship extraction device, so as to implement the medical entity relationship extraction method of the first embodiment.
  • the network interface 63 may include a wireless network interface or a wired network interface, and the network interface 63 is generally used to establish a communication connection between the computer device 6 and other computer devices 6.
  • the network interface 63 is used to connect the computer device 6 to an external terminal through a network, and to establish a data transmission channel and a communication connection between the computer device 6 and the external terminal.
  • the network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
  • FIG. 8 only shows the computer device 6 with components 61-63, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
  • the medical entity relationship extraction device 5 stored in the memory 61 can also be divided into one or more program modules, and the one or more program modules are stored in the memory 61 and consist of one Or executed by multiple processors (in this embodiment, the processor 62) to complete the application.
  • this application also provides a computer-readable storage system (computer-readable storage medium), which includes multiple storage media, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.) ), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk , CD-ROM, server, App application mall, etc., on which computer programs are stored, and when the programs are executed by the processor 52, corresponding functions are realized.
  • the computer-readable storage medium of this embodiment is used to store the medical entity relationship extraction device, and when executed by the processor 62, the medical entity relationship extraction method of the first embodiment is implemented.
  • the storage medium involved in this application may be non-volatile or volatile.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

医学实体关系抽取方法、装置、计算机设备及可读存储介质,涉及自然语言处理技术领域,包括获取医学文本,基于所述医学文本获取多条待处理数据(S100);采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果(S200);基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对(S300);基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据(S400),解决了现有技术中人工提取医学实体关系费时费力且效率较低的问题。

Description

医学实体关系抽取方法、装置、计算机设备及可读存储介质
本申请要求于2020年10月20日提交中国专利局、申请号为202011123634.8,发明名称为“医学实体关系抽取方法、装置、计算机设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及自然语言处理技术领域,尤其涉及一种医学实体关系抽取方法、装置、计算机设备及可读存储介质。
背景技术
生物医学文献中蕴含着丰富的以及前沿的生物医学知识,是生物医学领域相关研究者重要的知识宝库,发明人意识到,基于生物医学文献的实体关系是构建医学知识图谱的重要研究内容,也是智能导诊、问诊、临床辅助决策等智慧医疗应用的基础。
然而,发明人研究发现,现有的生物医学实体间关系知识库基本是基于专家人力构建而成的,医学关系知识的覆盖范围很小,知识规模受限,随着医学文献数量指数级别的增长,仅仅依靠专家人工编辑整理知识构建医学知识库的方法是无法构建全量的医学关系知识库,人工操作工作量较大且效率较低,成本较大。
发明内容
本申请的目的是提供一种医学实体关系抽取方法、装置、计算机设备及可读存储介质,用于解决现有技术中人工提取医学实体关系费时费力且效率较低的技术问题。
为实现上述目的,本申请提供一种医学实体关系抽取方法,包括:
获取医学文本,基于所述医学文本获取多条待处理数据;
采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
为实现上述目的,本申请还提供一种医学实体关系抽取装置,包括:
获取模块,用于获取医学文本,基于所述医学文本获取多条待处理数据;
实体识别模块,用于采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
关系识别模块,用于基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
生成模块,用于基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
为实现上述目的,本申请还提供一种计算机设备,所述计算机设备包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下方法:
获取医学文本,基于所述医学文本获取多条待处理数据;
采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
为实现上述目的,本申请还提供一种计算机可读存储介质,其包括多个存储介质,各存储介质上存储有计算机程序,所述多个存储介质存储的所述计算机程序被处理器执行时 共同实现以下方法:
获取医学文本,基于所述医学文本获取多条待处理数据;
采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
本申请能够实现对医学实体关系的自动提取,解决现有技术中人工提取医学实体关系费时费力且效率较低的技术问题。
附图说明
图1为本申请所述医学实体关系抽取方法实施例一的流程图;
图2为本申请所述医学实体关系抽取方法实施例一中采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果的流程图;
图3为本申请所述医学实体关系抽取方法实施例一中在采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果前,对所述第一模型进行训练的流程图;
图4为本申请所述医学实体关系抽取方法实施例一中基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对前的流程图;
图5为本申请所述医学实体关系抽取方法实施例一中基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对的流程图;
图6为本申请所述医学实体关系抽取方法实施例一中所述基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据的流程图;
图7为本申请所述医学实体关系抽取装置实施例二的程序模块示意图;
图8为本申请计算机设备实施例三中计算机设备的硬件结构示意图。
附图标记:
4、医学实体关系抽取装置 41、模型训练模块 42、预处理模块
43、执行模块431、分析单元 432、提取单元
433、调整单元 5、计算机设备 51、存储器
52、处理器 53、网络接口
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
本申请的技术方案可应用于人工智能、智慧城市、数字医疗、区块链和/或大数据技术领域,以实现医学实体关系的自动提取。可选的,本申请涉及的数据如医学文本、置信度和/或目标数据等可存储于数据库中,或者可以存储于区块链中,比如通过区块链分布式存储,本申请不做限定。
本申请提供的医学实体关系抽取方法、装置、计算机设备及可读存储介质,适用于领域,为提供一种基于获取模块、实体识别模块、关系识别模块、生成模块的医学实体关系抽取方法。本申请通过获取模块基于医学文本获取待处理数据,采用实体识别模块中的第一模型对待处理数据进行处理获得实体识别结果,所述实体识别结果包括10中实体类型, 而后根据实体识别结果采用关系识别模块进行实体关系抽取,获得带有实体关系的实体对,实体关系包括150种,为基于依存关系类型生成的任意两个所述实体类别关联关系,最后采用生成模块计算各个实体对的置信度,用于评估各个实体对的关联性,并生成目标数据,通过前述方式实现对实体关系的自动提取,解决现有技术中人工提取医学实体关系费时费力且效率较低的技术问题。
实施例一
请参阅图1,本实施例的一种医学实体关系抽取方法,应用于服务器端,包括以下步骤:
S100:获取医学文本,基于所述医学文本获取多条待处理数据;
具体的,本方案中采用大量的医学文本来进行实体关系提取,获取医学文本后可通过预设规则进行预筛选和解析,将获得的各个单条语句作为待处理数据,每一医学文本包含多个待处理数据,作为举例的,根据预设标号(如句号,分号等)对医学文本进行拆分,并对拆分获得的各条语句进行筛选,提出未满足预设条件的语句获得待处理数据,预设条件可以是语句长度等,获得的待处理数据示例如:“The profile of the ACE makes it a therapeutic target for heart failure.”
S200:采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
在本方案中,第一模型包括依次设置的Bert-Embeding层、Bi-LSTM网络、Bi-GRU网络以及CRF网络;所述实体识别结果包括实体和实体类型,所述实体类型包括基因(gene)、疾病(disease)、物质(chemical)、蛋白(protein)、症状(symptom)、实验室检查(test)、治疗方案(therapy,包括手术、化疗、放疗、免疫治疗等)、微生物(microorganism)、免疫因子(immune factor)、生物通路(pathway)。
在上述实施方式中,输入是待处理数据,输出是识别出的多类别医学实体的信息,采用BIO标注的方式进行实体识别,包括BIO三种标签,B表示一个实体在文本中开始位置,I表示一个实体在文本中的中间或结束位置,O表示不是实体部分,在本方案的多类别医学实体识别任务中BI分别有10种类别,即B-gene、…、B-pathway;I-gene、…、I-pathway。
具体的,所述采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果,参阅图3,包括以下步骤:
S211:获取任一待处理数据,采用Bert-Embeding层对所述待处理数据进行向量化处理,获得第一向量;
在本方案中使用预训练模型bert来获取每一个词的embedding表示,Bert等预训练模型是在通过Masked LM任务以及Next Sentence Prediction任务进行预训练,然后再将预训练好的模型在具体任务上在进行fine-tuning(一种机器学习或深度学习中常用的调参方法),采用预训练模型学习到词embedding表示效果好于现有技术中常用的word2vec的网络结构训练得到的词embedding,通过Bert-Embeding层获得各个待处理数据对应的向量以便于后续语义识别,本方案将bert模型使用大量医学文献语料进行了预训练,从而可以适应于医学领域的自然语言处理任务。
S212:采用Bi-LSTM网络对所述第一向量进行语义识别,获得第二向量;
具体的,本方案中将上述步骤S211中拼接后的每个词的向量输入到Bi-LSTM层可以得到隐藏向量(即第二向量),LSTM是常用的循环神经网络,这里采用了Bi-LSTM是双向LSTM,Bi-LSTM可以很好的学习每个词在句子中的前向以及后向的语义(即上下文语义)。
S213:采用Bi-GRU网络对所述第二向量进行语义识别,获得第三向量;
具体的,本方案中将上述经过Bi-LSTM网络处理后得到的第二向量输入到Bi-GRU层 可以得到隐藏向量(即第三向量),第一模型使用了两层循环神经网络,第一层使用的是Bi-LSTM、第二层使用的是Bi-GRU。本模型多层循环神经网络可以学习到更深的语义表示
S214:将所述第三向量输入CRF层,输出实体识别的预测标记序列,根据所述预测标记序列获得所述待处理数据对应的实体识别结果;
具体的,本方案中上述S212和S213中输出为每一个实体标签的预测分值,这些分值将作为CRF层,CRF层可以为最后预测的标记序列添加一些约束来保证预测的标记序列是合法的,在训练数据训练过程中,这些约束可以通过CRF层自动学习到,作为举例的,句子中第一个词总是以标签“B-“或“O”开始,而不是“I-”等,采用CRF层处理后获得的预测标记序列中非法序列出现的概率将会大大降低。
S215:获取另一待处理数据,重复上述步骤S211-S214直至获得各个待处理数据对应的实体识别结果。
本方案中,作为举例的:“The profile of the ACE makes it a therapeutic target for heart failure.”,其中可以识别到两个医学实体,分别是ACE(gene)、heart failure(disease)。
本方案中所述的第一模型区别现有的word2vec的word embedding信息作为输入,单层循环神经网络(RNN)的结构,使用双层循环网络进一步增加模型的学习,提高获得的实体识别结果的准确性。
在采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果前,对所述第一模型进行训练,参阅图3,包括以下;
S221:获取训练样本,所述训练样本包括多条带有实体标签的训练数据;
其中,所述实体标签包括实体和实体类型;所述实体标签包括上述对应10种医学实体类型,所述实体标签采用BIO模型来标注。
S222:将所述训练数据输入Bert-Embeding层进行向量化处理,获得与所述训练数据对应的第一处理向量;
S223:采用Bi-LSTM网络对所述第一处理向量进行语义识别,获得第二处理向量;
S224:采用Bi-GRU网络对所述第二处理向量进行语义识别,获得第三处理向量;
S225:将所述第三处理向量输入CRF层,输出实体识别的预测标记序列,根据所述预测标记序列获得样本目标结果;
具体的,上述训练过程中步骤S222-S225与上述处理过程中步骤S211-S214处理过程一致,使用预训练模型bert来获取待处理数据中每个词的embedding表示,而后将所述待处理数据对应的词向量依次输入到Bi-LSTM、Bi-GRU层可以第三处理向量进行两次语义识别,两层循环网络的设置(Bi-LSTM+Bi-GRU)可以学习到更深的语义表示,最后输入到CRF层,获得所述待处理数据对应的实体识别结果。
S226:将所述样本目标结果与所述训练数据对应的实体标签进行比对,调整所述第一模型的参数,直至完成训练,获得训练好的第一模型。
本方案中采用大量的训练样本对第一模型训练,确保第一模型的处理结果具有较高的准确率。
上述步骤S200进行医学命名实体识别操作,但
是医学文本中常常存在缩写形式的实体名称,据统计每一个医学缩写名称对应多个医学实体,从而对于医学实体名称的缩写消歧问题的处理是很重要的,容易生成很多错误的知识,因此在步骤S300基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对前,需要对实体名称进行缩写消歧处理,参阅图4,包括以下:
提供一预设实体数据库,所述包含多个实体的缩写名称以及与各个所述实体的缩写名名称对应的实体名称;
S231:基于所述实体识别结果获得实体名称列表;
具体的,所述实体名称列表将所述实体识别结果中各个实体集合即可获得。
S232:获取所述实体名称列表中的缩写名称,从所述实体数据库中获取与所述缩写名称对应的实体名称作为候选实体名称;
作为举例而非限定的,比如缩写名称HF可以对应Heart failure、Hydrofluoric acid、Helical Factor、finger protein、complement factor H等多种实体名称。
S233:基于各个所述候选实体名称在所述医学文本中查找,获取与所述医学文本匹配的候选实体名称作为与所述缩写名称对应的实体名称;
在本方案中采用基于文献自洽的缩写实体名称缩写消歧,具体是指对于缩写名称在医学文本中一般会有对应非缩写名称出现,因此在医学文本中查找出现过的候选实体名称即可,作为举例的,在识别出的一个实体指称项是HF,在该待处理数据中对应的全文中出现了Heart failure这一非缩写的实体名称,则该句中HF对应的实体是Heart failure,而不是对应Hydrofluoric acid等HF对应的其他实体,减少缩写名称造成的歧义。
S234:基于与所述缩写名称对应的实体名称更新所述实体识别结果。
具体的,在上述步骤S233获得与缩写名称对应的实体名称后,采用完整的实体名称替换缩写名称,进一步提高识别结果的准确性,也有利于后续S300中实体关系抽取结果的准确性。
S300:基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
基于上述实体类型,所述实体关系包括基于依存关系类型生成的任意两个所述实体类别关联关系,所述依存关系类型包括肯定语义、否定语义(neg)和无法确定语义(unclear),任意两个实体类型之间的关系共包括50中,每一两个实体类型之间的关系包括3种关系,共包括150种实体关系,包括但不限于gene-gene、……、gene-pathway;disease-protein、……、disease-pathway;……;immune factor-pathway等,而且每种关系类型都有neg、unclear或肯定的类型,比如gene-gene-neg、gene-gene-unclear。
具体的,步骤S300所述基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对,参阅图5,包括以下:
S310:获取任一待处理数据对应的实体识别结果,基于所述实体识别结果获取实体对和实体类型;
S320:识别所述待处理数据的依存关系类型,所述依存关系类型包括肯定语义、否定语义和无法确定语义;
具体的,识别所述待处理数据的依存关系类型采用自然语言处理的依存关系处理,依存关系通过分析语言单位内成分之间的依存关系解释其句法结构,主张句子中核心动词是支配其他成分的中心成分,作为举例的,肯定语义:“The profile of the ACE makes it a therapeutic target for heart failure.”,其中识别到两个医学实体,分别是ACE(gene)、heart failure(disease)。从中可以得到一条知识<ACE,heart failure,gene-disease>,存储记录格式如下:<ACE,heart failure,gene-disease>;否定语义:“BRCA1 is not associated with heart failure.”,其中识别到的医学实体分别是BRCA1(gene)、heart failure(disease),associated的依存关系
Figure PCTCN2020135082-appb-000001
中可以看出是有否定语义(neg),则获得一条知识<BRCA1,heart failure,gene-disease-neg>;无法确定语义:“However,whether GHRP have a beneficial effect on CHF is unclear.”,其中识别到的医学实体分别是GHRP(gene)、CHF(disease)。这句话的依存关系的根节点是unclear,这个词的语义是无法确定的语义,则获得一条知识<GHRP,CHF,gene-disease-unclear>。
S330:根据所述依存关系类型、所述实体对和所述实体类型生成实体关系,获得带有实体关系的实体对;
具体的,如上述步骤S320中示例可知,将实体对、实体关系分别进行拼接即可获得带有实体关系的实体对,如:<GHRP,CHF,gene-disease-unclear>。
S340:基于各个待处理数据对应的实体识别结果,获得所有带有实体关系的实体对。
在上述实施方式中,通过对各个待处理数据中依存关系类型的识别,确定各个待处理数据中实体对的实体关系,克服了现有技术中只能确定肯定语义的实体关系,提高对实体关系确定的准确性。
S400:基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
通过对抽取到的医学实体对置信度得分算法,评估抽取获得的每一实体对关联的置信度,当置信度越高,则对应该实体对关联度越高。
具体的,步骤S400中所述基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据,参阅图6,包括以下步骤:
S410:获取所述医学文本中各个所述实体对及对应所述实体关系的出现频次;
具体的,将所有抽取得到实体对和实体对通过简单计算可以得到以下的数据格式:<head_entity,tail_entity,rel,nums,nums_neg,nums_unclear>。其中head_entity,tail_entity代表该条知识的头尾实体,rel表示关系类型。nums_neg表示抽取该实体对为否定语义的数量(即在医学文本中出现的频次),nums_unclear表示抽取该实体对为无法确定语义的数量,nums表示抽取该实体对的肯定语义数量。作为举例的,抽取到的实体对<ACE,heart failure>通过简单转换计算的格式如下:<ACE,heart failure,gene-disease,964,2,6>,表示实体对<ACE,heart failure>的关系类型为gene-disease。抽取到该实体对包含neg语义的数量为2,抽取到该实体对包含unclear语义的数量为6,抽取到该实体对包含肯定语义数量为964。
S420:采用预设权重对所述实体对的各个实体关系对应的出现频次进行加权,并获取加权后与加权前的比值作为所述实体对的置信度;
具体的,所述置信度计算可表示为如下公式:
对于抽取到的每一实体对置信度Confidence,
Figure PCTCN2020135082-appb-000002
其中α 0、α 1、α 2是对应加权系数,本方案中设置为α 0=1、α 1=-1、α 2=0.5。这个分数计算的是每个医学实体对置信度得分,分数越大则该实体对有关联的可能性越大。以上述实体对<ACE,heart failure>为例,其对应的置信度为:
Figure PCTCN2020135082-appb-000003
S430:基于各个所述实体对、实体关系及对应的置信度生成目标数据。
在本方案中,生成的目标数据为<head_entity,tail_entity,rel,confidence>,以上述步骤S420中置信度为例,比如<ACE,heart failure,gene-disease,0.9928>,通过上述对置信度的计算进一步提高提取结果的参考性,后续可基于该目标数据中的置信度选择适用于多个不同场景下的实体对。
上述待实体对、实体关系以及对应的目标数据可上传至区块链以便于后续作为参考样本或训练样本,上传至区块链可保证其安全性和对用户的公正透明性,用户设备可以从区块链中下载得该摘要信息,以便查证优先级列表是否被篡改,后续也可以从区块链中下载获得对应金额数据的语音文件用于语音播报,无需生成过程,有效提高语音处理效率。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密 码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
本申请中预设10种医学实体以及55种不同实体间的医学关系类型,抽取获得的实体对覆盖率高、种类多,同时采用预训练模型bert来获取word embedding,并且使用到多层双向循环神经网络(Bi-LSTM、Bi-GRU)增强了模型的学习能力,提高实体识别结果的准确性,本方案还通过检索医学文本中出现过的实体数据作为与实体名称缩写对应的实体数据,处理了实体名称缩写的歧义问题,通过依存关系的解析实现对实体对关系的确定,以及对包含否定语义和无法确定语义的待处理数据中实体对关系的处理,最后还基于置信度得分算法,计算获得抽取到每个医学实体对有关联的置信度,进一步提高目标数据的参考性和准确性。
实施例二:
请参阅图7,本实施例的一种医学实体关系抽取装置5,包括:获取模块51、实体识别模块52、关系识别模块53以及生成模块54。
获取模块51,用于获取医学文本,基于所述医学文本获取多条待处理数据;
实体识别模块52,用于采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
其中,所述实体类型包括基因(gene)、疾病(disease)、物质(chemical)、蛋白(protein)、症状(symptom)、实验室检查(test)、治疗方案(therapy,包括手术、化疗、放疗、免疫治疗等)、微生物(microorganism)、免疫因子(immune factor)、生物通路(pathway)。上述第一模型包括依次设置的Bert-Embeding层、Bi-LSTM网络、Bi-GRU网络以及CRF网络。
关系识别模块53,用于基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
所述关系识别模块53还包括以下:
采集模块531,用于获取任一待处理数据对应的实体识别结果,基于所述实体识别结果获取实体对和实体类型;
识别模块532,用于识别所述待处理数据的依存关系类型,所述依存关系类型包括肯定语义、否定语义和无法确定语义;
关系确定模块533,用于根据所述依存关系类型、所述实体对和所述实体类型生成实体关系,获得带有实体关系的实体对。
所述实体关系包括基于依存关系类型生成的任意两个所述实体类别关联关系。
生成模块54,用于基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
所述装置还包括消歧模块55,用于基于所述实体识别结果获得实体名称列表;获取所述实体名称列表中的缩写名称,从所述实体数据库中获取与所述缩写名称对应的实体名称作为候选实体名称;基于各个所述候选实体名称在所述医学文本中查找,获取与所述医学文本匹配的候选实体名称作为与所述缩写名称对应的实体名称;基于与所述缩写名称对应的实体名称更新所述实体识别结果。
本技术方案基于语音语义中语义解析的自然语言处理,通过获取模块基于医学文本获取待处理数据,采用实体识别模块中的第一模型对待处理数据进行处理获得实体识别结果,其中,第一模型包括依次设置的Bert-Embeding层、Bi-LSTM网络、Bi-GRU网络以及CRF网络,所述实体识别结果包括10中实体类型,而后根据实体识别结果采用关系识别模块进行实体关系抽取,获得带有实体关系的实体对,实体关系包括基于依存关系类型生成的任 意两个所述实体类别关联关系,最后采用生成模块计算各个实体对的置信度,用于评估各个实体对的关联性,并生成目标数据,通过前述方式使用到多层双向循环神经网络(LSTM、GRU)增强了模型的学习能力,实现对实体关系的自动提取,解决现有技术中人工提取医学实体关系费时费力且效率较低的技术问题,最后还基于对置信度得分算法,进一步提高目标数据的参考性和准确性。本申请中在实体识别结果的获取过程中还通过消歧模块对实体名称缩写进行消歧处理,采用基于文献自洽的缩写实体名称缩写消歧,检索医学文本中出现过的实体作为与实体名称缩写对应的实体,处理了实体名称缩写的歧义问题,进一步提高实体识别结果的准确性。
实施例三:
为实现上述目的,本申请还提供一种计算机设备6,该计算机设备可包括多个计算机设备,实施例二的医学实体关系抽取装置1的组成部分可分散于不同的计算机设备6中,计算机设备6可以是执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。本实施例的计算机设备至少包括但不限于:计算机设备包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述方法中的部分或全部步骤。可选的,该计算机设备还可包括网络接口和/或医学实体关系抽取装置。例如,可通过系统总线相互通信连接的存储器61、处理器62、网络接口63以及医学实体关系抽取装置5,如图8所示。需要指出的是,图9仅示出了具有组件-的计算机设备,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
本实施例中,存储器61至少包括一种类型的计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器61可以是计算机设备的内部存储单元,例如该计算机设备的硬盘或内存。在另一些实施例中,存储器61也可以是计算机设备的外部存储设备,例如该计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,存储器61还可以既包括计算机设备的内部存储单元也包括其外部存储设备。本实施例中,存储器51通常用于存储安装于计算机设备的操作系统和各类应用软件,例如实施例一的医学实体关系抽取装置5的程序代码等。此外,存储器61还可以用于暂时地存储已经输出或者将要输出的各类数据。
处理器62在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器62通常用于控制计算机设备的总体操作。本实施例中,处理器62用于运行存储器51中存储的程序代码或者处理数据,例如运行医学实体关系抽取装置,以实现实施例一的医学实体关系抽取方法。
所述网络接口63可包括无线网络接口或有线网络接口,该网络接口63通常用于在所述计算机设备6与其他计算机设备6之间建立通信连接。例如,所述网络接口63用于通过网络将所述计算机设备6与外部终端相连,在所述计算机设备6与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。
需要指出的是,图8仅示出了具有部件61-63的计算机设备6,但是应理解的是,并不要求实施所有示出的部件,可以替代的实施更多或者更少的部件。
在本实施例中,存储于存储器61中的所述医学实体关系抽取装置5还可以被分割为一个或者多个程序模块,所述一个或者多个程序模块被存储于存储器61中,并由一个或多个处理器(本实施例为处理器62)所执行,以完成本申请。
实施例四:
为实现上述目的,本申请还提供一种计算机可读存储系统(计算机可读存储介质),其包括多个存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机程序,程序被处理器52执行时实现相应功能。本实施例的计算机可读存储介质用于存储医学实体关系抽取装置,被处理器62执行时实现实施例一的医学实体关系抽取方法。
可选的,本申请涉及的存储介质可以是非易失性的,也可以是易失性的。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种医学实体关系抽取方法,其中,包括:
    获取医学文本,基于所述医学文本获取多条待处理数据;
    采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
    基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
    基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
  2. 根据权利要求1所述的医学实体关系抽取方法,其中,所述采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果,包括以下:
    获取任一待处理数据,采用Bert-Embeding层对所述待处理数据进行向量化处理,获得第一向量;
    采用Bi-LSTM网络对所述第一向量进行语义识别,获得第二向量;
    采用Bi-GRU网络对所述第二向量进行语义识别,获得第三向量;
    将所述第三向量输入CRF层,输出实体识别的预测标记序列,根据所述预测标记序列获得所述待处理数据对应的实体识别结果;
    获取另一待处理数据,重复上述直至获得各个待处理数据对应的实体识别结果。
  3. 根据权利要求1所述的医学实体关系抽取方法,其中,在基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对前,包括以下:
    提供一预设实体数据库,所述包含多个实体的缩写名称以及与各个所述实体的缩写名名称对应的实体名称;
    基于所述实体识别结果获得实体名称列表;
    获取所述实体名称列表中的缩写名称,从所述实体数据库中获取与所述缩写名称对应的实体名称作为候选实体名称;
    基于各个所述候选实体名称在所述医学文本中查找,获取与所述医学文本匹配的候选实体名称作为与所述缩写名称对应的实体名称;
    基于与所述缩写名称对应的实体名称更新所述实体识别结果。
  4. 根据权利要求1所述的医学实体关系抽取方法,其中,所述基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对,包括以下:
    获取任一待处理数据对应的实体识别结果,基于所述实体识别结果获取实体对和实体类型;
    识别所述待处理数据的依存关系类型,所述依存关系类型包括肯定语义、否定语义和无法确定语义;
    根据所述依存关系类型、所述实体对和所述实体类型生成实体关系,获得带有实体关系的实体对;
    基于各个待处理数据对应的实体识别结果,获得所有带有实体关系的实体对。
  5. 根据权利要求1所述的医学实体关系抽取方法,其中,基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据,包括以下:
    获取所述医学文本中各个所述实体对及对应所述实体关系的出现频次;
    采用预设权重对所述实体对的各个实体关系对应的出现频次进行加权,并获取加权后与加权前的比值作为所述实体对的置信度;
    基于各个所述实体对、实体关系及对应的置信度生成目标数据。
  6. 根据权利要求1所述的医学实体关系抽取方法,其中:
    在采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果前,还包括对所述第一模型进行训练,包括以下;
    获取训练样本,所述训练样本包括多条带有实体标签的训练数据;
    其中,所述实体标签包括实体和实体类型;
    将所述训练数据输入Bert-Embeding层进行向量化处理,获得与所述训练数据对应的第一处理向量;
    采用Bi-LSTM网络对所述第一处理向量进行语义识别,获得第二处理向量;
    采用Bi-GRU网络对所述第二处理向量进行语义识别,获得第三处理向量;
    将所述第三处理向量输入CRF层,输出实体识别的预测标记序列,根据所述预测标记序列获得样本目标结果;
    将所述样本目标结果与所述训练数据对应的实体标签进行比对,调整所述第一模型的参数,直至完成训练,获得训练好的第一模型。
  7. 根据权利要求1所述的医学实体关系抽取方法,其中:
    所述实体识别结果包括实体和实体类型,所述实体类型包括基因、疾病、物质、蛋白、症状、实验室检查、治疗方案、微生物、免疫因子、生物通路;所述实体关系包括基于依存关系类型生成的任意两个所述实体类别关联,所述依存关系类型包括肯定语义、否定语义和无法确定语义。
  8. 一种医学实体关系抽取装置,其中,包括:
    获取模块,用于获取医学文本,基于所述医学文本获取多条待处理数据;
    实体识别模块,用于采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
    关系识别模块,用于基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
    生成模块,用于基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
  9. 一种计算机设备,其中,所述计算机设备包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下方法:
    获取医学文本,基于所述医学文本获取多条待处理数据;
    采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
    基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
    基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
  10. 根据权利要求9所述的计算机设备,其中,所述采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果时,具体实现:
    获取任一待处理数据,采用Bert-Embeding层对所述待处理数据进行向量化处理,获得第一向量;
    采用Bi-LSTM网络对所述第一向量进行语义识别,获得第二向量;
    采用Bi-GRU网络对所述第二向量进行语义识别,获得第三向量;
    将所述第三向量输入CRF层,输出实体识别的预测标记序列,根据所述预测标记序列获得所述待处理数据对应的实体识别结果;
    获取另一待处理数据,重复上述直至获得各个待处理数据对应的实体识别结果。
  11. 根据权利要求9所述的计算机设备,其中,在基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对前,所述处理器执行所述计算机程序时还用于实现:
    提供一预设实体数据库,所述包含多个实体的缩写名称以及与各个所述实体的缩写名名称对应的实体名称;
    基于所述实体识别结果获得实体名称列表;
    获取所述实体名称列表中的缩写名称,从所述实体数据库中获取与所述缩写名称对应的实体名称作为候选实体名称;
    基于各个所述候选实体名称在所述医学文本中查找,获取与所述医学文本匹配的候选实体名称作为与所述缩写名称对应的实体名称;
    基于与所述缩写名称对应的实体名称更新所述实体识别结果。
  12. 根据权利要求9所述的计算机设备,其中,所述基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对时,具体实现:
    获取任一待处理数据对应的实体识别结果,基于所述实体识别结果获取实体对和实体类型;
    识别所述待处理数据的依存关系类型,所述依存关系类型包括肯定语义、否定语义和无法确定语义;
    根据所述依存关系类型、所述实体对和所述实体类型生成实体关系,获得带有实体关系的实体对;
    基于各个待处理数据对应的实体识别结果,获得所有带有实体关系的实体对。
  13. 根据权利要求9所述的计算机设备,其中,基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据时,具体实现:
    获取所述医学文本中各个所述实体对及对应所述实体关系的出现频次;
    采用预设权重对所述实体对的各个实体关系对应的出现频次进行加权,并获取加权后与加权前的比值作为所述实体对的置信度;
    基于各个所述实体对、实体关系及对应的置信度生成目标数据。
  14. 根据权利要求9所述的计算机设备,其中:
    在采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果前,所述处理器还用于对所述第一模型进行训练,包括以下;
    获取训练样本,所述训练样本包括多条带有实体标签的训练数据;
    其中,所述实体标签包括实体和实体类型;
    将所述训练数据输入Bert-Embeding层进行向量化处理,获得与所述训练数据对应的第一处理向量;
    采用Bi-LSTM网络对所述第一处理向量进行语义识别,获得第二处理向量;
    采用Bi-GRU网络对所述第二处理向量进行语义识别,获得第三处理向量;
    将所述第三处理向量输入CRF层,输出实体识别的预测标记序列,根据所述预测标记序列获得样本目标结果;
    将所述样本目标结果与所述训练数据对应的实体标签进行比对,调整所述第一模型的参数,直至完成训练,获得训练好的第一模型。
  15. 一种计算机可读存储介质,其包括多个存储介质,各存储介质上存储有计算机程序,其中,所述多个存储介质存储的所述计算机程序被处理器执行时共同实现以下方法:
    获取医学文本,基于所述医学文本获取多条待处理数据;
    采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果;
    基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对;
    基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果时,具体实现:
    获取任一待处理数据,采用Bert-Embeding层对所述待处理数据进行向量化处理,获得第一向量;
    采用Bi-LSTM网络对所述第一向量进行语义识别,获得第二向量;
    采用Bi-GRU网络对所述第二向量进行语义识别,获得第三向量;
    将所述第三向量输入CRF层,输出实体识别的预测标记序列,根据所述预测标记序列获得所述待处理数据对应的实体识别结果;
    获取另一待处理数据,重复上述直至获得各个待处理数据对应的实体识别结果。
  17. 根据权利要求15所述的计算机可读存储介质,其中,在基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对前,所述多个存储介质存储的所述计算机程序被处理器执行时还用于共同实现:
    提供一预设实体数据库,所述包含多个实体的缩写名称以及与各个所述实体的缩写名名称对应的实体名称;
    基于所述实体识别结果获得实体名称列表;
    获取所述实体名称列表中的缩写名称,从所述实体数据库中获取与所述缩写名称对应的实体名称作为候选实体名称;
    基于各个所述候选实体名称在所述医学文本中查找,获取与所述医学文本匹配的候选实体名称作为与所述缩写名称对应的实体名称;
    基于与所述缩写名称对应的实体名称更新所述实体识别结果。
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述基于所述实体识别结果进行实体关系抽取,获取带有实体关系的实体对时,具体实现:
    获取任一待处理数据对应的实体识别结果,基于所述实体识别结果获取实体对和实体类型;
    识别所述待处理数据的依存关系类型,所述依存关系类型包括肯定语义、否定语义和无法确定语义;
    根据所述依存关系类型、所述实体对和所述实体类型生成实体关系,获得带有实体关系的实体对;
    基于各个待处理数据对应的实体识别结果,获得所有带有实体关系的实体对。
  19. 根据权利要求15所述的计算机可读存储介质,其中,基于所述实体关系计算所述实体对的置信度,基于各个所述实体对、实体关系及对应的置信度生成目标数据时,具体实现:
    获取所述医学文本中各个所述实体对及对应所述实体关系的出现频次;
    采用预设权重对所述实体对的各个实体关系对应的出现频次进行加权,并获取加权后与加权前的比值作为所述实体对的置信度;
    基于各个所述实体对、实体关系及对应的置信度生成目标数据。
  20. 根据权利要求15所述的计算机可读存储介质,其中,:
    在采用第一模型对各个待处理数据进行医学命名实体识别,获得各个待处理数据对应的实体识别结果前,所述多个存储介质存储的所述计算机程序被处理器执行时还用于共同实现对所述第一模型进行训练,包括以下;
    获取训练样本,所述训练样本包括多条带有实体标签的训练数据;
    其中,所述实体标签包括实体和实体类型;
    将所述训练数据输入Bert-Embeding层进行向量化处理,获得与所述训练数据对应的第一处理向量;
    采用Bi-LSTM网络对所述第一处理向量进行语义识别,获得第二处理向量;
    采用Bi-GRU网络对所述第二处理向量进行语义识别,获得第三处理向量;
    将所述第三处理向量输入CRF层,输出实体识别的预测标记序列,根据所述预测标记序列获得样本目标结果;
    将所述样本目标结果与所述训练数据对应的实体标签进行比对,调整所述第一模型的参数,直至完成训练,获得训练好的第一模型。
PCT/CN2020/135082 2020-10-20 2020-12-10 医学实体关系抽取方法、装置、计算机设备及可读存储介质 WO2021151353A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011123634.8A CN112256828B (zh) 2020-10-20 2020-10-20 医学实体关系抽取方法、装置、计算机设备及可读存储介质
CN202011123634.8 2020-10-20

Publications (1)

Publication Number Publication Date
WO2021151353A1 true WO2021151353A1 (zh) 2021-08-05

Family

ID=74245072

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/135082 WO2021151353A1 (zh) 2020-10-20 2020-12-10 医学实体关系抽取方法、装置、计算机设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN112256828B (zh)
WO (1) WO2021151353A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627186A (zh) * 2021-08-12 2021-11-09 平安科技(深圳)有限公司 基于人工智能的实体关系检测方法及相关设备
CN113792115A (zh) * 2021-08-17 2021-12-14 北京百度网讯科技有限公司 实体相关性确定方法、装置、电子设备及存储介质
CN113822420A (zh) * 2021-09-27 2021-12-21 闫鹏 基于容积二氧化碳图的死腔分数的模型建立方法及系统
CN113849658A (zh) * 2021-08-18 2021-12-28 广州国交润万交通信息有限公司 一种知识图谱构建方法及装置、存储介质和计算设备
CN113903420A (zh) * 2021-09-29 2022-01-07 清华大学 一种语义标签确定模型的构建方法、病历解析方法
CN114417875A (zh) * 2022-01-25 2022-04-29 腾讯科技(深圳)有限公司 数据处理方法、装置、设备、可读存储介质及程序产品
CN116110594A (zh) * 2022-12-02 2023-05-12 北京交通大学 基于关联文献的医学知识图谱的知识评价方法及系统
CN117290510A (zh) * 2023-11-27 2023-12-26 浙江太美医疗科技股份有限公司 文档信息抽取方法、模型、电子设备及可读介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011170B (zh) * 2021-02-25 2022-10-14 万翼科技有限公司 合同处理方法、电子设备及相关产品
CN112926333A (zh) * 2021-04-09 2021-06-08 平安科技(深圳)有限公司 实体识别方法、装置、电子设备及存储介质
CN113157866B (zh) * 2021-04-27 2024-05-14 平安科技(深圳)有限公司 一种数据分析方法、装置、计算机设备及存储介质
CN114781383A (zh) * 2022-05-05 2022-07-22 医渡云(北京)技术有限公司 特征数据提取方法及装置、可读存储介质、电子设备
WO2024042350A1 (zh) * 2022-08-24 2024-02-29 Evyd科技有限公司 医疗文本数据脱敏方法、装置、介质及电子设备
CN116108163B (zh) * 2023-04-04 2023-06-27 之江实验室 一种文本的匹配方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190220749A1 (en) * 2018-01-17 2019-07-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Text processing method and device based on ambiguous entity words
CN110427611A (zh) * 2019-06-26 2019-11-08 深圳追一科技有限公司 文本处理方法、装置、设备及存储介质
CN110688854A (zh) * 2019-09-02 2020-01-14 平安科技(深圳)有限公司 命名实体识别方法、装置及计算机可读存储介质
CN111428036A (zh) * 2020-03-23 2020-07-17 浙江大学 一种基于生物医学文献的实体关系挖掘方法
CN111625659A (zh) * 2020-08-03 2020-09-04 腾讯科技(深圳)有限公司 知识图谱处理方法、装置、服务器及存储介质
CN111709240A (zh) * 2020-05-14 2020-09-25 腾讯科技(武汉)有限公司 实体关系抽取方法、装置、设备及其存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10409912B2 (en) * 2014-07-31 2019-09-10 Oracle International Corporation Method and system for implementing semantic technology
CN108460012A (zh) * 2018-02-01 2018-08-28 哈尔滨理工大学 一种基于gru-crf的命名实体识别方法
CN110083831B (zh) * 2019-04-16 2023-04-18 武汉大学 一种基于BERT-BiGRU-CRF的中文命名实体识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190220749A1 (en) * 2018-01-17 2019-07-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Text processing method and device based on ambiguous entity words
CN110427611A (zh) * 2019-06-26 2019-11-08 深圳追一科技有限公司 文本处理方法、装置、设备及存储介质
CN110688854A (zh) * 2019-09-02 2020-01-14 平安科技(深圳)有限公司 命名实体识别方法、装置及计算机可读存储介质
CN111428036A (zh) * 2020-03-23 2020-07-17 浙江大学 一种基于生物医学文献的实体关系挖掘方法
CN111709240A (zh) * 2020-05-14 2020-09-25 腾讯科技(武汉)有限公司 实体关系抽取方法、装置、设备及其存储介质
CN111625659A (zh) * 2020-08-03 2020-09-04 腾讯科技(深圳)有限公司 知识图谱处理方法、装置、服务器及存储介质

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627186A (zh) * 2021-08-12 2021-11-09 平安科技(深圳)有限公司 基于人工智能的实体关系检测方法及相关设备
CN113627186B (zh) * 2021-08-12 2023-12-22 平安科技(深圳)有限公司 基于人工智能的实体关系检测方法及相关设备
CN113792115B (zh) * 2021-08-17 2024-03-22 北京百度网讯科技有限公司 实体相关性确定方法、装置、电子设备及存储介质
CN113792115A (zh) * 2021-08-17 2021-12-14 北京百度网讯科技有限公司 实体相关性确定方法、装置、电子设备及存储介质
CN113849658A (zh) * 2021-08-18 2021-12-28 广州国交润万交通信息有限公司 一种知识图谱构建方法及装置、存储介质和计算设备
CN113822420A (zh) * 2021-09-27 2021-12-21 闫鹏 基于容积二氧化碳图的死腔分数的模型建立方法及系统
CN113822420B (zh) * 2021-09-27 2024-04-19 中国航天科工集团七三一医院 基于容积二氧化碳图的死腔分数的模型建立方法及系统
CN113903420A (zh) * 2021-09-29 2022-01-07 清华大学 一种语义标签确定模型的构建方法、病历解析方法
CN114417875A (zh) * 2022-01-25 2022-04-29 腾讯科技(深圳)有限公司 数据处理方法、装置、设备、可读存储介质及程序产品
CN116110594A (zh) * 2022-12-02 2023-05-12 北京交通大学 基于关联文献的医学知识图谱的知识评价方法及系统
CN116110594B (zh) * 2022-12-02 2024-05-07 北京交通大学 基于关联文献的医学知识图谱的知识评价方法及系统
CN117290510B (zh) * 2023-11-27 2024-01-30 浙江太美医疗科技股份有限公司 文档信息抽取方法、模型、电子设备及可读介质
CN117290510A (zh) * 2023-11-27 2023-12-26 浙江太美医疗科技股份有限公司 文档信息抽取方法、模型、电子设备及可读介质

Also Published As

Publication number Publication date
CN112256828B (zh) 2023-08-08
CN112256828A (zh) 2021-01-22

Similar Documents

Publication Publication Date Title
WO2021151353A1 (zh) 医学实体关系抽取方法、装置、计算机设备及可读存储介质
CN112242187B (zh) 基于知识图谱表征学习的医疗方案推荐系统及方法
CN109670179B (zh) 基于迭代膨胀卷积神经网络的病历文本命名实体识别方法
CN106874643B (zh) 基于词向量自动构建知识库实现辅助诊疗的方法和系统
CN111401066B (zh) 基于人工智能的词分类模型训练方法、词处理方法及装置
CN110737758A (zh) 用于生成模型的方法和装置
WO2023124837A1 (zh) 问诊处理方法、装置、设备及存储介质
CN111026877A (zh) 基于概率软逻辑的知识验证模型构建与分析方法
CN115293161A (zh) 基于自然语言处理和药品知识图谱的合理用药系统及方法
CN113657105A (zh) 基于词汇增强的医学实体抽取方法、装置、设备及介质
CN116312915B (zh) 一种电子病历中药物术语标准化关联方法及系统
CN115954072A (zh) 一种智能临床试验方案生成方法及相关装置
CN115858886B (zh) 数据处理方法、装置、设备及可读存储介质
WO2024042348A1 (zh) 英文医疗文本结构化的方法、装置、介质及电子设备
CN113130025A (zh) 一种实体关系抽取方法、终端设备及计算机可读存储介质
CN116721699B (zh) 一种基于肿瘤基因检测结果的智能推荐方法
Xing et al. Phenotype extraction based on word embedding to sentence embedding cascaded approach
CN116719840A (zh) 一种基于病历后结构化处理的医疗信息推送方法
Ebrahimi et al. Analysis of Persian Bioinformatics Research with Topic Modeling
CN116011450A (zh) 分词模型训练方法、系统、设备、存储介质及分词方法
CN116072308A (zh) 基于图路径搜索和语义索引的医疗问答方法及相关设备
CN114238558A (zh) 一种电子病历的质检方法、装置、存储介质及设备
CN118538401B (zh) 基于语言大模型的糖尿病咨询交互方法及装置
CN115238700B (zh) 基于多任务学习的生物医学实体抽取方法
CN117875319B (zh) 医疗领域标注数据的获取方法、装置、电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20916683

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20916683

Country of ref document: EP

Kind code of ref document: A1