CN113095081A

CN113095081A - Disease identification method and device, storage medium and electronic device

Info

Publication number: CN113095081A
Application number: CN202110651159.XA
Authority: CN
Inventors: 蒋志燕; 陈诚; 黄石磊
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-07-09

Abstract

The invention provides a method and a device for identifying diseases, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring disease data, wherein the disease data comprises disease name information and disease attribute information; training a first biomedicine-based bidirectional encoder BioBERT model based on the disease data to obtain a second BioBERT model, wherein the second BioBERT model comprises the disease name information and semantic relation with the disease attribute information; disease identification was performed using the second BioBERT model. According to the invention, the semantic relation between the disease name information and the disease attribute information is constructed by training the BioBERT model, so that the automatic identification and confirmation of the diseases in medical systems such as inquiry and the like are realized, and the technical problem of low efficiency of identifying the diseases in the description text in the related technology is solved.

Description

Disease identification method and device, storage medium and electronic device

Technical Field

The invention relates to the field of computers, in particular to a disease identification method and device, a storage medium and an electronic device.

Background

In the related art, diseases have a great influence on human society. Diseases are not only important research topics in biomedicine, but also frequently occur in the task of natural language processing in biomedicine. The disease knowledge is important for a plurality of health-related biomedical tasks, including health care question answering, medical diagnosis reasoning and disease entity identification, at present, the semantic identification of the disease still stays in the stage of manual identification of experts, the identification efficiency is low, and the digitization process is influenced.

In view of the above problems in the related art, no effective solution has been found at present.

Disclosure of Invention

The embodiment of the invention provides a disease identification method and device, a storage medium and an electronic device.

According to an embodiment of the present invention, there is provided a method of identifying a disease, including: acquiring disease data, wherein the disease data comprises disease name information and disease attribute information; training a first biomedicine-based bidirectional encoder BioBERT model based on the disease data to obtain a second BioBERT model, wherein the second BioBERT model comprises the disease name information and semantic relation with the disease attribute information; disease identification was performed using the second BioBERT model.

Optionally, training the first BioBERT model based on the disease data to obtain a second BioBERT model, including: converting the disease data into a sample vector using a tokenizer, wherein the sample vector comprises: a training set, a verification set and a test set; extracting feature information from the sample vector, wherein the feature information comprises: a start symbol, a separator symbol or an end symbol for describing a text, a first vocabulary mark, a mask mark, a segment mark and a token position of a word segmentation device, wherein the first vocabulary mark is used for marking a vocabulary used by the word segmentation device, the mask mark is used for indicating a token and a filling element in a sequence, and the segment mark is used for marking a sentence to which the description text belongs; and determining a loss function of the first BioBERT model, and training the first BioBERT model based on the loss function and the characteristic information to obtain a second BioBERT model.

Optionally, determining the loss function of the first BioBERT model comprises: set the following disease loss function

And attribute loss function

：

；

；

Obtaining a loss function of the first BioBERT model using the combination of the disease loss function and the attribute loss function:

；

wherein β is a preset balance coefficient, λ is a preset weight coefficient, and T represents the number of token identifications corresponding to disease names; t is used for locating the current token identification of the disease name; x is the number of_tThe t token identification corresponding to the disease name is represented; p (x)_t| passage) represents x_tA conditional probability of being located on the target paragraph; z is a radical of_t:z_t=w*y_t+ b, where w represents a weight, y_tDenotes x_tOutputting the result of the layer after the embedding of the BERT model, wherein b represents deviation; n represents the number of attributes of the disease name; i is used for positioning the current attribute of the disease name; y is_tRepresenting the tag value, taking 0 or 1; lna_iRepresents the ith attribute a_iLogarithm of the output value obtained after passing through the softmax layer of the model.

Optionally, obtaining disease data comprises: determining a disease name, and collecting disease attribute information corresponding to the disease name, wherein the disease attribute information includes: disease information, etiology information, symptom information, diagnosis information, treatment information, prevention information, pathophysiology information, and propaganda information; adding a corresponding auxiliary description sentence in the disease attribute information, wherein the auxiliary description sentence is used for indicating a disease name and an attribute name corresponding to the disease attribute information; and storing the disease attribute information and the corresponding auxiliary description sentences in a classified manner.

Optionally, the disease identification using the second BioBERT model comprises: receiving a list of candidate diagnoses of a first disease from a medical question and answer system, wherein the list of candidate diagnoses comprises a plurality of candidate diagnoses of the first disease; outputting first disease attribute information corresponding to the first disease by using the second BioBERT model; ranking the list of candidate diagnoses based on the first disease attribute information.

Optionally, the disease identification using the second BioBERT model comprises: receiving from the medical reasoning system prerequisite description information and hypothetical diagnosis information for the second disease; outputting second disease attribute information corresponding to the hypothetical diagnostic information using the second BioBERT model; and if the second disease attribute information is matched with the precondition description information, determining that the diagnosis reasoning between the precondition description information and the assumed diagnosis information is correct.

Optionally, the disease identification using the second BioBERT model comprises: receiving entity description text of a third disease from a disease recognition system; extracting a disease name text and a disease attribute text from the entity description text; outputting third disease attribute information corresponding to the disease name text by adopting the second BioBERT model; and if the third disease attribute information is matched with the disease attribute text, determining that the disease name text is the disease name in the entity description text.

Optionally, training the first BioBERT model based on the loss function and the feature information, and obtaining a second BioBERT model includes: training the first BioBERT model by using the characteristic information of the training set to obtain a first intermediate model; verifying the first intermediate model by using the characteristic information of the verification set to obtain a second intermediate model; adopting the characteristic information of the test set to test the second intermediate model, if the preset test parameters of the second intermediate model accord with preset conditions, outputting the second intermediate model as the second BioBERT model, wherein the preset test parameters comprise: accuracy, recall, F1 value.

According to another embodiment of the present invention, there is provided an apparatus for identifying a disease, including: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring disease data, and the disease data comprises disease name information and disease attribute information; the training module is used for training a first biomedicine-based bidirectional encoder BioBERT model based on the disease data to obtain a second BioBERT model, wherein the second BioBERT model comprises the disease name information and semantic relation with the disease attribute information; and the identification module is used for identifying diseases by adopting the second BioBERT model.

Optionally, the training module includes: a conversion unit, configured to convert the disease data into a sample vector using a tokenizer, where the sample vector includes: a training set, a verification set and a test set; an extracting unit, configured to extract feature information from the sample vector, where the feature information includes: a start symbol, a separator symbol or an end symbol for describing a text, a first vocabulary mark, a mask mark, a segment mark and a token position of a word segmentation device, wherein the first vocabulary mark is used for marking a vocabulary used by the word segmentation device, the mask mark is used for indicating a token and a filling element in a sequence, and the segment mark is used for marking a sentence to which the description text belongs; and the training unit is used for determining a loss function of the first BioBERT model, training the first BioBERT model based on the loss function and the characteristic information, and obtaining a second BioBERT model.

Optionally, the training unit includes: a setting subunit for setting the following disease loss function

And attribute loss function

：

；

；

A combination subunit, configured to obtain a loss function of the first BioBERT model by using the combination of the disease loss function and the attribute loss function:

；

wherein β is a preset balance coefficient, λ is a preset weight coefficient, and T represents the number of token identifications corresponding to disease names; t is used for locating the current token identification of the disease name; x is the number of_tThe t token identification corresponding to the disease name is represented; p (x)_t| passage) represents x_tA conditional probability of being located on the target paragraph; z is a radical of_t:z_t=w*y_t+ b, where w represents a weight, y_tDenotes x_tOutputting the result of the layer after the embedding of the BERT model, wherein b represents deviation; n represents the number of attributes of the disease name(ii) a i is used for positioning the current attribute of the disease name; y is_tRepresenting the tag value, taking 0 or 1; lna_iRepresents the ith attribute a_iLogarithm of the output value obtained after passing through the softmax layer of the model.

Optionally, the obtaining module includes: the system comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for determining a disease name and acquiring disease attribute information corresponding to the disease name, and the disease attribute information comprises: disease information, etiology information, symptom information, diagnosis information, treatment information, prevention information, pathophysiology information, and propaganda information; an adding unit, configured to add a corresponding auxiliary description sentence in the disease attribute information, where the auxiliary description sentence is used to indicate a disease name and an attribute name corresponding to the disease attribute information; and the storage unit is used for storing the disease attribute information and the corresponding auxiliary description sentences in a classified manner.

Optionally, the identification module includes: a first receiving unit, configured to receive a candidate diagnosis list of a first disease from a question and answer system, wherein the candidate diagnosis list includes a plurality of candidate diagnosis information of the first disease; a first output unit configured to output first disease attribute information corresponding to the first disease using the second BioBERT model; a sorting unit for sorting the candidate diagnosis list based on the first disease attribute information.

Optionally, the identification module includes: a second receiving unit for receiving the prerequisite description information and the hypothesis diagnosis information of the second disease from the medical reasoning system; a second output unit configured to output second disease attribute information corresponding to the assumed diagnostic information using the second BioBERT model; a first determining unit, configured to determine that a diagnosis inference between the prerequisite description information and the hypothetical diagnosis information is correct if the second disease attribute information matches the prerequisite description information.

Optionally, the identification module includes: a third receiving unit for receiving an entity description text of a third disease from the disease recognition system; an extracting unit configured to extract a disease name text and a disease attribute text from the entity description text; a third output unit, configured to output third disease attribute information corresponding to the disease name text by using the second BioBERT model; and the second determining unit is used for determining the disease name text as the disease name in the entity description text if the third disease attribute information is matched with the disease attribute text.

Optionally, the training unit includes: the training subunit is used for training the first BioBERT model by using the characteristic information of the training set to obtain a first intermediate model; the verification subunit is used for verifying the first intermediate model by using the characteristic information of the verification set to obtain a second intermediate model; a testing subunit, configured to test the second intermediate model by using the feature information of the test set, and if a preset testing parameter of the second intermediate model meets a preset condition, output the second intermediate model as the second BioBERT model, where the preset testing parameter includes: accuracy, recall, F1 value.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, disease data is obtained, the disease data comprises disease name information and disease attribute information, a first biomedicine-based bidirectional encoder BioBERT model is trained based on the disease data to obtain a second BioBERT model, the second BioBERT model comprises the disease name information and the semantic relation with the disease attribute information, the second BioBERT model is adopted for disease identification, and the BioBERT model is trained to construct the semantic relation between the disease name information and the disease attribute information, so that automatic identification and confirmation of diseases in medical systems such as inquiry and the like are realized, and the technical problem of low efficiency of identifying diseases in description texts in related technologies is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a block diagram of a hardware configuration of a computer according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of identifying a disease according to an embodiment of the invention;

FIG. 3 is a flow diagram of an embodiment of the invention;

FIG. 4 is a block diagram of a disease recognition apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

The method provided by the first embodiment of the present application may be executed in a server, a computer, a mobile phone, or a similar computing device. Taking an example of the present invention running on a computer, fig. 1 is a block diagram of a hardware structure of a computer according to an embodiment of the present invention. As shown in fig. 1, the computer may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those of ordinary skill in the art that the configuration shown in FIG. 1 is illustrative only and is not intended to limit the configuration of the computer described above. For example, a computer may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to a disease identification method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In the present embodiment, a method for identifying a disease is provided, and fig. 2 is a flowchart of a method for identifying a disease according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, disease data are obtained, wherein the disease data comprise disease name information and disease attribute information;

in this embodiment, the disease attribute information includes information of each attribute (attribute) of the disease.

Step S204, training a first biomedical-based Bidirectional Encoder (BioBERT) model based on disease data to obtain a second BioBERT model, wherein the second BioBERT model comprises disease name information and semantic relation with the disease attribute information;

the BioBERT model of the embodiment is a pre-training language model, can successfully capture syntax, semantics and some common sense, and can identify semantic relations between Disease description texts and corresponding diseases (diseases) and attributes (attributes).

Step S206, adopting a second BioBERT model to carry out disease identification;

optionally, the scheme of the embodiment may be applied to medical application scenarios such as health care question answering, medical diagnosis reasoning, and disease entity identification.

Through the steps, disease data are obtained, the disease data comprise disease name information and disease attribute information, a first biomedicine-based bidirectional encoder BioBERT model is trained on the basis of the disease data, a second BioBERT model is obtained, the second BioBERT model comprises the disease name information and the semantic relation with the disease attribute information, the second BioBERT model is adopted for disease identification, the BioBERT model is trained to construct the semantic relation between the disease name information and the disease attribute information, automatic identification and confirmation of diseases in medical systems such as inquiry and the like are achieved, and the technical problem that the efficiency of identifying the diseases in description texts in related technologies is low is solved.

In one embodiment of this embodiment, obtaining disease data comprises: determining a disease name, and collecting disease attribute information corresponding to the disease name, wherein the disease attribute information includes: disease information, etiology information, symptom information, diagnosis information, treatment information, prevention information, pathophysiology information, and propaganda information; adding a corresponding auxiliary description sentence in the disease attribute information, wherein the auxiliary description sentence is used for indicating a disease name and an attribute name corresponding to the disease attribute information; and storing the disease attribute information and the corresponding auxiliary description sentences in a classified manner.

After selecting the database, obtaining and determining the name of the disease to be targeted from the database (one or more target diseases can be selected), selecting databases such as MeSH, Wikipedia (Wikipedia), WebMD, NHS choice, etc., describing the determined disease from 8 attributes (attribute) and collecting the disease attribute information, which are respectively: disease Information, etiology cause, Symptoms, Diagnosis, Treatment, Prevention, Pathophysiology, and Transmission. Some dirty data may exist in the collected data, because a large amount of data collection operations may include some interference data, such as meaningless punctuation marks, escape characters, and the like. In this embodiment, the regular expression is adopted to filter the collected original data, and then the filtered data is classified and stored in the tsv file, optionally, in the tsv file, the information from left to right is the disease name, the disease information, the cause of disease, the symptom, the diagnosis, the treatment, the prevention, the pathophysiology and the transmission science, respectively. In the data acquisition process, it is possible to intercept a segment of text from a certain attribute of a certain disease, and extract a disease name corresponding to the segment of text and description information of which attribute item belongs to the disease, and in consideration that the extracted segment does not necessarily include characters of the disease name and characters of the disease attribute description information, this embodiment constructs an auxiliary description sentence for each target disease, for indicating the disease name and attribute items, such as attribute item 1 corresponding to a first segment of text of the disease attribute information, and attribute item 2 corresponding to a second segment of text of the disease attribute information.

In some embodiments, training the first BioBERT model based on the disease data results in a second BioBERT model comprising:

s11, converting the disease data into a sample vector by adopting a word segmentation device, wherein the sample vector comprises: a training set, a verification set and a test set;

after the disease data is collected, label information of the disease data is set, and then the marked disease data is divided into three parts according to a certain proportion (such as 8:1: 1): (1) training set: the method is mainly used for training related semantic association language models; (2) and (4) verification set: the method is mainly used for verifying the training effect of the model and then screening out the model with the best training effect; (3) and (3) test set: mainly used for testing the learning ability of the model. The data of the training set is stored in a train.tsv file, the data of the verification set is stored in a val.tsv file, and the data of the test set is stored in a test.tsv file.

In the embodiment, after disease data is collected, the disease data is classified as sample data to obtain a training set, a verification set and a test set, and then the sample data is converted into vector features by adopting a word segmentation device.

S12, extracting feature information from the sample vector, wherein the feature information includes: a start symbol, a separator symbol or an end symbol for describing the text, a first vocabulary mark for identifying a vocabulary used by the segmenter, a mask mark for indicating a token and a fill element in the sequence, a token mark for identifying a sentence to which the description text belongs, a token mark of the segmenter, a segment mark, and a token position;

and S13, determining a loss function of the first BioBERT model, and training the first BioBERT model based on the loss function and the characteristic information to obtain a second BioBERT model.

In one example, determining the loss function of the first BioBERT model comprises: set the following disease loss function

And attribute loss function

：

；

；

A loss function of the first BioBERT model is obtained using a combination of the disease loss function and the attribute loss function:

beta is a preset balance coefficient, lambda is a preset weight coefficient, and T represents the number of token identifications corresponding to the disease names; t is used for locating the current token identification of the disease name; x is the number of_tThe t token identification corresponding to the disease name is represented; p (x)_t| passage) represents x_tA conditional probability of being located on the target paragraph; z is a radical of_t:z_t=w*y_t+ b, where w represents a weight, y_tDenotes x_tOutputting the result of the layer after the embedding of the BERT model, wherein b represents deviation; n represents the number of attributes of the disease name; i is used for positioning the current attribute of the disease name; y is_tRepresenting the tag value, taking 0 or 1; lna_iRepresents the ith attribute a_iLogarithm of the output value obtained after passing through the softmax layer of the model.

In this example, a BioBERT model was used to train in conjunction with collected disease data, so that an enhanced version of BioBERT (second BioBERT) could be obtained, fusing more disease knowledge.

In the training process, feature extraction is performed first. The text data is converted into binary data that can be understood by a computer, i.e., text vectorization. The present embodiment uses a Tokenizer (Tokenizer) to process text data, and the Tokenizer of the present embodiment is used to vectorize text or convert text into a series of classes, which is the first step for text preprocessing: and (5) word segmentation. Since the computer cannot understand the meaning of a word when processing a language word, a word (a single word or a word group in Chinese is regarded as a word) is converted into a positive integer, so that a text becomes a sequence including a plurality of positive integers. Tokenizer first checks whether the entire word is in the vocabulary, and if not, attempts to decompose the word into as large sub-words as possible contained in the vocabulary, i.e., to decompose the word into sub-words (which are semantically identical or close to the sub-words) that the vocabulary includes, and finally to decompose the word into individual characters. For this reason, a word can always be represented as a collection of at least its individual characters. In this embodiment, words that text does not have in the vocabulary are decomposed into sub-words and character tokens, and then embedding can be generated for them, so as to obtain sample vectors.

And performing feature extraction on the processed text. Because BioBERT is a pre-trained model that expects data to be entered in a specific format, the data needs to be packaged for entry, including the following information: (a) start, delimiter or end of the description text: special labeling of the beginning ([ CLS ]) and the separation/end ([ SEP ]) of the sentence; (b) a first vocabulary flag: tokens that conform to a fixed vocabulary used in the model; (c) token identification of the tokenizer: token id in Tokenizer of the model; (d) the mask identifies: mask id to indicate which elements in the sequence are tokens and which are padding elements; (e) segment identification: segment id is used to distinguish different sentences; (f) token location: for showing the position embedding of the token in the sequence.

Biomedical BERT models in the related art are pre-trained through BERT by using MLM, but 15% of random masking marks are predicted by the biomedical BERT models, but the [ MASK ] marks cannot appear in tasks in a fine adjustment stage, so that the difference between the pre-training stage and the fine adjustment stage is caused, and the final result is influenced.

The selection of the loss function is also crucial, and the learning rate of the loss function is selected in this embodiment

The loss function is considered in combination with disease and attribute, and the specific formula is as follows, wherein the value range of λ is between (0, 1), and is a preset weight coefficient, that is, a predicted loss weight for a certain attribute.

；

For the loss function of disease, it is considered here to maximize the original log (Z) before softmax normalization_t) To retain more useful information. The inverse of log can be added to the cross-entropy loss as a regularization term, i.e., the loss function of the disease is as follows, where β is used to balance the parameters of the two components:

；

for attribute prediction, it is a multi-classification problem, and its loss function employs a cross-entropy loss function, as shown below, where n denotes the total number of samples, y denotes the actual label value, and a denotes the predicted output. The input data of the loss function is the output of a sigmoid function, when the sigmoid is used as an activation function, the problem that the weight of the square loss function is updated too slowly can be perfectly solved by using the cross entropy loss function, and the weight is updated quickly when the error is large; the good property that the weight update is slow when the error is small.

In the training process of the model, a forward propagation process is adopted, optimization in the model training process is also considered, Adam is selected as an optimizer, backward propagation optimization is adopted, an automatic backward propagation process is constructed for the forward propagation process constructed, a medical BERT model (a first BioBERT model) is optimized, relevant model parameters are optimized, and reduction of parameters of the medical BERT model is achieved

Thereby improving the recognition capability of the model to the disease type; reduce

The recognition capability of the model to the attribute type is improved, and the semantic association between the disease and the attribute of the disease is better inferred.

Optionally, training the first BioBERT model based on the loss function and the feature information, and obtaining the second BioBERT model includes: training the first BioBERT model by using the characteristic information of the training set to obtain a first intermediate model; verifying the first intermediate model by using the characteristic information of the verification set to obtain a second intermediate model; and testing the second intermediate model by adopting the characteristic information of the test set, and outputting the second intermediate model as a second BioBERT model if the preset test parameters of the second intermediate model accord with preset conditions, wherein the preset test parameters comprise: accuracy, recall, F1 value.

The model is trained by utilizing the training set, and after the verification is completed by using the verification set, a trained model can be obtained. On this basis, the second BioBERT model is tested with the data of the test set, and three evaluation criteria of accuracy, recall rate and F1 value are used to quantify the quality of the model, so as to prepare the model for the subsequent use in downstream tasks such as health medical question answering, medical diagnosis reasoning and disease entity identification.

The second BioBERT model in this example may be applied to downstream tasks: health care question answering, medical diagnosis reasoning and disease entity identification.

In one application scenario, the method is applied to health care question answering. Disease identification using the second BioBERT model included: receiving a candidate diagnosis list of the first disease from the medical question-and-answer system, wherein the candidate diagnosis list comprises a plurality of candidate diagnosis information of the first disease; outputting first disease attribute information corresponding to the first disease by adopting a second BioBERT model; the list of candidate diagnoses is sorted based on the first disease attribute information.

When a user searches for a disease in a search engine, a plurality of candidate answers appear, and the task mainly comprises the step of giving more accurate candidate answers. By outputting disease attribute information of the disease using the second BioBERT model, it is possible to identify to which attribute description a certain answer in the search engine belongs to the disease, and to more accurately select a correct answer from the candidate answers. For example: if one could semantically link ". RTT.. RTRT." (disease descriptors) to the diagnosis (attributes) of pneumonia (disease), the model could more accurately select the correct answer from the candidate answers that are listed in the candidate diagnosis list that include "RTRT PCR".

In one application scenario, medical diagnostic reasoning is applied. Disease identification using the second BioBERT model included: receiving from the medical reasoning system prerequisite description information and hypothetical diagnosis information for the second disease; outputting second disease attribute information corresponding to the hypothetical diagnosis information by using a second BioBERT model; and if the second disease attribute information is matched with the precondition description information, determining that the diagnosis reasoning between the precondition description information and the assumed diagnosis information is correct.

The embodiment determines the relationship between the premise (premise) and the hypothesis (hypothesis) given the two conditions, thereby determining whether the hypothesis is true, which is a classification problem and requires identifying the mutual information between the premise and the hypothesis. If the second BioBERT model can identify whether the disease attribute information in the precondition description information is matched with the disease name in the hypothesis diagnosis information, the relationship between the precondition and the hypothesis can be easily output. For example: given the premise that "she cannot speak but has good reading ability", assuming "the patient has aphasia", if the second BioBERT model can output symptom description information of aphasia by inputting aphasia (disease name), the above example can be easily predicted.

In one application scenario, the application is in disease entity identification. Disease identification using the second BioBERT model included: receiving entity description text of a third disease from a disease recognition system; extracting a disease name text and a disease attribute text from the entity description text; outputting third disease attribute information corresponding to the disease name text by adopting the second BioBERT model; and if the third disease attribute information is matched with the disease attribute text, determining that the disease name text is the disease name in the entity description text.

The application scenario is to identify which words are names of diseases from a text, or to find out which diseases the text is semantically related to and which attribute of the diseases. For example: the given text "myotonic dystrophy" is caused by CTG expansion in 3 untranslated regions of the DM gene. The "second BioBERT model can determine myotonic dystrophy in the text as the disease name in the entity description text by semantically associating" CTG extensions "(disease attribute text) with the cause (attribute) of myotonic dystrophy (disease name text).

FIG. 3 is a flow diagram of an embodiment of the invention, comprising: the data acquisition and processing module: mainly obtains the relevant information of the disease so as to facilitate the processing of the subsequent modules. A model training module: the disease information acquired in the method is mainly injected into a BERT model, and the result is evaluated. A model application module: the trained model is mainly applied to three tasks of health medical question answering, medical diagnosis reasoning and disease entity identification.

The data of the scheme is acquired in a weak supervision mode, and time is greatly saved compared with manual marking. Compared with the existing biomedical BERT model, the method provided by the scheme can identify the semantic relation between the Disease description text and the corresponding Disease (Disease) and attribute (attribute).

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

In this embodiment, a disease recognition device is further provided for implementing the above embodiments and preferred embodiments, which have already been described and will not be described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a disease recognition apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus including: an acquisition module 40, a training module 42, a recognition module 44, wherein,

an obtaining module 40, configured to obtain disease data, where the disease data includes disease name information and disease attribute information;

a training module 42, configured to train a first biomedical-based bidirectional encoder BioBERT model based on the disease data to obtain a second BioBERT model, where the second BioBERT model includes semantic relationships between the disease name information and the disease attribute information;

an identification module 44 for identifying a disease using the second BioBERT model.

And attribute loss function

：

；

；

A combination subunit for obtaining the first BioB by combining the disease loss function and the attribute loss functionLoss function of ERT model:

；

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Fig. 5 is a structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes a processor 51, a communication interface 52, a memory 53 and a communication bus 54, where the processor 51, the communication interface 52, and the memory 53 complete communication with each other through the communication bus 54, and the memory 53 is used for storing a computer program;

the processor 51 is configured to implement the following steps when executing the program stored in the memory 53: acquiring disease data, wherein the disease data comprises disease name information and disease attribute information; training a first biomedicine-based bidirectional encoder BioBERT model based on the disease data to obtain a second BioBERT model, wherein the second BioBERT model comprises the disease name information and semantic relation with the disease attribute information; disease identification was performed using the second BioBERT model.

And attribute loss function

：

；

；

；

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment provided by the present application, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute the method for identifying a disease as described in any of the above embodiments.

In a further embodiment provided by the present application, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of identifying a disease as described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method for identifying a disease, comprising:

acquiring disease data, wherein the disease data comprises disease name information and disease attribute information;

training a first biomedicine-based bidirectional encoder BioBERT model based on the disease data to obtain a second BioBERT model, wherein the second BioBERT model comprises the disease name information and semantic relation with the disease attribute information;

disease identification was performed using the second BioBERT model.

2. The method of claim 1, wherein training a first BioBERT model based on the disease data results in a second BioBERT model comprising:

converting the disease data into a sample vector using a tokenizer, wherein the sample vector comprises: a training set, a verification set and a test set;

extracting feature information from the sample vector, wherein the feature information comprises: a start symbol, a separator symbol or an end symbol for describing a text, a first vocabulary mark, a mask mark, a segment mark and a token position of a word segmentation device, wherein the first vocabulary mark is used for marking a vocabulary used by the word segmentation device, the mask mark is used for indicating a token and a filling element in a sequence, and the segment mark is used for marking a sentence to which the description text belongs;

and determining a loss function of the first BioBERT model, and training the first BioBERT model based on the loss function and the characteristic information to obtain a second BioBERT model.

3. The method of claim 2, wherein determining the loss function of the first BioBERT model comprises:

set the following disease loss function

And attribute loss function

：

；

；

;

4. The method of claim 1, wherein obtaining disease data comprises:

determining a disease name, and collecting disease attribute information corresponding to the disease name, wherein the disease attribute information includes: disease information, etiology information, symptom information, diagnosis information, treatment information, prevention information, pathophysiology information, and propaganda information;

adding a corresponding auxiliary description sentence in the disease attribute information, wherein the auxiliary description sentence is used for indicating a disease name and an attribute name corresponding to the disease attribute information;

and storing the disease attribute information and the corresponding auxiliary description sentences in a classified manner.

5. The method of claim 1, wherein using the second BioBERT model for disease identification comprises:

receiving a list of candidate diagnoses of a first disease from a medical question and answer system, wherein the list of candidate diagnoses comprises a plurality of candidate diagnoses of the first disease;

outputting first disease attribute information corresponding to the first disease by using the second BioBERT model;

ranking the list of candidate diagnoses based on the first disease attribute information.

6. The method of claim 1, wherein using the second BioBERT model for disease identification comprises:

receiving from the medical reasoning system prerequisite description information and hypothetical diagnosis information for the second disease;

outputting second disease attribute information corresponding to the hypothetical diagnostic information using the second BioBERT model;

and if the second disease attribute information is matched with the precondition description information, determining that the diagnosis reasoning between the precondition description information and the assumed diagnosis information is correct.

7. The method of claim 1, wherein using the second BioBERT model for disease identification comprises:

receiving entity description text of a third disease from a disease recognition system;

extracting a disease name text and a disease attribute text from the entity description text;

outputting third disease attribute information corresponding to the disease name text by adopting the second BioBERT model;

and if the third disease attribute information is matched with the disease attribute text, determining that the disease name text is the disease name in the entity description text.

8. The method of claim 2, wherein training the first BioBERT model based on the loss function and the feature information, and wherein deriving a second BioBERT model comprises:

training the first BioBERT model by using the characteristic information of the training set to obtain a first intermediate model;

verifying the first intermediate model by using the characteristic information of the verification set to obtain a second intermediate model;

adopting the characteristic information of the test set to test the second intermediate model, if the preset test parameters of the second intermediate model accord with preset conditions, outputting the second intermediate model as the second BioBERT model, wherein the preset test parameters comprise: accuracy, recall, F1 value.

9. An apparatus for identifying a disease, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring disease data, and the disease data comprises disease name information and disease attribute information;

the training module is used for training a first biomedicine-based bidirectional encoder BioBERT model based on the disease data to obtain a second BioBERT model, wherein the second BioBERT model comprises the disease name information and semantic relation with the disease attribute information;

and the identification module is used for identifying diseases by adopting the second BioBERT model.

10. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 8 when executed.

11. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 8.