CN110609907A

CN110609907A - A random walk-based reasoning method for medical domain knowledge

Info

Publication number: CN110609907A
Application number: CN201910876121.5A
Authority: CN
Inventors: 张吉昕; 秦拯; 欧露; 颜俊; 陈浩; 欧博; 翟亚静
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2019-12-24

Abstract

本发明涉及一种基于随机游走的医药领域知识推理方法。其发明内容主要包括(1)基于上下文字符二元组和信息熵的医药领域命名实体识别方法；(2)基于谓词情感分类的医药领域实体间关系抽取方法；(3)基于随机游走的医药领域知识图谱推理方法。基于上述方法，识别医药领域命名实体、抽取命名实体间关系，从而自动构建医药领域知识图谱，并实现医药领域知识图谱推理。The invention relates to a reasoning method for medical field knowledge based on random walk. The content of the invention mainly includes (1) a named entity recognition method in the medical field based on contextual character binary groups and information entropy; (2) a method for extracting relationships between entities in the medical field based on predicate sentiment classification; (3) a medical field based on random walk Domain knowledge graph reasoning method. Based on the above method, the named entities in the medical field are identified, and the relationship between named entities is extracted, so as to automatically construct the knowledge map in the medical field and realize the reasoning of the knowledge map in the medical field.

Description

A random walk-based reasoning method for medical domain knowledge

技术领域technical field

本发明涉及知识工程和机器学习领域，一种基于随机游走的医药领域知识推理方法。The invention relates to the fields of knowledge engineering and machine learning, and relates to a random walk-based reasoning method for medical field knowledge.

背景技术Background technique

知识图谱技术作为知识工程和人工智能领域的关键技术之一，是当前热门的技术研究领域之一。不同于机器学习技术，往往存在特征间局部关系难解释以及特征与输出间全局关系难解释的问题，知识图谱技术通过三元组表示知识实体间关系，直观的反映知识本体和知识实体间关联逻辑，具有很好的可解释性，已得到工业界越来越多的重视，成为人工智能技术的重要基础之一。As one of the key technologies in the field of knowledge engineering and artificial intelligence, knowledge graph technology is one of the current hot technical research fields. Different from machine learning technology, there are often problems that it is difficult to explain the local relationship between features and the global relationship between features and output. Knowledge graph technology uses triples to represent the relationship between knowledge entities, which intuitively reflects the knowledge ontology and the association logic between knowledge entities. , has very good explainability, has been paid more and more attention by the industry, and has become one of the important foundations of artificial intelligence technology.

知识图谱技术主要包括构建、推理等方面，其中，知识图谱构建技术主要包括命名实体识别、关系抽取等，知识图谱推理技术主要包括实体关系预测、知识推理等。通过从文本数据中的事实中识别知识实体、抽取知识实体间的关系，并基于三元组表示法构建知识图谱，并通过挖掘和预测可能存在的实体间关系来对知识图谱进行补全，基于知识图谱中已知的实体间关系进行知识规则的提取与推理。Knowledge map technology mainly includes construction, reasoning, etc. Among them, knowledge map construction technology mainly includes named entity recognition, relationship extraction, etc., knowledge map reasoning technology mainly includes entity relationship prediction, knowledge reasoning, etc. By identifying knowledge entities from the facts in the text data, extracting the relationship between knowledge entities, constructing a knowledge graph based on the triple representation, and completing the knowledge graph by mining and predicting possible relationships between entities, based on The relationship between known entities in the knowledge graph is used to extract and reason knowledge rules.

医药领域作为知识密集型领域，十分依赖医学、药学背景知识，利用知识图谱表示医学、药学背景知识，对医药领域的辅助智能应用有着十分重要的支撑作用。然而，医药领域的命名实体、实体间关系、知识逻辑等具有十分鲜明的领域特点，相较于通用领域有着较大差异，需要提出有针对性的知识图谱构建与推理技术支撑知识图谱在医药领域中的辅助智能应用。As a knowledge-intensive field, the medical field relies heavily on the background knowledge of medicine and pharmacy. The use of knowledge graphs to represent the background knowledge of medicine and pharmacy plays a very important role in supporting the application of auxiliary intelligence in the field of medicine. However, named entities, relationships between entities, and knowledge logic in the medical field have very distinct domain characteristics, which are quite different from those in the general field. It is necessary to propose targeted knowledge graph construction and reasoning techniques to support knowledge graphs in the medical field. Assistant intelligent applications in .

发明内容Contents of the invention

本发明目的旨在解决医药知识图谱自动构建和推理问题。The purpose of the present invention is to solve the problem of automatic construction and reasoning of medical knowledge graphs.

为此，本发明提出了一种基于随机游走的医药领域知识推理方法，主要包括三部分内容：For this reason, the present invention proposes a method for inferring medical field knowledge based on random walks, which mainly includes three parts:

(1)基于上下文字符二元组和信息熵的医药领域命名实体识别方法；(1) A named entity recognition method in the medical field based on contextual character binary groups and information entropy;

(2)基于谓词情感分类的医药领域实体间关系抽取方法；(2) A method for extracting relationships between entities in the medical field based on predicate sentiment classification;

(3)基于随机游走的医药领域知识图谱推理方法。(3) A reasoning method based on random walk knowledge map in medical field.

具体内容如下：The specific content is as follows:

采用方法(1)识别医药领域命名实体，包括药品、疾病、症状、人群、成分等概念；采用方法(2)抽取医药领域命名实体间的正向关系和负向关系，包括适用、禁忌等关系；利用医药领域命名实体和实体间关系自动构建医药知识图谱，并采用方法(3)实现医药知识图谱推理。基于上述方法实现医药知识图谱自动构建以及医药领域知识推理。Use method (1) to identify named entities in the medical field, including concepts such as drugs, diseases, symptoms, populations, and ingredients; use method (2) to extract positive and negative relationships between named entities in the medical field, including applicable, taboo, etc. ;Using named entities in the medical field and the relationship between entities to automatically construct a medical knowledge graph, and using method (3) to achieve medical knowledge graph reasoning. Based on the above method, the automatic construction of medical knowledge map and knowledge reasoning in medical field are realized.

(1)基于上下文字符二元组和信息熵的医药领域命名实体识别方法。(1) A named entity recognition method in the medical field based on contextual character dyads and information entropy.

收集常规语料和医药专业语料，去掉其中标点符号和停用词，根据医药语料和常规预料库中上下文分别建立了两个字符转移概率矩阵，矩阵中的每个元素是上下文中的转移频率值。令Mat_medical为医药语料的上下文字符转移概率矩阵，Mat_normal为常规语料的上下文字符转移概率矩阵，令{c_i,c_i+1}为语料中连续的字符上下文，通过分别计算{c_i,c_i+1}在医药语料和常规语料中转移概率，我们得到矩阵Mat_medical(c_i,c_i+1)和矩阵Mat_normal(c_i,c_i+1)。Collect conventional corpus and medical professional corpus, remove the punctuation marks and stop words, and establish two character transition probability matrices according to the medical corpus and the context in the conventional prediction database. Each element in the matrix is the transition frequency value in the context. Let Mat _medical be the context character transition probability matrix of the medical corpus, Mat _normal be the context character transition probability matrix of the conventional corpus, let {c _i , _ci+1 } be the continuous character context in the corpus, and calculate {c _i , c _i+1 }Transfer probability between medical corpus and regular corpus, we get matrix Mat _medical ( _ci ,ci ₊₁ ) and matrix Mat _normal ( _ci ,ci ₊₁ ).

基于医药语料和常规语料的上下文字符转移概率矩阵，采用信息熵计算每组字符上下文属于医药领域的显著程度，由于常规语料中的字符转移概率比较稳定，医药语料中显著偏离常规语料字符转移概率的字符上下文则判定医药命名实体。Based on the context character transition probability matrix of the medical corpus and conventional corpus, information entropy is used to calculate the significance of each group of character contexts belonging to the medical field. Since the character transition probability in the conventional corpus is relatively stable, the medical corpus significantly deviates from the conventional corpus character transition probability. The character context determines the pharmaceutical named entity.

根据下列公式计算字符转移概率的信息熵，信息熵Entropy(c_i,c_i+1)用于标记{c_i,c_i+1}是否为医药领域命名实体，如果Entropy(c_i,c_i+1)＞t，其中t(t＝1)是临界值，那么{c_i,c_i+1}是同一个命名实体的字符上下文，组合连续的字符上下文形成医药命名实体。Calculate the information entropy of the character transfer probability according to the following formula, the information entropy Entropy(ci,ci ₊₁ ) is used to mark whether { _ci , _ci+1 _} is a named entity in the medical field, if Entropy( _ci ,ci+ _{1 +1} )>t, where t(t=1) is a critical value, then {c _i , c _i+1 } is the character context of the same named entity, and continuous character contexts are combined to form a medical named entity.

(2)基于谓词情感分类的医药领域实体间关系抽取方法。(2) A method for extracting relationships between entities in the medical field based on predicate sentiment classification.

将医药语料根据标点符号进行断句，得到短句集合，对其中部分短句的情感进行标记，标签包括正向、负向、中性。采用基于维特比的条件随机场方法对医药语料中带有情感标签的所有短句进行中文分词处理，并采用词向量方法对所有词进行向量化处理。将所有词的词向量进行加权平均后得到短句的文本向量，并采用支持向量机对带有情感标签的文本向量进行训练，得到文本情感分类模型。基于该模型对医药语料中所有短句进行情感分类，提取其中具有显著正向或负向情感的短句。The medical corpus is segmented according to the punctuation marks to obtain a set of short sentences, and the emotions of some of the short sentences are marked, and the labels include positive, negative, and neutral. All short sentences with emotional tags in the medical corpus are processed by Chinese word segmentation using Viterbi-based conditional random field method, and all words are vectorized by word vector method. The word vectors of all words are weighted and averaged to obtain the text vectors of short sentences, and the support vector machine is used to train the text vectors with emotional labels to obtain the text emotion classification model. Based on this model, sentiment classification is carried out for all short sentences in the medical corpus, and short sentences with significant positive or negative emotions are extracted.

对上述具有正向或负向情感的短句进行中文分词处理，并对其中的词与进行词性标注，提取其中的谓词(动词)。如果该短句包含的医药命名实体数量大于或等于2，且分属谓词所在位置的两侧或谓词是头词、尾词，则抽取谓词两侧实体并建立实体间关系，同时根据短句的正向或负向情感判别实体间的关系属于正向关系或负向关系。Perform Chinese word segmentation processing on the above short sentences with positive or negative emotions, and perform part-of-speech tagging on the words and phrases in them, and extract the predicates (verbs) in them. If the number of medical named entities contained in the short sentence is greater than or equal to 2, and the two sides of the predicate belong to the position or the predicate is the head word and the tail word, then extract the entities on both sides of the predicate and establish the relationship between the entities, and at the same time, according to the short sentence The relationship between positive or negative emotion discrimination entities belongs to positive relationship or negative relationship.

根据上述医药命名实体识别方法和实体间关系抽取方法，基于三元组表示法，构建了医药知识图谱KG(V,E,P)，其中V表示知识图谱中的顶点，即医药实体，E表示知识图谱中的两顶点之间的边，即两实体间的关系，P表示知识图谱中边的正向或负向属性。According to the above-mentioned medical named entity recognition method and the relationship extraction method between entities, a medical knowledge graph KG(V, E, P) is constructed based on the triple representation method, where V represents the vertex in the knowledge graph, that is, the medical entity, and E represents The edge between two vertices in the knowledge graph is the relationship between two entities, and P represents the positive or negative attribute of the edge in the knowledge graph.

如图1知识图谱概念间关系图所示，医药实体的概念包括疾病、症状、药品、人群、科室、身体部位等。知识图谱中的顶点也包括疾病实体，如感冒、胃炎等；症状实体，如咳嗽、胃痛等；药物实体，如阿司匹林、头孢等；人群实体，如婴儿、孕妇等；身体部位实体，如头、胸等。边表示的是每两个实体之间的关系，例如，感冒-咳嗽、胃炎-胃痛表示感冒和胃炎分别导致了咳嗽和胃痛。另外，医药实体之间的关系还包括了正向关系和反向关系，例如边感冒-咳嗽是正向关系，因为感冒导致了咳嗽，然而边四环素-孕妇是负向关系，因为四环素是孕妇的禁忌药。正向关系包括适用于、引起等，负向关系包括慎用、禁忌等。例如，给定一个短句“胃炎是胃黏膜的炎症…通常表现为上腹部疼痛、恶心、呕吐…并发症包括出血、胃溃疡…”，从短句中提取出的疾病顶点就是{胃炎}，提取的症状顶点是{上腹部疼痛，恶心，呕吐…出血，胃溃疡…}，关系和权值则是{(胃炎，上腹部疼痛，1.0),(胃炎，恶心，1.0),(胃炎，呕吐，1.0)…(胃炎，出血，1.0),(胃炎，胃溃疡，1.0)…}。As shown in Figure 1, the relationship between knowledge map concepts, the concepts of medical entities include diseases, symptoms, medicines, groups of people, departments, body parts, etc. The vertices in the knowledge graph also include disease entities, such as colds, gastritis, etc.; symptom entities, such as cough, stomach pain, etc.; drug entities, such as aspirin, cephalosporin, etc.; crowd entities, such as babies, pregnant women, etc.; body part entities, such as head, Chest etc. The edge represents the relationship between each two entities, for example, cold-cough, gastritis-stomach pain means that cold and gastritis cause cough and stomach pain respectively. In addition, the relationship between medical entities also includes positive and negative relationships. For example, a cold-cough is a positive relationship, because a cold causes a cough, but a tetracycline-pregnant woman is a negative relationship, because tetracycline is a contraindication for pregnant women. medicine. Positive relationships include applying to, causing, etc., and negative relationships include cautious use, taboo, etc. For example, given a short sentence "Gastritis is inflammation of the gastric mucosa... usually manifested as epigastric pain, nausea, vomiting... Complications include bleeding, gastric ulcer...", the disease vertex extracted from the short sentence is {gastritis}, The symptom vertex extracted is {upper abdominal pain, nausea, vomiting... bleeding, gastric ulcer...}, and the relationship and weight are {(gastritis, upper abdominal pain, 1.0), (gastritis, nausea, 1.0), (gastritis, vomiting , 1.0)...(gastritis, bleeding, 1.0), (gastritis, gastric ulcer, 1.0)...}.

根据医药知识图谱，基于随机游走方法进行知识推理。推理过程会被转化成遍历的过程，从有限线索中(若干实体)开始迭代搜索推理结果。V＝{v₁,v₂,....,v_n}是一组实体，根据下列公式推断出这一组实体可以推理出候选结果。According to the medical knowledge map, knowledge reasoning is performed based on the random walk method. The reasoning process will be transformed into a traversal process, starting from limited clues (several entities) to iteratively search for reasoning results. V={v ₁ ,v ₂ ,...,v _n } is a group of entities, and the candidate results can be inferred from this group of entities according to the following formula.

其中，score(v_i)是指定实体v_i的分值，In(v_i)是v_i的入度，Out(v_i)是v_i的出度，p_j,i是v_i和v_j之间边的属性值，正向为1，负向为-1，α(α＝0.85)是经验参数。推理时，在知识图谱上对于已知的这组实体进行初始化，对应顶点分值初始化为1，其余顶点分值初始化为0。通过随机游走迭代计算得到所有顶点分值，对分值进行排序，分值较高的顶点对应的实体则为这组已知实体可以推理出的候选结果，最后根据实际情况进行筛选。Among them, score(v _i ) is the score of the specified entity v _i , In(v _i ) is the in-degree of v _i , Out(v _i ) is the out-degree of v _i , p _j,i are v _i and v _j The attribute value of the edge in between is 1 for the positive direction and -1 for the negative direction, and α (α=0.85) is an empirical parameter. During inference, the known group of entities is initialized on the knowledge graph, the corresponding vertex scores are initialized to 1, and the remaining vertex scores are initialized to 0. The scores of all vertices are obtained through random walk iterative calculation, and the scores are sorted. The entities corresponding to the vertices with higher scores are the candidate results that can be inferred from this group of known entities, and finally screened according to the actual situation.

附图说明Description of drawings

图1为知识图谱概念间关系图Figure 1 is a diagram of the relationship between knowledge map concepts

具体实施方式Detailed ways

本发明步骤如下：The steps of the present invention are as follows:

步骤1：采集常规语料和医药专业语料，去掉其中标点符号和停用词。Step 1: Collect regular corpus and medical professional corpus, and remove punctuation marks and stop words.

步骤2：根据医药语料和常规预料库中字符上下文{c_i,c_i+1}分别建立了医药语料和常规语料的字符转移概率矩阵Mat_medical(c_i,c_i+1)和Mat_normal(c_i,c_i+1)。Step 2: According to the character context {c _i ,c _i+1 } in the medical corpus and the conventional prediction library, the character transition probability matrices Mat _medical ( _ci ,ci ₊₁ ) and Mat _normal ( c _i ,c _i+1 ).

步骤3：基于医药语料和常规语料的上下文字符转移概率矩阵，采用信息熵计算每组字符上下文属于医药领域的显著程度。Step 3: Based on the context character transition probability matrix of the medical corpus and the conventional corpus, the information entropy is used to calculate the significance of each group of character contexts belonging to the medical field.

步骤4：如果信息熵Entropy(c_i,c_i+1)＞t，其中t(t＝1)是临界值，那么{c_i,c_i+1}是同一个命名实体的字符上下文，组合连续的字符上下文形成医药命名实体。Step 4: If information entropy Entropy(ci,ci ₊₁ )>t, where t(t=1) is the critical value, then { _ci , _ci+1 _} is the character context of the same named entity, the combination Consecutive character contexts form pharmaceutical named entities.

步骤5：将医药语料根据标点符号进行断句，得到短句集合，对其中部分短句的情感进行标记，标签包括正向、负向、中性。Step 5: Segment the medical corpus according to punctuation marks to obtain a set of short sentences, and label the emotions of some of the short sentences, including positive, negative, and neutral.

步骤6：采用中文分词和词向量方法对所有词进行向量化处理。将所有词的词向量进行加权平均后得到短句的文本向量，并采用支持向量机对带有情感标签的文本向量进行训练，得到文本情感分类模型。基于该模型对医药语料中的短句进行情感分类。Step 6: Use Chinese word segmentation and word vector method to vectorize all words. The word vectors of all words are weighted and averaged to obtain the text vectors of short sentences, and the support vector machine is used to train the text vectors with emotional labels to obtain the text emotion classification model. Sentiment classification of short sentences in medical corpus is carried out based on this model.

步骤7：采用词性标注方法提取句子中的谓词，如果句子中包含的医药命名实体数量大于或等于2，且分属谓词所在位置的两侧或谓词是头词、尾词，则抽取谓词两侧实体并建立实体间关系，同时根据短句的正向或负向情感判别实体间的关系属于正向关系或负向关系。Step 7: Use the part-of-speech tagging method to extract the predicates in the sentence. If the number of medical named entities contained in the sentence is greater than or equal to 2, and belong to both sides of the predicate or the predicate is the head word and tail word, then extract both sides of the predicate Entities and establish the relationship between entities, and at the same time judge whether the relationship between entities is a positive relationship or a negative relationship according to the positive or negative emotion of the short sentence.

步骤8：根据医药命名实体识别和实体间关系，基于三元组表示法，构建了医药知识图谱。Step 8: According to the medical named entity recognition and the relationship between entities, based on the triple representation, the medical knowledge graph is constructed.

步骤9：根据医药知识图谱，基于随机游走方法进行知识推理。推理过程会被转化成遍历的过程，从有限线索中(若干实体)开始迭代搜索候选推理结果。。Step 9: According to the medical knowledge map, knowledge reasoning is performed based on the random walk method. The reasoning process will be transformed into a traversal process, starting from limited clues (several entities) to iteratively search for candidate reasoning results. .

Claims

1. A medicine field knowledge inference method based on random walk is characterized by comprising the following steps:

(1) a medicine field named entity identification method based on context character binary and information entropy;

(2) a method for extracting relationships between medical field entities based on predicate sentiment classification;

(3) a medicine field knowledge graph reasoning method based on random walk.

2. The method for identifying a named entity in the medical field based on a context character binary group and information entropy as claimed in claim 1, wherein the named entity in the medical field is identified by comparing the statistical representation of the character context of the named entity in the general field with the statistical representation of the character context of the named entity in the medical field by using a context character binary group and information entropy method, aiming at the problem that the traditional named entity identification method is inaccurate due to the fact that the statistical representation of the character context of the named entity in the medical field is not smooth.

3. The method for extracting relationships between medical field entities based on predicate sentiment classification as claimed in claim 1, wherein the relationships between medical field entities are extracted by carrying out sentiment classification on adjacent predicates aiming at positive and negative relationships between medical field entities and related to predicate sentiment between entities.

4. The random walk based reasoning method for knowledge base of medical field according to claim 1, wherein the random walk method is used to perform reasoning for medical knowledge in the knowledge base of medical science, in order to solve the problem that the knowledge base is difficult to be used directly for reasoning due to the intensive and complicated incidence relation between the entities of the knowledge base of medical field.