CN113553840B

CN113553840B - Text information processing method, device, equipment and storage medium

Info

Publication number: CN113553840B
Application number: CN202110923275.2A
Authority: CN
Inventors: 姜逸文; 陈旭; 宋晓霞; 刘鸣谦; 洪平; 高玉杰; 黄智勇; 王琪; 孙嘉明
Original assignee: Winning Health Technology Group Co Ltd
Current assignee: Winning Health Technology Group Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2024-10-01
Anticipated expiration: 2041-08-12
Also published as: CN113553840A

Abstract

The application provides a text information processing method, a device, equipment and a storage medium, and relates to the technical field of data processing. The initial knowledge extraction model comprises an initial entity type identification model and an initial entity relation extraction model, and the method comprises the following steps: labeling the original training text information according to the corresponding relation between the entity type and the entity name to obtain a first training sample, wherein the entity type comprises: standard entity type, attribute entity type, and value entity type; inputting the first training sample into the initial entity type recognition model, and training to obtain an entity type recognition model; labeling the first training sample according to the corresponding relation between entity types to obtain a second training sample; and inputting the second training sample into the initial entity relation extraction model, and training to obtain an entity relation extraction model. By applying the embodiment of the application, the accuracy of the knowledge extraction model comprising the entity type identification model can be improved.

Description

Text information processing method, device, equipment and storage medium

Technical Field

The present application relates to the technical field of medical texts, and in particular, to a text information processing method, apparatus, device and storage medium.

Background

With the rapid development of medical informatization, medical staff generally adopts electronic medical records to record important information in the process of diagnosing and treating patients. Because information in electronic medical records (which may be referred to as medical text information) is mostly stored in unstructured form, it is difficult to be directly used in scenes such as scientific research.

At present, unstructured medical text information can be processed through a pre-trained knowledge extraction model to obtain structured information, wherein the structured information comprises entities in the medical text information and relations among the entities.

The comprehensiveness of the training sample used for training the knowledge extraction model directly affects the accuracy of the knowledge extraction model, so how to construct the comprehensiveness training sample is a technical problem to be solved at present.

Disclosure of Invention

The application aims to overcome the defects in the prior art and provide a text information processing method, a device, equipment and a storage medium, wherein the accuracy of a knowledge extraction model can be improved based on a constructed comprehensive training sample.

In order to achieve the above purpose, the technical scheme adopted by the embodiment of the application is as follows:

In a first aspect, an embodiment of the present application provides a text information processing method, where an initial knowledge extraction model includes an initial entity type recognition model and an initial entity relationship extraction model, and the method includes:

Labeling the original training text information according to the corresponding relation between the entity type and the entity name to obtain a first training sample, wherein the entity type comprises: a standard entity type, an attribute entity type, and a value entity type, wherein the attribute entity type and the value entity type are respectively used for characterizing the standard entity type, and the first training sample comprises: the original training text information and entity types corresponding to the entity names in the original training text information;

Inputting the first training sample into the initial entity type recognition model, and training to obtain an entity type recognition model;

Labeling the first training samples according to the corresponding relation among the entity types to obtain second training samples, wherein the corresponding relation among the entity types comprises the following steps: a pointing relationship between a subject entity type and a guest entity type, the second training sample comprising: the original training text information and the corresponding relation between entity names corresponding to the entity types in the original training text information;

and inputting the second training sample into the initial entity relation extraction model, and training to obtain an entity relation extraction model.

Optionally, labeling the first training sample according to the correspondence between entity types to obtain a second training sample, including:

And labeling the first training sample according to the corresponding relation between the entity types and the strength degree information between entity names corresponding to the entity types in the first training sample to obtain a second training sample.

Optionally, labeling the original training text information according to the correspondence between the entity type and the entity name to obtain a first training sample, including:

labeling the original training text information according to the corresponding relation between the entity type and the entity name to obtain an initial first training sample;

and if one entity name in the original training text information comprises a plurality of sub-entity names, deleting the entity type corresponding to each sub-entity name in the initial first training sample to obtain the first training sample.

Optionally, the method further comprises:

Inputting the target text information into the entity type recognition model, and outputting an entity set, wherein the entity set comprises: the entity names contained in the target text information and the entity types corresponding to the entity names, wherein the entity types comprise standard entity types, attribute entity types and value entity types;

And inputting the target text information and the entity set into the entity relation extraction model, and outputting an entity name pair, wherein the entity name pair comprises a subject entity name and a guest entity name, and the subject entity name points to the guest entity name.

Optionally, the inputting the target text information and the entity set into the entity relation extraction model outputs an entity name pair, including:

And inputting the target text information and the entity set into the entity relation extraction model, and outputting the entity name pair and the intensity information between entity names contained in the entity name pair, wherein the entity name pair comprises a subject entity name and a guest entity name, and the subject entity name points to the guest entity name.

Optionally, after the inputting the target text information and the entity set into the entity relationship extraction model and outputting the entity name pair, the method further includes:

And constructing a knowledge graph according to the entity name pairs, taking the subject entity names and the client entity names in the entity name pairs as nodes in the knowledge graph respectively, and taking the relation between the subject entity names and the object entity names as edges in the knowledge graph.

Optionally, after constructing a knowledge graph according to the entity name pair, the method further includes:

Acquiring a corresponding entity name from a database storing graph data corresponding to the knowledge graph according to a knowledge acquisition instruction input by a user;

And displaying the entity names in the knowledge graph according to the display states corresponding to the entity names.

Optionally, after inputting the target text information into the entity type recognition model and outputting the entity set, the method further includes:

Performing statistics operation on the entity set to obtain a statistical result, wherein the statistical result comprises: the frequency of occurrence of each entity name and/or the frequency of occurrence of each entity type in the entity set;

And respectively sequencing the contents belonging to the same dimension in the statistical result to obtain a sequencing result.

In a second aspect, an embodiment of the present application further provides a text information processing apparatus, where an initial knowledge extraction model includes an initial entity type recognition model and an initial entity relationship extraction model, and the apparatus includes:

The first labeling module is used for labeling the original training text information according to the corresponding relation between the entity type and the entity name to obtain a first training sample, and the entity type comprises: a standard entity type, an attribute entity type, and a value entity type, wherein the attribute entity type and the value entity type are respectively used for characterizing the standard entity type, and the first training sample comprises: the original training text information and entity types corresponding to the entity names in the original training text information;

The first training module is used for inputting the first training sample into the initial entity type recognition model and training to obtain an entity type recognition model;

The second labeling module is configured to label the first training sample according to a correspondence between entity types, so as to obtain a second training sample, where the correspondence between entity types includes: a pointing relationship between a subject entity type and a guest entity type, the second training sample comprising: the original training text information and the corresponding relation between entity names corresponding to the entity types in the original training text information;

And the second training module is used for inputting the second training sample into the initial entity relation extraction model and training to obtain the entity relation extraction model.

Optionally, the second labeling module is specifically configured to label the first training sample according to the correspondence between the entity types and strength information between entity names corresponding to the entity types in the first training sample, so as to obtain a second training sample.

Optionally, the first labeling module is specifically configured to label the original training text information according to a corresponding relationship between the entity type and the entity name, so as to obtain an initial first training sample; and if one entity name in the original training text information comprises a plurality of sub-entity names, deleting the entity type corresponding to each sub-entity name in the initial first training sample to obtain the first training sample.

Optionally, the apparatus further comprises:

The first output module is used for inputting the target text information into the entity type recognition model and outputting an entity set, and the entity set comprises: the entity names contained in the target text information and the entity types corresponding to the entity names, wherein the entity types comprise standard entity types, attribute entity types and value entity types;

And the second output module is used for inputting the target text information and the entity set into the entity relation extraction model and outputting an entity name pair, wherein the entity name pair comprises a subject entity name and a guest entity name, and the subject entity name points to the guest entity name.

Optionally, the second output module is specifically configured to input the target text information and the entity set into the entity relationship extraction model, output the entity name pair and strength information between entity names included in the entity name pair, where the entity name pair includes a subject entity name and a guest entity name, and the subject entity name points to the guest entity name.

Optionally, the apparatus further comprises:

The building module is used for building a knowledge graph according to the entity name pairs, taking the subject entity names and the client entity names in the entity name pairs as nodes in the knowledge graph respectively, and taking the relation between the subject entity names and the client entity names as edges in the knowledge graph.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring corresponding entity names from a database storing graph data corresponding to the knowledge graph according to a knowledge acquisition instruction input by a user;

And the display module is used for displaying the entity names in the knowledge graph according to the display states corresponding to the entity names.

Optionally, the apparatus further comprises:

The statistics module is used for carrying out statistics operation on the entity set to obtain a statistics result, and the statistics result comprises: the frequency of occurrence of each entity name and/or the frequency of occurrence of each entity type in the entity set; and respectively sequencing the contents belonging to the same dimension in the statistical result to obtain a sequencing result.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a storage medium, and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the text information processing method of the first aspect described above.

In a fourth aspect, an embodiment of the present application provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the text information processing method of the first aspect described above.

The beneficial effects of the application are as follows:

The embodiment of the application provides a text information processing method, a device, equipment and a storage medium, wherein an initial knowledge extraction model comprises an initial entity type identification model and an initial entity relation extraction model, and the method comprises the following steps: labeling the original training text information according to the corresponding relation between the entity type and the entity name to obtain a first training sample, wherein the entity type comprises: the method comprises the steps of a standard entity type, an attribute entity type and a value entity type, wherein the attribute entity type and the value entity type are respectively used for representing characteristics of the quasi entity type, and the first training sample comprises the following components: the original training text information and entity types corresponding to the entity names in the original training text information; inputting the first training sample into the initial entity type recognition model, and training to obtain an entity type recognition model; labeling the first training samples according to the corresponding relation between the entity types to obtain second training samples, wherein the corresponding relation between the entity types comprises the following steps: a pointing relationship between the subject entity type and the guest entity type, the second training sample comprising: the original training text information and the corresponding relation between entity names corresponding to entity types in the original training text information; and inputting the second training sample into the initial entity relation extraction model, and training to obtain an entity relation extraction model.

By adding the attribute entity type and the value entity type to the text information processing method provided by the embodiment of the application, the original training text information is marked, so that not only can the entity name corresponding to the standard entity type be identified from the original training text information, but also the entity name corresponding to the entity name and the value entity type corresponding to the attribute entity type can be identified from the original training text information, the content contained in the original training text information can be reflected more comprehensively by the constructed first training sample, and on the premise that the first training sample is comprehensive, the accuracy of the entity type identification model obtained by training by using the first training sample can be improved, and further the accuracy of the knowledge extraction model containing the entity type identification model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic structural diagram of an initial knowledge extraction model according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a knowledge extraction model according to an embodiment of the present application;

Fig. 3 is a flow chart of a text information processing method according to an embodiment of the present application;

Fig. 4 is a flow chart of another text information processing method according to an embodiment of the present application;

fig. 5 is a flowchart of another text information processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of converting unstructured target text information into structured graph data according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a knowledge graph according to an embodiment of the present application;

fig. 8 is a flowchart of another text information processing method according to an embodiment of the present application;

fig. 9 is a flowchart of another text information processing method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a text information processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application.

Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

Before explaining the embodiments of the present application in detail, some terms appearing in the embodiments of the present application are explained first.

Entity type: the main concepts involved in medical texts are covered, and the following description mainly describes 11 entity types, and the 11 entity types can be divided into a standard entity type, an attribute entity type and a value entity type, wherein the standard entity type mainly comprises: human body parts (including various organs of the human nine systems, broadly referred to as biological subjects of interest, including human tissues and cells); patient subject (subject with disease, broadly referred to as a subject described in medical text); clinical manifestations (symptoms and subjective abnormal sensations of the patient's body, and also manifestations of the patient's body parts, etc.); the term "diagnosis" refers to a diagnosis of a disease (including imaging examination, physical and chemical examination in a laboratory, etc.), diagnosis of a disease (including diagnosis of a disease defined in medicine and diagnosis of a disease by a doctor in clinical practice, classification of a disease, etc.), treatment method (including therapeutic intervention in a patient's health condition, and medical equipment for alleviating and eliminating disease and abnormal symptoms of a patient, including surgery and treatment), medicine name (generally referring to the general name of all medicines for preventing, treating and diagnosing a disease of a patient), orientation (for describing specific location information potentially involved in other entities), time (for describing the time and time period of occurrence of an event), attribute entity type (generally referring to all potential attributes describing a standard entity type).

Entity name: it should be understood that the specific description of the entity category is that, for example, the entity name corresponding to the entity type "body part" may include the names of the organs of the nine major systems of the human body, such as the right upper limb, the shoulder and neck, etc., the entity name corresponding to the entity type "clinical manifestation" may include numbness, pain, etc., the entity name corresponding to the entity type "attribute entity type" may include body temperature, boundary, etc., and the entity name corresponding to the entity type "value entity type" may include specific numerical values (such as 0.8x0.6cm) of the size, clarity, etc.

The application scenario of the application is described next. The application scene can be a scene of extracting index information required by clinical scientific research from an electronic medical record, wherein the information in the electronic medical record is medical text information stored in an unstructured form, and the medical text information is difficult to be directly used for clinical scientific research, so that unstructured medical text information needs to be converted into structured graph data.

Specifically, the initial knowledge extraction model may be trained according to the constructed training sample to obtain the knowledge extraction model, and fig. 1 is a schematic structural diagram of the initial knowledge extraction model provided in the embodiment of the present application. As shown in fig. 1, the initial knowledge extraction model 100 may include an initial entity type recognition model 101 and an initial entity relationship extraction model 102, alternatively, the initial entity type recognition model 101 and the initial entity relationship extraction model 102 may be trained as a whole, and when a training stop condition is satisfied, an entity type recognition model and an entity relationship extraction model are obtained, or the initial entity type recognition model 101 and the initial entity relationship extraction model 102 may be trained respectively, so as to obtain an entity type recognition model and an entity relationship extraction model, which are not limited by the present application.

FIG. 1 is a block diagram of training an initial entity type recognition model 101 and an initial entity relationship extraction model 102 separately, where the initial entity type recognition model 101 is a deep neural network model based on BERT (Bidirectional Encoder Representation from Transformers, bi-directional encoder) and CRF (Conditional Random Fields, conditional random field) decoders, and specifically, a first training sample 1011 for training the initial entity type recognition model 101 may be constructed by using the following embodiment, and training the initial entity type recognition model 101 according to the first training sample 1011 to obtain an entity recognition model; the initial entity relationship extraction model 102 is a BERT-based deep neural network, and specifically, the first training sample 1011 may be labeled by the following embodiment manner to obtain a second training sample 1021, and the initial entity relationship extraction model 102 is trained according to the second training sample 1021 to obtain the entity relationship extraction model.

Then, a process of applying the knowledge extraction model obtained by the training is performed. Fig. 2 is a schematic structural diagram of a knowledge extraction model provided in an embodiment of the present application, as shown in fig. 2, the knowledge extraction model 200 may include an entity type recognition model 201 and an entity relationship extraction model 202, the target text information 2011 is input into the entity type recognition model 201, the entity type recognition model 201 may output entity names included in the target text information 2011 and entity sets of entity types corresponding to the entity names, the entity names may include entity names corresponding to "attribute entity types", i.e., attribute entity names, the entity names corresponding to "value entity types", i.e., value entity names, the entity sets output by the entity type recognition model 201 and the target text information 2011 are input into the entity relationship extraction model 202, the entity relationship extraction model 202 may output entity name pairs 2021, the entity name pairs 2021 include standard entity names, attribute entity names and/or value entity names, and the pointing relationships therebetween, and the knowledge extraction model may convert unstructured medical text information into structured map data (entity name pairs).

The text information processing method mentioned in the present application is exemplified below with reference to the accompanying drawings. Fig. 3 is a flow chart of a text information processing method according to an embodiment of the present application, where, as shown in fig. 3, the method may include:

s301, marking original training text information according to a corresponding relation between an entity type and an entity name to obtain a first training sample, wherein the entity type comprises: standard entity type, attribute entity type, and value entity type.

The attribute entity type and the value entity type are respectively used for characterizing the characteristics of the standard entity type.

Specifically, the medical text information is a content described in natural language, and the content corresponds to unstructured data, but the unstructured data is difficult to be directly used in different clinical scientific research tasks or statistical analysis tasks, so that a model capable of converting unstructured data into structured data needs to be trained first, the model can be called a knowledge extraction model, and how to obtain the knowledge extraction model is mainly described below.

The original training text information is medical text information, which can be extracted from a corpus related to medical treatment, and it should be noted that the application does not limit the number of extracted medical texts. An entity type frame can be preset, the entity type frame comprises a plurality of entity types, the entity types and entity names have corresponding relations, and the entity types in the entity frame can be updated according to actual clinical scientific research tasks. For example, assuming that an entity class of "microorganism" is included in a clinical research task and the entity class is not included in the entity class frame, the entity class of "microorganism" may be added in the entity class frame, and assuming that an entity class of "drug name" is not generally involved in a clinical research task and the entity class exists in the entity class frame, the entity class of "drug name" may be deleted in the entity class frame. That is, the entity framework is expandable and can be dynamically adjusted according to actual clinical scientific research task requirements.

The entity category included in the entity framework has a standard entity type, an attribute entity type and a value entity type, and the three entity types have a certain relevance, wherein the attribute entity type is used for representing the characteristics of the standard entity type, namely, the entity name (attribute entity name) corresponding to the attribute entity type is equivalent to the attribute of the entity name (standard entity name) corresponding to the standard entity type; the value entity type is used for characterizing the standard entity type, and refers to one concrete representation of the attribute entity type, namely, the concrete representation of the entity name (value entity name) corresponding to the value entity type corresponds to the entity name (attribute entity name) corresponding to the attribute entity type, and the attribute of the entity name (value entity name) corresponding to the value entity type corresponds to the entity name (standard entity name) corresponding to the standard entity type can be understood.

For example, assuming that the original training text information includes "the patient generates heat and the heat peak is 38 ℃, the corresponding entity types" patient main body "," attribute entity type "," value entity type "may be marked on the entity names" patient "," heat peak "," 38 ℃ respectively according to the corresponding relationship between the entity types and the entity names, and of course, the marking may also be performed in a shorthand manner, for example, "patient", "genus", "value", and the marking of other contents in the original training text information may refer to the above description, and finally, a first training sample may be obtained, where the first training sample may include the original training text information and the entity types corresponding to the entity names in the original training text information.

In an implementation embodiment, the word segmentation process may be performed on the original training text information according to the dictionary matching module to obtain a plurality of entity names, and then each entity name is labeled to obtain the first training sample, and the labeling process may refer to the description above, where the dictionary matching module includes an entity dictionary, the entity dictionary includes a plurality of entity names of standard terms, and the word segmentation process may be performed on the original training text information according to the dictionary matching module, so that the word segmentation accuracy may be improved, and further the accuracy of the first training sample may be improved.

S302, inputting a first training sample into the initial entity type recognition model, and training to obtain an entity type recognition model.

The initial entity type recognition model and the initial entity relation extraction model are independently trained, the obtained original training text information in the first training sample is used as input of the initial entity type recognition model, the entity type corresponding to each entity name in the original training text information in the first training sample is used as output of the initial entity type recognition model to train the initial entity type recognition model, and the entity type recognition model can be obtained through training when training stop conditions are met.

S303, labeling the first training sample according to the corresponding relation between the entity types to obtain a second training sample, wherein the corresponding relation between the entity types comprises: a pointing relationship between a subject entity type and a guest entity type.

The corresponding relation between the entity types can be pre-stored in an entity relation frame, the entity frame mainly comprises a constraint table of the relation and a direction table of the relation, and the entity types of the corresponding relation are stored in the constraint table of the relation. For example, if there is a correspondence between the "patient subject" and the "attribute entity type", then the "patient subject" and the "attribute entity type" may be stored in association with each other in the constraint table of the relationship, and if there is no correspondence between the different entity features stored in the constraint table of the relationship, then it is considered that there is no correspondence between them. The direction table of the relationship stores the pointing relationship between entity types, and continuing to say, the relationship between the patient subject and the attribute entity type is started from the patient subject, and the pointing relationship points to the attribute entity type, namely, the patient subject is the subject entity type, the attribute entity type is the object entity type, if the directions are opposite, the patient subject is the object entity type, and the attribute entity type is the subject entity type. The pointing relationships between entity types are unique, and loops cannot occur in the pointing relationships between all entity types, for example: the "patient subject" takes a certain "drug name" for alleviating a certain "clinical manifestation" which occurs in a certain "patient subject" in which a loop occurs in a pointing relationship between the types of entities, i.e., a loop cannot occur in a pointing relationship between the types of entities preset in the entity relationship framework.

According to the corresponding relation between entity types and entity names, the corresponding relation between entity names can be obtained, a first training sample can be marked based on the corresponding relation between entity names, namely, the entity names with the corresponding relation in the original training text information of the first training sample are associated, and the corresponding relation between entity types comprises the corresponding relation between the attribute entity type, the value entity type and the standard entity type, and the corresponding relation between the attribute entity type and the value entity type, so that the corresponding relation between the attribute entity names, the value entity names and the standard entity names in the original training text information of the first training sample can be further carried out, a second training sample can be finally obtained, and the second training sample can comprise the original training text information and the corresponding relation between the entity names in the original training text information, wherein the entity names can comprise the standard entity names, the attribute entity names and the value entity names.

S304, inputting the second training sample into an initial entity relation extraction model, and training to obtain an entity relation extraction model.

And training the initial entity relation extraction model by taking the obtained original training text information in the second training sample and entity types corresponding to the entity names in the original training text information as the input of the initial entity relation extraction model and taking the corresponding relation between the entity names corresponding to the entity types in the original training text information in the second training sample as the output of the initial entity relation extraction model, and training to obtain the entity relation extraction model when the training stop condition is met.

In summary, in the text information processing method provided by the application, by adding the attribute entity type and the value entity type, the original training text information is marked, so that not only the entity name corresponding to the standard entity type but also the entity name corresponding to the attribute entity type and the value entity type can be identified from the original training text information, the content contained in the original training text information can be reflected more comprehensively by the constructed first training sample, and on the premise that the first training sample is comprehensive, the accuracy of the entity type identification model obtained by training by using the first training sample can be improved, and further the accuracy of the knowledge extraction model containing the entity type identification model can be improved.

Optionally, labeling the first training sample according to the correspondence between entity types to obtain a second training sample, including: and labeling the first training sample according to the corresponding relation between the entity types and the strength degree information between the entity names corresponding to the entity types in the first training sample to obtain a second training sample.

The entity relationship framework includes a constraint table of the relationship and a direction table of the relationship, and also includes a type of occurrence of the entity relationship, where the type of occurrence of the relationship is equivalent to strength information between entity names corresponding to the entity types. Specific semantics are corresponding to different entity names, for example, assuming that a corresponding relationship exists between a "patient subject" and a "clinical manifestation", the semantic relationship between the corresponding entity names can be expressed as: the patient may develop a clinical manifestation to some extent, and the type of relationship may refer to the degree of strength of the clinical manifestation, which may include: the corresponding relationship exists between the non-represented, slight, moderate, severe and other degrees, for example, the "treatment method" and the "disease diagnosis", and the semantic relationship between the corresponding entity names can be uniformly represented as: the degree of a certain treatment method can relieve or treat a certain disease diagnosis, and the degree of the treatment method can correspond to the degree of intensity information of a plurality of grades.

After determining the entity names with the corresponding relations in the original training text information according to the corresponding relations among the entity types, semantic strength degree information among the entity names with the corresponding relations can be identified. Specifically, the degree of intensity information may be divided into positive semantics, negative semantics and uncertain semantics, and continuing the above description, for example, it is assumed that the relationship between the entity names corresponding to the "patient body" and the "clinical manifestation" in the original training text is: if a patient has a certain clinical manifestation, namely positive semanteme, the information of the intensity degree between the patient body and the clinical manifestation is positive, and the patient body and the clinical manifestation can be classified into positive semanteme for slight manifestation, moderate manifestation and the like, and the patient body and the clinical manifestation are mapped into positive semanteme relations, but the true semantic meanings are as follows: the patient presents clinical manifestation, and the meaning among entity names corresponding to each entity type can be clearly represented by the representation mode of the intensity degree information. It should be noted that the application does not limit the grading of the degree, that is, the entity relationship framework has expansibility, and the type of the entity relationship can be dynamically adjusted according to the actual clinical scientific research task requirement.

In an implementation embodiment, the degree information of intensity between entity names corresponding to each entity type in the first training sample can be revised based on the regular expression, so that the accuracy of the second training sample can be improved.

It can be seen that the type of the entity relationship is represented by the strength degree information, so that the work of constructing the second training sample is greatly simplified, and meanwhile, the association semantic relationship of the load between entity names is reserved, namely, the efficiency of training the initial entity relationship extraction model can be improved.

Fig. 4 is a flowchart of another text information processing method according to an embodiment of the present application. As shown in fig. 4, optionally, labeling the original training text information according to the correspondence between the entity type and the entity name to obtain a first training sample includes:

S401, marking the original training text information according to the corresponding relation between the entity type and the entity name, and obtaining an initial first training sample.

S402, if one entity name in the original training text information comprises a plurality of sub-entity names, deleting entity types corresponding to the sub-entity names in the initial first training sample to obtain the first training sample.

The method comprises the steps of segmenting the original training text information according to the entity dictionary in the entity matching module to obtain a plurality of entity names contained in the original training text information, and labeling the original training text information according to the corresponding relation between the entity types and the entity names to obtain an initial first training sample, wherein the specific labeling mode can be described by referring to the corresponding parts and is not described herein.

In one implementation embodiment, labeling is performed on all the granular entity names in the original training text information, and if there is an overlapping relationship between the entity names, the corresponding entity type can be labeled on each granular entity name, and this labeling mode can be called nested labeling. For example, assuming that the entity name "prostatic hyperplasia" is present in the original training text information, the entity name "prostatic hyperplasia" includes a plurality of sub-entity names, such as "prostate gland" and "hyperplasia", where the entity type corresponding to the entity name "prostatic hyperplasia" is "disease diagnosis", the entity type corresponding to the sub-entity name "prostate" is "body part", and the entity type corresponding to the sub-entity name "hyperplasia" is "clinical manifestation", the entity type corresponding to the "prostatic hyperplasia" is "disease diagnosis" includes two entity types, namely "body part" and "clinical manifestation".

In another implementation embodiment, the entity type marked by the entity name of the longest character can be reserved based on the longest matching principle, and the entity type marked by each sub-entity name included by the entity name is deleted. Continuing the above example, the entity type corresponding to the entity name "prostatic hyperplasia" may be reserved as "disease diagnosis", the entity type "human body part" corresponding to the entity name "prostatic hyperplasia" is deleted, and the entity type "clinical manifestation" corresponding to the entity name "hyperplasia" is further obtained, thereby obtaining the first training sample; the method can also divide words of the original training text information according to the entity dictionary in the entity matching module, after obtaining a plurality of entity names included in the original training text information, firstly detecting whether the entity names have overlapping relation, if so, only storing the entity name corresponding to the longest character, and then obtaining a first training sample according to the corresponding relation between each entity name and the entity type.

The following mainly explains the process of applying the entity type recognition model and the entity relationship extraction model.

Fig. 5 is a flowchart of another text information processing method according to an embodiment of the present application.

As shown in fig. 5, the method may further include:

S501, inputting target text information into an entity type recognition model, and outputting an entity set, wherein the entity set comprises: the entity name and the entity type corresponding to the entity name contained in the target text information, wherein the entity type comprises a standard entity type, an attribute entity type and a value entity type.

The target text information is medical text information, the content of which is equivalent to unstructured data, referring to fig. 2, the target text information may be input into an entity type recognition model 201 in a knowledge extraction model 200, the entity type recognition model 201 may recognize attribute entity names and value entity names in the target text information, that is, entity sets output by the entity type recognition model may include entity names included in the target text information, and the entity names may include entity names corresponding to "standard entity types", such as "patients"; entity names corresponding to the attribute entity type, such as "hot peak", "size"; the "value entity type" corresponds to an entity name such as "38 ℃," 0.8x0.6cm ".

S502, inputting the target text information and the entity set into the entity relation extraction model, and outputting an entity name pair, wherein the entity name pair comprises a subject entity name and a guest entity name, and the subject entity name points to the guest entity name.

The entity set output by the target text information and the entity type identification model can be simultaneously input into the entity relation extraction model in the knowledge extraction model, the entity relation extraction model can identify the entity name serving as a subject according to the target text information and the entity names contained in the target text information in the entity set and the entity types corresponding to the entity names, then identify the entity names serving as objects, the entity names serving as the subjects can form entity name pairs, the entity names serving as the subjects in the entity name pairs can be called as subject entity names, the entity names serving as the objects can be called as object entity names, and the subject object entity names point to the object entity names, wherein standard entity names, attribute entity names and value entity names exist in the subject object entity and the object entity names.

It can be seen that the entity type recognition model obtained through training of the first training sample can recognize various entity names, such as a standard entity name, an attribute entity name and a value entity name, contained in the target text information, and then the entity name pair with the relationship can be extracted by using the entity relationship extraction model obtained through training of the second training sample, that is, the pointing relationship among the standard entity name, the attribute entity name and the value entity name can be extracted, so that the entity names contained in the target text information can be extracted more comprehensively, and the finally obtained structured graph data is more matched with unstructured target text information, or in other words, the finally obtained structured graph data can more comprehensively reflect the content in unstructured target text information. Fig. 6 is a schematic diagram of converting unstructured target text information into structured graph data, where the unstructured target text information is specifically shown in fig. 6, the content in a circular frame of the unstructured target text information is a standard entity name in the unstructured target text information, such as fever, cough, etc., the content in a diamond frame is an attribute entity name in the unstructured target text information, such as a hot peak, vomit, etc., and the content in a square frame is a value entity name in the unstructured target text information, such as 38.2 ℃ and clear, etc.

Optionally, inputting the target text information and the entity set into the entity relationship extraction model, and outputting an entity name pair, including:

And inputting the target text information and the entity set into the entity relation extraction model, outputting an entity name pair and strength information between entity names contained in the entity name pair, wherein the entity name pair comprises a subject entity name and a guest entity name, and the subject entity name points to the guest entity name.

After the entity type recognition model outputs the entity set, the entity set and the target text information may be input into an entity relationship extraction model obtained by training a second training sample labeled with strength degree information between entity names corresponding to entity types, where the entity relationship extraction model may output entity name pairs with corresponding relationships and strength degree information between the entity name pairs, where the strength degree information may indicate semantics between the entity name pairs.

Optionally, after inputting the target text information and the entity set into the entity relationship extraction model and outputting the entity name pair, the method further includes: and constructing a knowledge graph according to the entity name pairs, taking the subject entity names and the client entity names in the entity name pairs as nodes in the knowledge graph respectively, and taking the relation between the subject entity names and the client entity names as edges in the knowledge graph.

The entity name pairs may be referred to as graph data, that is, the graph data includes entity names and corresponding relations between the entity names, and the graph data may be stored in an associated database. The knowledge graph consists of nodes and edges, wherein the nodes in the knowledge graph are entity names identified by an entity type identification model, namely entity names in graph data, each node can be associated with an entity type corresponding to the entity name and position information of the entity name in target text information, the edges in the knowledge graph are entity name pairs with corresponding relations extracted by an entity relation extraction model, and the directions of the edges are the corresponding relations between the entity names in the graph data by pointing to object entity names from the entity names in the entity name pairs. Alternatively, after the entity relationship extraction model outputs a plurality of entity name pairs, the entity name pairs may be stored in a database in the form of graph data, the subject entity names and the object entity names included in each entity name pair in the graph data are extracted from the database, each subject entity name and each object entity name are used as nodes of a knowledge graph, and the relationship between each subject entity name and each object entity name may be used as edges of the knowledge graph.

In another embodiment, the information of the intensity between the entity names contained in the entity name pair can be further added on the side of the knowledge graph. As shown in fig. 7, fig. 7 is a schematic structural diagram of a knowledge graph provided by the embodiment of the present application, and as can be seen from fig. 7, if the information of the intensity between the patient and the fever is present, the fever phenomenon occurs in the patient, and the heat peak is shown as 38.2 ℃.

Optionally, when the knowledge graph is displayed, the display may be performed according to a preset display state, for example, the display size of each node shape in the knowledge graph may be determined according to the number of times that each entity name appears in the entity pair, and the more the entity name appears in the entity pair, the larger the corresponding node shape is displayed, as shown in fig. 7, the largest node shape display corresponding to the entity name of "patient"; the entity names belonging to the same entity type can be displayed in the same color display state, and the specific display state of the knowledge graph is not limited by the application.

Fig. 8 is a flowchart of another text information processing method according to an embodiment of the present application. Optionally, as shown in fig. 8, after the knowledge graph is constructed according to the entity name pairs, the method further includes:

S801, according to a knowledge acquisition instruction input by a user, acquiring a corresponding entity name from a database storing graph data corresponding to the knowledge graph.

The map data corresponding to the knowledge graph is pre-stored in an associated database, and after a knowledge acquisition instruction input by a user is received, map data matched with the knowledge acquisition instruction can be acquired from the database, wherein the map data comprises entity names. For example, the user may search the database for index information meeting the requirement by searching, and, assuming that the content included in the knowledge acquisition instruction input by the user is "what clinical manifestations of the patient occur", the entity name corresponding to the clinical manifestation may be acquired from the knowledge graph.

S802, displaying the entity name in the knowledge graph according to the display state corresponding to the entity name.

The knowledge graph can be visually displayed on an interface, as shown in fig. 7, the display state of the entity name can be preset by taking the entity type as a dimension, for example, the entity name corresponding to the attribute entity type can be represented by a yellow display parameter, namely, the related node is displayed as yellow; the entity name corresponding to the "value entity type" may be represented by a blue display parameter, i.e., the associated node is displayed in blue. Therefore, the user can conveniently and intuitively know the information to be checked.

Fig. 9 is a flowchart of another text information processing method according to an embodiment of the present application. Optionally, as shown in fig. 9, after the target text information is input into the entity type recognition model and the entity set is output, the method further includes:

S901, carrying out statistics operation on an entity set to obtain a statistics result, wherein the statistics result comprises: the entity sets the frequency of occurrence of each entity name and/or the frequency of occurrence of each entity type.

S902, sorting the contents belonging to the same dimension in the statistical result respectively to obtain a sorting result.

After the entity set is output by the entity type identification model, the entity set can be input into an entity consistency detection module, the entity consistency detection module can perform statistical analysis on information in the entity set, specifically, the number of occurrences of each entity name and the number of occurrences of each entity type in the entity set can be counted, for example, the entity set specifically comprises 4 entity categories (human body part, clinical manifestation, attribute entity characteristics and value entity characteristics), wherein the human body part and the clinical manifestation belong to the same dimension, namely all belong to standard entity characteristics, and the number of occurrences of the human body part and the clinical manifestation can be sequenced to obtain a sequencing result, so that a user can know information in the entity set in real time.

Fig. 10 is a schematic structural diagram of a text information processing apparatus according to an embodiment of the present application. As shown in fig. 10, the apparatus includes:

The first labeling module 1001 is configured to label the original training text information according to a correspondence between an entity type and an entity name, so as to obtain a first training sample, where the entity type includes: standard entity type, attribute entity type, and value entity type;

the first training module 1002 is configured to input a first training sample into an initial entity type recognition model, and train to obtain an entity type recognition model;

the second labeling module 1003 is configured to label the first training sample according to a correspondence between entity types, to obtain a second training sample, where the correspondence between entity types includes: a pointing relationship between the subject entity type and the guest entity type;

The second training module 1004 is configured to input the second training sample into an initial entity relationship extraction model, and train to obtain an entity relationship extraction model.

Optionally, the second labeling module 1003 is specifically configured to label the first training sample according to the correspondence between entity types and the strength information between entity names corresponding to each entity type in the first training sample, so as to obtain a second training sample.

Optionally, the first labeling module 1001 is specifically configured to label the original training text information according to a correspondence between the entity type and the entity name, so as to obtain an initial first training sample; if one entity name in the original training text information comprises a plurality of sub-entity names, deleting the entity type corresponding to each sub-entity name in the initial first training sample to obtain the first training sample.

Optionally, the apparatus further comprises:

The first output module is used for inputting the target text information into the entity type recognition model and outputting an entity set, wherein the entity set comprises: the entity names and entity types corresponding to the entity names contained in the target text information, wherein the entity types comprise standard entity types, attribute entity types and value entity types;

The second output module is used for inputting the target text information and the entity set into the entity relation extraction model, outputting an entity name pair, wherein the entity name pair comprises a subject entity name and a guest entity name, and the subject entity name points to the guest entity name.

Optionally, the second output module is specifically configured to input the target text information and the entity set into the entity relationship extraction model, output the entity name pair and the intensity information between the entity names included in the entity name pair, where the entity name pair includes a subject entity name and a guest entity name, and the subject entity name points to the guest entity name.

Optionally, the apparatus further comprises:

the building module is used for building a knowledge graph according to the entity name pairs, taking the entity names of the entity name pairs and the customer entity names as nodes in the knowledge graph respectively, and taking the relation between the entity names and the customer entity names as edges in the knowledge graph.

Optionally, the apparatus further comprises:

The statistics module is used for carrying out statistics operation on the entity set to obtain a statistics result, wherein the statistics result comprises: the frequency of occurrence of each entity name and/or the frequency of occurrence of each entity type in the entity set; and respectively sequencing the contents belonging to the same dimension in the statistical result to obtain a sequencing result.

The foregoing apparatus is used for executing the method provided in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

The above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more Application SPECIFIC INTEGRATED Circuits (ASIC), or one or more microprocessors, or one or more field programmable gate arrays (Field Programmable GATE ARRAY FPGA), etc. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 11, where the electronic device may include: the processor 1101, the storage medium 1102 and the bus 1103, the storage medium 1102 stores machine readable instructions executable by the processor 1101, and when the electronic device is running, the processor 1101 communicates with the storage medium 1102 through the bus 1103, and the processor 1101 executes the machine readable instructions to perform the steps of the above method embodiments. The specific implementation manner and the technical effect are similar, and are not repeated here.

Optionally, the present application further provides a storage medium, on which a computer program is stored, which when being executed by a processor performs the steps of the above-described method embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the indirect coupling or communication connection of devices or elements may be in the form of electrical, mechanical, or otherwise.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform some of the steps of the methods according to the embodiments of the application. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for processing text information, wherein an initial knowledge extraction model includes an initial entity type recognition model and an initial entity relationship extraction model, the method comprising:

Inputting the second training sample into the initial entity relation extraction model, and training to obtain an entity relation extraction model;

Labeling the first training sample according to the corresponding relation between entity types to obtain a second training sample, wherein the labeling comprises the following steps:

Labeling the first training sample according to the corresponding relation between the entity types and the strength degree information between entity names corresponding to the entity types in the first training sample to obtain a second training sample;

The method further comprises the steps of:

2. The method of claim 1, wherein labeling the original training text information according to the correspondence between the entity type and the entity name to obtain the first training sample includes:

3. The method of claim 1, wherein said inputting the target text information and the entity set into the entity relationship extraction model, outputting an entity name pair, comprises:

4. The method of claim 3, wherein said inputting said target text information and said set of entities into said entity relationship extraction model, after outputting a pair of entity names, further comprises:

and constructing a knowledge graph according to the entity name pairs, respectively taking the subject entity names and the object entity names in the entity name pairs as nodes in the knowledge graph, and taking the relation between the subject entity names and the object entity names as edges in the knowledge graph.

5. The method of claim 4, wherein after constructing a knowledge-graph from the entity name pairs, the method further comprises:

6. The method of claim 3, wherein after inputting the target text information into the entity type recognition model and outputting the entity set, the method further comprises:

7. A text information processing apparatus, wherein an initial knowledge extraction model includes an initial entity type recognition model and an initial entity relationship extraction model, the apparatus comprising:

The second training module is used for inputting the second training sample into the initial entity relation extraction model and training to obtain an entity relation extraction model;

the second labeling module is specifically configured to:

the second training module is further configured to:

8. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the text information processing method of any of claims 1-6.

9. A storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the text information processing method according to any of claims 1-6.