CN113628757A

CN113628757A - A method, system and device for obtaining temporal compound words in medical text based on lexical word formation

Info

Publication number: CN113628757A
Application number: CN202110970693.7A
Authority: CN
Inventors: 卢旭召; 李军; 周鹏程; 冯洪海; 魏亚举; 侯瑞辉
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2021-11-09

Abstract

The invention discloses a method, device and system for extracting temporal compound words in medical texts, and specifically relates to the technical field of natural language processing information extraction. The present invention includes a reading module, a computing module and a display module. The reading module mainly refers to the system reading medical text. The calculation module includes a time compound word extraction unit, a time compound word update unit, and a new time compound word structure extraction unit. The extraction of time compound words is mainly based on the word formation method "numerical words + time words + time collocation words", and it is necessary to manually summarize the structure of some time compound words. The steps are as follows: a. read the medical text; b. extract the temporal compound word through the temporal compound word structure; c. obtain the temporal compound word structure through the temporal compound word. If the result is an increase in the number of existing entities, continue the iteration, otherwise end. The display module includes: a storage unit and an output unit. The invention takes the medical text as a starting point, and finally realizes the accurate extraction of time compound words in the medical text.

Description

Method, system and device for acquiring time compound words in medical text based on lexical word formation method

Technical Field

The invention relates to the technical field of natural language processing information extraction, in particular to a method, a device and a system for extracting time compound words from medical texts.

Background

In recent years, a large amount of medical texts have been accumulated on the internet. The medical texts mainly comprise treatises in professional teaching materials, professional medical websites, medical classics, electronic cases and medical scientific research periodicals. These medical texts contain abundant medical data, which mainly includes information of onset time, treatment time, etiology, symptoms, treatment, diagnosis, etc. of diseases. However, most of these massive data exist in a semi-structured or unstructured form, and the current natural language processing or information extraction technology is not mature enough for extracting complete and accurate information from unstructured text. Existing companies or products have not been able to extract disease time compounds to an accurate level. The invention mainly analyzes the structure of the common time compound words in the medical text, mathematically transforms the structure of the time compound words, designs an iterative algorithm and a program, and can iteratively acquire accurate time compound words from the medical text.

With the continuous development of computers, text mining systems have been implemented. For example, a text mining method and system based on an unstructured electronic medical record, which is disclosed in patent application No. 201910701406.5, includes a text preprocessing module, a feature engineering module, and an analysis prediction module. The main extracted characteristics of the invention include symptoms, inspection findings, radiotherapy and chemotherapy schemes, curative effect evaluation and the like. The patent uses time nodes to segment hospitalization records, extracts features through disease information extraction of a rule base, and finally realizes text clustering through unsupervised clustering. The patent is divided by time nodes, the time node acquisition accuracy rate is in agreement, and the complete semantics of sentences are not taken into consideration. The input text only comprises medical history records in the hospital database, and the data source range is small.

Many difficulties are faced in the identification task in the medical field, mainly in the following aspects.

From the point of view of the extraction process.

The medical field typically contains a rich category of entities.

There are many different modifiers and qualifiers for an entity context that make the boundaries of the entity more difficult to determine and partition.

There are usually different more descriptive ways of the entities to be extracted.

The length of a time compound word entity is often difficult to determine.

From the extraction result.

The extracted time compound words are few, only thousands, and more than ten thousands, but do not reach the scale of tens of thousands to hundreds of thousands. The medical texts involved are only a few thousand, not reaching the scale of tens of thousands up to tens of thousands.

Disclosure of Invention

The invention aims to provide a method, a device and a system for acquiring a time compound word in a medical text. To solve the problems set forth in the above-described technical background. The invention aims to extract a time compound word entity in a medical text by taking the medical text as a starting point.

In order to achieve the above object, the present invention provides a method for extracting time compound words, the method mainly comprises the following steps.

Step 1: medical texts are acquired and partial time compound word structures are manually summarized.

Step 2: and acquiring the time compound words through the time compound word structure in the medical text based on the time compound word structure.

And step 3: and (5) performing time compound word edulcoration and merging the time compound word edulcoration into the existing time compound word set.

And 4, step 4: and extracting a new time compound word structure based on the updated time compound word set.

And 5: and (5) performing time compound word structure impurity removal, verifying and merging the time compound word structure impurity removal and verification into the existing time compound word structure set.

Step 6: and repeating the step two based on the new time compound word structure until no new time compound word structure exists.

And 7: and finally, removing impurities of the time compound words.

Preferably, the medical texts obtained in step 1 are various unstructured medical texts such as papers in hospital cases, professional textbooks, professional medical websites, medical classics, electronic cases, and medical scientific research periodicals.

Preferably, a regular expression is used on the read medical text to filter out Chinese sentences in the medical text.

Preferably, the learning of each semantic element in the present invention is an iterative process, that is: a process of iterative learning of time compound words and time compound word structures.

Preferably, the invention ensures the accuracy of each semantic element when learning to further improve the accuracy of the next iterative extraction.

Corresponding to the method, the invention also provides a time compound word extraction system, which comprises the following steps.

And the text input unit is used for reading the unstructured medical text by the system.

And the time compound word extracting unit is used for extracting the medical time compound word entity and extracting the time compound word entity through the time compound word structure.

And the time compound word updating unit is used for updating the existing time compound word set.

And the new time compound word structure extraction unit is used for extracting a time compound word structure, and extracting a new time compound word structure by segmenting the time compound word and analyzing the part of speech of the time compound word.

And the storage unit is used for structured storage of results and storing the extracted time compound words and the time compound word structure into corresponding files.

Corresponding to the system, the embodiment of the invention provides a time compound word extraction device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the system for extracting the time compound words from the medical text is realized.

Embodiments of the present invention provide a computer readable storage medium, which may store a computer program that, when executed by a processor, implements a system for extracting temporal compounds in medical text.

Compared with the prior art, the invention has the following advantages and beneficial effects.

(1) The invention realizes a method, a device and a system for extracting time compound words from medical texts, and the processor can accurately extract the time compound words through different time compound word structure limitations. Meanwhile, the problem that the entity length of the time compound word in the dependency field cannot be processed can be solved well.

(2) The invention realizes that the extraction result of the time compound word reaches tens of thousands of orders of magnitude, and greatly improves the precision and the accuracy.

Drawings

FIG. 1 is a block diagram of the system of the present invention.

FIG. 2 is a flow chart of a temporal compound extraction method of the present invention.

FIG. 3 is a schematic flow chart of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings.

The invention provides a method for extracting time compound words from medical texts, which mainly comprises the following steps: the time compound words are identified through the time compound word structure provided by the invention, and semantic elements in each time compound word structure can strictly limit the time compound word entity and finally realize extraction. And learning and extracting through the newly added semantic elements, and expanding the semantic element library. The method has the advantages that the method can be applied to various unstructured medical texts such as treatises in professional textbooks, professional medical websites, medical classics, electronic cases and medical scientific research periodicals, can well solve the problem of large entity length difference of time compound words, and plays a vital role in research of related vertical fields.

The first embodiment.

With reference to fig. 1, fig. 1 is a block diagram of a system for extracting temporal compounds from medical texts, which includes the following components.

And the time compound word extracting unit is used for extracting the time compound word entity and extracting the time compound word entity through a time compound word structure.

Unstructured medical text is first input to the system through a text input unit in the reading module. And then, combining each time compound word structure in a calculation module, and extracting corresponding entity words through a time compound word extraction unit, a time compound word updating unit and a new time compound word structure extraction unit. And finally, performing structured storage on the extracted entity through a storage unit in the display module.

Example two.

With reference to fig. 3, fig. 3 is a flowchart of a method for extracting time compounds from medical texts, and the specific steps are as follows.

Step 1: medical texts are acquired and stored, and partial time compound word structures are manually summarized.

And step 3: and removing impurities from the new time compound words.

And 4, step 4: and merging the time compound words after the impurity removal into the existing time compound word set.

And 5: and extracting a new time compound word structure based on the updated time compound word set.

Step 6: and (5) removing impurities from the new time compound word structure and verifying the new time compound word structure.

And 7: and merging the verified time compound word structure into the existing time compound word structure set.

And 8: and acquiring the time compound words in the medical text through the updated time compound word structure based on the new time compound word structure.

And step 9: and removing impurities from the new time compound words.

Step 10: and merging the time compound words after the impurity removal into the existing time compound word set.

Step 11: and storing the acquired time compound words and the time compound word structure.

The method carries out multiple times of training through an iterative idea, and meanwhile, the threshold parameter setting is updated to obtain an optimal model.

In the method for extracting time compound words from medical texts of the second embodiment, the entity to be extracted can be obtained by combining the word formation method "number words + time collocation words", so that the accuracy and precision of entity extraction are improved, and the problem that the length of the time compound words cannot be solved is effectively solved.

Example three.

The third embodiment of the present invention provides a time compound word extraction device, which mainly includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the steps of the time compound word extraction method when executing the computer program.

The time compound word extraction device of the third embodiment of the present invention includes: an acquirer, a processor, a memory, and a computer program stored in and executable on the memory, such as: a temporal compound extraction program. The processor, when executing the computer program, implements the steps of the second embodiment of the time compound word extraction method, such as the steps of the time compound word extraction method shown in fig. 2. Or the processor, when executing the computer program, implements the functions of the modules or units in the above device examples, such as: the device comprises a text input unit, a time compound word extraction unit, a time compound word updating unit, a new time compound word structure extraction unit and a storage unit.

The above description is only a preferred example of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. a method for obtaining temporal compound words in medical text, is characterized in that, comprises:

Step S100: obtaining medical text and manually summarizing part-time compound word structures;

Step S200: extracting temporal compound words, based on the temporal compound word structure, in the medical text, obtaining temporal compound words through the temporal compound word structure;

Step S300: removing impurities from the time compound words, and incorporating the time compound words after the removal of impurities into the existing time compound word set;

Step S400: extracting a temporal compound word structure, and extracting a new temporal compound word structure based on the updated temporal compound word set;

Step S500: carry out temporal compound word structure removal, verification and incorporation into the existing temporal compound word structure set;

Step S600: Based on the new temporal compound structure, repeat step S200 until there is no new temporal compound structure.

2. the method for extracting temporal compound words in medical text according to claim 1, is characterized in that, the impurity removal of described temporal compound word is to carry out word segmentation and stop word filtering according to Hanlp word segmentation tool, then add specific threshold screening conditions to filter .

3 . The method for extracting temporal compound words from medical text according to claim 1 , wherein the obtaining of the medical text is to obtain Chinese sentences in unstructured text through regular expressions. 4 .

4. The method for extracting temporal compound words in medical text according to claim 1, is characterized in that, when training the temporal compound word extraction model, multiple times of training are carried out through the iterative ideological model, and threshold parameter settings are introduced simultaneously, and finally after adjusting the temporal compound word extraction model. Participate in the optimal model.

5. A device for extracting temporal compound words in medical text, comprising an acquirer, a processor, a memory, and a computer program stored in the memory and running on the processor, wherein the processor The steps of the method according to any one of claims 1-5 are carried out when the computer program is executed.

6. A system for extracting temporal compound words in medical text, characterized in that the system comprises:

Medical text library, used to store unstructured medical text and various entity sets;

Text input unit for the system to read unstructured medical text;

The time compound word extraction unit is used to extract the time compound word entity, and extract the time compound word entity through the time compound word structure;

The time compound word update unit is used to update the existing time compound word set;

The new time compound word structure extraction unit is used to extract the time compound word structure, by segmenting the time compound word, and then analyzing its part of speech, and then extracting the new time compound word structure;

The storage unit is used for structured storage of the result, and stores the extracted time compound word and time compound word structure in the corresponding file;

Display unit, used to display the results of temporal compound word extraction.