Method, system and device for acquiring time compound words in medical text based on lexical word formation method
Technical Field
The invention relates to the technical field of natural language processing information extraction, in particular to a method, a device and a system for extracting time compound words from medical texts.
Background
In recent years, a large amount of medical texts have been accumulated on the internet. The medical texts mainly comprise treatises in professional teaching materials, professional medical websites, medical classics, electronic cases and medical scientific research periodicals. These medical texts contain abundant medical data, which mainly includes information of onset time, treatment time, etiology, symptoms, treatment, diagnosis, etc. of diseases. However, most of these massive data exist in a semi-structured or unstructured form, and the current natural language processing or information extraction technology is not mature enough for extracting complete and accurate information from unstructured text. Existing companies or products have not been able to extract disease time compounds to an accurate level. The invention mainly analyzes the structure of the common time compound words in the medical text, mathematically transforms the structure of the time compound words, designs an iterative algorithm and a program, and can iteratively acquire accurate time compound words from the medical text.
With the continuous development of computers, text mining systems have been implemented. For example, a text mining method and system based on an unstructured electronic medical record, which is disclosed in patent application No. 201910701406.5, includes a text preprocessing module, a feature engineering module, and an analysis prediction module. The main extracted characteristics of the invention include symptoms, inspection findings, radiotherapy and chemotherapy schemes, curative effect evaluation and the like. The patent uses time nodes to segment hospitalization records, extracts features through disease information extraction of a rule base, and finally realizes text clustering through unsupervised clustering. The patent is divided by time nodes, the time node acquisition accuracy rate is in agreement, and the complete semantics of sentences are not taken into consideration. The input text only comprises medical history records in the hospital database, and the data source range is small.
Many difficulties are faced in the identification task in the medical field, mainly in the following aspects.
From the point of view of the extraction process.
The medical field typically contains a rich category of entities.
There are many different modifiers and qualifiers for an entity context that make the boundaries of the entity more difficult to determine and partition.
There are usually different more descriptive ways of the entities to be extracted.
The length of a time compound word entity is often difficult to determine.
From the extraction result.
The extracted time compound words are few, only thousands, and more than ten thousands, but do not reach the scale of tens of thousands to hundreds of thousands. The medical texts involved are only a few thousand, not reaching the scale of tens of thousands up to tens of thousands.
Disclosure of Invention
The invention aims to provide a method, a device and a system for acquiring a time compound word in a medical text. To solve the problems set forth in the above-described technical background. The invention aims to extract a time compound word entity in a medical text by taking the medical text as a starting point.
In order to achieve the above object, the present invention provides a method for extracting time compound words, the method mainly comprises the following steps.
Step 1: medical texts are acquired and partial time compound word structures are manually summarized.
Step 2: and acquiring the time compound words through the time compound word structure in the medical text based on the time compound word structure.
And step 3: and (5) performing time compound word edulcoration and merging the time compound word edulcoration into the existing time compound word set.
And 4, step 4: and extracting a new time compound word structure based on the updated time compound word set.
And 5: and (5) performing time compound word structure impurity removal, verifying and merging the time compound word structure impurity removal and verification into the existing time compound word structure set.
Step 6: and repeating the step two based on the new time compound word structure until no new time compound word structure exists.
And 7: and finally, removing impurities of the time compound words.
Preferably, the medical texts obtained in step 1 are various unstructured medical texts such as papers in hospital cases, professional textbooks, professional medical websites, medical classics, electronic cases, and medical scientific research periodicals.
Preferably, a regular expression is used on the read medical text to filter out Chinese sentences in the medical text.
Preferably, the learning of each semantic element in the present invention is an iterative process, that is: a process of iterative learning of time compound words and time compound word structures.
Preferably, the invention ensures the accuracy of each semantic element when learning to further improve the accuracy of the next iterative extraction.
Corresponding to the method, the invention also provides a time compound word extraction system, which comprises the following steps.
And the text input unit is used for reading the unstructured medical text by the system.
And the time compound word extracting unit is used for extracting the medical time compound word entity and extracting the time compound word entity through the time compound word structure.
And the time compound word updating unit is used for updating the existing time compound word set.
And the new time compound word structure extraction unit is used for extracting a time compound word structure, and extracting a new time compound word structure by segmenting the time compound word and analyzing the part of speech of the time compound word.
And the storage unit is used for structured storage of results and storing the extracted time compound words and the time compound word structure into corresponding files.
Corresponding to the system, the embodiment of the invention provides a time compound word extraction device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the system for extracting the time compound words from the medical text is realized.
Embodiments of the present invention provide a computer readable storage medium, which may store a computer program that, when executed by a processor, implements a system for extracting temporal compounds in medical text.
Compared with the prior art, the invention has the following advantages and beneficial effects.
(1) The invention realizes a method, a device and a system for extracting time compound words from medical texts, and the processor can accurately extract the time compound words through different time compound word structure limitations. Meanwhile, the problem that the entity length of the time compound word in the dependency field cannot be processed can be solved well.
(2) The invention realizes that the extraction result of the time compound word reaches tens of thousands of orders of magnitude, and greatly improves the precision and the accuracy.
Drawings
FIG. 1 is a block diagram of the system of the present invention.
FIG. 2 is a flow chart of a temporal compound extraction method of the present invention.
FIG. 3 is a schematic flow chart of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings.
The invention provides a method for extracting time compound words from medical texts, which mainly comprises the following steps: the time compound words are identified through the time compound word structure provided by the invention, and semantic elements in each time compound word structure can strictly limit the time compound word entity and finally realize extraction. And learning and extracting through the newly added semantic elements, and expanding the semantic element library. The method has the advantages that the method can be applied to various unstructured medical texts such as treatises in professional textbooks, professional medical websites, medical classics, electronic cases and medical scientific research periodicals, can well solve the problem of large entity length difference of time compound words, and plays a vital role in research of related vertical fields.
The first embodiment.
With reference to fig. 1, fig. 1 is a block diagram of a system for extracting temporal compounds from medical texts, which includes the following components.
And the text input unit is used for reading the unstructured medical text by the system.
And the time compound word extracting unit is used for extracting the time compound word entity and extracting the time compound word entity through a time compound word structure.
And the time compound word updating unit is used for updating the existing time compound word set.
And the new time compound word structure extraction unit is used for extracting a time compound word structure, and extracting a new time compound word structure by segmenting the time compound word and analyzing the part of speech of the time compound word.
And the storage unit is used for structured storage of results and storing the extracted time compound words and the time compound word structure into corresponding files.
Unstructured medical text is first input to the system through a text input unit in the reading module. And then, combining each time compound word structure in a calculation module, and extracting corresponding entity words through a time compound word extraction unit, a time compound word updating unit and a new time compound word structure extraction unit. And finally, performing structured storage on the extracted entity through a storage unit in the display module.
Example two.
With reference to fig. 3, fig. 3 is a flowchart of a method for extracting time compounds from medical texts, and the specific steps are as follows.
Step 1: medical texts are acquired and stored, and partial time compound word structures are manually summarized.
Step 2: and acquiring the time compound words through the time compound word structure in the medical text based on the time compound word structure.
And step 3: and removing impurities from the new time compound words.
And 4, step 4: and merging the time compound words after the impurity removal into the existing time compound word set.
And 5: and extracting a new time compound word structure based on the updated time compound word set.
Step 6: and (5) removing impurities from the new time compound word structure and verifying the new time compound word structure.
And 7: and merging the verified time compound word structure into the existing time compound word structure set.
And 8: and acquiring the time compound words in the medical text through the updated time compound word structure based on the new time compound word structure.
And step 9: and removing impurities from the new time compound words.
Step 10: and merging the time compound words after the impurity removal into the existing time compound word set.
Step 11: and storing the acquired time compound words and the time compound word structure.
The method carries out multiple times of training through an iterative idea, and meanwhile, the threshold parameter setting is updated to obtain an optimal model.
In the method for extracting time compound words from medical texts of the second embodiment, the entity to be extracted can be obtained by combining the word formation method "number words + time collocation words", so that the accuracy and precision of entity extraction are improved, and the problem that the length of the time compound words cannot be solved is effectively solved.
Example three.
The third embodiment of the present invention provides a time compound word extraction device, which mainly includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the steps of the time compound word extraction method when executing the computer program.
The time compound word extraction device of the third embodiment of the present invention includes: an acquirer, a processor, a memory, and a computer program stored in and executable on the memory, such as: a temporal compound extraction program. The processor, when executing the computer program, implements the steps of the second embodiment of the time compound word extraction method, such as the steps of the time compound word extraction method shown in fig. 2. Or the processor, when executing the computer program, implements the functions of the modules or units in the above device examples, such as: the device comprises a text input unit, a time compound word extraction unit, a time compound word updating unit, a new time compound word structure extraction unit and a storage unit.
The above description is only a preferred example of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.