Nothing Special   »   [go: up one dir, main page]

CN113628757A - A method, system and device for obtaining temporal compound words in medical text based on lexical word formation - Google Patents

A method, system and device for obtaining temporal compound words in medical text based on lexical word formation Download PDF

Info

Publication number
CN113628757A
CN113628757A CN202110970693.7A CN202110970693A CN113628757A CN 113628757 A CN113628757 A CN 113628757A CN 202110970693 A CN202110970693 A CN 202110970693A CN 113628757 A CN113628757 A CN 113628757A
Authority
CN
China
Prior art keywords
compound word
temporal
compound
time
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110970693.7A
Other languages
Chinese (zh)
Inventor
卢旭召
李军
周鹏程
冯洪海
魏亚举
侯瑞辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202110970693.7A priority Critical patent/CN113628757A/en
Publication of CN113628757A publication Critical patent/CN113628757A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Primary Health Care (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种在医疗文本中提取时间复合词的方法及装置及系统,具体涉及自然语言处理信息抽取技术领域。本发明包括读取模块、计算模块、展示模块。读取模块主要指系统读取医疗文本。计算模块包含时间复合词提取单元、时间复合词更新单元、新时间复合词结构提取单元。时间复合词的提取主要根据构词法“数词+时间词+时间搭配词”,需人工总结部分时间复合词结构。步骤如下:a.读取医疗文本;b.通过时间复合词结构提取时间复合词;c.通过时间复合词获取时间复合词结构。如果结果较已有的实体在数量上有增加,则继续迭代,否则结束。展示模块包括:存储单元、输出单元。本发明以医疗文本为起点,最终实现医疗文本中时间复合词的准确抽取。

Figure 202110970693

The invention discloses a method, device and system for extracting temporal compound words in medical texts, and specifically relates to the technical field of natural language processing information extraction. The present invention includes a reading module, a computing module and a display module. The reading module mainly refers to the system reading medical text. The calculation module includes a time compound word extraction unit, a time compound word update unit, and a new time compound word structure extraction unit. The extraction of time compound words is mainly based on the word formation method "numerical words + time words + time collocation words", and it is necessary to manually summarize the structure of some time compound words. The steps are as follows: a. read the medical text; b. extract the temporal compound word through the temporal compound word structure; c. obtain the temporal compound word structure through the temporal compound word. If the result is an increase in the number of existing entities, continue the iteration, otherwise end. The display module includes: a storage unit and an output unit. The invention takes the medical text as a starting point, and finally realizes the accurate extraction of time compound words in the medical text.

Figure 202110970693

Description

Method, system and device for acquiring time compound words in medical text based on lexical word formation method
Technical Field
The invention relates to the technical field of natural language processing information extraction, in particular to a method, a device and a system for extracting time compound words from medical texts.
Background
In recent years, a large amount of medical texts have been accumulated on the internet. The medical texts mainly comprise treatises in professional teaching materials, professional medical websites, medical classics, electronic cases and medical scientific research periodicals. These medical texts contain abundant medical data, which mainly includes information of onset time, treatment time, etiology, symptoms, treatment, diagnosis, etc. of diseases. However, most of these massive data exist in a semi-structured or unstructured form, and the current natural language processing or information extraction technology is not mature enough for extracting complete and accurate information from unstructured text. Existing companies or products have not been able to extract disease time compounds to an accurate level. The invention mainly analyzes the structure of the common time compound words in the medical text, mathematically transforms the structure of the time compound words, designs an iterative algorithm and a program, and can iteratively acquire accurate time compound words from the medical text.
With the continuous development of computers, text mining systems have been implemented. For example, a text mining method and system based on an unstructured electronic medical record, which is disclosed in patent application No. 201910701406.5, includes a text preprocessing module, a feature engineering module, and an analysis prediction module. The main extracted characteristics of the invention include symptoms, inspection findings, radiotherapy and chemotherapy schemes, curative effect evaluation and the like. The patent uses time nodes to segment hospitalization records, extracts features through disease information extraction of a rule base, and finally realizes text clustering through unsupervised clustering. The patent is divided by time nodes, the time node acquisition accuracy rate is in agreement, and the complete semantics of sentences are not taken into consideration. The input text only comprises medical history records in the hospital database, and the data source range is small.
Many difficulties are faced in the identification task in the medical field, mainly in the following aspects.
From the point of view of the extraction process.
The medical field typically contains a rich category of entities.
There are many different modifiers and qualifiers for an entity context that make the boundaries of the entity more difficult to determine and partition.
There are usually different more descriptive ways of the entities to be extracted.
The length of a time compound word entity is often difficult to determine.
From the extraction result.
The extracted time compound words are few, only thousands, and more than ten thousands, but do not reach the scale of tens of thousands to hundreds of thousands. The medical texts involved are only a few thousand, not reaching the scale of tens of thousands up to tens of thousands.
Disclosure of Invention
The invention aims to provide a method, a device and a system for acquiring a time compound word in a medical text. To solve the problems set forth in the above-described technical background. The invention aims to extract a time compound word entity in a medical text by taking the medical text as a starting point.
In order to achieve the above object, the present invention provides a method for extracting time compound words, the method mainly comprises the following steps.
Step 1: medical texts are acquired and partial time compound word structures are manually summarized.
Step 2: and acquiring the time compound words through the time compound word structure in the medical text based on the time compound word structure.
And step 3: and (5) performing time compound word edulcoration and merging the time compound word edulcoration into the existing time compound word set.
And 4, step 4: and extracting a new time compound word structure based on the updated time compound word set.
And 5: and (5) performing time compound word structure impurity removal, verifying and merging the time compound word structure impurity removal and verification into the existing time compound word structure set.
Step 6: and repeating the step two based on the new time compound word structure until no new time compound word structure exists.
And 7: and finally, removing impurities of the time compound words.
Preferably, the medical texts obtained in step 1 are various unstructured medical texts such as papers in hospital cases, professional textbooks, professional medical websites, medical classics, electronic cases, and medical scientific research periodicals.
Preferably, a regular expression is used on the read medical text to filter out Chinese sentences in the medical text.
Preferably, the learning of each semantic element in the present invention is an iterative process, that is: a process of iterative learning of time compound words and time compound word structures.
Preferably, the invention ensures the accuracy of each semantic element when learning to further improve the accuracy of the next iterative extraction.
Corresponding to the method, the invention also provides a time compound word extraction system, which comprises the following steps.
And the text input unit is used for reading the unstructured medical text by the system.
And the time compound word extracting unit is used for extracting the medical time compound word entity and extracting the time compound word entity through the time compound word structure.
And the time compound word updating unit is used for updating the existing time compound word set.
And the new time compound word structure extraction unit is used for extracting a time compound word structure, and extracting a new time compound word structure by segmenting the time compound word and analyzing the part of speech of the time compound word.
And the storage unit is used for structured storage of results and storing the extracted time compound words and the time compound word structure into corresponding files.
Corresponding to the system, the embodiment of the invention provides a time compound word extraction device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the system for extracting the time compound words from the medical text is realized.
Embodiments of the present invention provide a computer readable storage medium, which may store a computer program that, when executed by a processor, implements a system for extracting temporal compounds in medical text.
Compared with the prior art, the invention has the following advantages and beneficial effects.
(1) The invention realizes a method, a device and a system for extracting time compound words from medical texts, and the processor can accurately extract the time compound words through different time compound word structure limitations. Meanwhile, the problem that the entity length of the time compound word in the dependency field cannot be processed can be solved well.
(2) The invention realizes that the extraction result of the time compound word reaches tens of thousands of orders of magnitude, and greatly improves the precision and the accuracy.
Drawings
FIG. 1 is a block diagram of the system of the present invention.
FIG. 2 is a flow chart of a temporal compound extraction method of the present invention.
FIG. 3 is a schematic flow chart of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings.
The invention provides a method for extracting time compound words from medical texts, which mainly comprises the following steps: the time compound words are identified through the time compound word structure provided by the invention, and semantic elements in each time compound word structure can strictly limit the time compound word entity and finally realize extraction. And learning and extracting through the newly added semantic elements, and expanding the semantic element library. The method has the advantages that the method can be applied to various unstructured medical texts such as treatises in professional textbooks, professional medical websites, medical classics, electronic cases and medical scientific research periodicals, can well solve the problem of large entity length difference of time compound words, and plays a vital role in research of related vertical fields.
The first embodiment.
With reference to fig. 1, fig. 1 is a block diagram of a system for extracting temporal compounds from medical texts, which includes the following components.
And the text input unit is used for reading the unstructured medical text by the system.
And the time compound word extracting unit is used for extracting the time compound word entity and extracting the time compound word entity through a time compound word structure.
And the time compound word updating unit is used for updating the existing time compound word set.
And the new time compound word structure extraction unit is used for extracting a time compound word structure, and extracting a new time compound word structure by segmenting the time compound word and analyzing the part of speech of the time compound word.
And the storage unit is used for structured storage of results and storing the extracted time compound words and the time compound word structure into corresponding files.
Unstructured medical text is first input to the system through a text input unit in the reading module. And then, combining each time compound word structure in a calculation module, and extracting corresponding entity words through a time compound word extraction unit, a time compound word updating unit and a new time compound word structure extraction unit. And finally, performing structured storage on the extracted entity through a storage unit in the display module.
Example two.
With reference to fig. 3, fig. 3 is a flowchart of a method for extracting time compounds from medical texts, and the specific steps are as follows.
Step 1: medical texts are acquired and stored, and partial time compound word structures are manually summarized.
Step 2: and acquiring the time compound words through the time compound word structure in the medical text based on the time compound word structure.
And step 3: and removing impurities from the new time compound words.
And 4, step 4: and merging the time compound words after the impurity removal into the existing time compound word set.
And 5: and extracting a new time compound word structure based on the updated time compound word set.
Step 6: and (5) removing impurities from the new time compound word structure and verifying the new time compound word structure.
And 7: and merging the verified time compound word structure into the existing time compound word structure set.
And 8: and acquiring the time compound words in the medical text through the updated time compound word structure based on the new time compound word structure.
And step 9: and removing impurities from the new time compound words.
Step 10: and merging the time compound words after the impurity removal into the existing time compound word set.
Step 11: and storing the acquired time compound words and the time compound word structure.
The method carries out multiple times of training through an iterative idea, and meanwhile, the threshold parameter setting is updated to obtain an optimal model.
In the method for extracting time compound words from medical texts of the second embodiment, the entity to be extracted can be obtained by combining the word formation method "number words + time collocation words", so that the accuracy and precision of entity extraction are improved, and the problem that the length of the time compound words cannot be solved is effectively solved.
Example three.
The third embodiment of the present invention provides a time compound word extraction device, which mainly includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the steps of the time compound word extraction method when executing the computer program.
The time compound word extraction device of the third embodiment of the present invention includes: an acquirer, a processor, a memory, and a computer program stored in and executable on the memory, such as: a temporal compound extraction program. The processor, when executing the computer program, implements the steps of the second embodiment of the time compound word extraction method, such as the steps of the time compound word extraction method shown in fig. 2. Or the processor, when executing the computer program, implements the functions of the modules or units in the above device examples, such as: the device comprises a text input unit, a time compound word extraction unit, a time compound word updating unit, a new time compound word structure extraction unit and a storage unit.
The above description is only a preferred example of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1.一种在医疗文本中获取时间复合词的方法,其特征在于,包括:1. a method for obtaining temporal compound words in medical text, is characterized in that, comprises: 步骤S100:获取医疗文本并人工总结部分时间复合词结构;Step S100: obtaining medical text and manually summarizing part-time compound word structures; 步骤S200:提取时间复合词,基于时间复合词结构,在医疗文本中,通过时间复合词结构获取时间复合词;Step S200: extracting temporal compound words, based on the temporal compound word structure, in the medical text, obtaining temporal compound words through the temporal compound word structure; 步骤S300:对时间复合词进行除杂,将除杂后的时间复合词并入到已有的时间复合词集合中;Step S300: removing impurities from the time compound words, and incorporating the time compound words after the removal of impurities into the existing time compound word set; 步骤S400:提取时间复合词结构,基于更新后的时间复合词集合,提取新的时间复合词结构;Step S400: extracting a temporal compound word structure, and extracting a new temporal compound word structure based on the updated temporal compound word set; 步骤S500:进行时间复合词结构除杂、验证并并入已有的时间复合词结构集合中;Step S500: carry out temporal compound word structure removal, verification and incorporation into the existing temporal compound word structure set; 步骤S600:基于新的时间复合词结构,重复步骤S200,直到没有新的时间复合词结构为止。Step S600: Based on the new temporal compound structure, repeat step S200 until there is no new temporal compound structure. 2.根据权利要求1所述的医疗文本中提取时间复合词的方法,其特征在于,所述时间复合词的除杂是根据Hanlp分词工具进行分词和停用词过滤,然后加入特定阈值筛选条件进行筛选。2. the method for extracting temporal compound words in medical text according to claim 1, is characterized in that, the impurity removal of described temporal compound word is to carry out word segmentation and stop word filtering according to Hanlp word segmentation tool, then add specific threshold screening conditions to filter . 3.根据权利要求1所述的医疗文本中提取时间复合词的方法,其特征在于,所述医疗文本的获取是通过正则表达式来获取非结构化文本中的中文句子。3 . The method for extracting temporal compound words from medical text according to claim 1 , wherein the obtaining of the medical text is to obtain Chinese sentences in unstructured text through regular expressions. 4 . 4.根据权利要求1所述的医疗文本中提取时间复合词的方法,其特征在于,在训练时间复合词抽取模型时,通过迭代的思想模型进行多次的训练,同时引入阈值参数设置,最终经过调参得到最优模型。4. The method for extracting temporal compound words in medical text according to claim 1, is characterized in that, when training the temporal compound word extraction model, multiple times of training are carried out through the iterative ideological model, and threshold parameter settings are introduced simultaneously, and finally after adjusting the temporal compound word extraction model. Participate in the optimal model. 5.一种在医疗文本中提取时间复合词的装置,包括获取器、处理器、存储器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1-5中任意一个所述方法的步骤。5. A device for extracting temporal compound words in medical text, comprising an acquirer, a processor, a memory, and a computer program stored in the memory and running on the processor, wherein the processor The steps of the method according to any one of claims 1-5 are carried out when the computer program is executed. 6.一种在医疗文本中提取时间复合词的系统,其特征在于所述系统包括:6. A system for extracting temporal compound words in medical text, characterized in that the system comprises: 医疗文本库,用于存储非结构化医疗文本以及各实体集合;Medical text library, used to store unstructured medical text and various entity sets; 文本输入单元,用于系统读取非结构化医疗文本;Text input unit for the system to read unstructured medical text; 时间复合词提取单元,用于提取时间复合词实体,通过时间复合词结构来提取时间复合词实体;The time compound word extraction unit is used to extract the time compound word entity, and extract the time compound word entity through the time compound word structure; 时间复合词更新单元,用于更新已有的时间复合词集合;The time compound word update unit is used to update the existing time compound word set; 新时间复合词结构提取单元,用于提取时间复合词结构,通过对时间复合词分词,然后分析其词性,继而提取新的时间复合词结构;The new time compound word structure extraction unit is used to extract the time compound word structure, by segmenting the time compound word, and then analyzing its part of speech, and then extracting the new time compound word structure; 存储单元,用于结果的结构化存储,将提取出来的时间复合词和时间复合词结构存储到相应文件中;The storage unit is used for structured storage of the result, and stores the extracted time compound word and time compound word structure in the corresponding file; 展示单元,用于展示时间复合词提取的结果。Display unit, used to display the results of temporal compound word extraction.
CN202110970693.7A 2021-08-23 2021-08-23 A method, system and device for obtaining temporal compound words in medical text based on lexical word formation Pending CN113628757A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110970693.7A CN113628757A (en) 2021-08-23 2021-08-23 A method, system and device for obtaining temporal compound words in medical text based on lexical word formation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110970693.7A CN113628757A (en) 2021-08-23 2021-08-23 A method, system and device for obtaining temporal compound words in medical text based on lexical word formation

Publications (1)

Publication Number Publication Date
CN113628757A true CN113628757A (en) 2021-11-09

Family

ID=78387300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110970693.7A Pending CN113628757A (en) 2021-08-23 2021-08-23 A method, system and device for obtaining temporal compound words in medical text based on lexical word formation

Country Status (1)

Country Link
CN (1) CN113628757A (en)

Similar Documents

Publication Publication Date Title
CN111414393B (en) Semantic similar case retrieval method and equipment based on medical knowledge graph
Vijaymeena et al. A survey on similarity measures in text mining
CN108108426B (en) Understanding method and device for natural language question and electronic equipment
JP5746286B2 (en) High-performance data metatagging and data indexing method and system using a coprocessor
CN104899260B (en) Chinese pathological text structured processing method
US20200311115A1 (en) Method and system for mapping text phrases to a taxonomy
CN109920540A (en) Construction method, device and the computer equipment of assisting in diagnosis and treatment decision system
WO2017092337A1 (en) Comment tag extraction method and apparatus
WO2020074023A1 (en) Deep learning-based method and device for screening for key sentences in medical document
CN104391852A (en) Method and device for establishing keyword word bank
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN111460173B (en) A method for constructing a disease ontology model of thyroid cancer
CN112151186A (en) Method, device and system for extracting disease causes and disease causes from medical texts
CN113343680B (en) Structured information extraction method based on multi-type medical record text
CN110335654A (en) An electronic medical record information extraction method, system and computer equipment
CN113628757A (en) A method, system and device for obtaining temporal compound words in medical text based on lexical word formation
CN114168751B (en) Medical text label identification method and system based on medical knowledge conceptual diagram
CN114117082B (en) Method, apparatus and medium for correction of data to be corrected
AU2021104693A4 (en) An approach and device and system for extracting diseases and causes in medical texts
CN112364996A (en) Environment archaeological information visualization construction method and device based on digital human
Lowe Ocr2seq: A novel multi-modal data augmentation pipeline for weak supervision
CN112818122A (en) Dialog text-oriented event extraction method and system
CN113628756A (en) Method, device and system for extracting symptoms and causes of diseases from medical texts
JP5020274B2 (en) Semantic drift occurrence evaluation method and apparatus
AU2021106441A4 (en) Method, System and Device for Extracting Compound Words of Pathological location in Medical Texts Based on Word-Formation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication