Nothing Special   »   [go: up one dir, main page]

CN114841168A - Structured information processing method of imaging report text, lung disease monitoring method and system - Google Patents

Structured information processing method of imaging report text, lung disease monitoring method and system Download PDF

Info

Publication number
CN114841168A
CN114841168A CN202210546120.6A CN202210546120A CN114841168A CN 114841168 A CN114841168 A CN 114841168A CN 202210546120 A CN202210546120 A CN 202210546120A CN 114841168 A CN114841168 A CN 114841168A
Authority
CN
China
Prior art keywords
report text
sentence
imaging
entity
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210546120.6A
Other languages
Chinese (zh)
Inventor
靳超
郭利
冯圣中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER)
Original Assignee
NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER) filed Critical NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER)
Priority to CN202210546120.6A priority Critical patent/CN114841168A/en
Publication of CN114841168A publication Critical patent/CN114841168A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/20ICT specially adapted for the handling or processing of medical images for handling medical images, e.g. DICOM, HL7 or PACS

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The application relates to a structured information processing method of an imaging report text, a lung disease monitoring method and a system. The structured information processing method comprises the following steps: s11, further dividing the part and morphological characteristics in the medical professional entity of the image into negative and positive types to obtain eight entity labels, namely a vacancy filling symbol, a sentence starting symbol, a sentence ending symbol, a part-negative type, a part-positive type, a morphology-negative type, a morphology-positive type and a disease name, carrying out named body recognition on the image report text based on the eight entity labels, and extracting to obtain named entity output in a BIO format; and S12, filtering redundant information marked as O and entity labels of position-negativity and form-negativity according to the named entities obtained by extraction, calculating sentence vectors and storing the sentence vectors in a database. The method and the device have the advantages that the disease information is efficiently extracted, the sentence vectors with stronger expressiveness are obtained, and therefore more accurate similar case clusters can be obtained to perform space-time distribution characteristic analysis.

Description

Structured information processing method of imaging report text, lung disease monitoring method and system
Technical Field
The present application relates to a technology for processing and analyzing medical texts, and more particularly, to a method and a system for processing structured information of a photographical report text, a method and a system for monitoring lung diseases based on the photographical report text, and an electronic device.
Background
Processing and analysis of medical texts has been a focus of research interest at home and abroad. The medical text contains abundant information such as admission records, pathological reports and imaging reports of patients and has an important guiding function for clinical diagnosis. Different from the analysis and application of English medical texts which are researched more abroad, the Chinese text analysis has great difficulty due to the unique property of Chinese language, namely no clear separator, no root word and prefix. Meanwhile, data codes of different hospitals are different, and methods for doctors to write electronic medical records are different, so that the electronic medical records contain a large number of meaningless punctuations and stop words, and the problems of non-uniform formats, ambiguous words, abbreviations, misspelling and the like exist, which bring great difficulty to subsequent text mining and analysis.
Medical text is mostly stored in a semi-structured manner, where unstructured text is more diverse in expressive power and presentation form, but not conducive to subsequent in-depth analysis. Therefore, extracting unstructured text information and realizing structured representation have important significance for text analysis. At present, a rule-based method, a deep learning-based method and the like are commonly used. Among them, the rule-based method requires extensive and deep professional knowledge, and relies on expert to construct rules for development, so that the labor cost is high. Deep learning based methods rely on labeled data and design of network structures for end-to-end training. The method has the advantages that the characteristics do not need to be screened manually, the labor cost of expert characteristic extraction is reduced, the generalization capability is strong, and the method receives wide attention at present.
In realizing the structured expression of the medical text, the most important step is to identify the named body. Named body identification is to extract words with specific meanings from any text, and the words are used as structural information of the text to enter a subsequent deep analysis process. Generally, the processing methods for named body recognition are classified into three categories, including rule and dictionary-based methods, conventional statistical analysis methods, and neural network training methods.
Dictionary-based methods rely on a dictionary of terms and then use a matching algorithm for named body recognition. Therefore, for some texts with strong speciality, such as medical texts, the scale and quality of the labeled corpus have important influence on the model effect. In the practice of text analysis in the medical field, the annotation specification and corpus construction are in the process of continuous exploration, including the annotation specification of Chinese electronic medical record formulated in 2015 with reference to the annotation specification of i2b 22010. The rule-based method realizes the analysis processing of the whole text by constructing a rule template, wherein the named body recognition is realized by a matching mode. This method is more intuitive and convenient to maintain, but also requires time-consuming task of rule construction by domain experts, and it becomes very difficult to construct a template without explicit rules.
Methods employing statistical machine learning are receiving a great deal of attention in response to the difficulties presented by dictionary and rule based methods. When the traditional machine learning method is used for named body recognition, a large number of labeled data sets are needed, and a named body recognition task is converted into a classification problem. Commonly used sequence labeling models include Hidden Markov Models (HMMs), maximum Entropy Models (EM), conditional random field models (CRF), Support Vector Machines (SVM), and the like.
The neural network is a large complex model formed by interconnection of a large number of nonlinear units, and can represent complex nonlinear dynamics characteristics. The method also depends on a large amount of data-driven model training, has self-organization, self-adaption and self-learning capabilities, and is particularly suitable for processing the problem of inaccurate information processing which needs to consider a plurality of factors simultaneously. In the prior art, a neural network model based on CNN-BiLSTM-CRF is proposed to reduce the workload of artificial feature extraction, i.e., a convolutional neural network is used to train a character vector with morphological features and a word vector with semantic feature information, and the two are combined and input into the BiLSTM-CRF model, which can be specifically referred to as the following documents: kupiec J, Robust part-of-speed labeling using a high Markov model [ J ]. Computer Speech & Languge, 1992,6(3): 2250242. In order to extract general features of latent semantic information and syntax in sentences, it is proposed to integrate a reading control gate combining a language model and a sentence level into a BilSTM-CRF model, wherein the reading control gate is used for integrating implicit information of sentences, and language model learning can represent richer latent features, which can be specifically seen in the following documents: li L, Jiang Y. integrating language model and reading control gate In BilL-CRF for biological assay [ A ]. In IEEE International reference on Bioinformatics and biological (BIBM) [ C ]. Kansas: IEEE computer assay, 2017: 380-. For the problem that it is difficult to utilize the whole chapter information at the sentence level, there is also a proposal in the prior art to introduce an attention mechanism into the BiLSTM-CRF model, and obtain the context representation of the current word in the full text scope by using the attention mechanism, which can be specifically referred to the following documents: [12] yangbei, Yangxihao, Roling, et al, chemical drug named entity recognition based on attention mechanism [ J ] computer research and development, 2018,55(7): 1548-.
However, there is a lack in the art of methods and systems for information extraction for imaging report text to analyze lung disease characteristics for lung disease monitoring.
Disclosure of Invention
The technical problem to be solved by the present application is to provide a method and a system for processing structured information of an imaging report text, a method and a system for monitoring a lung disease based on the imaging report text, and an electronic device, aiming at the above-mentioned defects of the prior art.
In order to solve the technical problem, in a first aspect, the present application provides a method for processing structured information of a video report text, including the following steps:
s11, further dividing the part and form characteristics in the medical professional entity of the image into negative and positive types to obtain eight entity labels, namely, a vacancy filling character, a sentence starting character, a sentence terminating character, a part-negative character, a part-positive character, a form-negative character, a form-positive character and a disease name, carrying out named body recognition on the report text of the image science based on the eight entity labels, and extracting to obtain named entity output in a BIO format;
s12, filtering redundant information marked as O and entity labels of part-negativity and form-negativity according to the named entities extracted in the step S11, calculating sentence vectors of the iconography report text, and storing the sentence vectors in a database.
In an embodiment according to the first aspect of the present application, in step S11, a BERT-BiLSTM-CRF model is used to perform named body recognition on the imaging report text.
In an embodiment of the first aspect of the present application, the calculating sentence vectors in step S12 further includes: and calculating a sentence vector by using word vectors of words corresponding to the three entity labels of the part-positive, the form-positive and the disease name which are reserved after redundant information is removed, and setting respective weights for the word vectors corresponding to different entity labels when calculating the sentence vector, wherein the word vector is the output of the BERT model in the step S11.
In order to solve the technical problem, in a second aspect, the present application provides a method for monitoring a lung disease based on an imaging report text, including the following steps:
s21, processing the lung imaging report text by adopting the structured information processing method of the imaging report text to construct a database;
s22, searching a database, calculating the similarity between sentence vectors of the imaging report texts, and classifying all the imaging report texts with the similarity larger than a threshold value into similar cases;
s23, analyzing the space-time distribution characteristics of all the imaging report texts classified into similar cases.
In an embodiment according to the second aspect of the present application, the step S22 further includes:
acquiring n iconography report texts in a period of time;
calculating cosine similarity among the n iconography report texts to obtain an nxn cosine similarity matrix;
constructing a weighted or undirected graph representation among the iconography report texts according to the cosine similarity matrix and the cosine similarity threshold by taking each iconography report text as a graph node to obtain a weighted adjacent matrix A of the iconography report text;
calculating a carplace matrix L which has a right and no mapping representation among the iconography report texts, wherein the L is D-A, D is a degree matrix, and D is a diagonal matrix with one dimension of nxn;
calculating a normalized Laplace matrix L ', L' ═ D -1/2 LD -1/2 And finding out an imaging report text with similar diseases by adopting a spectral clustering method to obtain a plurality of case clusters of similar texts.
In an embodiment according to the second aspect of the present application, the step S23 further includes:
extracting a set of words corresponding to the position-positive and form-positive entity labels of the imaging report text divided in the same case cluster, and using the set as the imaging characteristic representation of the disease;
and respectively drawing the spatial distribution and the time distribution of the cases corresponding to the imaging report texts of the same case cluster according to the treatment time and the hospital position of the case.
In order to solve the technical problem, in a third aspect, the present application provides a system for processing structured information of a report text of imaging science, comprising:
the named entity recognition module is used for further dividing the part and morphological characteristics in the medical professional entity of the image into negative and positive types to obtain eight entity labels, namely a vacancy filling character, a sentence starting character, a sentence terminating character, a part-negative, a part-positive, a morphology-negative, a morphology-positive and a disease name, carrying out named entity recognition on the report text of the image science based on the eight entity labels, and extracting to obtain named entity output in a BIO format;
and the sentence vector construction module is used for filtering redundant information marked as O and entity labels of part-negativity and form-negativity according to the named entities extracted by the named body identification module, calculating to obtain a sentence vector of the iconography report text and storing the sentence vector in a database.
In order to solve the technical problem, in a fourth aspect, the present application provides a pulmonary disease monitoring system based on an imaging report text, including:
the structured information processing system of the imaging report text is used for carrying out structured information processing on the lung imaging report text to construct a database;
the similar case classification module is used for searching the database, calculating the similarity between sentence vectors of the iconography report texts, and classifying all the iconography report texts with the similarity larger than a threshold value into similar cases;
and the space-time characteristic analysis module is used for analyzing the space-time distribution characteristics of all the imaging report texts classified into similar cases.
In order to solve the technical problem, in a fifth aspect, an electronic device is provided, which includes a processor and a memory, where the memory stores a computer program, and the computer program is configured to implement the steps of the method for processing structured information of a report text of imaging science as described above when the computer program is executed by the processor.
In order to solve the technical problem, in a sixth aspect, an electronic device is provided, which includes a processor and a memory, where the memory stores a computer program, and the computer program is executed by the processor to implement the steps of the method for monitoring a pulmonary disease based on an imaging report text as described above.
The implementation of the structured information processing method of the imaging report text, the pulmonary disease monitoring method and system based on the imaging report text and the electronic device has the following beneficial effects: according to the structured information processing method of the imaging report text, negative description and positive description are distinguished for the entity label when named body recognition is carried out, redundant word description and negative feature expression in the report text can be efficiently filtered, and positive features are reserved; according to the structured information processing method of the imaging report text, provided by the embodiment of the application, the sentence vector of the imaging report text is further calculated according to the named entity label type weighting, and the sentence vector with stronger expressiveness is obtained. According to the lung disease monitoring method, the structural information processing method of the iconography report texts is adopted to process a large number of lung iconography report texts, and more accurate similar case clusters can be obtained to perform space-time distribution characteristic analysis.
Drawings
The present application will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flowchart of a method for processing structured information of a visual report text according to an embodiment of the present application;
FIG. 2 is a schematic representation of the BERT-BilSTM-CRF model employed in the present application;
FIG. 3 is a flowchart of a method for pulmonary disease monitoring based on imaging report text according to an embodiment of the present application;
FIG. 4 is a schematic diagram of the logical structure of a system for pulmonary disease monitoring based on imaging report text according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a logical structure of an electronic device according to an embodiment of the present application;
fig. 6 is a schematic logical structure diagram of an electronic device according to another embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application. Also, the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The application relates to a structured information processing method of an iconography report text, which identifies and extracts entity information containing disease iconography characteristics from a named body of the iconography report text and constructs a sentence vector of the iconography report text with stronger representation. The application also relates to a lung disease monitoring method based on the imaging report text, which monitors the text similarity in lung X-ray, CT and MRI detection in a large number of imaging reports and is used for analyzing the space-time distribution characteristics of similar cases.
Fig. 1 shows a flow diagram of a structured information processing method 10 of a visual report text according to an embodiment of the present application. Referring to fig. 1, the method 10 for processing the structured information of the imaging report text includes the following steps:
step S11, further dividing the part and shape characteristics in the medical professional entity of the image into negative and positive types to obtain eight entity labels, namely, a vacancy filling character, a sentence starting character, a sentence terminating character, a part-negative character, a part-positive character, a shape-negative character, a shape-positive character and a disease name, carrying out named body recognition on the report text of the image science based on the eight entity labels, and extracting to obtain named entity output in a BIO format;
and step S12, filtering redundant information marked as O and entity labels of part-negativity and form-negativity according to the named entities extracted in the step S11, calculating sentence vectors of the iconography report text, and storing the sentence vectors in a database.
In an embodiment of the present application, the method 10 for processing structured information of a visual report text performs named body recognition of the visual report text in step S11. The named entities to be identified are tags for data, and the tags include the symbols PAD, CLS and SEP of the sentence, wherein PAD represents the space filler, CLS is the sentence start symbol, SEP is the sentence end symbol, and the image medical professional entity concerned includes the part, the shape and the disease name. For the medical entity, because there are descriptions of not only the disease condition but also normal tissue and organs in the imaging report, and the descriptions of these normal tissue and organs affect the disease condition characteristics, the text labels are further classified in the above embodiment of the present application by means of expert labeling, and the part and morphological characteristics in the imaging report are further classified into negative and positive types. Thus, there are eight named entity tags in total, PAD, CLS, SEP, site-negative (par-neg), site-positive (par-neg), morphology-negative (mor-pos), morphology-positive (mor-pos), and disease name (dis). In addition, the entity is labeled in a BIO format, each word has one label, B indicates that the word is a starting word or a single word body, I indicates that the word is a middle word or a final word, and O indicates a non-target word.
According to an embodiment of the present application, in step S11, the method 10 for processing structured information of a photographical report text performs named object recognition on the photographical report text by using a BERT-BiLSTM-CRF model based on the eight entity labels. The structure of the BERT-BilSTM-CRF model is shown in FIG. 2. The BERT (Bidirectional Encoder representation based on a converter) is a pre-training model, and for an input iconography report text, a vocab file of the model is first searched to obtain an ID number corresponding to each character, number, or symbol, so as to realize conversion from the input text to an input number sequence, so as to perform subsequent calculation. The BERT model converts each character ID of the input text with the sequence length of n into a vector with 768 dimensions, finally, the word embedded expression with the output of [ n,768] is input from the ID sequence of [ n,1 ]. The word vector sequence output by the BERT model is continuously input into a BilSTM (Bidirectional Long Short-Term Memory), so as to be further processed and extract context information.
The calculation function of a common LSTM model is as follows:
i t =σ(W ix x t +W ir y t-1 +W ic c t-1 +b i )
f t =σ(W fx x t +W fr y t-1 +W fc c t-1 +b f )
g t =σ(W cx x t +W cr y t-1 +b c )
c t =f t ⊙c t-1 +g t ⊙i t
o t =σ(W ox x t +W or y t-1 +W oc c t +b o )
m t =o t ⊙h(c t )
y t =W ym m t
wherein, { W ix ,W ir ,W ic ,W fx ,W fr ,W fc ,W cx ,W cr ,W ox ,W or ,W oc ,W ym }、{b i ,b f ,b c ,b o Calculating and iteratively updating a weight matrix and a bias vector parameter of the neural network by a gradient descent method in the training process; x is the number of t Input information indicating the current position t, corresponding to [ n,768]]Inputting word vector information with the length of 768 at the position where n is t in the sequence; y is t-1 Is the output of the t-1 th position, and the length of the output characteristic vector is consistent with the input, which is 768; c. C t-1 Is the dominant line memory feature vector of LSTM, the length of vector is 768; i.e. i t 、f t 、g t 、c t 、o t 、m t Are all intermediate calculation vectors; y is t Is the calculated output of the LSTM module; σ is sigmoid activation function calculation, and the all-pass is matrix element-by-element multiplication calculation. The LSTM module is called forward LSTM, meaning that the feature vector of the current position t is related to the information of the previous position t-1.
The BilSTM used in the present application is a bi-directional LSTM model, comprising a forward and a backward LSTM model. For backward LSTM, the feature vector at position t is related to the information of the next adjacent position t-1. On the other hand, the output information length of each sub-module in the BiLSTM model is 768/2-384, and finally the outputs of the two LSTM models are combined together to form a feature vector with the length of 768. This sequence dependency is shown in the BiLSTM layer shown in fig. 2. For the next calculation of the CRF layer, a linear layer is added after the input of the BilSTM layer, and each feature vector with the length of 768 is reduced to 8 dimensions, which corresponds to a named entity with eight labels.
The output of the BilSTM is an [ n, 8]]The length of the feature vector of each position is 8, and the feature vector corresponds to the scores of 8 labels respectively. The CRF further processes this sequence to generate a tag in the BIO format. CRF (Conditional Random Fields, i.e. Conditional randomness)Field) is a probabilistic graphical model that takes a given random variable as input and solves for the distribution of conditional probabilities of the output random variable. The 8-label per position scoring of the BilSTM output does not take into account the association between words, whereas the CRF establishes the association between adjacent words by learning the transition matrix between labels. This transition matrix is labeled T, with dimensions 8x8, denoted T ij Representing the transition probability from label i to label j. Definition P is the output of the BilsTM layer, its element P ij Representing the probability of the jth label of the ith word in a sentence. For a sentence of length m, S ═ S (S) 1 ,S 2 ,…,S m ) Assume that the tag sequence to predict this sentence is y ═ (y) 1 ,y 2 ,…,y m ) The following formula calculates the score for the tag sequence:
Figure BDA0003649077660000091
the score for the entire sequence is equal to the sum of the scores of each word in the sentence, where P is the score from the BilSTM output and T is the transition matrix for the CRF layer. The probability is calculated by adopting a softmax function, and let y' represent all possible labels, and the calculation form is as follows:
Figure BDA0003649077660000092
in the model training process, a maximum likelihood function method is adopted, and the method is expressed as the following formula:
logP(y|S)=score(S,y)-log∑ y′ exp(score(S,y′))
through the steps, the output of the named entity from the text with the length of n to the corresponding BIO format can be completed.
According to an embodiment of the present application, in step S12, the method 10 for processing structured information of a visual report text filters redundant information in a visual report according to the extracted named entities, calculates a sentence vector of the visual report text, and stores the sentence vector in the database. Wherein redundant information is referred to in a named entityWords marked as O and entity labels classified as negative, "site-negative (par-neg)" and "morphology-negative (mor-neg)", and finally only disease name information and entity labels marked as positive, "site-positive (par-pos)" and "morphology-positive (mor _ pos)", were retained. According to the specific embodiment of the present application, the sentence vectors are calculated using the word vectors output by the BERT model in the foregoing step S11 corresponding to these words, and the respective weights are set for the word vectors corresponding to different entity labels. By W i,l Representing the word vector with the i position marked as the l label, the calculation formula of the sentence vector is as follows:
Sentence=∑ i λ l W i,l
wherein, the sequence is a Sentence vector used for representing all information of the whole Sentence about the disease description; lambda [ alpha ] l The sentence vector weight corresponding to the label l is set by an expert according to statistical experience. Finally, the entity and the sentence vector which are divided by the named body identification are stored in a database for storage.
According to the structured information processing method 10 of the imaging report text of the embodiment of the application, negative description and positive description are distinguished for entity labels when named body recognition is carried out, redundant word description and negative feature expression in the report text can be efficiently filtered, and positive features are reserved. The method 10 for processing structured information of a reporting text in imaging science according to the above embodiment of the present application further calculates a sentence vector of the reporting text in imaging science according to the type of the named entity tag, and obtains a sentence vector with stronger expressiveness.
Based on the structural information processing method of the imaging report text, the application also provides a lung disease monitoring method based on the imaging report text. Fig. 3 shows a flow chart of a method 20 for pulmonary disease monitoring based on imaging report text according to an embodiment of the present application. Referring to fig. 3, the method 20 for pulmonary disease monitoring based on imaging report text includes the following steps:
step S21, processing the lung imaging report text by the imaging report text structured information processing method 10 to construct a database;
step S22, searching a database, calculating the similarity between sentence vectors of every iconography report text, and classifying all the iconography report texts with the similarity larger than a threshold value into similar cases;
step S23, analyzing the spatiotemporal distribution features of all the iconographic report texts classified as similar cases.
According to a specific embodiment of the present application, in step S21, the above-mentioned method 20 for monitoring a pulmonary disease based on an imaging report text uses the aforementioned method 10 for processing structured information of an imaging report text described with reference to fig. 1 and fig. 2 to process a large number of pulmonary imaging report texts (for example, pulmonary X-ray, CT, MRI detection reports, etc.), so as to obtain sentence vectors and construct a structured information database, and a specific implementation process thereof is not repeated.
In an embodiment of the present application, the pulmonary disease monitoring method 20 based on the imaging report text compares the similarity of the texts in step S22 to classify similar cases. Specifically, the text similarity may be calculated based on the sentence vector of each of the iconography report texts, and a cosine angle therebetween is calculated as an expression of the text similarity, where the cosine similarity is calculated according to the following formula:
Figure BDA0003649077660000101
wherein, cos (θ) ij ) Is the cosine similarity between corresponding texts i and j, and the symbol of | represents the modulo operation of the sentence vector. All texts with the similarity between the iconography report texts larger than a certain threshold are classified as similar cases for later analysis of spatio-temporal distribution.
In an embodiment of the present application, the method 20 for pulmonary disease monitoring based on imaging report text further includes, in step S22:
in step S221, n iconography report texts within a period of time are obtained, and a total of n texts are assumed and labeled as [0,1,2, …, n-1 ].
Step S222, calculating cosine similarity between the n iconography report texts to obtain an nxn cosine similarity matrix, where diagonal elements of the matrix are 0, and the cosine similarity matrix represents the following formula if each text does not perform similarity calculation with itself:
Figure BDA0003649077660000111
step S223, using each of the iconography report texts as a graph node, and constructing a weighted or non-weighted graph representation between the iconography report texts according to the cosine similarity matrix and the cosine similarity threshold. The specific operation is as follows: the cosine similarity matrix is larger than a threshold value cos (theta) thresh ) Is set to cos (theta) ij ) Indicating that there is a link between the two texts and the weight of the edge is cos (θ) ij ) (ii) a Otherwise, set to 0, indicating that the two texts are unrelated. A weighted adjacency matrix A, A of the iconographic report text can then be obtained ij The calculation formula is as follows:
Figure BDA0003649077660000112
then, in step S224, a laplacian matrix L between the reporting texts of the iconography, where L is D-a, where D is a degree matrix, D is a diagonal matrix with a dimension of nxn, and the calculation formula is:
Figure BDA0003649077660000113
next, in step S225, a normalized laplacian matrix L ', L' ═ D is calculated -1/2 LD -1/2 Then, an imaging report text with similar diseases is found out by adopting a spectral clustering method, and a plurality of case clusters of similar texts are obtained. Specifically, first, eigenvalues and eigenvectors of the normalized laplacian matrix L' are extracted by eigenvalue decomposition, the eigenvectors corresponding to a low-dimensional embedded representation of the text map; then, clustering the eigenvectors by using a K-means method to obtain the eigenvectorsCase clusters of similar text.
According to an embodiment of the present application, the analyzing the spatiotemporal distribution characteristics of all the imaging report texts classified as similar cases in step S23 by the imaging report text-based lung disease monitoring method 20 specifically includes: extracting the imaging report texts divided into the same case cluster, and analyzing the set of words corresponding to the part-positive and form-positive entity labels extracted in the step S21 as the imaging feature representation of the disease; then, according to the time of the medical case and the hospital position, the spatial distribution and the time distribution of the medical cases corresponding to the imaging report texts which are divided into the same medical case cluster are respectively drawn as the space-time distribution characteristics of the diseases. If the disease is infectious, its spatiotemporal distribution will show a strong correlation.
According to the lung disease monitoring method 20 based on the imaging report text of the embodiment of the present application, the structured information processing method 10 of the imaging report text is adopted to perform structured information processing on a large number of lung imaging report texts, so that more accurate similar case clusters can be obtained for performing spatio-temporal distribution feature analysis.
Based on the above-described imaging report text structured information processing method and the imaging report text based lung disease monitoring method, the present application also provides an imaging report text structured information processing system and an imaging report text based lung disease monitoring system using the system to construct a structured information database. Fig. 4 is a schematic diagram illustrating a logical structure of a pulmonary disease monitoring system 40 based on imaging report text according to an embodiment of the present application. Referring to fig. 4, the imaging report text-based lung disease monitoring system 40 includes an imaging report text structured information processing system 30, a similar case classification module 41, and a spatiotemporal feature analysis module 42. The structured information processing system 30 for the imaging report text is used for performing structured information processing on the lung imaging report text to build a database. The structured information processing system 30 for the visual report text further includes a named body recognition module 31 and a sentence vector construction module 32. The named entity recognition module 31 is configured to further divide the part and morphological features in the medical professional imaging entity into negative and positive classes to obtain eight entity labels, namely, a vacancy filling character, a sentence start character, a sentence end character, a part-negative character, a part-positive character, a morphology-negative character, a morphology-positive character and a disease name, perform named entity recognition on the imaging report text based on the eight entity labels, and extract a named entity output in a BIO format. The sentence vector construction module 32 is configured to filter redundant information marked as O and entity labels of part-negative and form-negative according to the named entity extracted by the named entity identification module 31, calculate a sentence vector of the iconography report text, and store the sentence vector in the database. For specific implementation of the named entity recognition module 31 and the sentence vector construction module 32, reference may be made to the foregoing detailed description of step S11 and step S12 of the method 10 for processing structured information of a visual report text, and details thereof are not repeated here. The similar case classification module 41 is configured to search the database constructed by the structured information processing system 30 for the above-described imaging report text, calculate similarity between sentence vectors of each imaging report text, and classify all imaging report texts with similarity greater than a threshold value as similar cases. The spatiotemporal feature analysis module 42 is used to analyze spatiotemporal distribution features of all the iconographic report texts classified as similar cases. For specific implementation of the similar case classifying module 41 and the spatiotemporal feature analyzing module 42, reference may be made to the foregoing detailed description of step S22 and step S23 of the method 20 for monitoring lung diseases based on an imaging report text, and further description is omitted here.
Based on the method 10 for processing structured information of an imaging report text, an electronic device 50 is also provided. Referring to fig. 5, the electronic device 50 comprises a processor 51 and a memory 52, the processor 51 and the memory 52 being communicatively connected. The memory 52 stores a computer program which, when executed by the processor 51, causes the processor 51 to implement the structured information processing method 10 of the imaging report text of the foregoing embodiment of the present application.
Based on the above-mentioned method 20 for monitoring lung diseases based on imaging report text, the present application further provides an electronic device 60. Referring to fig. 6, the electronic device 60 includes a processor 61 and a memory 62, the processor 61 and the memory 62 being communicatively coupled. The memory 62 stores a computer program that, when executed by the processor 61, causes the processor 61 to implement the imaging report text based lung disease monitoring method 20 of the previous embodiment of the present application.
The above description is only a preferred embodiment of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A structured information processing method of a video report text is characterized by comprising the following steps:
s11, further dividing the part and form characteristics in the medical professional entity of the image into negative and positive types to obtain eight entity labels, namely, a vacancy filling character, a sentence starting character, a sentence terminating character, a part-negative character, a part-positive character, a form-negative character, a form-positive character and a disease name, carrying out named body recognition on the report text of the image science based on the eight entity labels, and extracting to obtain named entity output in a BIO format;
s12, filtering redundant information marked as O and entity labels of part-negativity and form-negativity according to the named entities extracted in the step S11, calculating sentence vectors of the iconography report text, and storing the sentence vectors in a database.
2. The method according to claim 1, wherein in step S11, the BERT-BiLSTM-CRF model is used to perform named body recognition on the imaging report text.
3. The method of claim 2, wherein the step of calculating sentence vectors in S12 further comprises: and calculating a sentence vector by using word vectors of words corresponding to the three entity labels of the part-positive, the form-positive and the disease name which are reserved after redundant information is removed, and setting respective weights for the word vectors corresponding to different entity labels when calculating the sentence vector, wherein the word vector is the output of the BERT model in the step S11.
4. A pulmonary disease monitoring method based on an imaging report text is characterized by comprising the following steps:
s21, processing the lung imaging report text by adopting the structured information processing method of the imaging report text according to any one of claims 1-3, and constructing a database;
s22, searching a database, calculating the similarity between sentence vectors of the imaging report texts, and classifying all the imaging report texts with the similarity larger than a threshold value into similar cases;
s23, analyzing the space-time distribution characteristics of all the imaging report texts classified into similar cases.
5. The method according to claim 4, wherein the step S22 further comprises:
acquiring n iconography report texts in a period of time;
calculating cosine similarity among the n iconography report texts to obtain an n x n cosine similarity matrix;
constructing a weighted or undirected graph representation between the iconography report texts according to the cosine similarity matrix and the cosine similarity threshold by taking each iconography report text as a graph node to obtain a weighted adjacent matrix A of the iconography report text;
calculating a laplacian matrix L which is weighted to have no mapping representation among the iconography report texts, wherein the L is D-A, D is a degree matrix, and D is a diagonal matrix with one dimension being n x n;
calculating a normalized Laplace matrix L ', L' ═ D -1/2 LD -1/2 And finding out an imaging report text with similar diseases by adopting a spectral clustering method to obtain a plurality of case clusters of similar texts.
6. The method according to claim 5, wherein the step S23 further comprises:
extracting a set of words corresponding to the position-positive and form-positive entity labels of the imaging report text divided in the same case cluster, and using the set as the imaging characteristic representation of the disease;
and respectively drawing the spatial distribution and the time distribution of the cases corresponding to the imaging report texts of the same case cluster according to the treatment time and the hospital position of the case.
7. A structured information processing system for imaging report text, comprising:
the named entity recognition module is used for further dividing the part and morphological characteristics in the medical professional entity of the image into negative and positive types to obtain eight entity labels, namely a vacancy filling character, a sentence starting character, a sentence terminating character, a part-negative, a part-positive, a morphology-negative, a morphology-positive and a disease name, carrying out named entity recognition on the report text of the image science based on the eight entity labels, and extracting to obtain named entity output in a BIO format;
and the sentence vector construction module is used for filtering redundant information marked as O and entity labels of part-negativity and form-negativity according to the named entities extracted by the named body identification module, calculating to obtain a sentence vector of the iconography report text and storing the sentence vector in a database.
8. A system for pulmonary disease monitoring based on imaging report text, comprising:
the structured information processing system of imaging report text according to claim 7, for performing structured information processing on lung imaging report text to construct a database;
the similar case classification module is used for searching the database, calculating the similarity between sentence vectors of the iconography report texts, and classifying all the iconography report texts with the similarity larger than a threshold value into similar cases;
and the space-time characteristic analysis module is used for analyzing the space-time distribution characteristics of all the imaging report texts classified into similar cases.
9. An electronic device comprising a processor and a memory, said memory storing a computer program, wherein said computer program, when executed by the processor, implements the steps of the structured information processing method of the iconography report text according to any one of claims 1 to 3.
10. An electronic device comprising a processor and a memory, said memory storing a computer program, wherein said computer program, when executed by the processor, performs the steps of the method for pulmonary disease monitoring based on imaging report text according to claim 4.
CN202210546120.6A 2022-05-18 2022-05-18 Structured information processing method of imaging report text, lung disease monitoring method and system Pending CN114841168A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210546120.6A CN114841168A (en) 2022-05-18 2022-05-18 Structured information processing method of imaging report text, lung disease monitoring method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210546120.6A CN114841168A (en) 2022-05-18 2022-05-18 Structured information processing method of imaging report text, lung disease monitoring method and system

Publications (1)

Publication Number Publication Date
CN114841168A true CN114841168A (en) 2022-08-02

Family

ID=82569105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210546120.6A Pending CN114841168A (en) 2022-05-18 2022-05-18 Structured information processing method of imaging report text, lung disease monitoring method and system

Country Status (1)

Country Link
CN (1) CN114841168A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118335275A (en) * 2024-06-13 2024-07-12 广州高通影像技术有限公司 Semantic understanding and information extraction method and device facing endoscope report

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118335275A (en) * 2024-06-13 2024-07-12 广州高通影像技术有限公司 Semantic understanding and information extraction method and device facing endoscope report

Similar Documents

Publication Publication Date Title
CN111540468B (en) ICD automatic coding method and system for visualizing diagnostic reasons
US10929420B2 (en) Structured report data from a medical text report
Kumar et al. An ensemble of fine-tuned convolutional neural networks for medical image classification
CN105404632B (en) System and method for carrying out serialized annotation on biomedical text based on deep neural network
CN111554360A (en) Drug relocation prediction method based on biomedical literature and domain knowledge data
CN111538845A (en) Method, model and system for constructing kidney disease specialized medical knowledge map
CN111949759A (en) Method and system for retrieving medical record text similarity and computer equipment
CN111858940A (en) Multi-head attention-based legal case similarity calculation method and system
CN109036577A (en) Diabetic complication analysis method and device
Dessì et al. A recommender system of medical reports leveraging cognitive computing and frame semantics
Ababneh Investigating the relevance of Arabic text classification datasets based on supervised learning
CN115293161A (en) Reasonable medicine taking system and method based on natural language processing and medicine knowledge graph
CN113535947B (en) Multi-label classification method and device for incomplete data with missing labels
CN115841861A (en) Similar medical record recommendation method and system
CN117393098A (en) Medical image report generation method based on visual priori and cross-modal alignment network
Lyakhova et al. Systematic review of approaches to detection and classification of skin cancer using artificial intelligence: Development and prospects
Chen et al. Imbalanced prediction of emergency department admission using natural language processing and deep neural network
CN113643825B (en) Medical case knowledge base construction method and system based on clinical key feature information
CN114841168A (en) Structured information processing method of imaging report text, lung disease monitoring method and system
CN117194604B (en) Intelligent medical patient inquiry corpus construction method
CN113343680A (en) Structured information extraction method based on multi-type case history texts
US11977952B1 (en) Apparatus and a method for generating a confidence score associated with a scanned label
Malgieri Ontologies, Machine Learning and Deep Learning in Obstetrics
Sonawane et al. A design and implementation of heart disease prediction model using data and ECG signal through hybrid clustering
US20240028831A1 (en) Apparatus and a method for detecting associations among datasets of different types

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination