Nothing Special   »   [go: up one dir, main page]

CN111898528B - Data processing method, device, computer readable medium and electronic equipment - Google Patents

Data processing method, device, computer readable medium and electronic equipment Download PDF

Info

Publication number
CN111898528B
CN111898528B CN202010745286.1A CN202010745286A CN111898528B CN 111898528 B CN111898528 B CN 111898528B CN 202010745286 A CN202010745286 A CN 202010745286A CN 111898528 B CN111898528 B CN 111898528B
Authority
CN
China
Prior art keywords
identification
data
identifier
main body
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010745286.1A
Other languages
Chinese (zh)
Other versions
CN111898528A (en
Inventor
苏晨
李斌
洪科元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010745286.1A priority Critical patent/CN111898528B/en
Publication of CN111898528A publication Critical patent/CN111898528A/en
Application granted granted Critical
Publication of CN111898528B publication Critical patent/CN111898528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Input (AREA)

Abstract

The application belongs to the technical field of artificial intelligence, and particularly relates to a data processing method, a data processing device, a computer readable medium and electronic equipment. The method comprises the following steps: acquiring an image to be processed for displaying a data set, wherein the data set comprises at least one data object; text recognition is carried out on the image to be processed to obtain an object main body identifier, an object association identifier and a set type of the data set of the data object; performing identification matching in a main body identification database according to the object main body identification to obtain one or more identification bodies corresponding to the object main body identification; and screening the identification body according to the object association identification and the set type to obtain a target body, and establishing a mapping relation between the data object and the target body. The method can improve the data processing efficiency and obtain more accurate data processing results.

Description

Data processing method, device, computer readable medium and electronic equipment
Technical Field
The application belongs to the technical field of artificial intelligence, and particularly relates to a data processing method, a data processing device, a computer readable medium and electronic equipment.
Background
With the development of computer technology, electronic data analysis and data storage based on computer equipment have great advantages over traditional paper media, so that the data processing cost can be reduced and the data processing efficiency can be improved.
Taking a medical institution or physical examination institution as an example, information collection can be performed on the physical functions and health conditions of the user through various examination devices so as to perform health assessment or risk prediction based on the collected data. For convenience of user's checking and carrying, the data obtained by the information collection of the related checking device will be generally presented in the form of paper checking list. On the basis, the relevant data items can be extracted from the paper test sheet by means of manual entry or automatic identification for electronic data processing. Because the data items are various and the form is various, the manually entered data extraction mode has higher requirements on the knowledge level of operators, higher labor cost and time cost are required to be consumed, and the automatic identification mode can only be suitable for scenes with simpler data content, and the problems of poor identification accuracy and easy identification errors of related data with incomplete data acquisition or high data similarity degree are solved.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.
Disclosure of Invention
The application aims to provide a data processing method, a data processing device, a computer readable medium and electronic equipment, which at least overcome the technical problems of low processing efficiency, poor accuracy and the like in related technologies such as data extraction and data identification to a certain extent.
Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.
According to an aspect of an embodiment of the present application, there is provided a data processing method, including:
acquiring an image to be processed for displaying a data set, wherein the data set comprises at least one data object;
text recognition is carried out on the image to be processed to obtain an object main body identifier, an object association identifier and a set type of the data set of the data object;
performing identification matching in a main body identification database according to the object main body identification to obtain one or more identification bodies corresponding to the object main body identification;
And screening the identification body according to the object association identification and the set type to obtain a target body, and establishing a mapping relation between the data object and the target body.
According to an aspect of an embodiment of the present application, there is provided a data processing apparatus including:
an image acquisition module configured to acquire an image to be processed for presenting a data set, the data set comprising at least one data object;
the text recognition module is configured to perform text recognition on the image to be processed to obtain an object main body identifier, an object association identifier and a set type of the data set of the data object;
the identification matching module is configured to perform identification matching in a main body identification database according to the object main body identification so as to obtain one or more identification bodies corresponding to the object main body identification;
and the body screening module is configured to screen the identification body according to the object association identification and the set type to obtain a target body, and establish a mapping relation between the data object and the target body.
In some embodiments of the present application, based on the above technical solutions, the text recognition module includes:
A set text recognition unit configured to perform text recognition on the image to be processed to obtain text content of the data set, wherein the text content comprises data text fields forming the data object;
a text field classification unit configured to classify the data text field according to the distribution position of the data text field on the image to be processed so as to determine an object main body identifier and an object association identifier of the data object;
and the text content classification unit is configured to classify the text content to obtain the set type of the data set.
In some embodiments of the present application, based on the above technical solution, the aggregate text recognition unit includes:
the line detection subunit is configured to perform line detection on the image to be processed to obtain a form line in the image to be processed;
the region dividing subunit is configured to divide the region of the image to be processed according to the form lines so as to obtain a data form region where the data set is located;
and the text recognition subunit is configured to perform text recognition on the data table area to obtain text contents of the data set.
In some embodiments of the present application, based on the above technical solution, the line detection subunit includes:
the pixel classification subunit is configured to classify the pixel points in the image to be identified based on the image semantics so as to determine the foreground pixel points where the image lines are located;
the image segmentation subunit is configured to carry out image segmentation on the image to be identified according to the foreground pixel points so as to obtain a foreground line image;
and the line fitting subunit is configured to perform line fitting on the foreground line image to obtain a table line in the image to be processed.
In some embodiments of the present application, based on the above technical solution, the text field classification unit includes:
an indication field acquisition subunit configured to acquire an identification indication field in the text content, the identification indication field including a subject identification indication field for indicating the subject identification of the object and an association identification indication field for indicating the association identification of the object;
an indication area determining subunit configured to determine, on the image to be processed, a subject identification indication area corresponding to the subject identification indication field and an associated identification indication area corresponding to the associated identification indication field;
A positional relationship determining subunit configured to determine a regional positional relationship between the data text field and the main body identification indication region and the association identification indication region according to a distribution position of the data text field on the image to be processed;
and the text field classifying subunit is configured to classify the data text field according to the area position relation so as to determine an object main body identifier and an object association identifier of the data object.
In some embodiments of the present application, based on the above technical solution, the text content classification unit includes:
a feature extraction subunit configured to perform feature extraction on the text content to obtain content features of the text content;
a feature mapping subunit configured to perform a mapping process on the content features to predict classification probabilities of classifying the text content into a plurality of types of labels, respectively;
and a tag selection subunit configured to select a target tag from the plurality of type tags according to the classification probability, and determine the target tag as a set type of the data set.
In some embodiments of the present application, based on the above technical solution, the identifier matching module includes:
An exact matching unit configured to perform matching detection in a subject identification database according to the subject identification to determine whether there is an exact matching identification identical to the subject identification in the subject identification database;
a first body determining unit configured to determine an identification body having a mapping relationship with the exact match identification as an identification body corresponding to the object body identification if the exact match identification is detected in the body identification database;
the fuzzy matching unit is configured to perform matching detection in the main body identification database according to the object main body identification if the accurate matching identification is not detected in the main body identification database so as to determine whether fuzzy matching identification within a preset text difference range with the object main body identification exists in the main body identification database;
and a second body determining unit configured to determine an identification body having a mapping relationship with the fuzzy match identification as an identification body corresponding to the object body identification if the fuzzy match identification is detected in the body identification database.
In some embodiments of the present application, based on the above technical solution, the fuzzy matching unit includes:
a fuzzy matching model establishing subunit configured to establish a fuzzy matching model with a tree structure according to the text difference degree of each identifier in the main body identifier database;
and a fuzzy matching model traversing subunit configured to traverse the fuzzy matching model to determine whether fuzzy matching identifiers within a preset text difference range with the object main body identifier exist in each node of the fuzzy matching model.
In some embodiments of the present application, based on the above technical solution, the ontology filtering module includes:
a set type screening unit configured to screen the identification ontology according to the set type to obtain a candidate ontology matched with the set type;
the association identifier searching unit is configured to search candidate association identifiers with mapping relation with the candidate ontology in an association identifier database;
an association identifier selection unit configured to select a target association identifier matched with the object association identifier from the candidate association identifiers;
and the target ontology determining unit is configured to determine a candidate ontology with a mapping relation with the target association identifier as a target ontology.
In some embodiments of the present application, based on the above technical solution, the object association identifier includes a unit identifier for representing a unit of measure of the data object and a range identifier for representing a range of values of the data object, and the target association identifier includes a target unit identifier matched with the unit identifier and a target range identifier matched with the range identifier.
In some embodiments of the present application, based on the above technical solution, the target ontology determining unit includes:
a first target ontology determining subunit configured to determine the candidate ontology as a target ontology if the target unit identifier and the target range identifier map to the same candidate ontology;
and a second target ontology determining subunit configured to determine, if the target unit identifier and the target range identifier are mapped to different candidate ontologies, a candidate ontology having a mapping relationship with the target unit identifier as a target ontology.
In some embodiments of the present application, based on the above technical solution, the entity screening module further includes:
the mapping relation establishing unit is configured to form the object value, the object main body identifier, the object association identifier and the target body of the data object into structural information with a mapping relation.
According to an aspect of the embodiments of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a data processing method as in the above technical solutions.
According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the data processing method as in the above technical solution via execution of the executable instructions.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the data processing method as in the above technical solution.
According to the technical scheme provided by the embodiment of the application, diversified information such as the object main body identification, the object association identification and the set type of the data set of the data object can be obtained by carrying out text recognition on the image to be processed, then the object main body identification is taken as main information for identification matching to obtain a corresponding identification body, and then the identification body is screened by combining the object association identification and the set type to obtain a more accurate target body. According to the application, the body recognition and matching are carried out on the data object by utilizing various diversified text information, so that more accurate data processing results can be obtained while the data processing efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:
fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
Fig. 2 schematically shows an example of an image of a medical examination sheet.
Fig. 3 schematically shows a schematic view of a scenario principle of the technical solution of the present application in application scenarios such as health assessment and insurance underwriting.
Fig. 4 schematically illustrates a flow chart of steps of a data processing method in some embodiments of the application.
Fig. 5 schematically shows a flow chart of method steps for text recognition of an image to be processed in some embodiments of the application.
FIG. 6 schematically illustrates a flowchart of method steps for identity matching based on object body identity in some embodiments of the application.
Fig. 7 schematically illustrates a flowchart of method steps for screening identity ontologies in some embodiments of the present application.
Fig. 8 schematically shows a schematic diagram of a processing procedure in an application scenario of inspection sheet data processing according to an embodiment of the present application.
Fig. 9 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present application.
Fig. 10 schematically shows a block diagram of a computer system suitable for use in implementing embodiments of the application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
Before introducing the technical scheme provided by the application, a brief description is first made of related technologies of artificial intelligence related to the technical scheme of the application.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the Computer processes the target into an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value. The technical scheme of the application relates to related technologies of artificial intelligence such as computer vision, natural language processing, machine learning and the like, and is specifically described by the following embodiments.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
As shown in fig. 1, system architecture 100 may include a terminal device 110, a network 120, and a server 130. Terminal device 110 may include various electronic devices such as smart phones, tablet computers, notebook computers, desktop computers, and the like. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, and may be, for example, a wired communication link or a wireless communication link.
The system architecture in embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by the terminal device 110 and the server 130 together, which is not limited in particular.
For example, the technical scheme of the application can be applied to various business scenes for carrying out automatic extraction and ontology mapping of data items, so that electronic data analysis and processing can be further carried out. Wherein, the ontology is formalized and standardized description describing the sharing concept of the specific field, and the specific reference content and physical meaning of the ontology can be determined by carrying out ontology mapping on the data item.
Taking a medical facility or physical examination facility as an example, various test equipment and related information systems are typically of a variety of different models and types, and the same test item is typically of a variety of different common aliases, resulting in the possibility that test items on different medical test sheets may be printed with different names. For example, the "white blood cell count" in a "blood routine" test sheet and the "white blood cell count" in a "urine routine" test sheet may appear as the same name on the respective test sheets, but their actual data meanings and uses are not the same, and therefore they should be mapped to different data ontologies. In addition, the medical examination sheet may fail to print the complete examination item name, for example, "red blood cell volume distribution width-coefficient of variation" and "red blood cell volume distribution width-standard deviation" are two different data items due to the printing format or the limitation of the number of words, etc., and may be printed as the same "red blood cell volume distribution width-" on the examination sheet, and the corresponding data body cannot be judged only from this examination item of the medical examination sheet.
Aiming at the related problems in the application scene, the application provides a diversified data processing method based on the assistance of various data contents, and when the method is applied to the information extraction of a medical examination sheet, the related data of each examination item in the medical examination sheet can be efficiently and accurately mapped to the corresponding examination item body, so that the extracted related data can be applied to downstream business scenes such as health assessment, risk prediction and the like. Fig. 2 schematically shows an example of an image of a medical laboratory sheet showing the blood sample test results of a user in the form of a table, wherein project data relating to a plurality of test projects, based on which the health condition of the user can be evaluated and predicted.
Fig. 3 schematically shows a schematic view of a scenario principle of the technical solution of the present application in application scenarios such as health assessment and insurance underwriting. As shown in fig. 3, in the application scenario of health assessment, in order to comprehensively assess the physical health condition of a user and predict risks of serious diseases, the data processing method of the technical scheme of the application can be implemented through terminal devices such as an evaluation system of a personal health assistance mobile phone APP or a physical examination institution.
Specifically, the existing test chart 301 may be input to the text recognition engine 302, and text and table information in the test chart 301 may be automatically recognized by the text recognition engine 302. Then, the inspection result information on the inspection sheet can be extracted by the inspection result information extraction module 303. For example, for the first data item in FIG. 2, the following item information may be extracted:
project name: white blood cell count (WBC)
Results: 8.64
Units: 10≡9/L
Reference interval: 4- -10
After the test order test item mapping, the data item may be mapped to a test item ontology "laboratory exam-blood routine-white blood cell count" (here in a medical code manner that incorporates class-level hierarchy), resulting in structured test information 304 comprising the test item ontology and corresponding item information. Such structured test information may be input into the health assessment engine 305, through which the health assessment engine 305 may automatically assess the health and disease risk of the subject.
Similarly to the health assessment, when the insured life evaluates the health condition of the insured life to determine whether to hold, the same method as in the health assessment scenario above may be used to obtain structured test information including the test item body and corresponding item information, and then the structured test information is input into the underwriting prediction engine 306, where the relevant data may be evaluated by the underwriting prediction engine 306 to give conclusions about underwriting prediction, such as refusal of maintenance, addition of maintenance, and so on.
It should be noted that the above health evaluation and insurance verification are only examples of the scenario applying the technical solution of the present application, and the application field of the present application is not limited thereto. The technical scheme of the application can be practically applied to various application scenes for processing external data by using computer equipment.
The following describes in detail the data processing method, the data processing apparatus, the computer readable medium, the electronic device and other technical schemes provided by the present application in connection with the specific embodiments.
Fig. 4 schematically illustrates a flow chart of steps of a data processing method in some embodiments of the application. The data processing method may be executed by the terminal device, may be executed by the server, or may be executed by both the terminal device and the server, which is not particularly limited in the embodiment of the present application. As shown in fig. 4, the data processing method may mainly include the following steps S410 to S440.
Step S410: an image to be processed is acquired for presenting a data set comprising at least one data object.
Step S420: text recognition is carried out on the image to be processed to obtain an object main body identifier, an object association identifier and a set type of the data set of the data object.
Step S430: and performing identification matching in a main body identification database according to the main body identification of the object to obtain one or more identification bodies corresponding to the main body identification of the object.
Step S440: and screening the identification ontology according to the object association identification and the set type to obtain a target ontology, and establishing a mapping relation between the data object and the target ontology.
In the data processing method of the embodiment of the application, diversified information such as the object main body identification, the object association identification and the set type of the data set of the data object can be obtained by carrying out text recognition on the image to be processed, then the object main body identification is taken as main information for identification matching to obtain a corresponding identification body, and then the identification body is screened by combining the object association identification and the set type to obtain a more accurate target body. According to the method, the body recognition and the matching are carried out on the data object by utilizing various diversified text information, so that the data processing efficiency can be improved, and meanwhile, a more accurate data processing result can be obtained.
The respective method steps in the data processing method are described in detail below.
In step S410, a to-be-processed image is acquired for presenting a data set comprising at least one data object.
Taking the data processing method executed by the terminal device as an example, the image to be processed can be an image directly acquired by an image acquisition device such as a camera on the terminal device, or an image received by a network and transmitted by a server or other terminal devices. A data set comprising at least one data object may be presented in the image to be processed. As can be seen from the above description of the application scenario, the image to be processed in the embodiment of the present application may be a test sheet as shown in fig. 2, where the test sheet includes 24 test items, each test item is a data object, and an item table formed by the test items together is a data set.
In step S420, text recognition is performed on the image to be processed to obtain an object body identifier of the data object, an object association identifier, and a collection type of the data collection.
The object body identification is used to represent the body content of the data object, and may be, for example, the item name "white blood cell count (WBC)", shown in fig. 2. The object association identifier is information having an association relationship with the main content of the data object, and may be, for example, a measurement unit "10≡9/L" and a reference interval "4-10" associated with the test item "white blood cell count (WBC)" shown in fig. 2. The collection type is used to represent classification information for the data collection, for example, the test item table in FIG. 2 may be classified into the "blood routine" collection type.
Fig. 5 schematically shows a flow chart of method steps for text recognition of an image to be processed in some embodiments of the application. As shown in fig. 5, on the basis of the above embodiment, in step S420, text recognition is performed on the image to be processed to obtain the object body identifier of the data object, the object association identifier, and the set type of the data set, and the following steps S510 to S530 may be further included.
Step S510: text recognition is performed on the image to be processed to obtain text content of the data set, wherein the text content comprises data text fields forming data objects.
In the image to be processed, in addition to the text of each data object in the data set, other data text outside the data set is included. For example, a table comprising 24 test item data is shown in FIG. 2, together with user identity information located above the table and shipping information located below the table. In order to obtain text content corresponding to a data object in the data table, the step may use a line of the table to divide the region of the image to be processed. For example, the method of text recognition of an image to be processed to obtain text contents of a data set may include the following steps S511 to S513.
Step S511: and carrying out line detection on the image to be processed to obtain the form lines in the image to be processed.
Step S512: and carrying out region division on the image to be processed according to the form lines to obtain a data form region where the data set is located.
Step S513: text recognition is performed on the data form area to obtain text content of the data set.
By combining the inspection sheet shown in fig. 2, three transverse lines and one longitudinal line included in the image can be obtained by performing line inspection on the image to be processed. The image to be processed is divided into three image areas by the transverse lines positioned at the uppermost part and the transverse lines positioned at the lowermost part, and the area positioned in the middle of the image is the data table area where the data set is positioned. Text recognition of the image content within the data table area may correspondingly result in text content of the data set.
In some alternative embodiments of the present application, the method for performing line detection on the image to be processed may be to perform image segmentation on the image to be processed based on pixel points and perform line fitting by using the segmented image. Specifically, in step S511, the line detection is performed on the image to be processed to obtain the form line in the image to be processed, and the following steps S5111 to S5113 may be further included.
Step S5111: and classifying the pixel points in the image to be identified based on the image semantics to determine the foreground pixel points where the image lines are located.
Step S5112: and carrying out image segmentation on the image to be identified according to the foreground pixel points to obtain a foreground line image.
Step S5113: and performing line fitting on the foreground line image to obtain a table line in the image to be processed.
All pixels in the image to be processed can be classified into two categories by semantic recognition of the image to be processed, one being foreground pixels corresponding to the lines of the table in the image and the other being background pixels corresponding to other content than the lines of the table. The embodiment of the application can pre-train a semantic segmentation model based on a neural network, and then classify pixels of the image to be processed by using the semantic segmentation model. For example, a high-efficiency neural network model Enet which is obtained by training a deep neural architecture based on real-time semantic segmentation can be adopted, and mainly comprises an encoder-decoder network structure formed by a plurality of bottleneck modules which are connected in sequence, and the Enet model has the advantages of less parameter requirements, high segmentation speed, high segmentation precision and the like. After the foreground pixel points in the image to be processed are determined in a classified manner, the foreground pixel points can be subjected to image segmentation processing, and a foreground line image which is completely composed of the foreground pixel points can be obtained by removing the background pixel points. And finally, performing line fitting on foreground pixel points in the foreground line image to obtain corresponding table lines, for example, performing function fitting on position coordinates of each pixel point by adopting a least square method to obtain fitting functions of the corresponding table lines.
The data table area where the data set is located can be determined by carrying out area division on the image to be processed based on the table lines, and text recognition can be carried out on the image content in the data table area to obtain the text content of the data set. The method for text recognition of the data table area may be, for example, optical character recognition (Optical Character Recognition, OCR). OCR technology refers to a process of determining the shape of a character by detecting dark and bright patterns in an image, and then translating the character shape into computer text by a character recognition method, and the text in an image format can be converted into a text format by using OCR technology.
In an embodiment of the application, the text content of the identified data set comprises data text fields that make up the data object. Taking the test chart shown in fig. 2 as an example, each test item in the figure is a data object, and each test item includes a plurality of text fields distributed along a transverse direction, for example, the data object corresponding to the first test item includes four data text fields, namely "white blood cell count (WBC)", "8.64", "10≡9/L", "4-10".
Step S520: and classifying the data text field according to the distribution position of the data text field on the image to be processed so as to determine the object main body identification and the object association identification of the data object.
Each data text field is distributed in different image positions on the image to be processed, and classification of the data text fields can be realized according to the position relation between the data text fields, so that the object main body identification and the object association identification of the data object are determined.
In some alternative embodiments of the present application, the method of classifying the text field of data to obtain the relevant identification may include the following steps S521 to S524.
Step S521: an identification indication field in the text content is obtained, wherein the identification indication field comprises a main body identification indication field for indicating the main body identification of the object and an association identification indication field for indicating the association identification of the object.
Taking the inspection sheet in fig. 2 as an example, the identification indication field may be a header portion in the data table, for example, the body identification indication field is a text field "item name" in the drawing, and the association identification indication field is a text field "unit" and "reference section" in the drawing.
Step S522: and determining a subject identification indication area corresponding to the subject identification indication field and an associated identification indication area corresponding to the associated identification indication field on the image to be processed.
The body identification indication area may be a data column in which the body identification indication field "item name" is located, and the association identification indication area may be a data column in which the association identification indication field "unit" and "reference section" are located.
Step S523: and determining the region position relation between the data text field and the main body identification indication region and the association identification indication region according to the distribution position of the data text field on the image to be processed.
The distribution position of each data text field on the image to be processed can be expressed as position coordinates in the image, and the region position relation between each data text field and the corresponding region can be determined based on the coordinate value relation between the position coordinates and the region coordinates of the main body identification indication region and the association identification indication region. The region positional relationship may include both a relationship located inside the region and a relationship located outside the region.
Step S524: and classifying the data text field according to the region position relationship to determine the object main body identification and the object association identification of the data object.
According to the difference of the area location relationship, the related text fields can be divided into three types, namely a data text field located in the main body identification indication area, a data text field located in the associated identification indication area and other data text fields located outside the main body identification indication area and the associated identification indication area. The data text field in the body id indication area may be determined as the object body id of the data object, such as the id corresponding to the item names of "white blood cell count (WBC)", "red blood cell count (RBC)", and the like shown in fig. 2. The data text field located in the association identifier indication area can be determined as the object association identifier of the data object, such as the identifier corresponding to the unit parts of 10-9/L, 10-12/L, etc. and the identifier corresponding to the reference interval part of 4-10, 3.5-5.5, etc. as shown in fig. 2.
By performing the above steps S521 to S524, classification of the data text field can be achieved, so that the object body identification and the object association identification of each data object can be determined.
Step S530: the text content is classified to obtain a collection type of the data collection.
By classifying the text content of a data set, the set type of the data set can be obtained. For example, the text content of the data set may be extracted and mapped by using a pre-trained text classification model to obtain a corresponding classification result. In some alternative embodiments of the present application, the method of classifying text contents may further include the following steps S531 to S533.
Step S531: and extracting the characteristics of the text content to obtain the content characteristics of the text content.
The method for extracting the characteristics of the text content can be to vector the text content by using an embedding matrix to obtain the characteristics of the content with vector form for calculation by a neural network.
Step S532: the content features are mapped to predict classification probabilities for classifying the text content into a plurality of types of labels, respectively.
For different application scenarios, a plurality of different types of labels can be preset. For example, for medical examination sheets, a plurality of different types of labels for blood routine, urine routine, liver function examination, etc. may be provided. After the mapping process is performed on the content features, a classification probability that the text content is classified to each type of label can be calculated, and the higher the classification probability is, the higher the accuracy of prediction classification is.
Step S533: and selecting a target label from the plurality of type labels according to the classification probability, and determining the target label as the set type of the data set.
The classification probability obtained in the previous step can be selected from one or more target tags, for example, one type tag with the largest classification probability can be selected as the target tag, and one or more type tags with the classification probability exceeding a certain probability threshold can be selected as the target tag. Based on the selected target tag, a set type of the data set may be determined, for example, the data set in the test chart shown in fig. 2 should be classified into a set type corresponding to "blood routine".
In step S430, identity matching is performed in the body identity database according to the object body identity to obtain one or more identity bodies corresponding to the object body identity.
The subject identification database is a database for storing a mapping relation between identification bodies and subject identifications, for example, in an application scenario of health assessment, a test item body of one test item can be expressed as a plurality of different names (i.e. aliases), and the subject identification database can be used for storing a mapping relation between the test item body and different aliases thereof, and based on the mapping relation, an entity corresponding to the aliases can be queried by using the aliases, and an aliases corresponding to the aliases can also be queried by using the entity.
FIG. 6 schematically illustrates a flowchart of method steps for identity matching based on object body identity in some embodiments of the application. As shown in fig. 6, on the basis of the above embodiment, performing identity matching in the body identity database according to the object body identity in step S430 to obtain one or more identity bodies corresponding to the object body identity may include the following steps S610 to S640.
Step S610: and carrying out matching detection in a subject identification database according to the subject identification to determine whether an accurate matching identification identical to the subject identification exists in the subject identification database.
And taking the object main body identification as a keyword, searching in a main body identification database and carrying out text consistency check so as to find and judge whether an accurate matching identification consistent with the text of the object main body identification exists in the main body identification database.
Step S620: if the accurate matching identification is detected in the main body identification database, the identification body with the mapping relation with the accurate matching identification is determined to be the identification body corresponding to the object main body identification.
According to the match detection result of step S610, if an exact match identification is detected in the subject identification database, an identification body having a mapping relationship with the exact match identification in the subject identification database may be determined as an identification body corresponding to the subject identification. The number of identity bodies determined in this step may be one or more.
Step S630: if the accurate matching identification is not detected in the main body identification database, matching detection is carried out in the main body identification database according to the object main body identification so as to determine whether fuzzy matching identification within the range of the difference degree of the preset text with the object main body identification exists in the main body identification database.
If no exact match identification is detected in the subject identification database, it is stated that there is no subject identification in the subject identification database that is consistent with the subject identification text. In this case, the subject identifier having a higher degree of similarity with the subject identifier may be searched for in the subject identifier database by means of visibly matching the accuracy. And particularly, the main body identification of the text difference in the preset text difference degree range can be determined as the corresponding fuzzy matching identification.
In some optional embodiments of the present application, a fuzzy matching model with a tree structure may be established according to the text variability of each identifier in the subject identifier database, and then the fuzzy matching model is traversed to determine whether a fuzzy matching identifier within a preset text variability range with the subject identifier exists in each node of the fuzzy matching model. For example, in the embodiment of the present application, a fuzzy matching model based on Burkhard Keller Tree (BK Tree) may be established according to the body identifier database, where BK Tree is a Tree structure formed by performing node layout according to an edit distance (Levenshtein distance) of each body identifier in the body identifier database, where the edit distance is used to represent the minimum number of editing steps required for mutual conversion of two character strings. The fuzzy matching model based on BK Tree can be used for quickly searching the subject identifier with the editing distance smaller than the preset distance threshold value with the subject identifier, and the subject identifier is used as the fuzzy matching identifier.
Step S640: if the fuzzy matching identification is detected in the main body identification database, the identification body with the mapping relation with the fuzzy matching identification is determined to be the identification body corresponding to the object main body identification.
According to the matching detection result of fuzzy matching in step S630, if a corresponding fuzzy matching identifier is found, the identifier body having a mapping relationship with the fuzzy matching identifier in the body identifier database may be determined as the identifier body corresponding to the subject body identifier. The number of identity bodies determined in this step may likewise be one or more.
Through executing the steps S610 to S640, the identification body can be acquired in the main body identification database through two stages of accurate matching and fuzzy matching, so that the acquisition efficiency of the identification body can be improved on one hand, and the problem of failure in body matching caused by incomplete collection of main body identifications in the main body identification database can be avoided on the other hand.
In step S440, the identification ontology is screened according to the object association identification and the set type to obtain a target ontology, and a mapping relationship between the data object and the target ontology is established.
If the identification body corresponding to the subject identification obtained by the matching detection in the body identification database is only one, the identification body can be directly taken as the target body. If the identification body corresponding to the object body identification obtained by matching detection in the body identification database comprises a plurality of identification bodies (namely two or more than two identification bodies), the identification bodies can be screened by utilizing the object association identification and the collection type so as to obtain a more accurate target body.
Fig. 7 schematically illustrates a flowchart of method steps for screening identity ontologies in some embodiments of the present application. As shown in fig. 7, based on the above embodiment, in step S440, the filtering of the identification body according to the object association identification and the collection type to obtain the target body may further include the following steps S710 to S740.
Step S710: and screening the identification ontology according to the set type to obtain candidate ontologies matched with the set type.
Each identification body obtained by matching detection in the main identification database can determine the corresponding combination type, and the identification bodies of the same type are generally aggregated in the same data set. And screening the identification ontology according to the set type of the data set to obtain candidate ontology with type matching.
Step S720: and searching candidate association identifications with mapping relation with the candidate ontology in an association identification database.
The associated identification database is a database for storing the mapping relation between the identification ontology and the associated identifications, and based on the mapping relation, the associated identifications corresponding to the identification ontology can be searched by utilizing the identification ontology, and meanwhile, the associated identifications can be used for searching the identification ontology corresponding to the associated identifications.
Step S730: and selecting a target associated identifier matched with the object associated identifier from the candidate associated identifiers.
The object association identifier may include a unit identifier (e.g., a data column corresponding to a "unit" field in fig. 2) for indicating a unit of measure of the data object, and a range identifier (e.g., a data column corresponding to a "reference section" field in fig. 2) for indicating a range of values of the data object. Accordingly, the target association identifier includes a target unit identifier matched with the unit identifier and a target range identifier matched with the range identifier. The matching detection is carried out by utilizing the two different types of object association identifiers, so that the detection precision can be improved, and a better matching detection effect can be obtained.
Step S740: and determining the candidate ontology with the mapping relation with the target association identifier as a target ontology.
If the target unit identifier and the target range identifier are mapped to the same candidate ontology, determining the candidate ontology as a target ontology; and if the target unit identifier and the target range identifier are mapped to different candidate ontologies, determining the candidate ontologies with the mapping relation with the target unit identifier as the target ontologies.
The above steps S710 to S740 are performed to obtain a target ontology corresponding to the data object, and on this basis, a mapping relationship between the data object and the target ontology may be established, and specifically, the object value, the object body identifier, the object association identifier, and the target ontology of the data object may be formed into structural information having a mapping relationship. The object value is a specific storage value of the data object in the database, for example, may be a data column corresponding to the "result" field shown in fig. 2.
Fig. 8 schematically shows a schematic diagram of a processing procedure in an application scenario of inspection sheet data processing according to an embodiment of the present application. As shown in fig. 8, the method for data processing of the inspection sheet in the application scenario may include the following steps.
Step S801: the name of the test item of a certain medical test sheet, the category of the test sheet form to which the test item belongs, and the unit and reference range of the test item are extracted.
Step S802: searching the alias library according to the name of the test item, matching the test item alias in the alias library, recording the test item body corresponding to the test item alias successfully matched, and referring to the record as a test item accurate matching result.
If only 1 test item body exists in the test item accurate matching result, outputting the body as the test item body mapped by the medical test item.
If there are 2 or more test item bodies in the test item exact match result, step S803 is executed to assist in judging the bodies.
If there are 0 test item entities in the test item exact match result, step S805 is performed to assist in determining the entities.
Step S803: and filtering the test item body in the matching result by using the test sheet table category.
And if the number of the filtered test item bodies is 1, outputting the bodies as the test item bodies mapped by the medical examination list test items.
If the number of filtered test item bodies is 0, the body mapping of the medical test item fails.
If the number of the filtered test item bodies is 2 or more, step S804 is executed to assist in determining the bodies.
Step S804: and carrying out matching detection in a unit and reference range knowledge base according to the unit and reference range corresponding to the test item of the test sheet, so as to select a certain test item body as the test item body mapped by the test item of the test sheet.
The step can preferably select the body with the unit and the reference range completely matched. If there are no entities that match exactly, then the entity that matches the unit may be selected preferentially. Again, if there is no identity matching, then the identity of the reference range match may be selected.
Step S805: and if 0 check item bodies exist in the check item accurate matching result, performing alias library fuzzy matching of a defined threshold value. The record matches the corresponding test item body of the test item alias successfully, and this record is referred to as a test item fuzzy matching result hereinafter.
If the fuzzy matching result of the test items has 0 test item bodies, the body mapping of the test items of the medical examination list fails.
If 1 or more of the check item fuzzy match results are present, the check item exact match result is replaced with the check item fuzzy match result, and the execution returns to step S803.
In the application scenario, the medical and health institution information systems are independent, so that the same test item has different aliases on different test sheets. By the technical scheme, the test item body can be accurately found, and interpretation of a test sheet is assisted. The medical knowledge base comprises an alias base and a unit reference range knowledge base, additional information on a test sheet page is introduced, and body mapping of test items of the test sheet is assisted, so that the problem that a real test item body in a plurality of easily-confused bodies cannot be determined by singly relying on the test item names can be solved. Meanwhile, similar test item bodies are filtered by using the test sheet form type information, the multi-level characteristics of the test items and the characteristics that the test item bodies of the same type are generally aggregated in one form are fully utilized, and the accuracy of the test item body mapping is improved. By adopting the technical scheme, the test item information extracted from the test item is subjected to body mapping and then stored as the structural information, so that the health information can be stored more perfectly, and the problem that medical staff is required to carry out subsequent processing due to ambiguity of the name of the test item is avoided.
It should be noted that although the steps of the methods of the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
The following describes embodiments of the apparatus of the present application that may be used to perform the data processing methods of the above-described embodiments of the present application. Fig. 9 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 9, the data processing apparatus 900 may mainly include:
an image acquisition module 910 configured to acquire an image to be processed for presenting a data set, the data set comprising at least one data object;
a text recognition module 920 configured to perform text recognition on the image to be processed to obtain an object body identifier of the data object, an object association identifier, and a collection type of the data collection;
an identity matching module 930 configured to perform identity matching in a body identity database according to the object body identity to obtain one or more identity bodies corresponding to the object body identity;
The ontology filtering module 940 is configured to filter the identity ontology according to the object association identity and the set type to obtain a target ontology, and establish a mapping relationship between the data object and the target ontology.
In some embodiments of the present application, based on the above embodiments, the text recognition module includes:
the set text recognition unit is configured to perform text recognition on the image to be processed to obtain text content of a data set, wherein the text content comprises data text fields forming data objects;
the text field classifying unit is configured to classify the data text field according to the distribution position of the data text field on the image to be processed so as to determine the object main body identification and the object association identification of the data object;
and the text content classification unit is configured to classify the text content to obtain the set type of the data set.
In some embodiments of the present application, based on the above embodiments, the aggregate text recognition unit includes:
the line detection subunit is configured to perform line detection on the image to be processed to obtain a table line in the image to be processed;
the region dividing subunit is configured to divide the region of the image to be processed according to the form lines so as to obtain a data form region where the data set is located;
And the text recognition subunit is configured to perform text recognition on the data table area to obtain text content of the data set.
In some embodiments of the present application, based on the above embodiments, the line detection subunit includes:
the pixel classification subunit is configured to classify the pixel points in the image to be identified based on the image semantics so as to determine the foreground pixel points where the image lines are located;
the image segmentation subunit is configured to carry out image segmentation on the image to be identified according to the foreground pixel points so as to obtain a foreground line image;
and the line fitting subunit is configured to perform line fitting on the foreground line image to obtain a table line in the image to be processed.
In some embodiments of the present application, based on the above embodiments, the text field classification unit includes:
an indication field obtaining subunit configured to obtain an identification indication field in the text content, where the identification indication field includes a subject identification indication field for indicating a subject identification of the object and an association identification indication field for indicating an association identification of the object;
an indication area determining subunit configured to determine, on the image to be processed, a subject identification indication area corresponding to the subject identification indication field and an associated identification indication area corresponding to the associated identification indication field;
A position relation determining subunit configured to determine a region position relation between the data text field and the main body identification indication region and between the data text field and the association identification indication region according to the distribution position of the data text field on the image to be processed;
and the text field classifying subunit is configured to classify the data text field according to the regional position relation so as to determine the object main body identification and the object association identification of the data object.
In some embodiments of the present application, based on the above embodiments, the text content classification unit includes:
a feature extraction subunit configured to perform feature extraction on the text content to obtain content features of the text content;
a feature mapping subunit configured to perform a mapping process on the content features to predict classification probabilities of classifying the text content into a plurality of type tags, respectively;
and the label selecting subunit is configured to select a target label from a plurality of type labels according to the classification probability and determine the target label as the set type of the data set.
In some embodiments of the present application, based on the above embodiments, identifying the matching module includes:
an exact matching unit configured to perform matching detection in the subject identification database according to the subject identification to determine whether there is an exact matching identification identical to the subject identification in the subject identification database;
A first ontology determining unit configured to determine an identity ontology having a mapping relationship with an exact match identity as an identity ontology corresponding to an object subject identity if the exact match identity is detected in the subject identity database;
the fuzzy matching unit is configured to perform matching detection in the main body identification database according to the object main body identification if the accurate matching identification is not detected in the main body identification database so as to determine whether fuzzy matching identification within a preset text difference range with the object main body identification exists in the main body identification database;
and a second body determining unit configured to determine an identification body having a mapping relationship with the fuzzy match identification as an identification body corresponding to the subject identification of the object if the fuzzy match identification is detected in the subject identification database.
In some embodiments of the present application, based on the above embodiments, the fuzzy matching unit includes:
a fuzzy matching model establishing subunit configured to establish a fuzzy matching model with a tree structure according to the text difference degree of each identifier in the main body identifier database;
and a fuzzy matching model traversing subunit configured to traverse the fuzzy matching model to determine whether fuzzy matching identifiers within a preset text difference range with the object main body identifier exist in each node of the fuzzy matching model.
In some embodiments of the present application, based on the above embodiments, the ontology filtering module includes:
the collection type screening unit is configured to screen the identification ontology according to the collection type to obtain a candidate ontology matched with the collection type;
the association identifier searching unit is configured to search candidate association identifiers with mapping relation with the candidate ontology in an association identifier database;
an association identifier selection unit configured to select a target association identifier matched with the object association identifier from the candidate association identifiers;
and a target ontology determining unit configured to determine a candidate ontology having a mapping relationship with the target association identifier as a target ontology.
In some embodiments of the present application, based on the above embodiments, the object association identifier includes a unit identifier for representing a unit of measure of the data object and a range identifier for representing a range of values of the data object, and the target association identifier includes a target unit identifier that matches the unit identifier and a target range identifier that matches the range identifier.
In some embodiments of the present application, based on the above embodiments, the target body determining unit includes:
A first target ontology determining subunit configured to determine a candidate ontology as a target ontology if the target unit identifier and the target range identifier map to the same candidate ontology;
and a second target ontology determining subunit configured to determine, as the target ontology, a candidate ontology having a mapping relationship with the target unit identifier if the target unit identifier and the target range identifier are mapped to different candidate ontologies.
In some embodiments of the present application, based on the above embodiments, the ontology filtering module further includes:
the mapping relation establishing unit is configured to form the object value, the object main body identifier, the object association identifier and the target body of the data object into structural information with a mapping relation.
Specific details of the data processing apparatus provided in each embodiment of the present application have been described in the corresponding method embodiments, and are not described herein.
Fig. 10 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the application.
It should be noted that, the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 10, the computer system 1000 includes a central processing unit 1001 (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 1002 (ROM) or a program loaded from a storage section 1008 into a random access Memory 1003 (Random Access Memory, RAM). In the random access memory 1003, various programs and data necessary for the system operation are also stored. The cpu 1001, the rom 1002, and the ram 1003 are connected to each other via a bus 1004. An Input/Output interface 1005 (i.e., an I/O interface) is also connected to bus 1004.
The following components are connected to the input/output interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a local area network card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the input/output interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.
In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The computer programs, when executed by the central processor 1001, perform the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (14)

1. A method of data processing, comprising:
acquiring an image to be processed for displaying a data set, wherein the data set comprises at least one data object;
text recognition is carried out on the image to be processed to obtain an object main body identifier, an object association identifier and a set type of the data set of the data object; the object main body identifier is used for representing main body content of the data object, and the object association identifier is information with association relation with the main body content of the data object;
performing identification matching in a main body identification database according to the object main body identification to obtain one or more identification bodies corresponding to the object main body identification;
screening the identification ontology according to the set type to obtain a candidate ontology matched with the set type;
searching a candidate association identifier with a mapping relation with the candidate body in an association identifier database;
selecting a target associated identifier matched with the object associated identifier from the candidate associated identifiers;
and determining the candidate ontology with the mapping relation with the target association identifier as a target ontology, and establishing the mapping relation between the data object and the target ontology.
2. The method according to claim 1, wherein the text recognition of the image to be processed to obtain the object body identifier, the object association identifier, and the collection type of the data collection of the data object comprises:
performing text recognition on the image to be processed to obtain text content of the data set, wherein the text content comprises data text fields for forming the data object;
classifying the data text field according to the distribution position of the data text field on the image to be processed to determine an object main body identifier and an object association identifier of the data object;
and classifying the text content to obtain the collection type of the data collection.
3. The data processing method according to claim 2, wherein the text recognition of the image to be processed to obtain text content of the data set includes:
performing line detection on the image to be processed to obtain form lines in the image to be processed;
dividing the region of the image to be processed according to the form lines to obtain a data form region where the data set is located;
And carrying out text recognition on the data table area to obtain text contents of the data set.
4. A data processing method according to claim 3, wherein the performing line detection on the image to be processed to obtain a table line in the image to be processed comprises:
classifying the pixel points in the image to be identified based on the image semantics to determine the foreground pixel points where the image lines are located;
image segmentation is carried out on the image to be identified according to the foreground pixel points so as to obtain a foreground line image;
and performing line fitting on the foreground line image to obtain a table line in the image to be processed.
5. The method according to claim 2, wherein classifying the data text field according to the distribution position of the text field on the image to be processed to determine the object body identifier and the object association identifier of the data object comprises:
acquiring an identification indication field in the text content, wherein the identification indication field comprises a main body identification indication field for indicating the main body identification of the object and an association identification indication field for indicating the association identification of the object;
Determining a main body identification indication area corresponding to the main body identification indication field and an association identification indication area corresponding to the association identification indication field on the image to be processed;
determining the region position relation between the data text field and the main body identification indication region and between the data text field and the association identification indication region according to the distribution position of the data text field on the image to be processed;
and classifying the data text field according to the region position relation to determine an object main body identifier and an object association identifier of the data object.
6. The data processing method according to claim 2, wherein said classifying the text content to obtain a collection type of the data collection comprises:
extracting characteristics of the text content to obtain content characteristics of the text content;
mapping the content features to predict classification probabilities of classifying the text content into a plurality of types of labels, respectively;
and selecting a target label from the plurality of type labels according to the classification probability, and determining the target label as the set type of the data set.
7. The data processing method according to claim 1, wherein the performing, according to the object subject identifier, identifier matching in a subject identifier database to obtain one or more identifier identities corresponding to the object subject identifier, includes:
performing matching detection in a main body identification database according to the object main body identification to determine whether an accurate matching identification identical to the object main body identification exists in the main body identification database;
if the accurate matching identification is detected in the main body identification database, determining an identification body with a mapping relation with the accurate matching identification as an identification body corresponding to the object main body identification;
if the accurate matching identification is not detected in the main body identification database, carrying out matching detection in the main body identification database according to the object main body identification so as to determine whether a fuzzy matching identification within a preset text difference range with the object main body identification exists in the main body identification database;
and if the fuzzy matching identification is detected in the main body identification database, determining an identification body with a mapping relation with the fuzzy matching identification as an identification body corresponding to the object main body identification.
8. The method according to claim 7, wherein the performing a matching detection in the subject identification database according to the subject identification to determine whether there is a fuzzy matching identification with the subject identification within a preset text difference range in the subject identification database includes:
establishing a fuzzy matching model with a tree structure according to the text difference degree of each identifier in the main body identifier database;
traversing the fuzzy matching model to determine whether fuzzy matching identifiers within a preset text difference range with the object main body identifier exist in each node of the fuzzy matching model.
9. The data processing method according to claim 1, wherein the object association identifier includes a unit identifier for representing a unit of measure of the data object and a range identifier for representing a range of values of the data object, and the target association identifier includes a target unit identifier that matches the unit identifier and a target range identifier that matches the range identifier.
10. The data processing method according to claim 9, wherein the determining the candidate ontology having the mapping relation with the target association identifier as the target ontology includes:
If the target unit identifier and the target range identifier are mapped to the same candidate ontology, determining the candidate ontology as a target ontology;
and if the target unit identifier and the target range identifier are mapped to different candidate ontologies, determining the candidate ontologies with the mapping relation with the target unit identifier as target ontologies.
11. The method according to claim 1, wherein the establishing a mapping relationship between the data object and the target ontology includes:
and forming the object value, the object main body identifier, the object association identifier and the target body of the data object into structural information with a mapping relation.
12. A data processing apparatus, comprising:
an image acquisition module configured to acquire an image to be processed for presenting a data set, the data set comprising at least one data object;
the text recognition module is configured to perform text recognition on the image to be processed to obtain an object main body identifier, an object association identifier and a set type of the data set of the data object; the object main body identifier is used for representing main body content of the data object, and the object association identifier is information with association relation with the main body content of the data object;
The identification matching module is configured to perform identification matching in a main body identification database according to the object main body identification so as to obtain one or more identification bodies corresponding to the object main body identification;
the entity screening module is configured to screen the identification entity according to the set type to obtain a candidate entity matched with the set type; searching a candidate association identifier with a mapping relation with the candidate body in an association identifier database; selecting a target associated identifier matched with the object associated identifier from the candidate associated identifiers; and determining the candidate ontology with the mapping relation with the target association identifier as a target ontology, and establishing the mapping relation between the data object and the target ontology.
13. A computer readable medium having stored thereon a computer program which, when executed by a processor, implements the data processing method of any of claims 1 to 11.
14. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the data processing method of any one of claims 1 to 11 via execution of the executable instructions.
CN202010745286.1A 2020-07-29 2020-07-29 Data processing method, device, computer readable medium and electronic equipment Active CN111898528B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010745286.1A CN111898528B (en) 2020-07-29 2020-07-29 Data processing method, device, computer readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010745286.1A CN111898528B (en) 2020-07-29 2020-07-29 Data processing method, device, computer readable medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN111898528A CN111898528A (en) 2020-11-06
CN111898528B true CN111898528B (en) 2023-11-10

Family

ID=73183714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010745286.1A Active CN111898528B (en) 2020-07-29 2020-07-29 Data processing method, device, computer readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111898528B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641583A (en) * 2021-08-16 2021-11-12 拉扎斯网络科技(上海)有限公司 Data processing method, data processing apparatus, electronic device, storage medium, and program product
CN115880300B (en) * 2023-03-03 2023-05-09 北京网智易通科技有限公司 Image blurring detection method, device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468666A (en) * 2015-08-11 2016-04-06 中国科学院软件研究所 Video content visual analysis method based on map metaphor
CN108734089A (en) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 Identify method, apparatus, equipment and the storage medium of table content in picture file
JP2019040467A (en) * 2017-08-25 2019-03-14 キヤノン株式会社 Information processing apparatus and control method therefor
CN110796031A (en) * 2019-10-11 2020-02-14 腾讯科技(深圳)有限公司 Table identification method and device based on artificial intelligence and electronic equipment
CN111258995A (en) * 2020-01-14 2020-06-09 腾讯科技(深圳)有限公司 Data processing method, device, storage medium and equipment
CN111275038A (en) * 2020-01-17 2020-06-12 平安医疗健康管理股份有限公司 Image text recognition method and device, computer equipment and computer storage medium
CN111310682A (en) * 2020-02-24 2020-06-19 民生科技有限责任公司 Universal detection analysis and identification method for text file table
CN111461108A (en) * 2020-02-21 2020-07-28 浙江工业大学 Medical document identification method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060072778A1 (en) * 2004-09-28 2006-04-06 Xerox Corporation. Encoding invisible electronic information in a printed document

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468666A (en) * 2015-08-11 2016-04-06 中国科学院软件研究所 Video content visual analysis method based on map metaphor
JP2019040467A (en) * 2017-08-25 2019-03-14 キヤノン株式会社 Information processing apparatus and control method therefor
CN108734089A (en) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 Identify method, apparatus, equipment and the storage medium of table content in picture file
CN110796031A (en) * 2019-10-11 2020-02-14 腾讯科技(深圳)有限公司 Table identification method and device based on artificial intelligence and electronic equipment
CN111258995A (en) * 2020-01-14 2020-06-09 腾讯科技(深圳)有限公司 Data processing method, device, storage medium and equipment
CN111275038A (en) * 2020-01-17 2020-06-12 平安医疗健康管理股份有限公司 Image text recognition method and device, computer equipment and computer storage medium
CN111461108A (en) * 2020-02-21 2020-07-28 浙江工业大学 Medical document identification method
CN111310682A (en) * 2020-02-24 2020-06-19 民生科技有限责任公司 Universal detection analysis and identification method for text file table

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Generation and grading of arduous MCQs using NLP and OMR detection using OpenCV;Sarjak Maniar等;《2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT)》;1-7 *
Structured Pathology Reporting for Cancer from Free Text: Lung Cancer Case Study;Anthony Nguyen等;《electronic Journal of Health Informatics》;第7卷(第1期);1-7 *
基于特征的表格内容识别的研究;李华桥;《中国优秀硕士学位论文全文数据库 (信息科技辑)》(第01期);I138-1521 *

Also Published As

Publication number Publication date
CN111898528A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
CN109086756B (en) Text detection analysis method, device and equipment based on deep neural network
CN113312461A (en) Intelligent question-answering method, device, equipment and medium based on natural language processing
CN112927776A (en) Artificial intelligence automatic interpretation system for medical inspection report
CN111125406A (en) Visual relation detection method based on self-adaptive cluster learning
CN111506729B (en) Information processing method, device and computer readable storage medium
CN111898528B (en) Data processing method, device, computer readable medium and electronic equipment
CN114372532B (en) Method, device, equipment, medium and product for determining label labeling quality
CN114647713A (en) Knowledge graph question-answering method, device and storage medium based on virtual confrontation
Gang et al. Recognition of honeycomb lung in CT images based on improved MobileNet model
CN116611071A (en) Function-level vulnerability detection method based on multiple modes
CN114519397B (en) Training method, device and equipment for entity link model based on contrast learning
CN110705384B (en) Vehicle re-identification method based on cross-domain migration enhanced representation
CN116958512A (en) Target detection method, target detection device, computer readable medium and electronic equipment
CN114638973A (en) Target image detection method and image detection model training method
CN110889717A (en) Method and device for filtering advertisement content in text, electronic equipment and storage medium
CN116071609B (en) Small sample image classification method based on dynamic self-adaptive extraction of target features
Si Analysis of calligraphy Chinese character recognition technology based on deep learning and computer-aided technology
CN116739001A (en) Text relation extraction method, device, equipment and medium based on contrast learning
CN114741483B (en) Data identification method and device
CN111582404B (en) Content classification method, device and readable storage medium
Xu et al. Research on intelligent campus and visual teaching system based on Internet of things
CN118503729B (en) Intelligent compliance detection method based on industry multi-mode feature data
CN117557871B (en) Three-dimensional model labeling method, device, equipment and storage medium
CN117611845B (en) Multi-mode data association identification method, device, equipment and storage medium
CN118012921B (en) Man-machine interaction data processing system for intellectual property virtual experiment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant