CN114359912B

CN114359912B - Software page key information extraction method and system based on graph neural network

Info

Publication number: CN114359912B
Application number: CN202210279500.8A
Authority: CN
Inventors: 方明超; 高扬
Original assignee: Hangzhou Real Intelligence Technology Co ltd
Current assignee: Hangzhou Real Intelligence Technology Co ltd
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-06-24
Anticipated expiration: 2042-03-22
Also published as: CN114359912A

Abstract

The invention belongs to the technical field of software page information extraction, and particularly relates to a software page key information extraction method and system based on a graph neural network. The method comprises S1, inputting the web page picture, and outputting all the text line coordinate information on the picture; s2, cutting out all text lines according to the obtained text line coordinate information and identifying to obtain character information of each text line; s3, combining the webpage picture, the text line coordinate information and the text line character information, and outputting the category of all text lines through a text line classification algorithm based on a graph neural network model; s4, performing key-value pair matching according to the category of the text line; and if the matching is successful, outputting the text information corresponding to the required key value pair. The system comprises a text line detection module, a text line identification module, a text line classification module and a text line key value pair matching module. The method has the characteristics of strong universality and capability of being applied to all software text types.

Description

Software page key information extraction method and system based on graph neural network

Technical Field

The invention belongs to the technical field of software page information extraction, and particularly relates to a software page key information extraction method and system based on a graph neural network.

Background

The RPA application scenario typically encounters the task of web page or software page specific text information extraction. The task needs to acquire all the text information on the page by means of Optical Character Recognition (OCR) technology, and then extract the required field content through some post-processing operations (such as regular matching according to keywords, etc.).

In recent years, with the development of the field of artificial intelligence, deep neural networks are widely applied in the field of OCR, such as document recognition, certificate recognition, bill recognition and the like. Compared with the traditional OCR recognition algorithm, the deep neural network can obviously improve the application range and recognition accuracy of OCR recognition. The most commonly used Convolutional Neural Networks (CNN) tend to focus on only local features of the image, ignoring the interrelationships before the local features. The graph neural network can regard local features of the image as graph nodes and learn the interrelation among the nodes. In some specific scenes such as software interfaces and the like, text lines on the images have great interrelations, and more useful information can be learned by using a graph neural network.

The key information extraction refers to extracting required specified field information from the image text. For example, specific field information such as name, gender, ethnicity, identification card number, etc. is extracted from the identification card picture. There are often many text messages on a general software interface, and only a few key text messages are useful in actual business. If all the useful key information is to be extracted from all the text information, a series of complicated post-processing methods, such as template matching and the like, need to be designed. When designing the template, the character information of the text line, the position information of the text line, etc. need to be considered. It takes a lot of labor cost and time cost to set different post-processing rules for different software interfaces.

One of the existing key information extraction methods is to determine whether a matching relationship exists between a template image and a character string of an image to be detected based on template matching according to a preset template. For example, after all text information on the picture is identified, some regular rules are set according to text features of the key fields to match with all text lines on the picture, and the text line successfully matched with the regular rule of the corresponding key field is the key information.

In addition, a deep neural network-based method is used for classifying all text boxes in the image extracted by the OCR algorithm. For example, if the picture to be tested is an identification card picture, all text boxes in the picture can be classified into categories such as name, nationality, date of birth, address and the like, so that the key information extraction is completed.

However, the method based on template matching is very dependent on the layout of the image text, and once the text layout of the image to be detected is inconsistent with the preset template text layout, the extraction of the key information is wrong or fails. In addition, the interface text layouts of different application software are different, and a universal matching template is difficult to design. For example, to extract a name field from a picture, it is generally necessary to design a matching pattern by first searching the field for the keyword "name" and then matching the text boxes of 2-3 Chinese characters from the text box on the right side of the "name" field. If the interface typesetting of certain software is not arranged from left to right but arranged from top to bottom, the actual name is below the keyword 'name'. In this case, the matching pattern set in the past cannot be applied. Therefore, the template matching based method is difficult to have good versatility.

The method based on deep neural network classification is to assign a category to all text lines in the picture. For example, to extract information from an identification card, all text line fields on the identification card can be classified into categories such as "name", "gender", "date of birth", "address", "identification card number", and the like. When a certain key field needs to be extracted, corresponding field information can be extracted only according to the corresponding category of the key field. This approach does not need to rely on a specific template, but does require all the categories to be unambiguous. The text types on different application software are very different, and all the categories are difficult to exhaust. Therefore, the deep neural network classification-based method can only be used for specific scenes, and is not very universal.

Based on the above problems, it is very important to design a method and a system for extracting key information of a software page based on a graph neural network, which have strong universality and can be applied to all software text types.

For example, chinese patent application No. CN201911163754.8 describes a method, an apparatus, a terminal device and a server for accessing a web page, the method includes: acquiring an access request of a target webpage; the access request carries preset keywords; acquiring the position information of the keywords in the target webpage and the page data of the target webpage; and displaying the page data of the target webpage according to the position information. Although the page data of the target webpage is displayed according to the position information of the keywords, the user can quickly find the relevant contents of the searched keywords in the target webpage, so that the user experience is improved, the method has the defect that the method can only be used in a specific scene and is not very universal.

Disclosure of Invention

The invention provides a software page key information extraction method and system based on a graph neural network, which have strong universality and can be applied to all software text types, and aims to solve the problems that the existing key information extraction method can only be used in specific scenes and does not have good universality in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

the software page key information extraction method based on the graph neural network comprises the following steps;

s1, the input webpage picture passes through a DBNet text detection algorithm, and all text line coordinate information on the webpage picture is output;

s2, cutting out all text lines and identifying according to the obtained text line coordinate information through a CRNN text identification algorithm to obtain character information of each text line;

s3, combining the input webpage picture and the obtained text line coordinate information and text line character information, and outputting the category of all text lines through a text line classification algorithm based on a graph neural network model;

s4, respectively extracting the text line coordinate information features and the text line character information features of any two text lines, fusing to obtain fusion features, and simultaneously performing key value pair matching by combining the categories of the text lines; and if the matching is successful, outputting the text information corresponding to all the required key value pairs.

Preferably, the categories of the text line described in step S3 include three categories of "key", "value", and "other".

Preferably, step S3 includes the steps of:

s31, extracting the characteristics of the web page picture by using a CNN backbone network, and processing the characteristics of all text lines into a uniform dimension by using an ROI Pooling layer; extraction of visual features of each text line with CNN + ROI Pooling

Extracting semantic features of text lines using long-short term memory networks LSTM

And combining the visual features

And semantic features

Fusing to obtain a fused feature

，

Representing the splicing operation, the formula is as follows:

s32, utilizing the fusion characteristics of each text line

Establishing a graph neural network model, and constructing an undirected graph by taking each text line as a graph node, wherein the undirected graph is represented as

Wherein

Representing the fused features of all lines of text,

weights representing edges of two nodes in the undirected graph;

constructing feature vectors considering spatial relationships between text lines

Wherein,

，

is shown as

The coordinates of the center point of the individual text line,

，

is shown as

The coordinates of the center point of the individual text line,

，

is shown as

The width and height of the individual lines of text,

，

denotes the first

Width and height of individual text lines;

and

representing the distance between two lines of text;

and

represents the aspect ratio of each of the two text lines;

and

representing the difference in aspect ratio between two lines of text.

Preferably, step S3 further includes the steps of:

s33, constructing a spatial relationship between two text lines

Wherein,

is a linear transformation for

The process of raising the vitamin content is carried out,

to represent

The process of the regularization is carried out,

a multi-layer neural network is represented.

Preferably, step S3 further includes the steps of:

s34, using the following formula to make an undirected graph

Node of

Iteration is carried out, the iteration times are hyper-parameters, and can be adjusted as required:

wherein,

a function of the ReLU activation is represented,

is a linear transformation that is a function of,

is shown as

In the second iteration

A graph node;

and S35, completing the construction of the graph neural network model.

Preferably, step S4 includes the steps of:

s41, extracting semantic features of the character information of each text line by using a long-short term memory network (LSTM)

Feature of text line coordinate information having four vertices for each text line

,

,

,

Fusing to get a fused feature

：

Wherein,

、

respectively represent the first

A line of text and

semantic features of individual text lines;

is shown as

Vertex coordinates of individual text lines;

is shown as

Vertex coordinates of individual text lines;

、

is shown as

Width and height of individual text lines;

、

is shown as

The width and height of individual text lines.

S42, fusing the fused features

Sending the two text lines to a classifier, and outputting the class of 0 when the two text lines do not belong to the same key value pair; when two text lines belong to the same key-value pair, the output category is 1.

The invention also provides a software page key information extraction system based on the graph neural network, which comprises the following steps:

the text line detection module is used for outputting all text line coordinate information on the webpage picture by a DBNet text detection algorithm on the input webpage picture;

the text line recognition module is used for cutting out all text lines and recognizing the text lines according to the obtained text line coordinate information through a CRNN text recognition algorithm to obtain character information of each text line;

the text line classification module is used for combining the input webpage picture with the obtained text line coordinate information and text line character information and outputting the categories of all the text lines through a text line classification algorithm based on a graph neural network model;

and the text line key value pair matching module is used for respectively extracting the text line coordinate information characteristics and the text line character information characteristics of any two text lines, fusing to obtain fusion characteristics, and meanwhile, matching the key value pairs according to the categories of the text lines.

Preferably, the software page key information extraction system based on the graph neural network further comprises;

and the key value pair output module is used for outputting text information corresponding to all required key value pairs when the key value pairs are successfully matched.

Preferably, the text line classification module further includes:

the graph neural network model module is used for constructing a graph neural network model;

and the classification module is used for outputting the categories of all text lines.

Compared with the prior art, the invention has the beneficial effects that: (1) the invention creatively applies the graph neural network to the extraction of the key information of the RPA application software, and can directly output all key value pairs in the software picture, thereby helping to extract the wanted key information and greatly reducing the complexity of searching the key information by manually setting rules in the later period; (2) the key information extraction method disclosed by the invention integrates the visual characteristics of the image, the semantic characteristics of the text and the position characteristics of the text line, so that the extraction accuracy of the key information is greatly improved; (3) the contrast learning method adopted by the key-value pair matching of the invention only needs a small amount of text box type labeling samples, thus having good key-value pair matching effect and strong system generalization.

Drawings

FIG. 1 is a flow chart of a method for extracting key information of a software page based on a graph neural network according to the present invention;

FIG. 2 is a functional architecture diagram of the software page key information extraction system based on graph neural network in the present invention;

FIG. 3 is a functional architecture diagram of the text line classification module of the present invention;

fig. 4 is a flowchart illustrating capturing a picture from an RPA to extracting key information according to an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

Example 1:

as shown in FIG. 1, the invention provides a software page key information extraction method based on a graph neural network, which comprises the following steps;

s3, combining the input web page picture with the obtained text line coordinate information and text line character information, and outputting the category of all text lines through a text line classification algorithm based on a graph neural network model;

Further, the categories of the text line described in step S3 include three categories of "key", "value", and "other".

The classification aims to extract all key values in the picture on one hand and filter out some invalid text lines on the other hand. A general classification network extracts visual features of an image through a series of convolution operations, and classifies pictures according to the visual features. However, the current task is to classify the text lines, the difference of the visual features of the text lines is not obvious, and the classification based on the visual features cannot have a good classification effect. The category of the text line has a strong relationship with semantic information and position information of the text line, some key information such as 'name', 'date' and the like are specific texts, and the 'value' is generally positioned at the right side or below the 'key'. Therefore, the classification accuracy of the text lines can be improved by taking the position information and the semantic information of the text lines as the input of the network.

As shown in fig. 3, step S3 includes the following steps:

s31, extracting the characteristics of the web page picture by using a CNN backbone network, and processing the characteristics of all text lines into a uniform dimension by using an ROI Pooling layer; visual feature extraction of each text line with CNN + ROI Pooling

And combining the visual features

And semantic features

Fusing to obtain a fused feature

，

Representing the splicing operation, the formula is as follows:

s32, utilizing the fusion characteristics of each text line

In which

Representing the fusion characteristics of all text lines and representing the weight of the edges of two nodes in the undirected graph;

Wherein,

，

is shown as

The coordinates of the center point of the individual text line,

，

is shown as

The coordinates of the center point of the individual text line,

，

is shown as

The width and height of the individual lines of text,

，

is shown as

Width and height of each text line;

and

representing the distance between two text lines;

and

represents the aspect ratio of each of the two text lines;

and

representing the difference in aspect ratio between two lines of text.

S33, constructing a spatial relationship between two text lines

Wherein,

is a linear transformation for

The dimension of the mixture is increased by a plurality of steps,

represent

The process of the regularization is carried out,

a multi-layer neural network is represented.

S34, using the following formula to make an undirected graph

Node of

Iteration is carried out, the iteration times are hyper-parameters, and can be adjusted according to needs:

wherein,

a function of the ReLU activation is represented,

is a linear transformation that is a function of,

is shown as

The first in the second iteration

A graph node;

and S35, completing the construction of the graph neural network model.

ROI Pooling is an operation that can process different dimensional features into the same dimension, and is ubiquitous in the mainstream two-stage target detection algorithm (e.g., fast RCNN).

Step S4 includes the steps of:

,

,

,

Fusing to obtain a fused feature

：

Wherein,

、

respectively represent

A line of text and

semantic features of individual text lines;

is shown as

Vertex coordinates of each text line;

is shown as

Vertex coordinates of individual text lines;

、

is shown as

Width and height of individual text lines;

、

is shown as

The width and height of individual text lines.

S42, fusing the fused features

Sending the two text lines to a classifier, and outputting the classification of 0 when the two text lines do not belong to the same key value pair; when two text lines belong to the same key-value pair, the output category is 1.

The invention divides the key information extraction into two steps, namely text line classification and text line key value pair matching. The text line classification is to classify all detected text lines into three categories: keys (keys), values (values) and others (other) do not need to distinguish specific key value categories, so that the universality is greatly enhanced, and the method can be applied to all software text types. The text line key value pair matching is to pair all keys and values, and bind each text line belonging to the category of "key" with the corresponding text line belonging to the category of "value", so that the corresponding value can be obtained as long as the key corresponding to certain key information is input.

As shown in fig. 2, the present invention further provides a software page key information extraction system based on the graph neural network, including:

and the text line key value pair matching module is used for respectively extracting the text line coordinate information characteristics and the text line character information characteristics of any two text lines, fusing to obtain fusion characteristics, and simultaneously performing key value pair matching by combining the categories of the text lines.

And the key value pair output module is used for outputting the text information corresponding to all the required key value pairs when the key value pairs are successfully matched.

Further, the text line classification module further includes:

Based on the technical scheme of the invention, in the specific implementation and operation process, the specific implementation flow of the invention is described by using the flow chart from capturing pictures by the RPA to extracting key information shown in FIG. 4.

As shown in fig. 4, the specific implementation flow is as follows:

1. capturing pictures of application software pages by using an RPA (resilient packet access) as input, and configuring names of key information fields needing to be output;

2. inputting the picture into a text detector, and detecting all text line coordinates in the picture;

3. cutting out all text lines from the original image according to the text line coordinates detected in the step 2, inputting the text lines into a text recognizer, and recognizing the character content of each text line;

4. inputting the original image, the coordinates of the text lines output by the text detector and the content of the text lines output by the text recognizer into a text line classifier to obtain the categories (keys, values and other) of all the text lines;

5. inputting each text line belonging to the key and all text lines belonging to the value into a key-value matcher for matching, and binding the current key and the value if matching is successful;

6. matching the name of the key according to the name of the key information field set in the step 1;

7. the "value" bound to it is output according to the "key" corresponding to the name.

The invention creatively applies the graph neural network to the extraction of the key information of the RPA application software, and can directly output all key value pairs in the software picture, thereby helping to extract the wanted key information and greatly reducing the complexity of searching the key information by manually setting rules in the later period; the key information extraction method disclosed by the invention integrates the visual characteristics of the image, the semantic characteristics of the text and the position characteristics of the text line, so that the extraction accuracy of the key information is greatly improved; the contrast learning method adopted by the key-value pair matching of the invention only needs a small amount of text box type labeling samples, thus having good key-value pair matching effect and strong system generalization.

The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims

1. The software page key information extraction method based on the graph neural network is characterized by comprising the following steps;

s4, respectively extracting the text line coordinate information features and the text line character information features of any two text lines, fusing to obtain fusion features, and simultaneously performing key value pair matching by combining the categories of the text lines; if the matching is successful, outputting text information corresponding to all required key value pairs;

the categories of the text line in step S3 include three categories of "key", "value", and "other";

step S3 includes the following steps:

And combining the visual features

And semantic features

Fusing to obtain a fused feature

，

Representing the splicing operation, the formula is as follows:

s32, utilizing the fusion characteristics of each text line

In which

Representing the fused features of all lines of text,

weights representing edges of two nodes in the undirected graph;

Wherein,

，

denotes the first

The coordinates of the center point of the individual text line,

，

denotes the first

The coordinates of the center point of the individual text line,

，

is shown as

The width and height of the individual lines of text,

，

denotes the first

Width and height of individual text lines;

and

representing the distance between two text lines;

and

represents the aspect ratio of each of the two text lines;

and

representing the difference in aspect ratio between two lines of text;

s33, constructing a spatial relationship between two text lines

Wherein,

is a linear transformation for

The dimension of the mixture is increased by a plurality of steps,

to represent

The process of the regularization is carried out,

representing a multi-layer neural network;

s34, using the following formula to make an undirected graph

Node of

Performing iteration with the number of iterations being a hyper-parameterCan be adjusted as required:

wherein,

a function of the ReLU activation is represented,

is a linear transformation that is a function of,

is shown as

In the second iteration

A graph node;

and S35, completing the construction of the graph neural network model.

2. The method for extracting the key information of the software page based on the graph neural network as claimed in claim 1, wherein the step S4 comprises the following steps:

s41, extracting semantic features of the text line character information of each text line by using a long-short term memory network LSTM

,

,

,

Fusing to obtain a fused feature

：

Wherein,

、

respectively represent the first

A line of text and

semantic features of individual text lines;

denotes the first

Vertex coordinates of each text line;

is shown as

Vertex coordinates of individual text lines;

、

denotes the first

Width and height of each text line;

、

denotes the first

Width and height of individual text lines;

s42, fusing the fused features

3. The software page key information extraction system based on the graph neural network is applied to the software page key information extraction method based on the graph neural network as claimed in any one of claims 1-2, and is characterized in that the software page key information extraction system based on the graph neural network comprises:

the text line detection module is used for outputting all text line coordinate information on the webpage picture by the DBNet text detection algorithm;

4. The software page key information extraction system based on the graph neural network as claimed in claim 3, further comprising;

5. The graph neural network-based software page key information extraction system of claim 3, wherein the text line classification module further comprises: