Nothing Special   »   [go: up one dir, main page]

CN114359912B - Software page key information extraction method and system based on graph neural network - Google Patents

Software page key information extraction method and system based on graph neural network Download PDF

Info

Publication number
CN114359912B
CN114359912B CN202210279500.8A CN202210279500A CN114359912B CN 114359912 B CN114359912 B CN 114359912B CN 202210279500 A CN202210279500 A CN 202210279500A CN 114359912 B CN114359912 B CN 114359912B
Authority
CN
China
Prior art keywords
text
text line
lines
neural network
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210279500.8A
Other languages
Chinese (zh)
Other versions
CN114359912A (en
Inventor
方明超
高扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Real Intelligence Technology Co ltd
Original Assignee
Hangzhou Real Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Real Intelligence Technology Co ltd filed Critical Hangzhou Real Intelligence Technology Co ltd
Priority to CN202210279500.8A priority Critical patent/CN114359912B/en
Publication of CN114359912A publication Critical patent/CN114359912A/en
Application granted granted Critical
Publication of CN114359912B publication Critical patent/CN114359912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of software page information extraction, and particularly relates to a software page key information extraction method and system based on a graph neural network. The method comprises S1, inputting the web page picture, and outputting all the text line coordinate information on the picture; s2, cutting out all text lines according to the obtained text line coordinate information and identifying to obtain character information of each text line; s3, combining the webpage picture, the text line coordinate information and the text line character information, and outputting the category of all text lines through a text line classification algorithm based on a graph neural network model; s4, performing key-value pair matching according to the category of the text line; and if the matching is successful, outputting the text information corresponding to the required key value pair. The system comprises a text line detection module, a text line identification module, a text line classification module and a text line key value pair matching module. The method has the characteristics of strong universality and capability of being applied to all software text types.

Description

Software page key information extraction method and system based on graph neural network
Technical Field
The invention belongs to the technical field of software page information extraction, and particularly relates to a software page key information extraction method and system based on a graph neural network.
Background
The RPA application scenario typically encounters the task of web page or software page specific text information extraction. The task needs to acquire all the text information on the page by means of Optical Character Recognition (OCR) technology, and then extract the required field content through some post-processing operations (such as regular matching according to keywords, etc.).
In recent years, with the development of the field of artificial intelligence, deep neural networks are widely applied in the field of OCR, such as document recognition, certificate recognition, bill recognition and the like. Compared with the traditional OCR recognition algorithm, the deep neural network can obviously improve the application range and recognition accuracy of OCR recognition. The most commonly used Convolutional Neural Networks (CNN) tend to focus on only local features of the image, ignoring the interrelationships before the local features. The graph neural network can regard local features of the image as graph nodes and learn the interrelation among the nodes. In some specific scenes such as software interfaces and the like, text lines on the images have great interrelations, and more useful information can be learned by using a graph neural network.
The key information extraction refers to extracting required specified field information from the image text. For example, specific field information such as name, gender, ethnicity, identification card number, etc. is extracted from the identification card picture. There are often many text messages on a general software interface, and only a few key text messages are useful in actual business. If all the useful key information is to be extracted from all the text information, a series of complicated post-processing methods, such as template matching and the like, need to be designed. When designing the template, the character information of the text line, the position information of the text line, etc. need to be considered. It takes a lot of labor cost and time cost to set different post-processing rules for different software interfaces.
One of the existing key information extraction methods is to determine whether a matching relationship exists between a template image and a character string of an image to be detected based on template matching according to a preset template. For example, after all text information on the picture is identified, some regular rules are set according to text features of the key fields to match with all text lines on the picture, and the text line successfully matched with the regular rule of the corresponding key field is the key information.
In addition, a deep neural network-based method is used for classifying all text boxes in the image extracted by the OCR algorithm. For example, if the picture to be tested is an identification card picture, all text boxes in the picture can be classified into categories such as name, nationality, date of birth, address and the like, so that the key information extraction is completed.
However, the method based on template matching is very dependent on the layout of the image text, and once the text layout of the image to be detected is inconsistent with the preset template text layout, the extraction of the key information is wrong or fails. In addition, the interface text layouts of different application software are different, and a universal matching template is difficult to design. For example, to extract a name field from a picture, it is generally necessary to design a matching pattern by first searching the field for the keyword "name" and then matching the text boxes of 2-3 Chinese characters from the text box on the right side of the "name" field. If the interface typesetting of certain software is not arranged from left to right but arranged from top to bottom, the actual name is below the keyword 'name'. In this case, the matching pattern set in the past cannot be applied. Therefore, the template matching based method is difficult to have good versatility.
The method based on deep neural network classification is to assign a category to all text lines in the picture. For example, to extract information from an identification card, all text line fields on the identification card can be classified into categories such as "name", "gender", "date of birth", "address", "identification card number", and the like. When a certain key field needs to be extracted, corresponding field information can be extracted only according to the corresponding category of the key field. This approach does not need to rely on a specific template, but does require all the categories to be unambiguous. The text types on different application software are very different, and all the categories are difficult to exhaust. Therefore, the deep neural network classification-based method can only be used for specific scenes, and is not very universal.
Based on the above problems, it is very important to design a method and a system for extracting key information of a software page based on a graph neural network, which have strong universality and can be applied to all software text types.
For example, chinese patent application No. CN201911163754.8 describes a method, an apparatus, a terminal device and a server for accessing a web page, the method includes: acquiring an access request of a target webpage; the access request carries preset keywords; acquiring the position information of the keywords in the target webpage and the page data of the target webpage; and displaying the page data of the target webpage according to the position information. Although the page data of the target webpage is displayed according to the position information of the keywords, the user can quickly find the relevant contents of the searched keywords in the target webpage, so that the user experience is improved, the method has the defect that the method can only be used in a specific scene and is not very universal.
Disclosure of Invention
The invention provides a software page key information extraction method and system based on a graph neural network, which have strong universality and can be applied to all software text types, and aims to solve the problems that the existing key information extraction method can only be used in specific scenes and does not have good universality in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the software page key information extraction method based on the graph neural network comprises the following steps;
s1, the input webpage picture passes through a DBNet text detection algorithm, and all text line coordinate information on the webpage picture is output;
s2, cutting out all text lines and identifying according to the obtained text line coordinate information through a CRNN text identification algorithm to obtain character information of each text line;
s3, combining the input webpage picture and the obtained text line coordinate information and text line character information, and outputting the category of all text lines through a text line classification algorithm based on a graph neural network model;
s4, respectively extracting the text line coordinate information features and the text line character information features of any two text lines, fusing to obtain fusion features, and simultaneously performing key value pair matching by combining the categories of the text lines; and if the matching is successful, outputting the text information corresponding to all the required key value pairs.
Preferably, the categories of the text line described in step S3 include three categories of "key", "value", and "other".
Preferably, step S3 includes the steps of:
s31, extracting the characteristics of the web page picture by using a CNN backbone network, and processing the characteristics of all text lines into a uniform dimension by using an ROI Pooling layer; extraction of visual features of each text line with CNN + ROI Pooling
Figure 112029DEST_PATH_IMAGE001
Extracting semantic features of text lines using long-short term memory networks LSTM
Figure 407881DEST_PATH_IMAGE002
And combining the visual features
Figure 317063DEST_PATH_IMAGE001
And semantic features
Figure 825404DEST_PATH_IMAGE002
Fusing to obtain a fused feature
Figure 267756DEST_PATH_IMAGE003
Figure 609876DEST_PATH_IMAGE004
Representing the splicing operation, the formula is as follows:
Figure DEST_PATH_IMAGE005
s32, utilizing the fusion characteristics of each text line
Figure 599828DEST_PATH_IMAGE003
Establishing a graph neural network model, and constructing an undirected graph by taking each text line as a graph node, wherein the undirected graph is represented as
Figure 911861DEST_PATH_IMAGE006
Wherein
Figure 474298DEST_PATH_IMAGE007
Representing the fused features of all lines of text,
Figure 987319DEST_PATH_IMAGE008
weights representing edges of two nodes in the undirected graph;
constructing feature vectors considering spatial relationships between text lines
Figure 385939DEST_PATH_IMAGE009
Wherein,
Figure 252395DEST_PATH_IMAGE010
Figure 420072DEST_PATH_IMAGE011
is shown as
Figure 477895DEST_PATH_IMAGE012
The coordinates of the center point of the individual text line,
Figure 973599DEST_PATH_IMAGE013
Figure 893013DEST_PATH_IMAGE014
is shown as
Figure 400349DEST_PATH_IMAGE015
The coordinates of the center point of the individual text line,
Figure 114227DEST_PATH_IMAGE016
Figure 471128DEST_PATH_IMAGE017
is shown as
Figure 928654DEST_PATH_IMAGE012
The width and height of the individual lines of text,
Figure 149551DEST_PATH_IMAGE018
Figure 283598DEST_PATH_IMAGE019
denotes the first
Figure 612948DEST_PATH_IMAGE015
Width and height of individual text lines;
Figure 359318DEST_PATH_IMAGE020
and
Figure 824935DEST_PATH_IMAGE021
representing the distance between two lines of text;
Figure 661041DEST_PATH_IMAGE022
and
Figure 212108DEST_PATH_IMAGE023
represents the aspect ratio of each of the two text lines;
Figure 27749DEST_PATH_IMAGE024
and
Figure 613451DEST_PATH_IMAGE025
representing the difference in aspect ratio between two lines of text.
Preferably, step S3 further includes the steps of:
s33, constructing a spatial relationship between two text lines
Figure 823721DEST_PATH_IMAGE026
Figure 393243DEST_PATH_IMAGE027
Figure 746995DEST_PATH_IMAGE028
Wherein,
Figure DEST_PATH_IMAGE029
is a linear transformation for
Figure 967630DEST_PATH_IMAGE030
The process of raising the vitamin content is carried out,
Figure DEST_PATH_IMAGE031
to represent
Figure 709321DEST_PATH_IMAGE031
The process of the regularization is carried out,
Figure 234980DEST_PATH_IMAGE032
a multi-layer neural network is represented.
Preferably, step S3 further includes the steps of:
s34, using the following formula to make an undirected graph
Figure 890958DEST_PATH_IMAGE033
Node of
Figure 451252DEST_PATH_IMAGE034
Iteration is carried out, the iteration times are hyper-parameters, and can be adjusted as required:
Figure 770369DEST_PATH_IMAGE035
Figure 48904DEST_PATH_IMAGE036
wherein,
Figure 242994DEST_PATH_IMAGE037
a function of the ReLU activation is represented,
Figure 657795DEST_PATH_IMAGE038
is a linear transformation that is a function of,
Figure DEST_PATH_IMAGE039
is shown as
Figure 210130DEST_PATH_IMAGE040
In the second iteration
Figure DEST_PATH_IMAGE041
A graph node;
and S35, completing the construction of the graph neural network model.
Preferably, step S4 includes the steps of:
s41, extracting semantic features of the character information of each text line by using a long-short term memory network (LSTM)
Figure 553124DEST_PATH_IMAGE042
Feature of text line coordinate information having four vertices for each text line
Figure 442583DEST_PATH_IMAGE043
,
Figure 728202DEST_PATH_IMAGE044
,
Figure 903968DEST_PATH_IMAGE045
,
Figure 406363DEST_PATH_IMAGE046
Fusing to get a fused feature
Figure 958567DEST_PATH_IMAGE047
Figure 98692DEST_PATH_IMAGE048
Wherein,
Figure 586305DEST_PATH_IMAGE042
Figure 61149DEST_PATH_IMAGE049
respectively represent the first
Figure 424170DEST_PATH_IMAGE012
A line of text and
Figure 933649DEST_PATH_IMAGE015
semantic features of individual text lines;
Figure 467529DEST_PATH_IMAGE050
is shown as
Figure 570614DEST_PATH_IMAGE012
Vertex coordinates of individual text lines;
Figure 199042DEST_PATH_IMAGE051
is shown as
Figure 812295DEST_PATH_IMAGE015
Vertex coordinates of individual text lines;
Figure 766344DEST_PATH_IMAGE016
Figure 966513DEST_PATH_IMAGE017
is shown as
Figure 398631DEST_PATH_IMAGE012
Width and height of individual text lines;
Figure 758068DEST_PATH_IMAGE018
Figure 866707DEST_PATH_IMAGE019
is shown as
Figure 803439DEST_PATH_IMAGE015
The width and height of individual text lines.
S42, fusing the fused features
Figure 789981DEST_PATH_IMAGE047
Sending the two text lines to a classifier, and outputting the class of 0 when the two text lines do not belong to the same key value pair; when two text lines belong to the same key-value pair, the output category is 1.
The invention also provides a software page key information extraction system based on the graph neural network, which comprises the following steps:
the text line detection module is used for outputting all text line coordinate information on the webpage picture by a DBNet text detection algorithm on the input webpage picture;
the text line recognition module is used for cutting out all text lines and recognizing the text lines according to the obtained text line coordinate information through a CRNN text recognition algorithm to obtain character information of each text line;
the text line classification module is used for combining the input webpage picture with the obtained text line coordinate information and text line character information and outputting the categories of all the text lines through a text line classification algorithm based on a graph neural network model;
and the text line key value pair matching module is used for respectively extracting the text line coordinate information characteristics and the text line character information characteristics of any two text lines, fusing to obtain fusion characteristics, and meanwhile, matching the key value pairs according to the categories of the text lines.
Preferably, the software page key information extraction system based on the graph neural network further comprises;
and the key value pair output module is used for outputting text information corresponding to all required key value pairs when the key value pairs are successfully matched.
Preferably, the text line classification module further includes:
the graph neural network model module is used for constructing a graph neural network model;
and the classification module is used for outputting the categories of all text lines.
Compared with the prior art, the invention has the beneficial effects that: (1) the invention creatively applies the graph neural network to the extraction of the key information of the RPA application software, and can directly output all key value pairs in the software picture, thereby helping to extract the wanted key information and greatly reducing the complexity of searching the key information by manually setting rules in the later period; (2) the key information extraction method disclosed by the invention integrates the visual characteristics of the image, the semantic characteristics of the text and the position characteristics of the text line, so that the extraction accuracy of the key information is greatly improved; (3) the contrast learning method adopted by the key-value pair matching of the invention only needs a small amount of text box type labeling samples, thus having good key-value pair matching effect and strong system generalization.
Drawings
FIG. 1 is a flow chart of a method for extracting key information of a software page based on a graph neural network according to the present invention;
FIG. 2 is a functional architecture diagram of the software page key information extraction system based on graph neural network in the present invention;
FIG. 3 is a functional architecture diagram of the text line classification module of the present invention;
fig. 4 is a flowchart illustrating capturing a picture from an RPA to extracting key information according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
as shown in FIG. 1, the invention provides a software page key information extraction method based on a graph neural network, which comprises the following steps;
s1, the input webpage picture passes through a DBNet text detection algorithm, and all text line coordinate information on the webpage picture is output;
s2, cutting out all text lines and identifying according to the obtained text line coordinate information through a CRNN text identification algorithm to obtain character information of each text line;
s3, combining the input web page picture with the obtained text line coordinate information and text line character information, and outputting the category of all text lines through a text line classification algorithm based on a graph neural network model;
s4, respectively extracting the text line coordinate information features and the text line character information features of any two text lines, fusing to obtain fusion features, and simultaneously performing key value pair matching by combining the categories of the text lines; and if the matching is successful, outputting the text information corresponding to all the required key value pairs.
Further, the categories of the text line described in step S3 include three categories of "key", "value", and "other".
The classification aims to extract all key values in the picture on one hand and filter out some invalid text lines on the other hand. A general classification network extracts visual features of an image through a series of convolution operations, and classifies pictures according to the visual features. However, the current task is to classify the text lines, the difference of the visual features of the text lines is not obvious, and the classification based on the visual features cannot have a good classification effect. The category of the text line has a strong relationship with semantic information and position information of the text line, some key information such as 'name', 'date' and the like are specific texts, and the 'value' is generally positioned at the right side or below the 'key'. Therefore, the classification accuracy of the text lines can be improved by taking the position information and the semantic information of the text lines as the input of the network.
As shown in fig. 3, step S3 includes the following steps:
s31, extracting the characteristics of the web page picture by using a CNN backbone network, and processing the characteristics of all text lines into a uniform dimension by using an ROI Pooling layer; visual feature extraction of each text line with CNN + ROI Pooling
Figure 971301DEST_PATH_IMAGE001
Extracting semantic features of text lines using long-short term memory networks LSTM
Figure 80203DEST_PATH_IMAGE002
And combining the visual features
Figure 222340DEST_PATH_IMAGE001
And semantic features
Figure 261840DEST_PATH_IMAGE002
Fusing to obtain a fused feature
Figure 205656DEST_PATH_IMAGE003
Figure 672410DEST_PATH_IMAGE004
Representing the splicing operation, the formula is as follows:
Figure 301843DEST_PATH_IMAGE005
s32, utilizing the fusion characteristics of each text line
Figure 20401DEST_PATH_IMAGE003
Establishing a graph neural network model, and constructing an undirected graph by taking each text line as a graph node, wherein the undirected graph is represented as
Figure 67991DEST_PATH_IMAGE006
In which
Figure 456378DEST_PATH_IMAGE007
Representing the fusion characteristics of all text lines and representing the weight of the edges of two nodes in the undirected graph;
constructing feature vectors considering spatial relationships between text lines
Figure 589419DEST_PATH_IMAGE009
Wherein,
Figure 219990DEST_PATH_IMAGE010
Figure 122086DEST_PATH_IMAGE011
is shown as
Figure 806009DEST_PATH_IMAGE012
The coordinates of the center point of the individual text line,
Figure 442658DEST_PATH_IMAGE013
Figure 96493DEST_PATH_IMAGE014
is shown as
Figure 367943DEST_PATH_IMAGE015
The coordinates of the center point of the individual text line,
Figure 81821DEST_PATH_IMAGE016
Figure 674608DEST_PATH_IMAGE017
is shown as
Figure 866554DEST_PATH_IMAGE012
The width and height of the individual lines of text,
Figure 258090DEST_PATH_IMAGE018
Figure 408449DEST_PATH_IMAGE019
is shown as
Figure 488532DEST_PATH_IMAGE015
Width and height of each text line;
Figure 15328DEST_PATH_IMAGE020
and
Figure 887469DEST_PATH_IMAGE021
representing the distance between two text lines;
Figure 457996DEST_PATH_IMAGE022
and
Figure 274643DEST_PATH_IMAGE023
represents the aspect ratio of each of the two text lines;
Figure 90283DEST_PATH_IMAGE024
and
Figure 675985DEST_PATH_IMAGE025
representing the difference in aspect ratio between two lines of text.
S33, constructing a spatial relationship between two text lines
Figure 417414DEST_PATH_IMAGE026
Figure 986936DEST_PATH_IMAGE027
Figure 340688DEST_PATH_IMAGE028
Wherein,
Figure 921842DEST_PATH_IMAGE029
is a linear transformation for
Figure 850484DEST_PATH_IMAGE030
The dimension of the mixture is increased by a plurality of steps,
Figure 625410DEST_PATH_IMAGE031
represent
Figure 297700DEST_PATH_IMAGE031
The process of the regularization is carried out,
Figure 343148DEST_PATH_IMAGE032
a multi-layer neural network is represented.
S34, using the following formula to make an undirected graph
Figure 177112DEST_PATH_IMAGE033
Node of
Figure 439334DEST_PATH_IMAGE034
Iteration is carried out, the iteration times are hyper-parameters, and can be adjusted according to needs:
Figure 384157DEST_PATH_IMAGE052
Figure 284111DEST_PATH_IMAGE053
wherein,
Figure 288976DEST_PATH_IMAGE037
a function of the ReLU activation is represented,
Figure 38495DEST_PATH_IMAGE038
is a linear transformation that is a function of,
Figure 521429DEST_PATH_IMAGE039
is shown as
Figure 541469DEST_PATH_IMAGE040
The first in the second iteration
Figure 717235DEST_PATH_IMAGE041
A graph node;
and S35, completing the construction of the graph neural network model.
ROI Pooling is an operation that can process different dimensional features into the same dimension, and is ubiquitous in the mainstream two-stage target detection algorithm (e.g., fast RCNN).
Step S4 includes the steps of:
s41, extracting semantic features of the character information of each text line by using a long-short term memory network (LSTM)
Figure 813105DEST_PATH_IMAGE042
Feature of text line coordinate information having four vertices for each text line
Figure 584883DEST_PATH_IMAGE043
,
Figure 708697DEST_PATH_IMAGE044
,
Figure 773473DEST_PATH_IMAGE045
,
Figure 248317DEST_PATH_IMAGE046
Fusing to obtain a fused feature
Figure 89365DEST_PATH_IMAGE047
Figure 333265DEST_PATH_IMAGE048
Wherein,
Figure 123539DEST_PATH_IMAGE042
Figure 85679DEST_PATH_IMAGE049
respectively represent
Figure 730418DEST_PATH_IMAGE012
A line of text and
Figure 94403DEST_PATH_IMAGE015
semantic features of individual text lines;
Figure 297720DEST_PATH_IMAGE050
is shown as
Figure 622522DEST_PATH_IMAGE012
Vertex coordinates of each text line;
Figure 54641DEST_PATH_IMAGE051
is shown as
Figure 289444DEST_PATH_IMAGE015
Vertex coordinates of individual text lines;
Figure 414395DEST_PATH_IMAGE016
Figure 600394DEST_PATH_IMAGE017
is shown as
Figure 570624DEST_PATH_IMAGE012
Width and height of individual text lines;
Figure 784568DEST_PATH_IMAGE018
Figure 831153DEST_PATH_IMAGE019
is shown as
Figure 989601DEST_PATH_IMAGE015
The width and height of individual text lines.
S42, fusing the fused features
Figure 278369DEST_PATH_IMAGE047
Sending the two text lines to a classifier, and outputting the classification of 0 when the two text lines do not belong to the same key value pair; when two text lines belong to the same key-value pair, the output category is 1.
The invention divides the key information extraction into two steps, namely text line classification and text line key value pair matching. The text line classification is to classify all detected text lines into three categories: keys (keys), values (values) and others (other) do not need to distinguish specific key value categories, so that the universality is greatly enhanced, and the method can be applied to all software text types. The text line key value pair matching is to pair all keys and values, and bind each text line belonging to the category of "key" with the corresponding text line belonging to the category of "value", so that the corresponding value can be obtained as long as the key corresponding to certain key information is input.
As shown in fig. 2, the present invention further provides a software page key information extraction system based on the graph neural network, including:
the text line detection module is used for outputting all text line coordinate information on the webpage picture by a DBNet text detection algorithm on the input webpage picture;
the text line recognition module is used for cutting out all text lines and recognizing the text lines according to the obtained text line coordinate information through a CRNN text recognition algorithm to obtain character information of each text line;
the text line classification module is used for combining the input webpage picture with the obtained text line coordinate information and text line character information and outputting the categories of all the text lines through a text line classification algorithm based on a graph neural network model;
and the text line key value pair matching module is used for respectively extracting the text line coordinate information characteristics and the text line character information characteristics of any two text lines, fusing to obtain fusion characteristics, and simultaneously performing key value pair matching by combining the categories of the text lines.
And the key value pair output module is used for outputting the text information corresponding to all the required key value pairs when the key value pairs are successfully matched.
Further, the text line classification module further includes:
the graph neural network model module is used for constructing a graph neural network model;
and the classification module is used for outputting the categories of all text lines.
Based on the technical scheme of the invention, in the specific implementation and operation process, the specific implementation flow of the invention is described by using the flow chart from capturing pictures by the RPA to extracting key information shown in FIG. 4.
As shown in fig. 4, the specific implementation flow is as follows:
1. capturing pictures of application software pages by using an RPA (resilient packet access) as input, and configuring names of key information fields needing to be output;
2. inputting the picture into a text detector, and detecting all text line coordinates in the picture;
3. cutting out all text lines from the original image according to the text line coordinates detected in the step 2, inputting the text lines into a text recognizer, and recognizing the character content of each text line;
4. inputting the original image, the coordinates of the text lines output by the text detector and the content of the text lines output by the text recognizer into a text line classifier to obtain the categories (keys, values and other) of all the text lines;
5. inputting each text line belonging to the key and all text lines belonging to the value into a key-value matcher for matching, and binding the current key and the value if matching is successful;
6. matching the name of the key according to the name of the key information field set in the step 1;
7. the "value" bound to it is output according to the "key" corresponding to the name.
The invention creatively applies the graph neural network to the extraction of the key information of the RPA application software, and can directly output all key value pairs in the software picture, thereby helping to extract the wanted key information and greatly reducing the complexity of searching the key information by manually setting rules in the later period; the key information extraction method disclosed by the invention integrates the visual characteristics of the image, the semantic characteristics of the text and the position characteristics of the text line, so that the extraction accuracy of the key information is greatly improved; the contrast learning method adopted by the key-value pair matching of the invention only needs a small amount of text box type labeling samples, thus having good key-value pair matching effect and strong system generalization.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims (5)

1. The software page key information extraction method based on the graph neural network is characterized by comprising the following steps;
s1, the input webpage picture passes through a DBNet text detection algorithm, and all text line coordinate information on the webpage picture is output;
s2, cutting out all text lines and identifying according to the obtained text line coordinate information through a CRNN text identification algorithm to obtain character information of each text line;
s3, combining the input webpage picture and the obtained text line coordinate information and text line character information, and outputting the category of all text lines through a text line classification algorithm based on a graph neural network model;
s4, respectively extracting the text line coordinate information features and the text line character information features of any two text lines, fusing to obtain fusion features, and simultaneously performing key value pair matching by combining the categories of the text lines; if the matching is successful, outputting text information corresponding to all required key value pairs;
the categories of the text line in step S3 include three categories of "key", "value", and "other";
step S3 includes the following steps:
s31, extracting the characteristics of the web page picture by using a CNN backbone network, and processing the characteristics of all text lines into a uniform dimension by using an ROI Pooling layer; extraction of visual features of each text line with CNN + ROI Pooling
Figure DEST_PATH_IMAGE002
Extracting semantic features of text lines using long-short term memory networks LSTM
Figure DEST_PATH_IMAGE004
And combining the visual features
Figure 841030DEST_PATH_IMAGE002
And semantic features
Figure 554908DEST_PATH_IMAGE004
Fusing to obtain a fused feature
Figure DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE008
Representing the splicing operation, the formula is as follows:
Figure DEST_PATH_IMAGE010
s32, utilizing the fusion characteristics of each text line
Figure 459279DEST_PATH_IMAGE006
Establishing a graph neural network model, and constructing an undirected graph by taking each text line as a graph node, wherein the undirected graph is represented as
Figure DEST_PATH_IMAGE012
In which
Figure DEST_PATH_IMAGE014
Representing the fused features of all lines of text,
Figure DEST_PATH_IMAGE016
weights representing edges of two nodes in the undirected graph;
constructing feature vectors considering spatial relationships between text lines
Figure DEST_PATH_IMAGE018
Wherein,
Figure DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE022
denotes the first
Figure DEST_PATH_IMAGE024
The coordinates of the center point of the individual text line,
Figure DEST_PATH_IMAGE026
Figure DEST_PATH_IMAGE028
denotes the first
Figure DEST_PATH_IMAGE030
The coordinates of the center point of the individual text line,
Figure DEST_PATH_IMAGE032
Figure DEST_PATH_IMAGE034
is shown as
Figure 422863DEST_PATH_IMAGE024
The width and height of the individual lines of text,
Figure DEST_PATH_IMAGE036
Figure DEST_PATH_IMAGE038
denotes the first
Figure 695625DEST_PATH_IMAGE030
Width and height of individual text lines;
Figure DEST_PATH_IMAGE040
and
Figure DEST_PATH_IMAGE042
representing the distance between two text lines;
Figure DEST_PATH_IMAGE044
and
Figure DEST_PATH_IMAGE046
represents the aspect ratio of each of the two text lines;
Figure DEST_PATH_IMAGE048
and
Figure DEST_PATH_IMAGE050
representing the difference in aspect ratio between two lines of text;
s33, constructing a spatial relationship between two text lines
Figure DEST_PATH_IMAGE052
Figure DEST_PATH_IMAGE054
Figure DEST_PATH_IMAGE056
Wherein,
Figure DEST_PATH_IMAGE058
is a linear transformation for
Figure DEST_PATH_IMAGE060
The dimension of the mixture is increased by a plurality of steps,
Figure DEST_PATH_IMAGE062
to represent
Figure 735449DEST_PATH_IMAGE062
The process of the regularization is carried out,
Figure DEST_PATH_IMAGE064
representing a multi-layer neural network;
s34, using the following formula to make an undirected graph
Figure DEST_PATH_IMAGE066
Node of
Figure DEST_PATH_IMAGE068
Performing iteration with the number of iterations being a hyper-parameterCan be adjusted as required:
Figure DEST_PATH_IMAGE070
Figure DEST_PATH_IMAGE072
wherein,
Figure DEST_PATH_IMAGE074
a function of the ReLU activation is represented,
Figure DEST_PATH_IMAGE076
is a linear transformation that is a function of,
Figure DEST_PATH_IMAGE078
is shown as
Figure DEST_PATH_IMAGE080
In the second iteration
Figure DEST_PATH_IMAGE082
A graph node;
and S35, completing the construction of the graph neural network model.
2. The method for extracting the key information of the software page based on the graph neural network as claimed in claim 1, wherein the step S4 comprises the following steps:
s41, extracting semantic features of the text line character information of each text line by using a long-short term memory network LSTM
Figure DEST_PATH_IMAGE084
Feature of text line coordinate information having four vertices for each text line
Figure DEST_PATH_IMAGE086
,
Figure DEST_PATH_IMAGE088
,
Figure DEST_PATH_IMAGE090
,
Figure DEST_PATH_IMAGE092
Fusing to obtain a fused feature
Figure DEST_PATH_IMAGE094
Figure DEST_PATH_IMAGE096
Wherein,
Figure 831843DEST_PATH_IMAGE084
Figure DEST_PATH_IMAGE098
respectively represent the first
Figure 539731DEST_PATH_IMAGE024
A line of text and
Figure 51352DEST_PATH_IMAGE030
semantic features of individual text lines;
Figure DEST_PATH_IMAGE100
denotes the first
Figure 966087DEST_PATH_IMAGE024
Vertex coordinates of each text line;
Figure DEST_PATH_IMAGE102
is shown as
Figure 923679DEST_PATH_IMAGE030
Vertex coordinates of individual text lines;
Figure 503434DEST_PATH_IMAGE032
Figure 371027DEST_PATH_IMAGE034
denotes the first
Figure 394347DEST_PATH_IMAGE024
Width and height of each text line;
Figure 484574DEST_PATH_IMAGE036
Figure 307168DEST_PATH_IMAGE038
denotes the first
Figure 544114DEST_PATH_IMAGE030
Width and height of individual text lines;
s42, fusing the fused features
Figure 518761DEST_PATH_IMAGE094
Sending the two text lines to a classifier, and outputting the classification of 0 when the two text lines do not belong to the same key value pair; when two text lines belong to the same key-value pair, the output category is 1.
3. The software page key information extraction system based on the graph neural network is applied to the software page key information extraction method based on the graph neural network as claimed in any one of claims 1-2, and is characterized in that the software page key information extraction system based on the graph neural network comprises:
the text line detection module is used for outputting all text line coordinate information on the webpage picture by the DBNet text detection algorithm;
the text line recognition module is used for cutting out all text lines and recognizing the text lines according to the obtained text line coordinate information through a CRNN text recognition algorithm to obtain character information of each text line;
the text line classification module is used for combining the input webpage picture with the obtained text line coordinate information and text line character information and outputting the categories of all the text lines through a text line classification algorithm based on a graph neural network model;
and the text line key value pair matching module is used for respectively extracting the text line coordinate information characteristics and the text line character information characteristics of any two text lines, fusing to obtain fusion characteristics, and simultaneously performing key value pair matching by combining the categories of the text lines.
4. The software page key information extraction system based on the graph neural network as claimed in claim 3, further comprising;
and the key value pair output module is used for outputting text information corresponding to all required key value pairs when the key value pairs are successfully matched.
5. The graph neural network-based software page key information extraction system of claim 3, wherein the text line classification module further comprises:
the graph neural network model module is used for constructing a graph neural network model;
and the classification module is used for outputting the categories of all text lines.
CN202210279500.8A 2022-03-22 2022-03-22 Software page key information extraction method and system based on graph neural network Active CN114359912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210279500.8A CN114359912B (en) 2022-03-22 2022-03-22 Software page key information extraction method and system based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210279500.8A CN114359912B (en) 2022-03-22 2022-03-22 Software page key information extraction method and system based on graph neural network

Publications (2)

Publication Number Publication Date
CN114359912A CN114359912A (en) 2022-04-15
CN114359912B true CN114359912B (en) 2022-06-24

Family

ID=81095001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210279500.8A Active CN114359912B (en) 2022-03-22 2022-03-22 Software page key information extraction method and system based on graph neural network

Country Status (1)

Country Link
CN (1) CN114359912B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117079288B (en) * 2023-10-19 2023-12-29 华南理工大学 Method and model for extracting key information for recognizing Chinese semantics in scene

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN112257841A (en) * 2020-09-03 2021-01-22 北京大学 Data processing method, device and equipment in graph neural network and storage medium
CN112464781A (en) * 2020-11-24 2021-03-09 厦门理工学院 Document image key information extraction and matching method based on graph neural network
CN114187595A (en) * 2021-12-14 2022-03-15 中国科学院软件研究所 Document layout recognition method and system based on fusion of visual features and semantic features

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11403488B2 (en) * 2020-03-19 2022-08-02 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for recognizing image-based content presented in a structured layout
CN114037985A (en) * 2021-11-04 2022-02-11 北京有竹居网络技术有限公司 Information extraction method, device, equipment, medium and product

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN112257841A (en) * 2020-09-03 2021-01-22 北京大学 Data processing method, device and equipment in graph neural network and storage medium
CN112464781A (en) * 2020-11-24 2021-03-09 厦门理工学院 Document image key information extraction and matching method based on graph neural network
CN114187595A (en) * 2021-12-14 2022-03-15 中国科学院软件研究所 Document layout recognition method and system based on fusion of visual features and semantic features

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Graph-based Visual-Semantic Entanglement Network for Zero-shot Image Recognition;Yang Hu 等;《arXiv》;20210614;第1-15页 *
基于主次关系特征的自动文摘方法;张迎等;《计算机科学》;20200615;第16-21页 *
基于深度学习技术的图片文字提取技术的研究;蒋良卫等;《信息系统工程》;20200320(第03期);第89-90页 *

Also Published As

Publication number Publication date
CN114359912A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
US8744196B2 (en) Automatic recognition of images
CN110717534B (en) Target classification and positioning method based on network supervision
CN111753120B (en) Question searching method and device, electronic equipment and storage medium
WO2020071558A1 (en) Business form layout analysis device, and analysis program and analysis method therefor
CN112464781A (en) Document image key information extraction and matching method based on graph neural network
CN111931859B (en) Multi-label image recognition method and device
Hu et al. Enriching the metadata of map images: a deep learning approach with GIS-based data augmentation
CN112381086A (en) Method and device for outputting image character recognition result in structured mode
CN113469067A (en) Document analysis method and device, computer equipment and storage medium
CN114359912B (en) Software page key information extraction method and system based on graph neural network
CN112966676B (en) Document key information extraction method based on zero sample learning
CN115063784A (en) Bill image information extraction method and device, storage medium and electronic equipment
CN113936764A (en) Method and system for desensitizing sensitive information in medical report sheet photo
CN113628181A (en) Image processing method, image processing device, electronic equipment and storage medium
CN115640401B (en) Text content extraction method and device
CN115858695A (en) Information processing method and device and storage medium
Vishwanath et al. Deep reader: Information extraction from document images via relation extraction and natural language
JP6896260B1 (en) Layout analysis device, its analysis program and its analysis method
CN116092100A (en) Text content extraction method and device
Rahul et al. Deep reader: Information extraction from document images via relation extraction and natural language
Akhter et al. Semantic segmentation of printed text from marathi document images using deep learning methods
Liao et al. Image-matching based identification of store signage using web-crawled information
Yadav et al. Rfpssih: reducing false positive text detection sequels in scenery images using hybrid technique
CN113591680B (en) Method and system for identifying longitude and latitude of geological picture drilling well
Khlif Multi-lingual scene text detection based on convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant