CN112149663B

CN112149663B - Image text extraction method, device and electronic device combining RPA and AI

Info

Publication number: CN112149663B
Application number: CN202010886737.3A
Authority: CN
Inventors: 汪冠春; 胡一川; 褚瑞; 李玮; 田艳莉; 王建周
Original assignee: Beijing Laiye Network Technology Co Ltd; Laiye Technology Beijing Co Ltd
Current assignee: Beijing Laiye Network Technology Co Ltd; Laiye Technology Beijing Co Ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2024-11-15
Anticipated expiration: 2040-08-28
Also published as: CN112149663A

Abstract

The present application proposes a method, device, electronic device and storage medium for extracting text from an image that combines RPA and AI, and belongs to the field of image processing technology. The method includes: performing target detection on the image to be processed to determine the position information of each detection frame contained in the image to be processed and the type of each detection frame, wherein the type of each detection frame includes: character, non-character, beginning of a text line and end of a text line; merging the detection frames of type character according to the position information of each detection frame and the type of each detection frame to determine the text frames contained in the image to be processed; performing text recognition on each text frame to determine the text contained in the image to be processed. Therefore, through this method of extracting text from an image that combines RPA and AI, different types of data content in the image can be determined through one detection, thereby simplifying the process of extracting text from the image and improving the efficiency of text extraction.

Description

Image character extraction method and device combining RPA and AI and electronic equipment

Technical Field

The present application relates to the field of automation technologies, and in particular, to a method and apparatus for extracting image text by combining RPA and AI, an electronic device, and a storage medium.

Background

Robot process automation (Robotic Process Automation, RPA for short) is to simulate the operation of a person on a computer through specific robot software, and automatically execute process tasks according to rules.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI for short) is a piece of technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence.

With the development of AI, optical character recognition (Optical Character Recognition, abbreviated OCR) technology has been in progress in various fields to help people mitigate repetitive inefficient work, especially the work that requires the transcription of text information into a computer. The combination of the RPA technology and the OCR technology has become a new trend in the RPA field, so as to help enterprises to process text image data more efficiently and improve the working efficiency.

However, in the related art, for documents including multiple types of content such as text, forms, red chapters, etc., multiple models are generally required to sequentially perform text extraction on different types of content, so that the text extraction process is complicated and the efficiency is low.

Disclosure of Invention

The application provides an image text extraction method, an image text extraction device, an image text extraction electronic device and a storage medium combining RPA and AI, which are used for solving the problems that in the related technology, for documents with various types of contents including texts, tables, red chapters and the like, a plurality of models are usually needed to extract the text of different types of contents in sequence, so that the text extraction process is complicated and the efficiency is low.

The image text extraction method combining RPA and AI provided by the embodiment of the application comprises the following steps: performing target detection on an image to be processed to determine position information of each detection frame and the type of each detection frame contained in the image to be processed, wherein the type of each detection frame comprises: characters, non-characters, beginning of text line, end of text line; combining the detection frames with the types of characters according to the position information of each detection frame and the types of each detection frame so as to determine each text frame contained in the image to be processed; and carrying out character recognition on each text box to determine characters contained in the image to be processed.

Optionally, in a possible implementation manner of the embodiment of the first aspect of the present application, the performing object detection on the image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed specifically includes:

extracting a plurality of dimension features of each detection frame from the image to be processed respectively;

performing attention mechanism learning on the plurality of dimension features to acquire adjacent frame information of each detection frame and text line head and tail information corresponding to each detection frame;

And determining the type of each detection frame according to the adjacent frame information of each detection frame and the corresponding head and tail information of the text line.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, before the extracting, from the image to be processed, a plurality of dimension features of each detection frame respectively, the method further includes:

preprocessing the image to be processed to obtain a plurality of feature images corresponding to the image to be processed;

the extracting the plurality of dimension features of each detection frame from the image to be processed comprises the following steps:

And respectively extracting a plurality of dimension characteristics of each detection frame from the plurality of characteristic diagrams.

Optionally, in still another possible implementation manner of the embodiment of the first aspect of the present application, the extracting, from the image to be processed, a plurality of dimension features of each detection frame includes:

and carrying out convolution processing on the image to be processed by using at least two filters to obtain at least two dimensional characteristics of each detection frame, wherein receptive fields of the at least two filters are different.

Optionally, in a further possible implementation manner of the embodiment of the first aspect of the present application, before the learning of the attention mechanism on the multiple dimensional features to obtain the adjacent frame information of each detection frame and the head and tail information of the text line corresponding to each detection frame, the method further includes:

And splicing the plurality of dimensional features to generate the features of each detection frame.

and carrying out normalization processing on the plurality of dimension features.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, the position information of each detection frame includes coordinates of each detection frame in a first direction and an offset in a second direction, and according to the position information of each detection frame and a type of each detection frame, the detection frames with a type of characters are combined to determine each text frame included in the image to be processed, which specifically includes:

If the type of any detection frame is the beginning of a text line, acquiring candidate detection frames matched with the second coordinate in the first direction and the first coordinate from the detection frames according to the first coordinate of the any detection frame in the first direction;

Acquiring adjacent detection frames adjacent to any detection frame in the second direction from the candidate detection frames according to the first offset of the any detection frame in the second direction;

and if the type of the adjacent detection frame is a character, merging the adjacent detection frame with any detection frame.

Optionally, in a further possible implementation manner of the embodiment of the first aspect of the present application, after the merging the detection frames with the type being a character according to the location information of each detection frame and the type of each detection frame to determine each text frame included in the image to be processed, the method further includes:

carrying out connected domain analysis on each text box to determine the shape of the connected domain corresponding to each text box;

And if the connected domain shape corresponding to any text box is circular, determining that the red chapter is contained in any text box.

In another aspect, an apparatus for extracting image text combining RPA and AI according to an embodiment of the present application includes: the first determining module is configured to perform object detection on an image to be processed, so as to determine position information of each detection frame and a type of each detection frame included in the image to be processed, where the type of each detection frame includes: characters, non-characters, beginning of text line, end of text line; the second determining module is used for combining the detection frames with the types of characters according to the position information of each detection frame and the types of each detection frame so as to determine each text frame contained in the image to be processed; and the third determining module is used for carrying out character recognition on each text box so as to determine characters contained in the image to be processed.

Optionally, in a possible implementation manner of the embodiment of the first aspect of the present application, the first determining module specifically includes:

The extraction unit is used for respectively extracting a plurality of dimension characteristics of each detection frame from the image to be processed;

The first acquisition unit is used for learning the attention mechanism of the plurality of dimension features so as to acquire adjacent frame information of each detection frame and text line head and tail information corresponding to each detection frame;

and the determining unit is used for determining the type of each detection frame according to the adjacent frame information of each detection frame and the corresponding text line head and tail information.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, the first determining module further includes:

The second acquisition unit is used for preprocessing the image to be processed to acquire a plurality of feature images corresponding to the image to be processed;

The extraction unit specifically comprises:

and the extraction subunit is used for respectively extracting a plurality of dimension features of each detection frame from the plurality of feature maps.

Optionally, in a further possible implementation manner of the embodiment of the first aspect of the present application, the extracting unit specifically includes:

And the acquisition subunit is used for carrying out convolution processing on the image to be processed by utilizing at least two filters so as to acquire at least two dimensional characteristics of each detection frame, wherein receptive fields of the at least two filters are different.

Optionally, in a further possible implementation manner of the embodiment of the first aspect of the present application, the first determining module further includes:

And the splicing unit is used for splicing the plurality of dimension characteristics to generate the characteristics of each detection frame.

And the normalization unit is used for performing normalization processing on the plurality of dimension features.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, the position information of each detection frame includes a coordinate of each detection frame in a first direction and an offset in a second direction, and the second determining module specifically includes:

a third obtaining unit, configured to obtain, when the type of any detection frame is a text line beginning, candidate detection frames matching with a second coordinate in a first direction and a first coordinate from the detection frames according to a first coordinate of the any detection frame in the first direction;

A fourth obtaining unit, configured to obtain, from the candidate detection frames, an adjacent detection frame adjacent to the arbitrary detection frame in the second direction according to a first offset of the arbitrary detection frame in the second direction;

and the merging unit is used for merging the adjacent detection frames with any detection frame when the type of the adjacent detection frames is characters.

Optionally, in a further possible implementation manner of the embodiment of the first aspect of the present application, the apparatus further includes:

a fourth determining module, configured to perform connected domain analysis on each text box, so as to determine a connected domain shape corresponding to each text box;

and the fifth determining module is used for determining that the red chapter is contained in any text box when the connected domain corresponding to the text box is circular in shape.

In another aspect, an embodiment of the present application provides an electronic device, including: the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor is used for realizing the method for extracting the image text combining the RPA and the AI when executing the program.

In a further aspect, the embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, where the program is executed by a processor to implement a method for extracting image text combining RPA and AI as described above.

In a further aspect of the present application, a computer program is provided, which when executed by a processor, implements the method for extracting image text combining RPA and AI according to the embodiment of the present application.

According to the method, the device, the electronic equipment, the computer readable storage medium and the computer program for extracting the image characters combining the RPA and the AI, the position information of each detection frame and the type of each detection frame contained in the image to be processed are determined by carrying out target detection on the image to be processed, and the detection frames with the types of characters are combined according to the position information of each detection frame and the type of each detection frame so as to determine each text frame contained in the image to be processed, and then each text frame is subjected to character recognition so as to determine the characters contained in the image to be processed. Therefore, the type of each detection frame is determined while the target detection is carried out on the image to be processed, so that the data content of different types in the image can be determined through one-time detection, the process of image character extraction is simplified, and the character extraction efficiency is improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic flow chart of an image text extraction method combining RPA and AI according to an embodiment of the application;

FIG. 2 is a schematic diagram showing the positions of an image to be processed and a detection frame;

FIG. 3 is a flowchart of another method for extracting image text combining RPA and AI according to an embodiment of the application;

FIG. 4 is a flowchart of another method for extracting image text combining RPA and AI according to an embodiment of the application;

Fig. 5 is a schematic structural diagram of an image text extraction device combining RPA and AI according to an embodiment of the present application;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the like or similar elements throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.

Aiming at the problems of complicated text extraction process and low efficiency caused by the fact that a plurality of models are generally needed to sequentially extract the text of different types of contents for documents comprising texts, tables, red chapters and the like in the related technology, the embodiment of the application provides an image text extraction method combining RPA and AI.

According to the image text extraction method combining the RPA and the AI, target detection is carried out on the image to be processed to determine the position information of each detection frame and the type of each detection frame contained in the image to be processed, the detection frames with the types of characters are combined according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed, and then text recognition is carried out on each text frame to determine the text contained in the image to be processed. Therefore, the type of each detection frame is determined while the target detection is carried out on the image to be processed, so that the data content of different types in the image can be determined through one-time detection, the process of image character extraction is simplified, and the character extraction efficiency is improved.

The method, the device, the electronic equipment, the storage medium and the computer program for extracting the image text combining the RPA and the AI provided by the application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a flow chart of an image text extraction method combining RPA and AI according to an embodiment of the present application.

As shown in fig. 1, the method for extracting the image text combining the RPA and the AI comprises the following steps:

step 101, performing object detection on an image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed, wherein the type of each detection frame includes: characters, non-characters, beginning of text line, end of text line.

It should be noted that, the RPA technology can intelligently understand the existing application of the electronic device through the user use interface, and automate repeated regular and massive conventional operations, such as automatically and repeatedly reading mails, reading Office components, operating databases, web pages, client software, etc., collect data and perform tedious computation, so as to generate files and reports in a large scale, thereby greatly reducing the investment of labor cost and effectively improving Office efficiency through the RPA technology. Therefore, in the scene of extracting the image text, the RPA program can be configured in the electronic device for extracting the image text, so that the electronic device can automatically extract the text of the acquired image according to the rule set in the RPA program.

In practical use, the method for extracting the image text combining the RPA and the AI of the embodiment of the application can be applied to any scene for extracting the text in the image, and the embodiment of the application is not limited to the method. For example, the method can be applied to the recording scene of paper files such as certificates, notes and the like.

The image to be processed may refer to an image acquired by the RPA robot. For example, when the method for extracting image characters combining the RPA and the AI is applied to a document information uploading scene of an accounting department, the image to be processed can be an image of various documents such as travel fees, traffic fees, banquet fees and the like, which are acquired by an RPA robot and uploaded by a user through electronic equipment.

The position information of the detection frame may include coordinates of each vertex of the detection frame in the image to be processed; or when the coordinate system corresponding to the image to be processed comprises a first direction and a second direction, the position information of the detection frame can also comprise the offset of the coordinate of the detection frame in the first direction and the coordinate of the detection frame in the second direction, so that the specific position of the detection frame in the image to be processed can be determined through the position information of the detection frame.

For example, as shown in fig. 2, a schematic diagram of the positions of the image to be processed and a detection frame is shown. Wherein, O is the origin of the coordinate system corresponding to the image to be processed 20, the Y axis is the first direction of the coordinate system corresponding to the image to be processed 20, the X axis is the second direction of the coordinate system corresponding to the image to be processed 20, then Y ₁ is the coordinate of the detection frame 21 in the first direction, X ₁ is the offset of the detection frame 21 in the second direction, that is, the position information of the detection frame 21 is (Y ₁,x₁).

As a possible implementation manner, an OCR algorithm based on CTPN (DETECTING TEXT INNATURAL IMAGE WITH Connectionist Text Proposal Network) may be used to perform object detection on the image to be processed, so as to determine the position information of each frame to be detected and the type of each frame to be detected included in the image to be processed. Specifically, a transducer can be used under CTPN frames to replace a Long Short-Term Memory (LSTM) network, target detection is performed on an image to be processed, and text line association information is obtained, so that not only can position information of detection frames corresponding to targets in the image be determined, and whether the type of each detection frame is a character, but also whether the type of each detection frame is the beginning of a text line, the end of the text line and the like can be determined.

And 102, merging the detection frames with the types of characters according to the position information of each detection frame and the types of each detection frame so as to determine each text frame contained in the image to be processed.

The text box contained in the image to be processed can comprise a complete and independent text in the image to be processed.

In the embodiment of the application, after the position information of each detection frame and the type of each detection frame in the image to be processed are determined, the detection frame in the same row can be determined according to the position information of each detection frame, and then whether the detection frame is the beginning of the text row or not is determined according to the type of the detection frame in the same row. If one detection frame A is the beginning of a text line, determining the next detection frame B which is in the same line with the detection frame A and is adjacent to the detection frame A according to the position information of the detection frame A; if the type of the detection frame B is a character and the detection frame B is the end of a text line, the detection frame A and the detection frame B can be combined to be used as a text frame; if the type of the detection frame B is a character and is not the end of the text line, the detection frame a and the detection frame B may be combined, and the next detection frame C adjacent to the detection frame B may be continuously determined, and then the above steps are repeated to determine whether the detection frame C may be combined with the detection frame a and the detection frame B until the detection frame D of the type of the end of the text line is traversed, and then all the detection frames between the detection frame a and the detection frame D may be combined to generate a text frame. Thus, repeating the above steps can determine all text boxes included in the image to be processed.

And 103, performing character recognition on each text box to determine characters contained in the image to be processed.

In the embodiment of the application, after the text boxes included in the image to be processed are determined, any text recognition algorithm can be adopted to recognize the text in each text box so as to determine the text corresponding to each text box and further determine the text included in the image to be processed.

In one possible implementation form of the application, multiple dimension features of the detection frame can be extracted to determine the type of the detection frame, so that the accuracy of image text extraction is further improved.

The method for extracting the image text combining the RPA and the AI according to the embodiment of the application is further described below with reference to FIG. 3.

Fig. 3 is a flowchart of another method for extracting image text by combining RPA and AI according to an embodiment of the present application.

As shown in fig. 3, the method for extracting the image text combining the RPA and the AI comprises the following steps:

step 201, performing object detection on the image to be processed to determine position information of each detection frame included in the image to be processed.

The specific implementation process and principle of the above step 201 may refer to the detailed description of the above embodiment, which is not repeated herein.

Step 202, extracting a plurality of dimension features of each detection frame from the image to be processed.

The plurality of dimension features may refer to features that may respectively represent features of a detection frame in an image to be processed from different granularities.

In the embodiment of the application, convolution processing can be performed on the image to be processed through convolution kernels with different sizes so as to generate a plurality of dimension characteristics of each detection frame in the image to be processed, wherein a convolution and a convolution processing result of the image to be processed are used as one dimension characteristic of each detection frame in the image to be processed; or the convolution processing can be carried out on the image to be processed by adopting different convolution modes through convolution kernels with the same size so as to generate a plurality of dimension characteristics of each detection frame in the image to be processed, wherein a convolution result corresponding to one convolution mode is one dimension characteristic of each detection frame in the image to be processed. Therefore, the method and the device realize the extraction of the characteristics of the images to be processed with different granularities, and further improve the accuracy of detection frame identification.

As a possible implementation manner, the image to be processed may be convolved by different filters to generate different dimensional characteristics of each detection frame. That is, in one possible implementation manner of the embodiment of the present application, the step 202 may include:

Wherein the filter may be a convolution kernel having a coefficient of expansion. For example, the filter may be 3 x3 in size, the coefficient of expansion may be 1, 2,5, etc.

As a possible implementation manner, the image to be processed may be convolved by n ₁ filters with a size of 3×3, that is, the size of the receptive field of the filter is 3×3, so as to generate n ₁ -dimensional features of each detection frame in the image to be processed; and carrying out cavity convolution on the image to be processed through n ₂ filters with the size of 3 multiplied by 3 and the expansion coefficient of 2, namely, the size of the receptive field of the filter is 7 multiplied by 7, so as to generate n ₂ -dimensional characteristics of each detection frame in the image to be processed; finally, the n ₃ filters with the size of 3×3 and the expansion coefficient of 5 can be used for carrying out cavity convolution on the image to be processed, namely the size of the receptive field of the filter is 19×19, so as to generate n ₃ -dimensional characteristics of each detection frame in the image to be processed, and thus n ₁+n₂+n₃ -dimensional characteristics of each detection frame in the image to be processed can be generated.

It should be noted that, in actual use, the specific value of n ₁、n₂、n₃ may be determined according to actual needs, which is not limited in the embodiment of the present application. For example, n ₁ may be 256, n ₂ may be 128, and n ₃ may be 128.

Furthermore, before extracting the multidimensional features of the image to be processed, the image to be processed can be preprocessed, and a feature map corresponding to the image to be processed is generated, so that the accuracy of identifying the image to be processed is further improved. That is, in one possible implementation manner of the embodiment of the present application, before the step 202, the method may further include:

preprocessing an image to be processed to obtain a plurality of feature images corresponding to the image to be processed;

accordingly, the step 202 may include:

As a possible implementation manner, densenet121,121 may be used to perform feature extraction on the image to be processed, so as to generate multiple feature maps corresponding to the image to be processed. For example, a feature map with a size of 512× (pic_height/8) × (pic_width/8) corresponding to the image to be processed may be generated, that is, 512 feature maps with a size of (pic_height/8) × (pic_width/8) corresponding to the image to be processed may be generated, where pic_height is the height of the image to be processed and pic_width is the width of the image to be processed. After determining a plurality of feature maps corresponding to the image to be processed, convolution processing can be performed on each feature map to generate a plurality of feature maps, and a plurality of dimension features of each detection frame are extracted from the plurality of feature maps.

Specifically, the same convolution processing may be performed on the feature map by adopting the above method of convolving the image to be processed, so as to extract multiple dimension features of each detection frame from the feature map. For example, 512 feature maps corresponding to the image to be processed are provided, and then the 512 feature maps can be respectively subjected to convolution processing through n ₁ filters with the size of 3×3×512 so as to generate n ₁ -dimensional features of each detection frame; and the hole convolution can be respectively carried out on 512 feature graphs through n ₂ filters with the size of 3 multiplied by 512 and the expansion coefficient of 2 so as to generate n ₂ -dimensional features of each detection frame; finally, the hole convolution can be performed on the 512 feature maps through n ₃ filters with the size of 3×3×512 and the expansion coefficient of 5, so as to generate n ₃ -dimensional features of each detection frame in the process, so that n ₁+n₂+n₃ -dimensional features of each detection frame of the image to be processed can be generated.

And 203, performing attention mechanism learning on the plurality of dimension features to acquire adjacent frame information of each detection frame and head and tail information of a text line corresponding to each detection frame.

In the embodiment of the application, the multiple dimension characteristics can be learned through the target detection model comprising the multi-layer decoder and the attention mechanism, so that the adjacent frame information of each detection frame and the corresponding text line head and tail information are acquired, namely the type information of the adjacent detection frame of each detection frame is acquired, whether each detection frame is the beginning of the text line and whether each detection frame is the end of the text line are acquired.

As a possible implementation manner, when the image to be processed is identified by the object detection model, if it is detected that k detection frames are included in the image to be processed, the object detection model may input 2k pieces of coordinate information for representing position information of each detection frame (for example, coordinates of each detection frame in the first direction and an offset in the second direction), and may output 4k pieces of score information for representing a category of each detection frame, that is, each detection frame may correspond to 4 pieces of score information, which are respectively used for representing a probability that the detection frame is a character, a probability that the detection frame is a non-character, a probability that the detection frame is a beginning of a text line, and a probability that the detection frame is an end of the text line.

Furthermore, before the attention mechanism learning is performed on the plurality of dimensional features, the plurality of dimensional features can be fused to integrally identify the plurality of dimensional features, so that the accuracy of identifying the detection frame in the image to be processed is further improved. That is, in one possible implementation manner of the embodiment of the present application, before the step 203, the method may further include:

and splicing the plurality of dimension features to generate the features of each detection frame.

As a possible implementation manner, in order to enable the target detection mode to combine features of different granularities of the image to be processed when the image to be processed is identified, multiple dimension features of each detection frame in the image to be processed can be spliced to generate features of each detection frame, namely, the features of each detection frame are represented by an integral feature vector, so that feature information of the image to be processed in different granularities can be contained in the features of each detection frame, and accuracy of identifying the detection frames in the image to be processed is improved.

For example, the image to be processed is convolved by 256 filters with a size of 3×3, so as to generate 256-dimensional features of each detection frame in the image to be processed; carrying out cavity convolution on the image to be processed through 128 filters with the size of 3 multiplied by 3 and the expansion coefficient of 2, and generating 128-dimensional characteristics of each detection frame in the image to be processed; the 128-dimensional characteristics of each detection frame in the image to be processed are generated by carrying out cavity convolution on the image to be processed through 128 filters with the size of 3 multiplied by 3 and the expansion coefficient of 5, so that the generated 256-dimensional characteristics and the two 128-dimensional characteristics can be spliced to generate 512-dimensional characteristics of each detection frame of the image to be processed.

As a possible implementation manner, after the multidimensional features are spliced, the spliced features can be compressed through the full connection layer, so that the size of the spliced features is reduced, and the calculated amount for identifying the spliced features is reduced. For example, if the spliced feature is 512 dimensions, the spliced feature can be compressed to 256 dimensions through the full connection layer.

Further, since feature metrics determined by different means may be different, features with too little brightness are easily ignored, thereby affecting the reliability of identifying an image to be processed. That is, in one possible implementation manner of the embodiment of the present application, before the step 203, the method may further include:

As a possible implementation manner, after determining the multiple dimension features of each detection frame in the image to be processed in different manners, normalization processing may be performed on the multiple dimension features so that the metrics of the respective dimension features are in the same numerical range, so that the influence of different metrics of the multiple dimension features on the recognition accuracy of the image to be processed may be reduced.

As another possible implementation manner, after the multiple dimensional features are spliced, normalization processing may be performed on the spliced features.

Step 204, determining the type of each detection frame according to the adjacent frame information of each detection frame and the corresponding text line head and tail information, wherein the type of each detection frame comprises: characters, non-characters, beginning of text line, end of text line.

In the embodiment of the application, after the adjacent detection frame information of each detection frame and the corresponding text line head and tail information in the image to be processed are determined through the attention mechanism, the type of each detection frame can be determined according to the adjacent detection frame information of each detection frame and the text line head and tail information.

As a possible implementation manner, the type of each detection frame may also be determined only according to the head and tail information of the text line corresponding to each detection frame. For example, the 4 pieces of score information of the detection frame a output by the target detection mode are [0.99,0,1,0] respectively, wherein the 4 pieces of score information sequentially represent the probability that the detection frame a is a character, the probability that the detection frame a is a non-character, the probability that the detection frame a is a beginning of a text line and the probability that the detection frame a is an end of the text line in sequence, so that the type of the detection frame can be determined to be the character and the beginning of the text line.

As another possible implementation manner, the type of each detection frame may also be determined by the adjacent frame information of each detection frame and the corresponding text line head and tail information together. Specifically, the type of each detection frame can be determined according to the head and tail information of the text line corresponding to each detection frame, and then the type of each detection frame is checked according to the adjacent frame information of each detection frame, so as to assist in judging whether the determined type of the detection frame is accurate or not, and further improve the accuracy of determining the type of the detection frame.

For example, the 4 pieces of score information of the detection frame a output by the target detection mode are [0.99,0,1,0] respectively, wherein the 4 pieces of score information sequentially represent the probability that the detection frame a is a character, the probability that it is a non-character, the probability that it is a text line, and the probability that it is a text line end in order, so that it can be determined that the type of the detection frame is a character, a text line start. The score information of the adjacent frame B located before the detection frame a is [0.9,0.05,0.1,0.95], the score information of the adjacent frame C located after the detection frame a is [0.92,0.1,0.1,0.1], and thus the probability that the type of the detection frame a is the beginning of the text line can be determined to be very high, and the type of the detection frame a can be determined to be the character or the beginning of the text line.

And step 205, merging the detection frames with the types of characters according to the position information of each detection frame and the types of each detection frame so as to determine each text frame contained in the image to be processed.

And 206, performing character recognition on each text box to determine characters contained in the image to be processed.

The specific implementation and principles of the steps 205-206 may refer to the detailed description of the embodiments, which is not repeated here.

According to the method for extracting the image characters combining the RPA and the AI, the plurality of dimension characteristics of each detection frame are extracted from the image to be processed respectively, attention mechanism learning is conducted on the plurality of dimension characteristics, adjacent frame information of each detection frame and text line head and tail information corresponding to each detection frame are obtained, then the type of each detection frame is determined according to the adjacent frame information of each detection frame and the corresponding text line head and tail information, and then the detection frames with the types of characters are combined according to the position information of each detection frame and the type of each detection frame, so that each text frame contained in the image to be processed is determined, and character recognition is conducted on each text frame, so that characters contained in the image to be processed are determined. Therefore, by extracting multidimensional features of different granularities of the image to be processed, each detection frame in the image to be processed is subjected to feature representation, so that different types of data contents in the image can be determined through one-time detection, the process of extracting the characters of the image is simplified, the efficiency of extracting the characters is improved, and the accuracy and the reliability of extracting the characters are further improved.

In one possible implementation form of the application, the detection frames of the character types can be combined according to the position information of the detection frames to determine each text frame included in the image, and connected domain analysis can be carried out on the text frames to realize the identification of red chapters in the image, so that the practicability and the universality of the image and text extraction are further improved.

The method for extracting the image text combining the RPA and the AI according to the embodiment of the present application is further described below with reference to fig. 4.

Fig. 4 is a flowchart of another method for extracting image text by combining RPA and AI according to an embodiment of the present application.

As shown in fig. 4, the method for extracting the image text combining the RPA and the AI comprises the following steps:

Step 301, performing object detection on an image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed, where the type of each detection frame includes: characters, non-characters, beginning of text line, end of text line.

The specific implementation process and principle of the above step 301 may refer to the detailed description of the above embodiments, which is not repeated herein.

In step 302, if the type of any detection frame is the beginning of the text line, a candidate detection frame matching the second coordinate in the first direction with the first coordinate is obtained from each detection frame according to the first coordinate in the first direction of any detection frame.

The position information of the detection frames comprises coordinates of each detection frame in a first direction and offset in a second direction. It should be noted that, the first direction may refer to a Y-axis direction of a coordinate system corresponding to the image to be processed, and the second direction may refer to an X-axis direction of the coordinate system corresponding to the image to be processed. Specific schematic diagrams may be explained with reference to fig. 2 and fig. 2 in the above embodiments, and are not repeated here.

In the embodiment of the application, if the type of one detection frame is the beginning of a text line, the detection frames which are in the same independent text with the detection frames can be determined from the detection frames which are in the same line with the detection frames and combined. Specifically, assuming that the type of the detection frame a is the beginning of the text line, the detection frame whose difference value between the second coordinate and the first coordinate is smaller than the first threshold may be determined according to the first coordinate of the detection frame a in the first direction and the second coordinate of each other detection frame in the first direction in the image to be processed, and the candidate detection frame whose second coordinate is matched with the first coordinate, that is, the candidate detection frame and the detection frame a are located in the same line.

It should be noted that, in actual use, the specific value of the first threshold may be determined according to the actual requirement and the height of the detection frame, which is not limited in the embodiment of the present application. For example, the first threshold may be 1/3 of the height of the detection frame.

Step 303, acquiring the adjacent detection frame adjacent to any detection frame in the second direction from the candidate detection frames according to the first offset of any detection frame in the second direction.

In the embodiment of the application, after the candidate detection frames which are in the same row with the detection frame with the beginning of the text row are determined, the adjacent detection frames adjacent to the detection frame can be determined according to the first offset of the detection frame in the second direction and the second offset of each candidate detection frame in the second direction. Specifically, assuming that the detection frame a is a detection frame with a type of beginning of a text line, if a difference between a second offset of one candidate detection frame B corresponding to the detection frame a and a first offset of the detection frame a is smaller than or equal to a width of the detection frame, it may be determined that the candidate detection frame B is an adjacent detection frame of the detection frame a; otherwise, it may be determined that the candidate detection frame B is not an adjacent detection frame to detection frame a.

Step 304, if the type of the adjacent detection frame is a character, merging the adjacent detection frame with any detection frame.

In the embodiment of the application, after determining the adjacent detection frames of the detection frames, if the type of the adjacent detection frames is a character, the adjacent detection frames and the detection frames can be combined. Then, in the same manner, an adjacent detection frame adjacent to the adjacent detection frame is determined, and it is determined whether or not the merging process can be performed. It can be understood that after traversing all the detection boxes in the image to be processed by the method, all the text boxes included in the image to be processed can be determined.

And 305, carrying out connected domain analysis on each text box to determine the shape of the connected domain corresponding to each text box.

In the embodiment of the disclosure, the red chapter included in the image to be processed can also be identified by performing connected domain analysis on the text box. Specifically, since the text box containing the content of the character is generally square, even if the text box containing the content of the character is subjected to connected domain analysis, the generated connected domain shape is generally square; since the red chapter is generally a high circle, and the red chapter generally contains different types of content such as characters and images, and the characters are not distributed in rows, the red chapter portion can be generally divided into a plurality of text boxes, and the text boxes corresponding to the red chapter have the same image characteristics. Therefore, each text box included in the image to be processed is subjected to connected domain analysis, each text box corresponding to the red chapter can be connected together to form a complete connected domain, and the connected domain corresponding to the red chapter is usually circular.

And 306, if the connected domain corresponding to any text box is circular, determining that any text box contains a red chapter.

In the embodiment of the application, since the shape of the red chapter in the text is generally circular, after each text box in the image to be processed is subjected to connected domain analysis, since the connected domain shape of the text box corresponding to the character is generally square, and the connected domain shape corresponding to the red chapter is generally circular, the text box corresponding to the connected domain shape of the text box is determined to be the text box containing the red chapter after each text box is subjected to the connected domain analysis.

In step 307, text recognition is performed on each text box to determine the text contained in the image to be processed.

The specific implementation process and principle of the above step 307 may refer to the detailed description of the above embodiment, which is not repeated here.

According to the method for extracting the image text combining the RPA and the AI, the position information of each detection frame and the type of each detection frame contained in the image to be processed are determined, when the detection frame is the beginning of a text line, candidate detection frames are determined according to the first coordinates of the detection frame in the first direction and the second coordinates of other detection frames in the first direction, the adjacent detection frames which are adjacent to the detection frame and are of the type of characters are combined with the detection frames according to the first offset of the detection frame in the second direction and the type of the candidate detection frames, then the red seal in the image to be processed is determined through connected domain analysis, and the text recognition is carried out on each text frame, so that the text contained in the image to be processed is determined. Therefore, through carrying out connected domain analysis on the text box to identify red chapters contained in the image, different types of data contents in the image can be determined through one-time detection, the process of image character extraction is simplified, the character extraction efficiency is improved, and the practicability and the universality of the image character extraction are further improved.

In order to realize the embodiment, the application also provides an image character extraction device combining the RPA and the AI.

Fig. 5 is a schematic structural diagram of an image text extraction device combining RPA and AI according to an embodiment of the present application.

As shown in fig. 5, the image-text extraction device 40 combining RPA and AI includes:

The first determining module 41 is configured to perform object detection on an image to be processed, so as to determine location information of each detection frame and a type of each detection frame included in the image to be processed, where the type of each detection frame includes: characters, non-characters, beginning of text line, end of text line;

a second determining module 42, configured to combine the detection boxes with the types being characters according to the position information of each detection box and the type of each detection box, so as to determine each text box included in the image to be processed;

and a third determining module 43, configured to perform text recognition on each text box to determine the text included in the image to be processed.

In practical use, the image text extraction device combining the RPA and the AI provided by the embodiment of the application can be configured in any electronic equipment to execute the image text extraction method combining the RPA and the AI.

According to the image text extraction device combining the RPA and the AI, target detection is carried out on the image to be processed to determine the position information of each detection frame and the type of each detection frame contained in the image to be processed, the detection frames with the types of characters are combined according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed, and then text recognition is carried out on each text frame to determine the text contained in the image to be processed. Therefore, the type of each detection frame is determined while the target detection is carried out on the image to be processed, so that the data content of different types in the image can be determined through one-time detection, the process of image character extraction is simplified, and the character extraction efficiency is improved.

In one possible implementation manner of the present application, the first determining module 41 specifically includes:

Further, in another possible implementation manner of the present application, the first determining module 41 further includes:

correspondingly, the extraction unit specifically comprises:

And the extraction subunit is used for respectively extracting the plurality of dimension features of each detection frame from the plurality of feature maps.

Further, in still another possible implementation form of the present application, the extracting unit specifically includes:

And the acquisition subunit is used for carrying out convolution processing on the image to be processed by using at least two filters so as to acquire at least two dimensional characteristics of each detection frame, wherein receptive fields of the at least two filters are different.

Further, in still another possible implementation form of the present application, the first determining module 41 further includes:

Further, in another possible implementation manner of the present application, the position information of each detection frame includes a coordinate of each detection frame in a first direction and an offset of each detection frame in a second direction; correspondingly, the second determining module 42 specifically includes:

A third obtaining unit, configured to obtain, when the type of any one of the detection frames is the beginning of the text line, candidate detection frames matching the second coordinate in the first direction with the first coordinate from each of the detection frames according to the first coordinate in the first direction of any one of the detection frames;

A fourth acquisition unit configured to acquire, from the candidate detection frames, an adjacent detection frame adjacent to any one of the detection frames in the second direction, according to the first offset amount of the any one of the detection frames in the second direction;

Further, in still another possible implementation form of the present application, the above-mentioned image text extraction device 40 combining RPA and AI further includes:

And the fifth determining module is used for determining that the red chapter is contained in any text box when the connected domain corresponding to the any text box is circular in shape.

It should be noted that the explanation of the embodiment of the method for extracting image text combining RPA and AI shown in fig. 1,3 and 4 is also applicable to the device 40 for extracting image text combining RPA and AI of this embodiment, and will not be repeated here.

According to the image text extraction device combining the RPA and the AI, the plurality of dimension features of each detection frame are extracted from the image to be processed respectively, attention mechanism learning is conducted on the plurality of dimension features to acquire adjacent frame information of each detection frame and text line head and tail information corresponding to each detection frame, then the type of each detection frame is determined according to the adjacent frame information of each detection frame and the corresponding text line head and tail information, and then the detection frames with the types of characters are combined according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed, and text recognition is conducted on each text frame to determine the text contained in the image to be processed. Therefore, by extracting multidimensional features of different granularities of the image to be processed, each detection frame in the image to be processed is subjected to feature representation, so that different types of data contents in the image can be determined through one-time detection, the process of extracting the characters of the image is simplified, the efficiency of extracting the characters is improved, and the accuracy and the reliability of extracting the characters are further improved.

In order to achieve the above embodiment, the present application further provides an electronic device.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

As shown in fig. 6, the electronic device 200 includes:

The memory 210 and the processor 220, the bus 230 connecting different components (including the memory 210 and the processor 220), the memory 210 stores a computer program, and the processor 220 executes the program to implement the method for extracting the image text combining the RPA and the AI according to the embodiment of the application.

Bus 230 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 200 typically includes a variety of electronic device readable media. Such media can be any available media that is accessible by electronic device 200 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 210 may also include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 240 and/or cache memory 250. The electronic device 200 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 260 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard disk drive"). Although not shown in fig. 6, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 230 via one or more data medium interfaces. Memory 210 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the application.

Program/utility 280 having a set (at least one) of program modules 270 may be stored in, for example, memory 210, such program modules 270 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 270 generally perform the functions and/or methods of the embodiments described herein.

The electronic device 200 may also communicate with one or more external devices 290 (e.g., keyboard, pointing device, display 291, etc.), one or more devices that enable a user to interact with the electronic device 200, and/or any device (e.g., network card, modem, etc.) that enables the electronic device 200 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 292. Also, electronic device 200 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 293. As shown, network adapter 293 communicates with other modules of electronic device 200 over bus 230. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 200, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processor 220 executes various functional applications and data processing by running programs stored in the memory 210.

It should be noted that, the implementation process and the technical principle of the electronic device in this embodiment refer to the foregoing explanation of the method for extracting the image text combining RPA and AI in this embodiment of the present application, and are not repeated here.

The electronic device provided by the embodiment of the application can execute the method for extracting the image characters combining the RPA and the AI, and the method for extracting the image characters combining the RPA and the AI comprises the steps of carrying out target detection on the image to be processed to determine the position information of each detection frame and the type of each detection frame contained in the image to be processed, merging the detection frames with the types of characters according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed, and further carrying out character recognition on each text frame to determine the characters contained in the image to be processed. Therefore, the type of each detection frame is determined while the target detection is carried out on the image to be processed, so that the data content of different types in the image can be determined through one-time detection, the process of image character extraction is simplified, and the character extraction efficiency is improved.

In order to implement the above embodiments, the present application also proposes a computer-readable storage medium.

The computer readable storage medium stores a computer program, which when executed by a processor, implements the method for extracting image text combining RPA and AI according to the embodiment of the present application.

In order to achieve the foregoing embodiments, an embodiment of the present application provides a computer program, which when executed by a processor, implements the method for extracting image text combining RPA and AI according to the embodiment of the present application.

In alternative implementations, the present embodiments may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on the remote electronic device or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. The method for extracting the image text combining the RPA and the AI is characterized by comprising the following steps of:

Performing target detection on an image to be processed to determine position information of each detection frame and the type of each detection frame contained in the image to be processed, wherein the type of each detection frame comprises: characters, non-characters, beginning of text line, end of text line;

Combining the detection frames with the types of characters according to the position information of each detection frame and the types of each detection frame so as to determine each text frame contained in the image to be processed;

performing character recognition on each text box to determine characters contained in the image to be processed;

the target detection is performed on the image to be processed to determine the position information of each detection frame and the type of each detection frame contained in the image to be processed, and the method specifically comprises the following steps:

2. The method of claim 1, further comprising, prior to said extracting the plurality of dimensional features of each of the inspection boxes from the image to be processed, respectively:

3. The method according to claim 1, wherein the extracting the plurality of dimensional features of each detection frame from the image to be processed includes:

4. The method of claim 1, further comprising, prior to learning the attention mechanism for the plurality of dimensional features to obtain adjacent box information for each of the detection boxes and beginning and ending information for a text line corresponding to each of the detection boxes:

5. The method of any one of claims 1-4, further comprising, prior to said learning of the attention mechanism for the plurality of dimensional features to obtain adjacent box information for each of the test boxes and beginning and ending information for a text line corresponding to each of the test boxes:

6. The method according to any one of claims 1 to 4, wherein the position information of each detection frame includes coordinates of each detection frame in a first direction and an offset in a second direction, and the combining the detection frames with the type of characters according to the position information of each detection frame and the type of each detection frame to determine each text frame included in the image to be processed specifically includes:

7. The method according to any one of claims 1 to 4, further comprising, after said combining the detection boxes of the type of characters based on the positional information of each of the detection boxes and the type of each of the detection boxes to determine each text box included in the image to be processed:

8. An image text extraction device combining RPA and AI, comprising:

The first determining module is configured to perform object detection on an image to be processed, so as to determine position information of each detection frame and a type of each detection frame included in the image to be processed, where the type of each detection frame includes: characters, non-characters, beginning of text line, end of text line;

The second determining module is used for combining the detection frames with the types of characters according to the position information of each detection frame and the types of each detection frame so as to determine each text frame contained in the image to be processed;

the third determining module is used for carrying out character recognition on each text box so as to determine characters contained in the image to be processed;

the first determining module specifically includes:

9. The apparatus of claim 8, wherein the first determination module further comprises:

The extraction unit specifically comprises:

10. The device according to claim 8, wherein the extraction unit comprises in particular:

11. The apparatus of claim 8, wherein the first determination module further comprises:

12. The apparatus of any of claims 8-11, wherein the first determining module further comprises:

13. The apparatus according to any one of claims 8 to 11, wherein the position information of each detection frame includes a coordinate of each detection frame in a first direction and an offset in a second direction, and the second determining module specifically includes:

14. The apparatus of any one of claims 8-11, further comprising:

15. An electronic device, comprising: a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor implements the method for extracting image text combining RPA and AI according to any one of claims 1-7 when executing the program.

16. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the method for extracting image text combining RPA and AI according to any one of claims 1-7.