CN115100659B

CN115100659B - Text recognition method, device, electronic equipment and storage medium

Info

Publication number: CN115100659B
Application number: CN202210665530.2A
Authority: CN
Inventors: 秦勇
Original assignee: Shenzhen Xingtong Technology Co ltd
Current assignee: Shenzhen Xingtong Technology Co ltd
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2024-08-02
Anticipated expiration: 2042-06-13
Also published as: CN115100659A

Abstract

The disclosure relates to a text recognition method, a text recognition device, an electronic device and a storage medium. Acquiring a text image; inputting a text image into a pre-trained text recognition model, wherein the text recognition model comprises a correction module, a feature extraction module and a decoding module, the correction module comprises a plurality of correction submodules and a first processing module, and the correction submodules respectively correspond to different correction methods; correcting the text image by using each correction sub-module in the plurality of correction sub-modules to obtain a plurality of corrected images; a plurality of corrected images are overlapped in series by utilizing a first processing module to obtain a first corrected image; extracting features of the first corrected image by using a feature extraction module to obtain first feature information; generating a first probability matrix corresponding to the text image based on the first characteristic information by using a decoding module; and recognizing characters in the text image according to the first probability matrix, wherein the recognition accuracy is high.

Description

Text recognition method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of machine learning, and in particular relates to a text recognition method, a text recognition device, electronic equipment and a storage medium.

Background

Along with development of machine learning, a machine learning method is gradually adopted to identify characters in a text image, but in the existing text identification method, the cost of marking the characters by a character-based method is relatively high, the problem of missing identification or multiple identifications of the characters possibly exists in a sequence-based method, multiple tests are required to be carried out for different application scenes to determine the machine learning method, and the problem that multiple information in the text image cannot be fully utilized also exists, so that the precision of text identification is relatively low.

Disclosure of Invention

In order to solve the technical problems, the present disclosure provides a text recognition method, a device, an electronic apparatus, and a storage medium, which can make full use of various information of a text image, and the accuracy of text recognition is also higher.

According to an aspect of the present disclosure, there is provided a text recognition method including:

acquiring a text image, wherein the text image comprises at least one character;

Inputting the text image into a pre-trained text recognition model, wherein the text recognition model comprises a correction module, a feature extraction module and a decoding module, the correction module comprises a plurality of correction sub-modules and a first processing module, and the correction sub-modules respectively correspond to different correction methods;

Correcting the text image by using each correction sub-module in the plurality of correction sub-modules to obtain a plurality of corrected images;

the plurality of corrected images are overlapped in series by utilizing the first processing module to obtain a first corrected image;

performing feature extraction on the first corrected image by using the feature extraction module to obtain first feature information;

generating a first probability matrix corresponding to the text image based on the first characteristic information by utilizing the decoding module;

and identifying the characters in the text image according to the first probability matrix to obtain an identification result.

According to another aspect of the present disclosure, there is provided a text recognition apparatus including:

an acquisition unit configured to acquire a text image including at least one character;

The input unit is used for inputting the text image into a pre-trained text recognition model, wherein the text recognition model comprises a correction module, a feature extraction module and a decoding module, the correction module comprises a plurality of correction sub-modules and a first processing module, and the correction sub-modules respectively correspond to different correction methods;

The processing unit is used for correcting the text image by utilizing each correction sub-module in the plurality of correction sub-modules to obtain a plurality of corrected images; the plurality of corrected images are overlapped in series by utilizing the first processing module to obtain a first corrected image; performing feature extraction on the first corrected image by using the feature extraction module to obtain first feature information; generating a first probability matrix corresponding to the text image based on the first characteristic information by utilizing the decoding module;

The recognition unit is used for recognizing the characters in the text image according to the first probability matrix to obtain a recognition result.

According to another aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory storing a program, wherein the program comprises instructions that when executed by the processor cause the processor to perform a text recognition method according to the above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to text recognition.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

Acquiring a text image; inputting a text image into a pre-trained text recognition model, wherein the text recognition model comprises a correction module, a feature extraction module and a decoding module, the correction module comprises a plurality of correction submodules and a first processing module, and the correction submodules respectively correspond to different correction methods; correcting the text image by using each correction sub-module in the plurality of correction sub-modules to obtain a plurality of correction images, and overlapping the plurality of correction images in series by using a first processing module to obtain a first correction image; extracting features of the first corrected image by using a feature extraction module to obtain first feature information; generating a first probability matrix corresponding to the text image based on the first characteristic information by using a decoding module; the characters in the text image are identified according to the first probability matrix, so that various information of the text image can be fully utilized, and the identification accuracy is high.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure;

FIG. 2 is a flowchart of a text recognition model training method provided in an embodiment of the present disclosure;

FIG. 3 is a network architecture diagram of a text recognition model provided by an embodiment of the present disclosure;

fig. 4 is a flowchart of a text recognition method according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of a text recognition method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a text image provided by an embodiment of the present disclosure;

Fig. 7 is a schematic structural diagram of a text recognition device according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Aiming at the technical problems, the text recognition method provided by the disclosure utilizes an automatic architecture search design idea, designs each module included in each text recognition paradigm based on the text recognition paradigm, and each module comprises a plurality of methods, so as to obtain a high-precision text recognition model which fully utilizes various information such as contents, positions, character shapes and the like of a text image, and accurately recognizes characters on the text image.

In particular, the text recognition method may be performed by a terminal or a server. Specifically, the terminal or the server can recognize characters in the text image through the text recognition model. The execution subject of the training method of the text recognition model and the execution subject of the text recognition method may be the same or different.

For example, in an application scenario, as shown in fig. 1, fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure, and the server 12 trains a text recognition model. The terminal 11 acquires the trained text recognition model from the server 12, and the terminal 11 recognizes characters in the text image through the trained text recognition model. The target image may be obtained by photographing by the terminal 11. Or the target image is acquired by the terminal 11 from another device. Still alternatively, the target image may be an image obtained by performing image processing on a preset image by the terminal 11, where the preset image may be obtained by photographing by the terminal 11, or the preset image may be obtained by the terminal 11 from another device. Here, the other devices are not particularly limited.

In another application scenario, the server 12 trains the text recognition model. Further, the server 12 recognizes characters in the text image by training the completed text recognition model. The manner in which the server 12 acquires the target image may be similar to the manner in which the terminal 11 acquires the target image as described above, and will not be described here.

In yet another application scenario, the terminal 11 trains the text recognition model. Further, the terminal 11 recognizes characters in the text image by the trained text recognition model.

It will be appreciated that the text recognition model training method and the text recognition method provided by the embodiments of the present disclosure are not limited to the several possible scenarios described above. Since the trained text recognition model can be applied to the text recognition method, the text recognition model training method can be described below before the text recognition method is described.

Taking the text recognition model training by the server 12 as an example, a text recognition model training method, i.e., a training process of the text recognition model, will be described. It will be appreciated that the text recognition model training method is equally applicable to the scenario in which the terminal 11 trains a text recognition model.

Fig. 2 is a flowchart of a text recognition model training method provided in an embodiment of the present disclosure, where before a text image is obtained to recognize text thereon, a text recognition model needs to be built and the built text recognition model needs to be trained, and specifically includes the following steps S210 to S240 shown in fig. 2:

s210, acquiring a sample image set, wherein the sample image set comprises a sample image and a text labeling result corresponding to the sample image.

It can be understood that the server obtains a sample image set, where the sample image set is used as a training sample of the text recognition model, the sample image set includes a large number of sample images and text labeling results corresponding to the sample images, and the text labeling results refer to accurate results of labeling characters in the sample images, where the sample images may be single-line text images, specifically, the single-line text images may be directly obtained as sample images, or multiple-line text images may be obtained, and after the multiple-line text images are divided into multiple single-line text images by a text detection method, the multiple single-line text images are then used as sample images. It is understood that the single line text in the sample image may be straight text, oblique text, and curved text, and the sample image may also be a conventional blurred, photocopied text image; after the sample image is obtained, labeling the text in the sample image, namely labeling text character information on the sample image, namely labeling all the texts in the sample image, so as to obtain a text labeling result corresponding to the sample image, wherein the text labeling result comprises a character sequence; when the text is marked, a dictionary is constructed according to each character in the text marking result, the dictionary only comprises independent characters, and the dictionary can be understood as a single character set which is related in a sample image set and has no repetition.

S220, inputting the sample image into a pre-constructed text recognition model, and generating a plurality of text recognition results.

It can be appreciated that, based on S210 above, the sample image set is input into the pre-constructed text recognition model, the text recognition model is trained based on each sample image in the sample image set and the text labeling result, and each sample image is input into the text recognition model to obtain multiple text recognition results, that is, one sample image corresponds to multiple text recognition results.

Optionally, the plurality of text recognition results include a first recognition result output by a first decoding submodule in the decoding module, a second recognition result output by a second decoding submodule in the decoding module, and a third recognition result output by a third decoding submodule in the decoding module.

Referring to fig. 3, for example, fig. 3 is a network structure diagram of a text recognition model according to an embodiment of the present disclosure, where the text recognition model 300 in fig. 3 includes a correction module 310, a first feature extraction module 320, a second feature extraction module 330, and a decoding module 340, and the correction module 310 includes a plurality of correction submodules, where the plurality of correction submodules respectively correspond to different correction methods, that is, the plurality of correction submodules respectively correct a text image based on the different correction methods, the correction submodule 1 corresponds to the correction method 1, the correction submodule 2 corresponds to the correction method 2, for example, the correction module 310 includes 3 correction submodules, The first correction sub-module 311 comprises 5 convolution layers and 2 full connection layers, and then a correction image is obtained based on a rigid transformation, which may be an affine transformation, the second correction sub-module 312 comprises 8 convolution layers, and the channel number of the last convolution layer is 2, and the third correction sub-module 313 also comprises 5 convolution layers and 2 full connection layers, and then a correction image is obtained based on a non-rigid transformation, which may be a wave plate spline interpolation (TPS) transformation; the first feature extraction module 320 includes a plurality of feature extraction sub-modules, each corresponding to a different feature extraction method, the first feature extraction module 320 includes 2 feature extraction sub-modules, denoted as a first feature extraction sub-module 321 and a second feature extraction sub-module 322, where the first feature extraction sub-module 321 is composed of a residual network module, the residual network is Resnet a network, the Resnet body is composed of 4 convolution blocks (Block blocks), each Block is composed of a plurality of convolution operations, the output of each Block is the input of the next Block, The second feature extraction submodule 322 is composed of an encoding module, which may be a sine and cosine encoding part in a transducer model; The second feature extraction module 330 includes a plurality of feature mapping modules, each feature mapping module corresponds to a feature enhancement method, the second feature extraction module 330 includes 2 feature mapping modules, which are denoted as a third feature mapping module 331 and a fourth feature mapping module 332, the third feature mapping module 331 is formed by a bidirectional cyclic network module, the bidirectional cyclic network module may be a two-layer bidirectional Long Short-Term Memory (LSTM), the fourth feature mapping module 332 is formed by an identity transformation module, and the identity transformation does not change input information, that is, the input information is output information; The decoding module 340 includes a plurality of decoding submodules, the plurality of decoding submodules respectively correspond to different decoding methods, the decoding module 340 includes 3 decoding submodules, which are denoted as a first decoding submodule 341, a second decoding submodule 342 and a third decoding submodule 343, where the first decoding submodule 341 is composed of a sub-attention layer and a circulation network layer, the circulation network layer may be a gate-controlled circulation unit (GRU), the second decoding submodule 342 is composed of 3 basic modules in a transducer model, and the third decoding submodule 343 is composed of a full connection layer.

It can be appreciated that after the sample image is input into the text recognition model, the internal flow of the text recognition model during training is as follows: the 3 correction submodules 311 to 313 in the correction module 310 respectively correct the sample images to obtain 3 corrected images, and then the obtained 3 corrected images are overlapped in series to be used as the input of the first feature extraction module 320; the 2 feature extraction sub-modules 321 to 322 in the first feature extraction module 320 respectively extract features of the corrected images after series superposition to obtain 2 sets of feature maps, and then combine the 2 sets of feature maps into a new set of feature maps; the 2 feature mapping modules 331 to 332 in the second feature extraction module 330 perform feature enhancement on the new feature maps respectively to obtain 2 sets of feature maps, and then obtain a set of new feature maps according to the 2 sets of feature maps as inputs of the decoding module 340; the 3 decoding submodules 341 to 343 in the decoding module 340 respectively decode the new feature map to obtain 3 decoding results, namely, text recognition results, that is, the text recognition model generates 3 text recognition results in the training process.

S230, calculating a loss value according to the text recognition results and the text labeling results.

It can be understood that, on the basis of S220, for each sample image, there are a plurality of text recognition results corresponding to each sample image, and the loss value is calculated according to the plurality of text recognition results corresponding to each sample image and the text labeling result corresponding to the sample image.

Optionally, the loss values include a first loss value, a second loss value, and a third loss value.

Optionally, calculating the loss value in S230 may include the following steps:

And calculating a first loss value according to the first identification result and the text labeling result by adopting a first loss function.

And calculating a second loss value according to the second identification result and the text labeling result by adopting the first loss function.

And calculating a third loss value according to the third identification result and the text labeling result by adopting the second loss function.

It can be appreciated that, for each sample image, for example, sample image 1 has a corresponding text labeling result 1, a first recognition result 1 output by the text recognition model, a second recognition result 1, and a third recognition result 1, and a first loss function is used to calculate a first loss value based on the first recognition result 1 and the text labeling result 1 output by the first decoding submodule 341, where the first loss function may be a multi-classification cross entropy loss function; calculating a second loss value based on the second recognition result 1 and the text labeling result 1 by adopting the first loss function; a third loss value is calculated based on the third recognition result 1 and the text labeling result 1 using a second loss function, wherein the second loss function may be a CTC loss function. And then calculating the sum of the first loss value, the second loss value and the third loss value to obtain a total loss value.

S240, updating network parameters of the text recognition model according to the loss value until the loss value is smaller than a preset threshold value, and outputting the text recognition model.

It can be understood that, based on the above S230, the network parameters of each level in the text recognition model are updated according to the calculated total loss value, until the difference value of the loss values calculated according to the loss function is within the preset range, that is, the obtained loss value remains unchanged basically, which indicates that the training of the text recognition model is completed, and the text recognition model is output.

The embodiment of the disclosure provides a text recognition model training method, each module in a text recognition model comprises a plurality of sub-modules, each sub-module corresponds to different processing methods, in the process of training the text recognition model based on a sample image set, a loss value is calculated through a text recognition result output by each decoding sub-module in a decoding module, after repeated iterative training, an optimal method can be selected in each module, the text recognition method with the highest recognition accuracy for the sample image set can be determined without selecting different methods for combined testing, the training speed of the text recognition model can be effectively improved, and the text recognition model is ensured to have higher recognition accuracy.

Fig. 4 is a flowchart of a text recognition method according to an embodiment of the present disclosure, after training a text recognition model, specifically includes the following steps S410 to S470 shown in fig. 4 in an application stage of the text recognition model:

S410, acquiring a text image, wherein the text image comprises at least one character.

It is understood that a text image is acquired, which may be a single line of text images, including at least one character. It can be understood that in the application stage of the text recognition model, in order to ensure the accuracy of the text recognition result, the image input into the text recognition model needs to have the same structure as the sample image, for example, the sample image input into the text recognition model during training is a single-line text image, and when the text recognition model is applied, the text image to be recognized input into the text recognition model also needs to be a single-line text image, if the obtained text image is a multi-line text image, the multi-line text image can be processed in advance to obtain a plurality of single-line text images, and then the plurality of single-line text images are sequentially input into the text recognition model for recognition, so that the method for processing the multi-line text image is not limited.

S420, inputting the text image into a pre-trained text recognition model.

Optionally, the text recognition model includes a correction module, a feature extraction module and a decoding module, the correction module includes a plurality of correction submodules and a first processing module, and the plurality of correction submodules respectively correspond to different correction methods.

And S430, correcting the text image by utilizing each correction sub-module in the plurality of correction sub-modules to obtain a plurality of corrected images.

S440, the plurality of corrected images are overlapped in series by utilizing the first processing module to obtain a first corrected image.

S450, performing feature extraction on the first corrected image by using a feature extraction module to obtain first feature information.

S460, generating a first probability matrix corresponding to the text image based on the first characteristic information by utilizing the decoding module.

Optionally, the feature extraction module includes a first feature extraction module and a second feature extraction module.

Optionally, in S450, feature extraction is performed on the first corrected image by using a feature extraction module to obtain first feature information, which may include the following steps:

and carrying out feature extraction on the first corrected image by using the first feature extraction module to obtain second feature information.

And carrying out data enhancement on the second characteristic information by using the second characteristic extraction module to obtain the first characteristic information.

Understandably, after the correction module outputs the first correction image to the feature extraction module, the first feature extraction module in the feature extraction module is utilized to perform feature extraction on the first correction image, so as to obtain second feature information corresponding to the first correction image; and then, receiving the second characteristic information by using the second characteristic extraction module and carrying out data enhancement on the second characteristic information to obtain the first characteristic information, namely, the first characteristic extraction module and the second characteristic extraction module are in serial connection.

Optionally, the first feature extraction module includes a plurality of feature extraction sub-modules, the plurality of feature extraction sub-modules respectively correspond to different feature extraction methods, a first feature extraction sub-module of the plurality of feature extraction sub-modules is composed of a residual network module, and a second feature extraction sub-module of the plurality of feature extraction sub-modules is composed of an encoding module.

Optionally, the feature extraction of the first corrected image by using the first feature extraction module to obtain the second feature information may include the following steps:

And carrying out feature mapping on the first corrected image by using the first feature extraction sub-module, and compressing the height of a feature mapping result to a preset threshold value to obtain a first feature mapping.

And encoding the first corrected image by using a second feature extraction sub-module, and compressing the height of the output vector obtained by encoding to a preset threshold value to obtain a second feature map.

And calculating a first score corresponding to the first feature map and the second feature map by using a first activation function layer in the first feature extraction module, and obtaining second feature information according to the first feature map, the second feature map and the score.

It can be understood that the first feature extraction module includes a plurality of feature extraction sub-modules, and the plurality of feature extraction sub-modules respectively correspond to different feature extraction methods, the plurality of feature extraction sub-modules are in parallel connection, the input is the first correction image, and the output is the feature mapping. Specifically, a first feature extraction submodule in the plurality of feature extraction submodules performs feature mapping on a first corrected image based on 4 convolution blocks (Block blocks), and then the height of the obtained feature mapping is compressed to a preset threshold, wherein the preset threshold can be determined by a user according to the user requirement, for example, the preset threshold is 4; the second feature extraction submodule in the plurality of feature extraction submodules is a decoder part of the transducer model, only a sine and cosine coding part is reserved, the specific network structure of the transducer model is not limited by using 4 basic modules, the second feature extraction submodule performs feature mapping on the first corrected image, the height of the feature mapping is compressed to a preset threshold value, and the second feature mapping is obtained, namely the heights of the feature mapping output by different feature extraction submodules are the same, so that new feature mapping can be obtained according to the plurality of feature mappings conveniently. After each feature extraction sub-module in the first feature extraction module outputs a feature map, the value of each corresponding position of the first feature map passes through a first activation function layer (softmax layer) to calculate a score to obtain a first score, then the product of the value of each corresponding position of the first feature map and the first score is calculated to obtain a first product, namely the value of each corresponding position is weighted, the value of each corresponding position of the second feature map also passes through the same softmax layer to calculate a score to obtain another first score, then the product of the value of each corresponding position of the second feature map and the other first score is obtained to obtain a second product, and the values in the first product and the second product are added according to the corresponding positions to obtain the combined second feature information.

Optionally, the second feature extraction module includes a plurality of feature mapping modules, and a third feature mapping module of the plurality of feature mapping modules is formed by a bidirectional cyclic network module.

Optionally, the feature extracting the second feature information by using the second feature extracting module to obtain the first feature information may include the following steps:

and carrying out feature enhancement on the second feature information by using a third feature mapping module to obtain enhancement information.

And calculating a second score according to the enhancement information and the second feature information by using a second activation function layer in the second feature extraction module, and obtaining the first feature information according to the second score, the enhancement information and the second feature information.

It can be understood that the second feature extraction module is configured to perform data enhancement on the second feature information output by the first feature extraction module, where the second feature extraction module includes a plurality of feature mapping modules, the plurality of feature mapping modules are in parallel connection, and each feature mapping module has input of the second feature information and output of the second feature information. Specifically, the third feature mapping module performs feature enhancement on the second feature information to obtain enhancement information, wherein the dimensions of the enhancement information and the second feature information are the same; then, according to the second feature information and the enhancement information, the weighted summation calculation is performed based on a second activation function layer (softmax layer) to obtain first feature information, where the first feature information can be understood as a new feature map, and the calculation method of the second activation function layer is the same as that of the first activation function layer, which is not described herein.

Optionally, the decoding module includes a plurality of decoding submodules, the plurality of decoding submodules respectively correspond to different decoding methods, a first decoding submodule in the plurality of decoding submodules is composed of a sub-attention layer and a circulating network layer, and a second decoding submodule in the plurality of decoding submodules is composed of a depth module.

Optionally, generating, by the decoding module, the first probability matrix corresponding to the text image based on the first feature information in S460 may include the following steps:

decoding is performed based on the first characteristic information by using a first decoding submodule to generate a second probability matrix comprising semantic information and time information.

The first characteristic information is mapped into a continuous representation by a second decoding sub-module, and a third probability matrix is generated.

And calculating a third score according to the second probability matrix and the third probability matrix by using a third activation function layer in the decoding module, and generating a first probability matrix corresponding to the text image according to the third score, the second probability matrix and the third probability matrix.

It can be understood that the decoding module includes a plurality of decoding submodules, each decoding submodule corresponds to one decoding method, that is, a parallel relationship is formed between the plurality of decoding submodules, the input of each decoding submodule is the first characteristic information, and the output is the probability matrix (decoding result). Specifically, decoding the first characteristic information by using a first decoding submodule to generate a second probability matrix; and decoding based on the first characteristic information by using a second decoding submodule to generate a third probability matrix, calculating scores by using the second probability matrix and the third probability matrix based on softmax operation, and obtaining a new probability matrix according to the scores, wherein the new probability matrix is marked as a first probability matrix, and the calculation mode is the same as that of the first activation function layer, and is not repeated here. It can be understood that the decoding module in the text recognition model includes 3 decoding sub-modules and further includes a third decoding sub-module, where the third decoding sub-module is formed by a fully-connected layer, and the decoding accuracy is lower than that of the first decoding sub-module and the second decoding sub-module, so that the output of the third decoding sub-module is only used to calculate the loss value when the text recognition model is trained, so as to improve the accuracy of the text recognition model, and the probability matrix output by the third decoding sub-module is not used in the application stage of the text recognition model.

And S470, recognizing characters in the text image according to the first probability matrix to obtain a recognition result.

It can be understood that, based on the step S460, after the first probability matrix output by the text recognition model is obtained, a greedy algorithm is adopted for the first probability matrix to obtain a recognition result of the characters in the text image, the size of the first probability matrix is the same as the size of the dictionary constructed according to the sample image set, that is, the probability values in the first probability matrix and the characters in the dictionary are in one-to-one correspondence, the first probability matrix includes all the characters to be recognized in the text image and the same probability values of all the characters in the dictionary, the greedy algorithm is adopted to determine the highest target probability value in the first probability matrix, and the characters stored in the position corresponding to the target probability value in the dictionary are determined as the recognition result of the characters to be recognized.

The embodiment of the disclosure provides a text recognition method, a constructed text recognition model comprises a plurality of modules, each module comprises a plurality of sub-modules, each sub-module corresponds to one processing mode, and the softmax is utilized to evaluate the plurality of methods in each module in the text recognition model, so that the advantages of the methods corresponding to each sub-module in each module are exerted, the disadvantages are avoided as far as possible, namely, the optimal recognition mode can be selected for each text image, and meanwhile, various information such as the content, the position, the character shape and the like of the text image is fully utilized, so that a recognition result with higher precision is obtained.

Fig. 5 is a flowchart of a text recognition method according to an embodiment of the present disclosure, optionally, in S430, each correction sub-module of the plurality of correction sub-modules is used to correct a text image to obtain a plurality of corrected images, which specifically includes the following steps S510 to S530 shown in fig. 5:

it can be understood that the correction module includes a plurality of correction sub-modules, the plurality of correction sub-modules are in parallel connection, and a corresponding correction method exists between each correction sub-module, that is, different correction sub-modules adopt different correction methods to correct the text image, the obtained correction images may be different, that is, the correction effects obtained by different correction sub-modules are different for different text images.

S510, predicting a first number of datum point coordinates of the text image by using a first correction submodule in the plurality of correction submodules, and carrying out affine transformation on the text image according to the first number of datum point coordinates to obtain a second correction image.

It can be appreciated that, the first correction submodule of the plurality of correction submodules is utilized to predict the datum point coordinates of the input text image, and specifically predict the datum point coordinates of a first quantity, which is determined by the user according to the requirement, for example, the first quantity can be 20; and then calculating a homography matrix based on the first number of datum point coordinates, and carrying out affine transformation on the text image based on the homography matrix to obtain a second correction image.

For example, referring to fig. 6, fig. 6 is a schematic diagram of a text image provided in an embodiment of the present disclosure, where fig. 6 includes a text image 610 and a second rectified image 620, where the text image 610 includes a single line of text 611 and a reference point 612 of the single line of text, and the second rectified image 620 includes a rectified single line of text 621, where only a predicted at least partial reference point is shown in the text image 610.

S520, calculating the offset of the text image by using a second correction submodule in the plurality of correction submodules, and adjusting the pixel value corresponding to each coordinate position of the text image according to the offset to obtain a third correction image.

It can be appreciated that the xy offset of the text image is calculated by using the second correction submodule, and then the pixel value corresponding to each coordinate position of the text image is adjusted according to the xy offset, so as to obtain a third correction image.

S530, predicting the second number of datum point coordinates of the text image by utilizing a third correction submodule in the plurality of correction submodules, and performing wave plate spline interpolation conversion on the text image according to the second number of datum point coordinates to obtain a fourth correction image.

It will be appreciated that the third correction sub-module also predicts reference point coordinates of the text image, specifically predicts a second number of reference point coordinates, and then performs a wave plate spline interpolation (TPS) transformation on the text image based on the second number of reference point coordinates to obtain a fourth corrected image.

It can be understood that after each correction sub-module outputs a correction image, the first processing module performs serial superposition on a plurality of correction images to obtain a first correction image, for example, the size of each correction image is 128×128×1, and the size of the obtained first correction image is 128×128×3 after serial superposition on 3 correction images.

The embodiment of the disclosure provides a text recognition method, wherein a correction module in a text recognition model comprises a plurality of correction submodules, each correction submodule corresponds to one correction method, a plurality of correction methods are adopted to correct a text image, a plurality of correction images are obtained, a final correction image is obtained based on the plurality of correction images, the accuracy of correcting the text image is high, and the recognition accuracy of characters in the text image is further improved.

On the basis of the foregoing embodiments, fig. 7 is a schematic structural diagram of a text recognition device provided in an embodiment of the present disclosure, where the text recognition device provided in the embodiment of the present disclosure may execute a process flow provided in the foregoing text recognition method embodiment, and as shown in fig. 7, the text recognition device 700 includes:

and an acquisition unit 710 for acquiring a text image including at least one character.

An input unit 720 for inputting the text image into the pre-trained text recognition model. The text recognition model comprises a correction module, a feature extraction module and a decoding module, wherein the correction module comprises a plurality of correction submodules and a first processing module, and the correction submodules respectively correspond to different correction methods.

A processing unit 730, configured to correct the text image by using each of the plurality of correction sub-modules to obtain a plurality of corrected images; a plurality of corrected images are overlapped in series by utilizing a first processing module to obtain a first corrected image; extracting features of the first corrected image by using a feature extraction module to obtain first feature information; and generating a first probability matrix corresponding to the text image based on the first feature information by using the decoding module.

The recognition unit 740 is configured to recognize characters in the text image according to the first probability matrix, so as to obtain a recognition result.

Optionally, the plurality of corrected images includes a second corrected image, a third corrected image, and a fourth corrected image.

Optionally, the processing unit 730 is further configured to:

Predicting a first number of reference point coordinates of the text image by using a first correction submodule in the plurality of correction submodules, and carrying out affine transformation on the text image according to the first number of reference point coordinates to obtain a second correction image;

Calculating the offset of the text image by using a second correction submodule in the plurality of correction submodules, and adjusting the pixel value corresponding to each coordinate position of the text image according to the offset to obtain a third correction image;

Predicting the second number of reference point coordinates of the text image by using a third correction submodule in the plurality of correction submodules, and performing wave plate spline interpolation conversion on the text image according to the second number of reference point coordinates to obtain a fourth correction image.

Optionally, the processing module 730 is further configured to:

performing feature extraction on the first corrected image by using a first feature extraction module to obtain second feature information;

Optionally, the first feature extraction module includes a plurality of feature extraction sub-modules, where the plurality of feature extraction sub-modules respectively correspond to different feature extraction methods; a first feature extraction sub-module of the plurality of feature extraction sub-modules is formed by a residual error network module; a second feature extraction sub-module of the plurality of feature extraction sub-modules is comprised of an encoding module.

Optionally, the processing unit 730 is further configured to:

performing feature mapping on the first corrected image by using a first feature extraction sub-module, and compressing the height of a feature mapping result to a preset threshold value to obtain a first feature mapping;

encoding the first corrected image by using a second feature extraction sub-module, and compressing the height of an output vector obtained by encoding to a preset threshold value to obtain a second feature map;

Optionally, the processing unit 730 is further configured to:

performing feature enhancement on the second feature information by using a third feature mapping module to obtain enhancement information;

Optionally, the processing unit 730 is further configured to:

Decoding based on the first characteristic information by using a first decoding submodule to generate a second probability matrix comprising semantic information and time information;

mapping the first characteristic information into a continuous representation by using a second decoding submodule to generate a third probability matrix;

Alternatively, the text recognition model may be trained by:

acquiring a sample image set, wherein the sample image set comprises a sample image and a text labeling result corresponding to the sample image;

inputting the sample image into a pre-constructed text recognition model to generate a plurality of text recognition results;

Calculating a loss value according to the text recognition results and the text labeling results;

And updating network parameters of the text recognition model according to the loss value until the loss value is smaller than a preset threshold value, and outputting the text recognition model.

Optionally, the plurality of text recognition results include a first recognition result output by a first decoding submodule in the decoding module, a second recognition result output by a second decoding submodule in the decoding module, and a third recognition result output by a third decoding submodule in the decoding module; the loss values include a first loss value, a second loss value, and a third loss value.

Optionally, calculating the loss value according to the plurality of text recognition results and the text labeling result includes:

calculating a first loss value according to the first identification result and the text labeling result by adopting a first loss function;

calculating a second loss value according to the second identification result and the text labeling result by adopting the first loss function;

The device provided in this embodiment has the same implementation principle and technical effects as those of the foregoing method embodiment, and for brevity, reference may be made to the corresponding content of the foregoing method embodiment where the device embodiment is not mentioned.

The exemplary embodiments of the present disclosure also provide an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to embodiments of the present disclosure when executed by the at least one processor.

The present disclosure also provides a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to embodiments of the disclosure.

Referring to fig. 8, a block diagram of an electronic device 800 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 807 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. The storage unit 808 may include, but is not limited to, magnetic disks, optical disks. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices over computer networks, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above. For example, in some embodiments, the text recognition method or training method of the recognition network may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. In some embodiments, the computing unit 801 may be configured to perform a text recognition method or training method of the recognition network by any other suitable means (e.g., by means of firmware).

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of text recognition, comprising:

Recognizing characters in the text image according to the first probability matrix to obtain a recognition result;

The correcting sub-modules are used for correcting the text image to obtain a plurality of corrected images, and the correcting sub-modules comprise:

Predicting a first number of datum point coordinates of the text image by using a first correction submodule in the plurality of correction submodules, and carrying out affine transformation on the text image according to the first number of datum point coordinates to obtain a second correction image;

Predicting a second number of reference point coordinates of the text image by using a third correction submodule in the plurality of correction submodules, and performing thin-plate spline interpolation transformation on the text image according to the second number of reference point coordinates to obtain a fourth correction image.

2. The method of claim 1, wherein the feature extraction module comprises a first feature extraction module and a second feature extraction module,

The feature extraction module performs feature extraction on the first corrected image to obtain first feature information, and the method includes:

Performing feature extraction on the first corrected image by using the first feature extraction module to obtain second feature information;

And carrying out data enhancement on the second characteristic information by utilizing the second characteristic extraction module to obtain the first characteristic information.

3. The method of claim 2, wherein the first feature extraction module comprises a plurality of feature extraction sub-modules, a first feature extraction sub-module of the plurality of feature extraction sub-modules being comprised of a residual network module, a second feature extraction sub-module of the plurality of feature extraction sub-modules being comprised of an encoding module,

The step of extracting the features of the first corrected image by using the first feature extraction module to obtain second feature information includes:

Performing feature mapping on the first corrected image by using the first feature extraction sub-module, and compressing the height of a feature mapping result to a preset threshold value to obtain a first feature mapping;

encoding the first corrected image by using the second feature extraction sub-module, and compressing the height of an output vector obtained by encoding to the preset threshold value to obtain a second feature map;

and calculating a first score corresponding to the first feature map and the second feature map by using a first activation function layer in the first feature extraction module, and obtaining the second feature information according to the first feature map, the second feature map and the score.

4. The method of claim 2, wherein the second feature extraction module comprises a plurality of feature mapping modules, a third feature mapping module of the plurality of feature mapping modules being comprised of a bi-directional torus network module,

The data enhancement is performed on the second feature information by using the second feature extraction module to obtain first feature information, including:

performing feature enhancement on the second feature information by using the third feature mapping module to obtain enhancement information;

5. The method of claim 1, wherein the decoding module comprises a plurality of decoding submodules, a first decoding submodule of the plurality of decoding submodules being comprised of a sub-attention layer and a circular network layer, a second decoding submodule of the plurality of decoding submodules being comprised of a depth module,

The generating, by the decoding module, a first probability matrix corresponding to the text image based on the first feature information includes:

decoding based on the first characteristic information by utilizing the first decoding submodule to generate a second probability matrix comprising semantic information and time information;

Mapping the first characteristic information into a continuous representation by using the second decoding submodule to generate a third probability matrix;

6. The method of claim 1, wherein the text recognition model is trained by:

And updating the network parameters of the text recognition model according to the loss value until the loss value is smaller than a preset threshold value, and outputting the text recognition model.

7. The method of claim 6, wherein the plurality of text recognition results includes a first recognition result output by a first one of the decoding modules, a second recognition result output by a second one of the decoding modules, and a third recognition result output by a third one of the decoding modules, the penalty values including a first penalty value, a second penalty value, and a third penalty value,

The calculating a loss value according to the text recognition results and the text labeling results comprises the following steps:

and calculating a third loss value according to the third identification result and the text labeling result by adopting a second loss function.

8. A text recognition device, comprising:

The recognition unit is used for recognizing characters in the text image according to the first probability matrix to obtain a recognition result;

wherein the plurality of rectified images includes a second rectified image, a third rectified image, and a fourth rectified image, the processing unit is configured to:

predicting a first number of datum point coordinates of the text image by using a first correction submodule in the plurality of correction submodules, and carrying out affine transformation on the text image according to the first number of datum point coordinates to obtain a second correction image; calculating the offset of the text image by using a second correction submodule in the plurality of correction submodules, and adjusting the pixel value corresponding to each coordinate position of the text image according to the offset to obtain a third correction image; predicting a second number of reference point coordinates of the text image by using a third correction submodule in the plurality of correction submodules, and performing thin-plate spline interpolation transformation on the text image according to the second number of reference point coordinates to obtain a fourth correction image.

9. An electronic device, the electronic device comprising:

A processor; and

A memory in which a program is stored,

Wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the text recognition method according to any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the text recognition method according to any one of claims 1 to 7.