CN115984883A

CN115984883A - A Hindi Image-Text Recognition Method Based on Enhanced Vision Transformer Network

Info

Publication number: CN115984883A
Application number: CN202310032987.4A
Authority: CN
Inventors: 汪增福; 李永瑞
Original assignee: Hefei Institutes of Physical Science of CAS
Current assignee: Hefei Institutes of Physical Science of CAS
Priority date: 2023-01-10
Filing date: 2023-01-10
Publication date: 2023-04-18

Abstract

The invention discloses a hindi image-text recognition method based on an enhanced vision converter network, which comprises the following steps: 1, synthesizing hindi text images and establishing a hindi image-text recognition training data set; 2, constructing a hindi image-text recognition network; 3 calculating a loss function corresponding to each input picture and training a hindi picture and text recognition network; and 4, carrying out text recognition on the arbitrarily input picture to be recognized by utilizing the trained hindi image-text recognition network. After the hindi text-image recognition network is trained, the hindi text-image recognition network can recognize the text of the hindi text-image which does not appear in the training data, and has higher practical value.

Description

A Hindi Image and Text Recognition Method Based on Enhanced Visual Transformer Network

技术领域Technical Field

本发明涉及印地语文本图像识别领域的相关问题，具体涉及一种基于增强视觉变换器网络的印地语图文识别方法。The present invention relates to related issues in the field of Hindi text image recognition, and in particular to a Hindi image and text recognition method based on an enhanced visual transformer network.

背景技术Background Art

目前的印地语图文识别方法大都基于深度学习技术。深度学习是一种数据驱动的方法，训练数据的数量和质量对模型的效果有至关重要的影响。印地语作为一种低资源语言，真实图文数据较为匮乏。同时，在识别模型方面，目前专门针对印地语的图文识别网络大多采用基于卷积神经网络的编码器和基于时序分类的解码器。在印地语图文识别中，这种设计思路限制了图文识别模型的性能。Most of the current Hindi image and text recognition methods are based on deep learning technology. Deep learning is a data-driven method, and the quantity and quality of training data have a crucial impact on the model's effectiveness. Hindi is a low-resource language, and real image and text data is relatively scarce. At the same time, in terms of recognition models, most of the current image and text recognition networks specifically for Hindi use encoders based on convolutional neural networks and decoders based on time series classification. In Hindi image and text recognition, this design idea limits the performance of the image and text recognition model.

发明内容Summary of the invention

本发明是为了克服现有技术的不足之处，提出的一种基于增强视觉变换器网络的印地语图文识别方法，以期能高效完成印地语图文识别任务，从而能提升印地语文本图像的图文识别准确率。The present invention aims to overcome the shortcomings of the prior art and proposes a Hindi image and text recognition method based on an enhanced visual transformer network, in order to efficiently complete the Hindi image and text recognition task, thereby improving the image and text recognition accuracy of Hindi text images.

本发明为达到上述发明目的，采用如下技术方案：In order to achieve the above-mentioned object of the invention, the present invention adopts the following technical scheme:

本发明一种基于增强视觉变换器网络的印地语图文识别方法的特点在于，包括以下步骤：The present invention provides a Hindi image and text recognition method based on an enhanced visual transformer network, which comprises the following steps:

步骤1：构建印地语图文识别训练数据集；Step 1: Build a Hindi image and text recognition training dataset;

用无文字的背景图像和印地语文字内容合成印地语文本图像，并将印地语文字内容作为相应印地语文本图像的文字标签，从而得到文本图像集合X＝[x₁，x₂，...，x_i，...，x_N]及其对应的标签集合G＝[g₁，g₂，...，g_i，...，g_N]；其中，

表示第i张印地语文本图像，g_i表示第i张印地语文本图像x_i对应的文字标签，H表示图像的高度，W表示图像的宽度，3为通道数；A Hindi text image is synthesized using a background image without text and Hindi text content, and the Hindi text content is used as a text label of the corresponding Hindi text image, thereby obtaining a text image set X = [x ₁ , x ₂ , ..., x _i , ..., x _N ] and its corresponding label set G = [g ₁ , g ₂ , ..., g _i , ..., g _N ]; wherein,

represents the ith Hindi text image, _gi represents the text label corresponding to the ith Hindi text image _xi , H represents the height of the image, W represents the width of the image, and 3 is the number of channels;

步骤2：构建文本图像识别网络，包括一个编码器和一个解码器，并用于得到第i张印地语文本图像x_i的预测概率P_i；Step 2: Build a text image recognition network, including an encoder and a decoder, and use it to obtain the prediction probability P _i of the i-th Hindi text image x _i ;

步骤3：加载ViTAEv2基础模块在图像分类任务中预训练好的参数，并用于对所述文本图像识别网络中的编码器参数进行初始化；Step 3: Load the pre-trained parameters of the ViTAEv2 basic module in the image classification task and use them to initialize the encoder parameters in the text image recognition network;

根据文字标签g_i和预测概率P_i构建交叉熵损失函数后，利用反向传播算法训练初始化后的文本图像识别网络，直至交叉熵损失函数收敛为止，从而得到训练后的文本图像识别网络；After constructing the cross entropy loss function according to the text label g _i and the predicted probability P _i , the initialized text image recognition network is trained using the back propagation algorithm until the cross entropy loss function converges, thereby obtaining the trained text image recognition network;

步骤4：利用训练后的文本图像识别网络对任意输入的待识别印地语文本图片进行识别，得到待识别图片每个位置字符的预测概率向量，再选择预测概率向中最大概率所对应的类别，作为待识别图片的文本识别结果。Step 4: Use the trained text image recognition network to recognize any input Hindi text image to be recognized, obtain the predicted probability vector of the characters at each position in the image to be recognized, and then select the category corresponding to the maximum probability in the predicted probability vector as the text recognition result of the image to be recognized.

本发明所述的一种基于增强视觉变换器网络的印地语图文识别方法的特点也在于，所述步骤2中的编码器模块由R个堆叠的ViTAEv2基础模块组成，其中，第r个ViTAEv2基础编码模块包括：一个主分支、一个旁路分支和一个带有残差连接的前馈编码层；The method for Hindi image and text recognition based on enhanced visual transformer network described in the present invention is also characterized in that the encoder module in step 2 is composed of R stacked ViTAEv2 basic modules, wherein the rth ViTAEv2 basic encoding module includes: a main branch, a bypass branch and a feedforward encoding layer with residual connection;

所述第r个ViTAEv2基础编码模块的主分支包括多尺度卷积操作层、层归一化操作层、多头注意力操作层；The main branch of the rth ViTAEv2 basic encoding module includes a multi-scale convolution operation layer, a layer normalization operation layer, and a multi-head attention operation layer;

所述第r个ViTAEv2基础编码模块的旁路分支包括卷积操作层、批量归一化操作层和Sigmoid加权线性单元；The bypass branch of the rth ViTAEv2 basic encoding module includes a convolution operation layer, a batch normalization operation layer and a Sigmoid weighted linear unit;

当r＝1时，所述第i张印地语文本图像x_i输入所述文本图像识别网络中，并经过编码器中第r个ViTAEv2基础编码模块的主分支的多尺度卷积操作层进行特征提取，输出一个多通道的特征图M_i,r并通过形状变化后转化为特征序列S_i，r，再经过多头注意力操作层的处理后，得到特征序列O_i，r；When r=1, the i-th Hindi text image x _i is input into the text image recognition network, and is subjected to feature extraction by the multi-scale convolution operation layer of the main branch of the r-th ViTAEv2 basic encoding module in the encoder, and a multi-channel feature map M _i,r is output and converted into a feature sequence S _i,r after shape change, and then processed by the multi-head attention operation layer to obtain a feature sequence O _i,r ;

所述编码器中第r个ViTAEv2基础编码模块的旁路分支对所述第i张印地语文本图像x_i进行处理，得到特征图M′_i，r，并通过形状变化后转化为特征序列S′_i，r；The bypass branch of the rth ViTAEv2 basic encoding module in the encoder processes the i-th Hindi text image x _i to obtain a feature map M′ _i,r , and converts it into a feature sequence S′ _i,r after shape change;

将O_i，r和S′_i，r的对应位置元素相加后，得到融合后的序列f_i，r并输入到第r个ViTAEv2基础编码模块的前馈编码层中，并依次经过层归一化操作、带有线性整流激活函数的两个全连接层的处理后，输出第r个视觉特征序列P_i，r；After adding the corresponding position elements of O _i,r and S ′ _i,r , the fused sequence fi _,r is obtained and input into the feedforward coding layer of the r-th ViTAEv2 basic coding module, and then processed by layer normalization operation and two fully connected layers with linear rectification activation function, and then the r-th visual feature sequence Pi _,r is output;

当r＝2，3，…，R时，第r-1个视觉特征序列P_i，r-1输入编码器的第r个ViTAEv2基础编码模块中进行处理，相应得到第r个视觉特征序列P_i，r；从而由第R个ViTAEv2基础编码模块输出的第R个视觉特征序列P_i,R作为所述编码器模块输出的视觉特征序列

其中，

表示第i张印地语文本图像x_i的第j个视觉特征向量；n表示视觉特征向量的总数。When r=2, 3, ..., R, the r-1th visual feature sequence P _i,r-1 is input into the rth ViTAEv2 basic coding module of the encoder for processing, and the rth visual feature sequence P _i,r is obtained accordingly; thus, the Rth visual feature sequence P _i,R output by the Rth ViTAEv2 basic coding module is used as the visual feature sequence output by the encoder module.

in,

represents the j-th visual feature vector of the i-th Hindi text image _xi ; n represents the total number of visual feature vectors.

所述步骤2中的解码器由线性分类层以及K个相同的Transformer基础解码模块堆叠而成，任意第k个Transformer基础解码模块包含：掩码多头自注意力层、编码器-解码器间的多头注意力层、带有残差连接的前馈层；The decoder in step 2 is composed of a linear classification layer and K identical Transformer basic decoding modules stacked together, and any k-th Transformer basic decoding module includes: a masked multi-head self-attention layer, a multi-head attention layer between the encoder and the decoder, and a feedforward layer with a residual connection;

当k＝1时，将文字标签gt_i进行嵌入表示，得到嵌入表示结果seq_i，并输入第k个掩码多头注意力层中进行处理，得到第k个自注意力处理结果序列

When k = 1, the text label gt _i is embedded to obtain the embedding result seq _i , and then input into the k-th masked multi-head attention layer for processing to obtain the k-th self-attention processing result sequence

所述视觉特征序列

输入所述解码器中，由第k个编码器-解码器间的多头注意力层对

进行处理后，得到第k个跨模态处理结果序列

The visual feature sequence

Input to the decoder, the multi-head attention layer between the kth encoder-decoder

After processing, the kth cross-modal processing result sequence is obtained

第k个带有残差连接的前馈层对

进行处理后得到第k个Transformer基础解码模块的处理结果序列

The kth feed-forward layer pair with residual connection

After processing, the processing result sequence of the kth Transformer basic decoding module is obtained

当k＝2，3，…，R时，第k-1个基础解码器模块输出的处理结果序列

输入所述第k个Transformer基础解码模块中的掩码多头注意力层中进行处理，输出的结果与

一起经过第k个编码器-解码器间的多头注意力层和带有残差连接的前馈层的处理后，得到第k个Transformer基础解码模块的处理结果序列

从而由第K个Transformer基础解码模块得到第K个处理结果序列

When k = 2, 3, ..., R, the processing result sequence output by the k-1th basic decoder module is

The input is processed in the masked multi-head attention layer in the k-th Transformer basic decoding module, and the output result is the same as

After being processed by the multi-head attention layer between the kth encoder-decoder and the feed-forward layer with residual connection, the processing result sequence of the kth Transformer basic decoding module is obtained.

Thus, the Kth processing result sequence is obtained from the Kth Transformer basic decoding module

所述线性分类层利用

进行预测，得到第i张印地语文本图像x_i的预测概率

其中，

表示第i张印地语文本图像x_i中第t个位置字符概率，T表示最大解码长度。The linear classification layer uses

Make a prediction and get the prediction probability of the i-th Hindi text image x _i

in,

represents the probability of the character at the t-th position in the ith Hindi text image _xi , and T represents the maximum decoding length.

本发明一种电子设备，包括存储器以及处理器的特点在于，所述存储器用于存储支持处理器执行任一所述印地语图文识别方法的程序，所述处理器被配置为用于执行所述存储器中存储的程序。An electronic device of the present invention includes a memory and a processor, wherein the memory is used to store a program supporting the processor to execute any of the Hindi image and text recognition methods, and the processor is configured to execute the program stored in the memory.

本发明一种计算机可读存储介质，计算机可读存储介质上存储有计算机程序的特点在于，所述计算机程序被处理器运行时执行任一所述印地语图文识别方法的步骤。The present invention provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program executes any step of the Hindi image and text recognition method when the computer program is executed by a processor.

与现有技术相比，本发明的有益效果在于：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明设计了一种印地语图文识别网络，该印地语图文识别网络在合成的印地语文本图像上训练好之后，可以有效识别现实场景中的真实印地语文本图像，并提升了识别准确性。1. The present invention designs a Hindi image and text recognition network. After being trained on synthetic Hindi text images, the Hindi image and text recognition network can effectively recognize real Hindi text images in real scenes and improve recognition accuracy.

2、本发明将增强视觉变换器网络的结构引入到印地语识别网络的编码器中，加强了印地语识别编码器在提取图像特征时的建模能力。这种方式使得印地语文本识别网络能够抽取到更有利于文本识别的特征。2. The present invention introduces the structure of the enhanced visual transformer network into the encoder of the Hindi recognition network, which enhances the modeling ability of the Hindi recognition encoder in extracting image features. This method enables the Hindi text recognition network to extract features that are more conducive to text recognition.

3、本发明使用迁移学习的思想，将增强视觉变换器网络中的部分参数在其他任务的图像数据集上学习到的知识迁移到印地语的图文识别中，弥补了印地语真实图文数据资源匮乏的不足。3. The present invention uses the idea of transfer learning to transfer the knowledge of some parameters in the enhanced visual transformer network learned on image datasets of other tasks to Hindi image and text recognition, making up for the lack of real Hindi image and text data resources.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明基于增强视觉变换器网络的印地语识别方法的流程图；FIG1 is a flow chart of a Hindi recognition method based on an enhanced visual transformer network according to the present invention;

图2是本发明基于增强视觉变换器网络的印地语图文识别网络的结构图。FIG. 2 is a structural diagram of a Hindi image-text recognition network based on an enhanced visual transformer network according to the present invention.

具体实施方式DETAILED DESCRIPTION

本实施例中，一种基于增强视觉变换器网络的印地语图文识别方法，如图1所示，包括以下步骤：In this embodiment, a Hindi image and text recognition method based on an enhanced visual transformer network, as shown in FIG1 , includes the following steps:

步骤2：如图2所示，构建文本图像识别网络，包括一个编码器和一个解码器；Step 2: As shown in Figure 2, construct a text image recognition network, including an encoder and a decoder;

编码器用于提取并处理图像中有利于识别图像中字符的信息。解码器以编码器构建的图像特征为输入，以自回归的方式逐个解码出图像中包含的字符。The encoder is used to extract and process information in the image that is useful for identifying characters in the image. The decoder takes the image features constructed by the encoder as input and decodes the characters contained in the image one by one in an autoregressive manner.

步骤2.1：编码器模块由R个堆叠的ViTAEv2基础模块组成。同时包含多头自注意力操作和卷积操作使得ViTAEv2基础模块能够发挥卷积操作的优势与变换器网络长距离依赖建模的优势。第r个ViTAEv2基础编码模块包括：一个主分支、一个旁路分支和一个带有残差连接的前馈编码层；Step 2.1: The encoder module consists of R stacked ViTAEv2 basic modules. The inclusion of multi-head self-attention operations and convolution operations enables the ViTAEv2 basic module to take advantage of the convolution operation and the transformer network's advantage in modeling long-distance dependencies. The rth ViTAEv2 basic encoder module includes: a main branch, a bypass branch, and a feedforward encoding layer with a residual connection;

第r个ViTAEv2基础编码模块的主分支包括多尺度卷积操作层、层归一化操作层、多头注意力操作层。其中，多头注意力操作用来建模图像中的长期依赖信息，其输入是一个向量的序列。如果上层结构的输出为一个多通道的特征图，则要对其进行形状转化，转化为向量的序列；The main branch of the rth ViTAEv2 basic encoding module includes a multi-scale convolution operation layer, a layer normalization operation layer, and a multi-head attention operation layer. Among them, the multi-head attention operation is used to model the long-term dependency information in the image, and its input is a sequence of vectors. If the output of the upper structure is a multi-channel feature map, it needs to be transformed into a sequence of vectors;

第r个ViTAEv2基础编码模块的旁路分支包括卷积操作层、批量归一化操作层和Sigmoid加权线性单元。如果上一层的输入是一个向量的序列，旁路分支在处理之前需要通过形状转化将其转化为多通道的特征图；The bypass branch of the rth ViTAEv2 basic encoding module includes a convolution operation layer, a batch normalization operation layer, and a Sigmoid weighted linear unit. If the input of the previous layer is a sequence of vectors, the bypass branch needs to convert it into a multi-channel feature map through shape transformation before processing;

当r＝1时，第i张印地语文本图像x_i输入文本图像识别网络中，并经过编码器中第r个ViTAEv2基础编码模块的主分支的多尺度卷积操作层进行特征提取，输出一个多通道的特征图M_i,r并通过形状变化后转化为特征序列S_i，r，再经过多头注意力操作层的处理后，得到特征序列O_i，r；When r = 1, the i-th Hindi text image x _i is input into the text image recognition network, and is subjected to feature extraction by the multi-scale convolution operation layer of the main branch of the r-th ViTAEv2 basic encoding module in the encoder, outputting a multi-channel feature map M _i,r and transforming it into a feature sequence S _i,r after shape transformation. After being processed by the multi-head attention operation layer, a feature sequence O _i,r is obtained;

编码器中第r个ViTAEv2基础编码模块的旁路分支对第i张印地语文本图像x_i进行处理，得到特征图M′_i，r，并通过形状变化后转化为特征序列S′_i，r；The bypass branch of the rth ViTAEv2 basic encoding module in the encoder processes the i-th Hindi text image x _i to obtain the feature map M ′ _{i, r} , and converts it into a feature sequence S ′ _{i, r} after shape transformation;

将O_i,r和S′_i，r的对应位置元素相加后，得到融合后的序列f_i，r并输入到第r个ViTAEv2基础编码模块的前馈编码层中，并依次经过层归一化操作、带有线性整流激活函数的两个全连接层的处理后，输出第r个视觉特征序列P_i，r；After adding the corresponding position elements of O _i,r and S ′ _i,r , the fused sequence fi _,r is obtained and input into the feedforward coding layer of the r-th ViTAEv2 basic coding module, and then processed by layer normalization operation and two fully connected layers with linear rectification activation function, and then the r-th visual feature sequence Pi _,r is output;

当r＝2,3，…，R时，第r-1个视觉特征序列P_i，r-1输入编码器的第r个ViTAEv2基础编码模块中进行处理，相应得到第r个视觉特征序列P_i，r；从而由第R个ViTAEv2基础编码模块输出的第R个视觉特征序列P_i,R作为编码器模块输出的视觉特征序列

其中，

表示第i张印地语文本图像x_i的第j个视觉特征向量；n表示视觉特征向量的总数；When r = 2, 3, ..., R, the r-1th visual feature sequence _Pi,r-1 is input into the rth ViTAEv2 basic coding module of the encoder for processing, and the rth visual feature sequence _Pi,r is obtained accordingly; thus, the Rth visual feature sequence _Pi,R output by the Rth ViTAEv2 basic coding module is used as the visual feature sequence output by the encoder module.

in,

represents the jth visual feature vector of the i-th Hindi text image _xi ; n represents the total number of visual feature vectors;

步骤2.2：解码器由线性分类层以及K个相同的Transformer基础解码模块堆叠而成，任意第k个Transformer基础解码模块包含：掩码多头自注意力层、编码器-解码器间的多头注意力层、带有残差连接的前馈层；Step 2.2: The decoder is composed of a linear classification layer and K identical Transformer basic decoding modules stacked together. Any k-th Transformer basic decoding module contains: a masked multi-head self-attention layer, a multi-head attention layer between the encoder and the decoder, and a feed-forward layer with residual connections;

解码器以编码器构建的视觉特征序列

为输入，以自回归的方式逐个解码出图像中包含的字符。在每一步，之前步预测出的字符信息作为当前步的额外输入送到解码器中。The decoder uses the sequence of visual features constructed by the encoder

As input, the characters contained in the image are decoded one by one in an autoregressive manner. At each step, the character information predicted by the previous step is sent to the decoder as an additional input of the current step.

在进行自注意力权重的计算时，在softmax(·)函数之前使用了掩码，防止解码器中的信息从右向左流动，造成标签泄露；When k = 1, the text label gt _i is embedded to obtain the embedding result seq _i , and then input into the k-th masked multi-head attention layer for processing to obtain the k-th self-attention processing result sequence

When calculating the self-attention weights, a mask is used before the softmax(·) function to prevent information in the decoder from flowing from right to left and causing label leakage;

视觉特征序列

输入解码器中，由第k个编码器-解码器间的多头注意力层对

和

进行处理后，得到第k个跨模态处理结果序列

通过编码器-解码器间的注意力，解码器为编码器输出的向量序列的每一个向量计算一个注意力权重。如果图像中的某个区域对于解码当前字符有重要的作用，则该区域对应的向量具有较高的注意力权重。通过加权求和的方式，多头注意力抑制了编码器输出的对解码当前字符无关的特征，突出了编码器输出的对解码当前字符重要的特征；Visual feature sequence

In the input decoder, the multi-head attention layer between the kth encoder-decoder

and

After processing, the kth cross-modal processing result sequence is obtained

Through the attention between the encoder and the decoder, the decoder calculates an attention weight for each vector in the vector sequence output by the encoder. If a certain area in the image plays an important role in decoding the current character, the vector corresponding to the area has a higher attention weight. Through weighted summation, multi-head attention suppresses the features of the encoder output that are irrelevant to decoding the current character, and highlights the features of the encoder output that are important for decoding the current character;

第k个带有残差连接的前馈层对

进行处理后得到第k个Transformer基础解码模块的处理结果序列

The kth feed-forward layer pair with residual connection

当k＝2,3，…，R时，第k-1个基础解码器模块输出的处理结果序列

输入第k个Transformer基础解码模块中的掩码多头注意力层中进行处理，输出的结果与

一起经过第k个编码器-解码器间的多头注意力层和带有残差连接的前馈层的处理后，得到得到第k个Transformer基础解码模块的处理结果序列

从而由第K个Transformer基础解码模块得到第K个处理结果序列

线性分类层利用

进行预测，得到第i张印地语文本图像x_i的预测概率

其中，

表示第i张印地语文本图像x_i中第t个位置字符概率，T表示最大解码长度。在网络训练时，这个概率被用来计算损失值。在网络测试时，这个概率被用来选择当前步的字符预测结果；The linear classification layer uses

in,

represents the probability of the character at the tth position in the ith Hindi text image _xi , and T represents the maximum decoding length. During network training, this probability is used to calculate the loss value. During network testing, this probability is used to select the character prediction result for the current step;

步骤3：加载ViTAEv2基础模块在图像分类任务中预训练好的参数，并用于对文本图像识别网络中的编码器参数进行初始化；Step 3: Load the pre-trained parameters of the ViTAEv2 basic module in the image classification task and use them to initialize the encoder parameters in the text image recognition network;

本实施例中，一种电子设备，包括存储器以及处理器，该存储器用于存储支持处理器执行上述印地语图文识别方法的程序，该处理器被配置为用于执行该存储器中存储的程序。In this embodiment, an electronic device includes a memory and a processor. The memory is used to store a program that supports the processor to execute the Hindi image and text recognition method. The processor is configured to execute the program stored in the memory.

本实施例中，一种计算机可读存储介质，是在计算机可读存储介质上存储有计算机程序，该计算机程序被处理器运行时执行上述印地语图文识别方法的步骤。In this embodiment, a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the Hindi image and text recognition method are executed.

Claims

1. A Hindi image and text recognition method based on enhanced visual transformer network, characterized in that it comprises the following steps:

Step 1: Build a Hindi image and text recognition training dataset;

A Hindi text image is synthesized using a background image without text and Hindi text content, and the Hindi text content is used as the text label of the corresponding Hindi text image, thereby obtaining a text image set X = [x ₁ , x ₂ , …, x _i , …, x _N ] and its corresponding label set G = [g ₁ , g ₂ , …, g _i , …, g _N ]; wherein,

Step 2: Build a text image recognition network, including an encoder and a decoder, and use it to obtain the prediction probability P _i of the i-th Hindi text image x _i ;

Step 3: Load the pre-trained parameters of the ViTAEv2 basic module in the image classification task and use them to initialize the encoder parameters in the text image recognition network;

After constructing the cross entropy loss function according to the text label g _i and the predicted probability P _i , the initialized text image recognition network is trained using the back propagation algorithm until the cross entropy loss function converges, thereby obtaining the trained text image recognition network;

Step 4: Use the trained text image recognition network to recognize any input Hindi text image to be recognized, obtain the predicted probability vector of the characters at each position in the image to be recognized, and then select the category corresponding to the maximum probability in the predicted probability vector as the text recognition result of the image to be recognized.

2. The Hindi text and image recognition method based on enhanced visual transformer network according to claim 1, characterized in that the encoder module in step 2 is composed of R stacked ViTAEv2 basic modules, wherein the rth ViTAEv2 basic encoding module includes: a main branch, a bypass branch and a feedforward encoding layer with residual connection;

The main branch of the rth ViTAEv2 basic encoding module includes a multi-scale convolution operation layer, a layer normalization operation layer, and a multi-head attention operation layer;

The bypass branch of the rth ViTAEv2 basic encoding module includes a convolution operation layer, a batch normalization operation layer and a Sigmoid weighted linear unit;

When r=1, the i-th Hindi text image x _i is input into the text image recognition network, and is subjected to feature extraction by the multi-scale convolution operation layer of the main branch of the r-th ViTAEv2 basic encoding module in the encoder, and a multi-channel feature map M _i,r is output and converted into a feature sequence S _i,r after shape change, and then processed by the multi-head attention operation layer to obtain a feature sequence O _i,r ;

The bypass branch of the rth ViTAEv2 basic encoding module in the encoder processes the i-th Hindi text image x _i to obtain a feature map M′ _i,r , and converts it into a feature sequence S′ _i,r after shape change;

After adding the corresponding position elements of O _i,r and S ′ _i,r , the fused sequence fi _,r is obtained and input into the feedforward coding layer of the r-th ViTAEv2 basic coding module, and then processed by layer normalization operation and two fully connected layers with linear rectification activation function, and then the r-th visual feature sequence Pi _,r is output;

When r=2,3,…,R, the r-1th visual feature sequence _Pi,r-1 is input into the rth ViTAEv2 basic coding module of the encoder for processing, and the rth visual feature sequence Pi _,r is obtained accordingly; thus, the Rth visual feature sequence _Pi,R output by the Rth ViTAEv2 basic coding module is used as the visual feature sequence output by the encoder module.

in,

3. The Hindi image and text recognition method based on enhanced visual transformer network according to claim 2, characterized in that the decoder in step 2 is composed of a linear classification layer and K identical Transformer basic decoding modules stacked together, and any k-th Transformer basic decoding module comprises: a masked multi-head self-attention layer, a multi-head attention layer between encoder and decoder, and a feedforward layer with residual connection;

The visual feature sequence

and

After processing, the kth cross-modal processing result sequence is obtained

The kth feed-forward layer pair with residual connection

The linear classification layer uses

in,

4. An electronic device comprising a memory and a processor, wherein the memory is used to store a program that supports the processor to execute the Hindi image and text recognition method according to any one of claims 1 to 3, and the processor is configured to execute the program stored in the memory.

5. A computer-readable storage medium having a computer program stored thereon, wherein the computer program executes the steps of the Hindi image and text recognition method according to any one of claims 1 to 3 when executed by a processor.