Nothing Special   »   [go: up one dir, main page]

CN115984883A - A Hindi Image-Text Recognition Method Based on Enhanced Vision Transformer Network - Google Patents

A Hindi Image-Text Recognition Method Based on Enhanced Vision Transformer Network Download PDF

Info

Publication number
CN115984883A
CN115984883A CN202310032987.4A CN202310032987A CN115984883A CN 115984883 A CN115984883 A CN 115984883A CN 202310032987 A CN202310032987 A CN 202310032987A CN 115984883 A CN115984883 A CN 115984883A
Authority
CN
China
Prior art keywords
text
image
hindi
layer
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310032987.4A
Other languages
Chinese (zh)
Inventor
汪增福
李永瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Institutes of Physical Science of CAS
Original Assignee
Hefei Institutes of Physical Science of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Institutes of Physical Science of CAS filed Critical Hefei Institutes of Physical Science of CAS
Priority to CN202310032987.4A priority Critical patent/CN115984883A/en
Publication of CN115984883A publication Critical patent/CN115984883A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention discloses a hindi image-text recognition method based on an enhanced vision converter network, which comprises the following steps: 1, synthesizing hindi text images and establishing a hindi image-text recognition training data set; 2, constructing a hindi image-text recognition network; 3 calculating a loss function corresponding to each input picture and training a hindi picture and text recognition network; and 4, carrying out text recognition on the arbitrarily input picture to be recognized by utilizing the trained hindi image-text recognition network. After the hindi text-image recognition network is trained, the hindi text-image recognition network can recognize the text of the hindi text-image which does not appear in the training data, and has higher practical value.

Description

一种基于增强视觉变换器网络的印地语图文识别方法A Hindi Image and Text Recognition Method Based on Enhanced Visual Transformer Network

技术领域Technical Field

本发明涉及印地语文本图像识别领域的相关问题,具体涉及一种基于增强视觉变换器网络的印地语图文识别方法。The present invention relates to related issues in the field of Hindi text image recognition, and in particular to a Hindi image and text recognition method based on an enhanced visual transformer network.

背景技术Background Art

目前的印地语图文识别方法大都基于深度学习技术。深度学习是一种数据驱动的方法,训练数据的数量和质量对模型的效果有至关重要的影响。印地语作为一种低资源语言,真实图文数据较为匮乏。同时,在识别模型方面,目前专门针对印地语的图文识别网络大多采用基于卷积神经网络的编码器和基于时序分类的解码器。在印地语图文识别中,这种设计思路限制了图文识别模型的性能。Most of the current Hindi image and text recognition methods are based on deep learning technology. Deep learning is a data-driven method, and the quantity and quality of training data have a crucial impact on the model's effectiveness. Hindi is a low-resource language, and real image and text data is relatively scarce. At the same time, in terms of recognition models, most of the current image and text recognition networks specifically for Hindi use encoders based on convolutional neural networks and decoders based on time series classification. In Hindi image and text recognition, this design idea limits the performance of the image and text recognition model.

发明内容Summary of the invention

本发明是为了克服现有技术的不足之处,提出的一种基于增强视觉变换器网络的印地语图文识别方法,以期能高效完成印地语图文识别任务,从而能提升印地语文本图像的图文识别准确率。The present invention aims to overcome the shortcomings of the prior art and proposes a Hindi image and text recognition method based on an enhanced visual transformer network, in order to efficiently complete the Hindi image and text recognition task, thereby improving the image and text recognition accuracy of Hindi text images.

本发明为达到上述发明目的,采用如下技术方案:In order to achieve the above-mentioned object of the invention, the present invention adopts the following technical scheme:

本发明一种基于增强视觉变换器网络的印地语图文识别方法的特点在于,包括以下步骤:The present invention provides a Hindi image and text recognition method based on an enhanced visual transformer network, which comprises the following steps:

步骤1:构建印地语图文识别训练数据集;Step 1: Build a Hindi image and text recognition training dataset;

用无文字的背景图像和印地语文字内容合成印地语文本图像,并将印地语文字内容作为相应印地语文本图像的文字标签,从而得到文本图像集合X=[x1,x2,...,xi,...,xN]及其对应的标签集合G=[g1,g2,...,gi,...,gN];其中,

Figure BDA0004048018790000011
表示第i张印地语文本图像,gi表示第i张印地语文本图像xi对应的文字标签,H表示图像的高度,W表示图像的宽度,3为通道数;A Hindi text image is synthesized using a background image without text and Hindi text content, and the Hindi text content is used as a text label of the corresponding Hindi text image, thereby obtaining a text image set X = [x 1 , x 2 , ..., x i , ..., x N ] and its corresponding label set G = [g 1 , g 2 , ..., g i , ..., g N ]; wherein,
Figure BDA0004048018790000011
represents the ith Hindi text image, gi represents the text label corresponding to the ith Hindi text image xi , H represents the height of the image, W represents the width of the image, and 3 is the number of channels;

步骤2:构建文本图像识别网络,包括一个编码器和一个解码器,并用于得到第i张印地语文本图像xi的预测概率PiStep 2: Build a text image recognition network, including an encoder and a decoder, and use it to obtain the prediction probability P i of the i-th Hindi text image x i ;

步骤3:加载ViTAEv2基础模块在图像分类任务中预训练好的参数,并用于对所述文本图像识别网络中的编码器参数进行初始化;Step 3: Load the pre-trained parameters of the ViTAEv2 basic module in the image classification task and use them to initialize the encoder parameters in the text image recognition network;

根据文字标签gi和预测概率Pi构建交叉熵损失函数后,利用反向传播算法训练初始化后的文本图像识别网络,直至交叉熵损失函数收敛为止,从而得到训练后的文本图像识别网络;After constructing the cross entropy loss function according to the text label g i and the predicted probability P i , the initialized text image recognition network is trained using the back propagation algorithm until the cross entropy loss function converges, thereby obtaining the trained text image recognition network;

步骤4:利用训练后的文本图像识别网络对任意输入的待识别印地语文本图片进行识别,得到待识别图片每个位置字符的预测概率向量,再选择预测概率向中最大概率所对应的类别,作为待识别图片的文本识别结果。Step 4: Use the trained text image recognition network to recognize any input Hindi text image to be recognized, obtain the predicted probability vector of the characters at each position in the image to be recognized, and then select the category corresponding to the maximum probability in the predicted probability vector as the text recognition result of the image to be recognized.

本发明所述的一种基于增强视觉变换器网络的印地语图文识别方法的特点也在于,所述步骤2中的编码器模块由R个堆叠的ViTAEv2基础模块组成,其中,第r个ViTAEv2基础编码模块包括:一个主分支、一个旁路分支和一个带有残差连接的前馈编码层;The method for Hindi image and text recognition based on enhanced visual transformer network described in the present invention is also characterized in that the encoder module in step 2 is composed of R stacked ViTAEv2 basic modules, wherein the rth ViTAEv2 basic encoding module includes: a main branch, a bypass branch and a feedforward encoding layer with residual connection;

所述第r个ViTAEv2基础编码模块的主分支包括多尺度卷积操作层、层归一化操作层、多头注意力操作层;The main branch of the rth ViTAEv2 basic encoding module includes a multi-scale convolution operation layer, a layer normalization operation layer, and a multi-head attention operation layer;

所述第r个ViTAEv2基础编码模块的旁路分支包括卷积操作层、批量归一化操作层和Sigmoid加权线性单元;The bypass branch of the rth ViTAEv2 basic encoding module includes a convolution operation layer, a batch normalization operation layer and a Sigmoid weighted linear unit;

当r=1时,所述第i张印地语文本图像xi输入所述文本图像识别网络中,并经过编码器中第r个ViTAEv2基础编码模块的主分支的多尺度卷积操作层进行特征提取,输出一个多通道的特征图Mi,r并通过形状变化后转化为特征序列Si,r,再经过多头注意力操作层的处理后,得到特征序列Oi,rWhen r=1, the i-th Hindi text image x i is input into the text image recognition network, and is subjected to feature extraction by the multi-scale convolution operation layer of the main branch of the r-th ViTAEv2 basic encoding module in the encoder, and a multi-channel feature map M i,r is output and converted into a feature sequence S i,r after shape change, and then processed by the multi-head attention operation layer to obtain a feature sequence O i,r ;

所述编码器中第r个ViTAEv2基础编码模块的旁路分支对所述第i张印地语文本图像xi进行处理,得到特征图M′i,r,并通过形状变化后转化为特征序列S′i,rThe bypass branch of the rth ViTAEv2 basic encoding module in the encoder processes the i-th Hindi text image x i to obtain a feature map M′ i,r , and converts it into a feature sequence S′ i,r after shape change;

将Oi,r和S′i,r的对应位置元素相加后,得到融合后的序列fi,r并输入到第r个ViTAEv2基础编码模块的前馈编码层中,并依次经过层归一化操作、带有线性整流激活函数的两个全连接层的处理后,输出第r个视觉特征序列Pi,rAfter adding the corresponding position elements of O i,r and S ′ i,r , the fused sequence fi ,r is obtained and input into the feedforward coding layer of the r-th ViTAEv2 basic coding module, and then processed by layer normalization operation and two fully connected layers with linear rectification activation function, and then the r-th visual feature sequence Pi ,r is output;

当r=2,3,…,R时,第r-1个视觉特征序列Pi,r-1输入编码器的第r个ViTAEv2基础编码模块中进行处理,相应得到第r个视觉特征序列Pi,r;从而由第R个ViTAEv2基础编码模块输出的第R个视觉特征序列Pi,R作为所述编码器模块输出的视觉特征序列

Figure BDA0004048018790000021
其中,
Figure BDA0004048018790000022
表示第i张印地语文本图像xi的第j个视觉特征向量;n表示视觉特征向量的总数。When r=2, 3, ..., R, the r-1th visual feature sequence P i,r-1 is input into the rth ViTAEv2 basic coding module of the encoder for processing, and the rth visual feature sequence P i,r is obtained accordingly; thus, the Rth visual feature sequence P i,R output by the Rth ViTAEv2 basic coding module is used as the visual feature sequence output by the encoder module.
Figure BDA0004048018790000021
in,
Figure BDA0004048018790000022
represents the j-th visual feature vector of the i-th Hindi text image xi ; n represents the total number of visual feature vectors.

所述步骤2中的解码器由线性分类层以及K个相同的Transformer基础解码模块堆叠而成,任意第k个Transformer基础解码模块包含:掩码多头自注意力层、编码器-解码器间的多头注意力层、带有残差连接的前馈层;The decoder in step 2 is composed of a linear classification layer and K identical Transformer basic decoding modules stacked together, and any k-th Transformer basic decoding module includes: a masked multi-head self-attention layer, a multi-head attention layer between the encoder and the decoder, and a feedforward layer with a residual connection;

当k=1时,将文字标签gti进行嵌入表示,得到嵌入表示结果seqi,并输入第k个掩码多头注意力层中进行处理,得到第k个自注意力处理结果序列

Figure BDA0004048018790000031
When k = 1, the text label gt i is embedded to obtain the embedding result seq i , and then input into the k-th masked multi-head attention layer for processing to obtain the k-th self-attention processing result sequence
Figure BDA0004048018790000031

所述视觉特征序列

Figure BDA0004048018790000032
输入所述解码器中,由第k个编码器-解码器间的多头注意力层对
Figure BDA00040480187900000314
Figure BDA0004048018790000033
进行处理后,得到第k个跨模态处理结果序列
Figure BDA0004048018790000034
The visual feature sequence
Figure BDA0004048018790000032
Input to the decoder, the multi-head attention layer between the kth encoder-decoder
Figure BDA00040480187900000314
Figure BDA0004048018790000033
After processing, the kth cross-modal processing result sequence is obtained
Figure BDA0004048018790000034

第k个带有残差连接的前馈层对

Figure BDA0004048018790000035
进行处理后得到第k个Transformer基础解码模块的处理结果序列
Figure BDA0004048018790000036
The kth feed-forward layer pair with residual connection
Figure BDA0004048018790000035
After processing, the processing result sequence of the kth Transformer basic decoding module is obtained
Figure BDA0004048018790000036

当k=2,3,…,R时,第k-1个基础解码器模块输出的处理结果序列

Figure BDA0004048018790000037
输入所述第k个Transformer基础解码模块中的掩码多头注意力层中进行处理,输出的结果与
Figure BDA0004048018790000038
一起经过第k个编码器-解码器间的多头注意力层和带有残差连接的前馈层的处理后,得到第k个Transformer基础解码模块的处理结果序列
Figure BDA0004048018790000039
从而由第K个Transformer基础解码模块得到第K个处理结果序列
Figure BDA00040480187900000310
When k = 2, 3, ..., R, the processing result sequence output by the k-1th basic decoder module is
Figure BDA0004048018790000037
The input is processed in the masked multi-head attention layer in the k-th Transformer basic decoding module, and the output result is the same as
Figure BDA0004048018790000038
After being processed by the multi-head attention layer between the kth encoder-decoder and the feed-forward layer with residual connection, the processing result sequence of the kth Transformer basic decoding module is obtained.
Figure BDA0004048018790000039
Thus, the Kth processing result sequence is obtained from the Kth Transformer basic decoding module
Figure BDA00040480187900000310

所述线性分类层利用

Figure BDA00040480187900000311
进行预测,得到第i张印地语文本图像xi的预测概率
Figure BDA00040480187900000312
其中,
Figure BDA00040480187900000313
表示第i张印地语文本图像xi中第t个位置字符概率,T表示最大解码长度。The linear classification layer uses
Figure BDA00040480187900000311
Make a prediction and get the prediction probability of the i-th Hindi text image x i
Figure BDA00040480187900000312
in,
Figure BDA00040480187900000313
represents the probability of the character at the t-th position in the ith Hindi text image xi , and T represents the maximum decoding length.

本发明一种电子设备,包括存储器以及处理器的特点在于,所述存储器用于存储支持处理器执行任一所述印地语图文识别方法的程序,所述处理器被配置为用于执行所述存储器中存储的程序。An electronic device of the present invention includes a memory and a processor, wherein the memory is used to store a program supporting the processor to execute any of the Hindi image and text recognition methods, and the processor is configured to execute the program stored in the memory.

本发明一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序的特点在于,所述计算机程序被处理器运行时执行任一所述印地语图文识别方法的步骤。The present invention provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program executes any step of the Hindi image and text recognition method when the computer program is executed by a processor.

与现有技术相比,本发明的有益效果在于:Compared with the prior art, the present invention has the following beneficial effects:

1、本发明设计了一种印地语图文识别网络,该印地语图文识别网络在合成的印地语文本图像上训练好之后,可以有效识别现实场景中的真实印地语文本图像,并提升了识别准确性。1. The present invention designs a Hindi image and text recognition network. After being trained on synthetic Hindi text images, the Hindi image and text recognition network can effectively recognize real Hindi text images in real scenes and improve recognition accuracy.

2、本发明将增强视觉变换器网络的结构引入到印地语识别网络的编码器中,加强了印地语识别编码器在提取图像特征时的建模能力。这种方式使得印地语文本识别网络能够抽取到更有利于文本识别的特征。2. The present invention introduces the structure of the enhanced visual transformer network into the encoder of the Hindi recognition network, which enhances the modeling ability of the Hindi recognition encoder in extracting image features. This method enables the Hindi text recognition network to extract features that are more conducive to text recognition.

3、本发明使用迁移学习的思想,将增强视觉变换器网络中的部分参数在其他任务的图像数据集上学习到的知识迁移到印地语的图文识别中,弥补了印地语真实图文数据资源匮乏的不足。3. The present invention uses the idea of transfer learning to transfer the knowledge of some parameters in the enhanced visual transformer network learned on image datasets of other tasks to Hindi image and text recognition, making up for the lack of real Hindi image and text data resources.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明基于增强视觉变换器网络的印地语识别方法的流程图;FIG1 is a flow chart of a Hindi recognition method based on an enhanced visual transformer network according to the present invention;

图2是本发明基于增强视觉变换器网络的印地语图文识别网络的结构图。FIG. 2 is a structural diagram of a Hindi image-text recognition network based on an enhanced visual transformer network according to the present invention.

具体实施方式DETAILED DESCRIPTION

本实施例中,一种基于增强视觉变换器网络的印地语图文识别方法,如图1所示,包括以下步骤:In this embodiment, a Hindi image and text recognition method based on an enhanced visual transformer network, as shown in FIG1 , includes the following steps:

步骤1:构建印地语图文识别训练数据集;Step 1: Build a Hindi image and text recognition training dataset;

用无文字的背景图像和印地语文字内容合成印地语文本图像,并将印地语文字内容作为相应印地语文本图像的文字标签,从而得到文本图像集合X=[x1,x2,...,xi,...,xN]及其对应的标签集合G=[g1,g2,...,gi,...,gN];其中,

Figure BDA0004048018790000041
表示第i张印地语文本图像,gi表示第i张印地语文本图像xi对应的文字标签,H表示图像的高度,W表示图像的宽度,3为通道数;A Hindi text image is synthesized using a background image without text and Hindi text content, and the Hindi text content is used as a text label of the corresponding Hindi text image, thereby obtaining a text image set X = [x 1 , x 2 , ..., x i , ..., x N ] and its corresponding label set G = [g 1 , g 2 , ..., g i , ..., g N ]; wherein,
Figure BDA0004048018790000041
represents the ith Hindi text image, gi represents the text label corresponding to the ith Hindi text image xi , H represents the height of the image, W represents the width of the image, and 3 is the number of channels;

步骤2:如图2所示,构建文本图像识别网络,包括一个编码器和一个解码器;Step 2: As shown in Figure 2, construct a text image recognition network, including an encoder and a decoder;

编码器用于提取并处理图像中有利于识别图像中字符的信息。解码器以编码器构建的图像特征为输入,以自回归的方式逐个解码出图像中包含的字符。The encoder is used to extract and process information in the image that is useful for identifying characters in the image. The decoder takes the image features constructed by the encoder as input and decodes the characters contained in the image one by one in an autoregressive manner.

步骤2.1:编码器模块由R个堆叠的ViTAEv2基础模块组成。同时包含多头自注意力操作和卷积操作使得ViTAEv2基础模块能够发挥卷积操作的优势与变换器网络长距离依赖建模的优势。第r个ViTAEv2基础编码模块包括:一个主分支、一个旁路分支和一个带有残差连接的前馈编码层;Step 2.1: The encoder module consists of R stacked ViTAEv2 basic modules. The inclusion of multi-head self-attention operations and convolution operations enables the ViTAEv2 basic module to take advantage of the convolution operation and the transformer network's advantage in modeling long-distance dependencies. The rth ViTAEv2 basic encoder module includes: a main branch, a bypass branch, and a feedforward encoding layer with a residual connection;

第r个ViTAEv2基础编码模块的主分支包括多尺度卷积操作层、层归一化操作层、多头注意力操作层。其中,多头注意力操作用来建模图像中的长期依赖信息,其输入是一个向量的序列。如果上层结构的输出为一个多通道的特征图,则要对其进行形状转化,转化为向量的序列;The main branch of the rth ViTAEv2 basic encoding module includes a multi-scale convolution operation layer, a layer normalization operation layer, and a multi-head attention operation layer. Among them, the multi-head attention operation is used to model the long-term dependency information in the image, and its input is a sequence of vectors. If the output of the upper structure is a multi-channel feature map, it needs to be transformed into a sequence of vectors;

第r个ViTAEv2基础编码模块的旁路分支包括卷积操作层、批量归一化操作层和Sigmoid加权线性单元。如果上一层的输入是一个向量的序列,旁路分支在处理之前需要通过形状转化将其转化为多通道的特征图;The bypass branch of the rth ViTAEv2 basic encoding module includes a convolution operation layer, a batch normalization operation layer, and a Sigmoid weighted linear unit. If the input of the previous layer is a sequence of vectors, the bypass branch needs to convert it into a multi-channel feature map through shape transformation before processing;

当r=1时,第i张印地语文本图像xi输入文本图像识别网络中,并经过编码器中第r个ViTAEv2基础编码模块的主分支的多尺度卷积操作层进行特征提取,输出一个多通道的特征图Mi,r并通过形状变化后转化为特征序列Si,r,再经过多头注意力操作层的处理后,得到特征序列Oi,rWhen r = 1, the i-th Hindi text image x i is input into the text image recognition network, and is subjected to feature extraction by the multi-scale convolution operation layer of the main branch of the r-th ViTAEv2 basic encoding module in the encoder, outputting a multi-channel feature map M i,r and transforming it into a feature sequence S i,r after shape transformation. After being processed by the multi-head attention operation layer, a feature sequence O i,r is obtained;

编码器中第r个ViTAEv2基础编码模块的旁路分支对第i张印地语文本图像xi进行处理,得到特征图M′i,r,并通过形状变化后转化为特征序列S′i,rThe bypass branch of the rth ViTAEv2 basic encoding module in the encoder processes the i-th Hindi text image x i to obtain the feature map M ′ i, r , and converts it into a feature sequence S ′ i, r after shape transformation;

将Oi,r和S′i,r的对应位置元素相加后,得到融合后的序列fi,r并输入到第r个ViTAEv2基础编码模块的前馈编码层中,并依次经过层归一化操作、带有线性整流激活函数的两个全连接层的处理后,输出第r个视觉特征序列Pi,rAfter adding the corresponding position elements of O i,r and S ′ i,r , the fused sequence fi ,r is obtained and input into the feedforward coding layer of the r-th ViTAEv2 basic coding module, and then processed by layer normalization operation and two fully connected layers with linear rectification activation function, and then the r-th visual feature sequence Pi ,r is output;

当r=2,3,…,R时,第r-1个视觉特征序列Pi,r-1输入编码器的第r个ViTAEv2基础编码模块中进行处理,相应得到第r个视觉特征序列Pi,r;从而由第R个ViTAEv2基础编码模块输出的第R个视觉特征序列Pi,R作为编码器模块输出的视觉特征序列

Figure BDA0004048018790000051
其中,
Figure BDA0004048018790000052
表示第i张印地语文本图像xi的第j个视觉特征向量;n表示视觉特征向量的总数;When r = 2, 3, ..., R, the r-1th visual feature sequence Pi,r-1 is input into the rth ViTAEv2 basic coding module of the encoder for processing, and the rth visual feature sequence Pi,r is obtained accordingly; thus, the Rth visual feature sequence Pi,R output by the Rth ViTAEv2 basic coding module is used as the visual feature sequence output by the encoder module.
Figure BDA0004048018790000051
in,
Figure BDA0004048018790000052
represents the jth visual feature vector of the i-th Hindi text image xi ; n represents the total number of visual feature vectors;

步骤2.2:解码器由线性分类层以及K个相同的Transformer基础解码模块堆叠而成,任意第k个Transformer基础解码模块包含:掩码多头自注意力层、编码器-解码器间的多头注意力层、带有残差连接的前馈层;Step 2.2: The decoder is composed of a linear classification layer and K identical Transformer basic decoding modules stacked together. Any k-th Transformer basic decoding module contains: a masked multi-head self-attention layer, a multi-head attention layer between the encoder and the decoder, and a feed-forward layer with residual connections;

解码器以编码器构建的视觉特征序列

Figure BDA0004048018790000053
为输入,以自回归的方式逐个解码出图像中包含的字符。在每一步,之前步预测出的字符信息作为当前步的额外输入送到解码器中。The decoder uses the sequence of visual features constructed by the encoder
Figure BDA0004048018790000053
As input, the characters contained in the image are decoded one by one in an autoregressive manner. At each step, the character information predicted by the previous step is sent to the decoder as an additional input of the current step.

当k=1时,将文字标签gti进行嵌入表示,得到嵌入表示结果seqi,并输入第k个掩码多头注意力层中进行处理,得到第k个自注意力处理结果序列

Figure BDA0004048018790000054
在进行自注意力权重的计算时,在softmax(·)函数之前使用了掩码,防止解码器中的信息从右向左流动,造成标签泄露;When k = 1, the text label gt i is embedded to obtain the embedding result seq i , and then input into the k-th masked multi-head attention layer for processing to obtain the k-th self-attention processing result sequence
Figure BDA0004048018790000054
When calculating the self-attention weights, a mask is used before the softmax(·) function to prevent information in the decoder from flowing from right to left and causing label leakage;

视觉特征序列

Figure BDA0004048018790000055
输入解码器中,由第k个编码器-解码器间的多头注意力层对
Figure BDA0004048018790000056
Figure BDA0004048018790000057
进行处理后,得到第k个跨模态处理结果序列
Figure BDA0004048018790000058
通过编码器-解码器间的注意力,解码器为编码器输出的向量序列的每一个向量计算一个注意力权重。如果图像中的某个区域对于解码当前字符有重要的作用,则该区域对应的向量具有较高的注意力权重。通过加权求和的方式,多头注意力抑制了编码器输出的对解码当前字符无关的特征,突出了编码器输出的对解码当前字符重要的特征;Visual feature sequence
Figure BDA0004048018790000055
In the input decoder, the multi-head attention layer between the kth encoder-decoder
Figure BDA0004048018790000056
and
Figure BDA0004048018790000057
After processing, the kth cross-modal processing result sequence is obtained
Figure BDA0004048018790000058
Through the attention between the encoder and the decoder, the decoder calculates an attention weight for each vector in the vector sequence output by the encoder. If a certain area in the image plays an important role in decoding the current character, the vector corresponding to the area has a higher attention weight. Through weighted summation, multi-head attention suppresses the features of the encoder output that are irrelevant to decoding the current character, and highlights the features of the encoder output that are important for decoding the current character;

第k个带有残差连接的前馈层对

Figure BDA0004048018790000059
进行处理后得到第k个Transformer基础解码模块的处理结果序列
Figure BDA00040480187900000510
The kth feed-forward layer pair with residual connection
Figure BDA0004048018790000059
After processing, the processing result sequence of the kth Transformer basic decoding module is obtained
Figure BDA00040480187900000510

当k=2,3,…,R时,第k-1个基础解码器模块输出的处理结果序列

Figure BDA0004048018790000061
输入第k个Transformer基础解码模块中的掩码多头注意力层中进行处理,输出的结果与
Figure BDA0004048018790000062
一起经过第k个编码器-解码器间的多头注意力层和带有残差连接的前馈层的处理后,得到得到第k个Transformer基础解码模块的处理结果序列
Figure BDA0004048018790000063
从而由第K个Transformer基础解码模块得到第K个处理结果序列
Figure BDA0004048018790000064
When k = 2, 3, ..., R, the processing result sequence output by the k-1th basic decoder module is
Figure BDA0004048018790000061
The input is processed in the masked multi-head attention layer in the k-th Transformer basic decoding module, and the output result is the same as
Figure BDA0004048018790000062
After being processed by the multi-head attention layer between the kth encoder-decoder and the feed-forward layer with residual connection, the processing result sequence of the kth Transformer basic decoding module is obtained.
Figure BDA0004048018790000063
Thus, the Kth processing result sequence is obtained from the Kth Transformer basic decoding module
Figure BDA0004048018790000064

线性分类层利用

Figure BDA0004048018790000065
进行预测,得到第i张印地语文本图像xi的预测概率
Figure BDA0004048018790000066
其中,
Figure BDA0004048018790000067
表示第i张印地语文本图像xi中第t个位置字符概率,T表示最大解码长度。在网络训练时,这个概率被用来计算损失值。在网络测试时,这个概率被用来选择当前步的字符预测结果;The linear classification layer uses
Figure BDA0004048018790000065
Make a prediction and get the prediction probability of the i-th Hindi text image x i
Figure BDA0004048018790000066
in,
Figure BDA0004048018790000067
represents the probability of the character at the tth position in the ith Hindi text image xi , and T represents the maximum decoding length. During network training, this probability is used to calculate the loss value. During network testing, this probability is used to select the character prediction result for the current step;

步骤3:加载ViTAEv2基础模块在图像分类任务中预训练好的参数,并用于对文本图像识别网络中的编码器参数进行初始化;Step 3: Load the pre-trained parameters of the ViTAEv2 basic module in the image classification task and use them to initialize the encoder parameters in the text image recognition network;

根据文字标签gi和预测概率Pi构建交叉熵损失函数后,利用反向传播算法训练初始化后的文本图像识别网络,直至交叉熵损失函数收敛为止,从而得到训练后的文本图像识别网络;After constructing the cross entropy loss function according to the text label g i and the predicted probability P i , the initialized text image recognition network is trained using the back propagation algorithm until the cross entropy loss function converges, thereby obtaining the trained text image recognition network;

步骤4:利用训练后的文本图像识别网络对任意输入的待识别印地语文本图片进行识别,得到待识别图片每个位置字符的预测概率向量,再选择预测概率向中最大概率所对应的类别,作为待识别图片的文本识别结果。Step 4: Use the trained text image recognition network to recognize any input Hindi text image to be recognized, obtain the predicted probability vector of the characters at each position in the image to be recognized, and then select the category corresponding to the maximum probability in the predicted probability vector as the text recognition result of the image to be recognized.

本实施例中,一种电子设备,包括存储器以及处理器,该存储器用于存储支持处理器执行上述印地语图文识别方法的程序,该处理器被配置为用于执行该存储器中存储的程序。In this embodiment, an electronic device includes a memory and a processor. The memory is used to store a program that supports the processor to execute the Hindi image and text recognition method. The processor is configured to execute the program stored in the memory.

本实施例中,一种计算机可读存储介质,是在计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述印地语图文识别方法的步骤。In this embodiment, a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the Hindi image and text recognition method are executed.

Claims (5)

1.一种基于增强视觉变换器网络的印地语图文识别方法,其特征在于,包括以下步骤:1. A Hindi image and text recognition method based on enhanced visual transformer network, characterized in that it comprises the following steps: 步骤1:构建印地语图文识别训练数据集;Step 1: Build a Hindi image and text recognition training dataset; 用无文字的背景图像和印地语文字内容合成印地语文本图像,并将印地语文字内容作为相应印地语文本图像的文字标签,从而得到文本图像集合X=[x1,x2,…,xi,…,xN]及其对应的标签集合G=[g1,g2,…,gi,…,gN];其中,
Figure FDA0004048018780000011
表示第i张印地语文本图像,gi表示第i张印地语文本图像xi对应的文字标签,H表示图像的高度,W表示图像的宽度,3为通道数;
A Hindi text image is synthesized using a background image without text and Hindi text content, and the Hindi text content is used as the text label of the corresponding Hindi text image, thereby obtaining a text image set X = [x 1 , x 2 , …, x i , …, x N ] and its corresponding label set G = [g 1 , g 2 , …, g i , …, g N ]; wherein,
Figure FDA0004048018780000011
represents the ith Hindi text image, gi represents the text label corresponding to the ith Hindi text image xi , H represents the height of the image, W represents the width of the image, and 3 is the number of channels;
步骤2:构建文本图像识别网络,包括一个编码器和一个解码器,并用于得到第i张印地语文本图像xi的预测概率PiStep 2: Build a text image recognition network, including an encoder and a decoder, and use it to obtain the prediction probability P i of the i-th Hindi text image x i ; 步骤3:加载ViTAEv2基础模块在图像分类任务中预训练好的参数,并用于对所述文本图像识别网络中的编码器参数进行初始化;Step 3: Load the pre-trained parameters of the ViTAEv2 basic module in the image classification task and use them to initialize the encoder parameters in the text image recognition network; 根据文字标签gi和预测概率Pi构建交叉熵损失函数后,利用反向传播算法训练初始化后的文本图像识别网络,直至交叉熵损失函数收敛为止,从而得到训练后的文本图像识别网络;After constructing the cross entropy loss function according to the text label g i and the predicted probability P i , the initialized text image recognition network is trained using the back propagation algorithm until the cross entropy loss function converges, thereby obtaining the trained text image recognition network; 步骤4:利用训练后的文本图像识别网络对任意输入的待识别印地语文本图片进行识别,得到待识别图片每个位置字符的预测概率向量,再选择预测概率向中最大概率所对应的类别,作为待识别图片的文本识别结果。Step 4: Use the trained text image recognition network to recognize any input Hindi text image to be recognized, obtain the predicted probability vector of the characters at each position in the image to be recognized, and then select the category corresponding to the maximum probability in the predicted probability vector as the text recognition result of the image to be recognized.
2.根据权利要求1所述的一种基于增强视觉变换器网络的印地语图文识别方法,其特征在于,所述步骤2中的编码器模块由R个堆叠的ViTAEv2基础模块组成,其中,第r个ViTAEv2基础编码模块包括:一个主分支、一个旁路分支和一个带有残差连接的前馈编码层;2. The Hindi text and image recognition method based on enhanced visual transformer network according to claim 1, characterized in that the encoder module in step 2 is composed of R stacked ViTAEv2 basic modules, wherein the rth ViTAEv2 basic encoding module includes: a main branch, a bypass branch and a feedforward encoding layer with residual connection; 所述第r个ViTAEv2基础编码模块的主分支包括多尺度卷积操作层、层归一化操作层、多头注意力操作层;The main branch of the rth ViTAEv2 basic encoding module includes a multi-scale convolution operation layer, a layer normalization operation layer, and a multi-head attention operation layer; 所述第r个ViTAEv2基础编码模块的旁路分支包括卷积操作层、批量归一化操作层和Sigmoid加权线性单元;The bypass branch of the rth ViTAEv2 basic encoding module includes a convolution operation layer, a batch normalization operation layer and a Sigmoid weighted linear unit; 当r=1时,所述第i张印地语文本图像xi输入所述文本图像识别网络中,并经过编码器中第r个ViTAEv2基础编码模块的主分支的多尺度卷积操作层进行特征提取,输出一个多通道的特征图Mi,r并通过形状变化后转化为特征序列Si,r,再经过多头注意力操作层的处理后,得到特征序列Oi,rWhen r=1, the i-th Hindi text image x i is input into the text image recognition network, and is subjected to feature extraction by the multi-scale convolution operation layer of the main branch of the r-th ViTAEv2 basic encoding module in the encoder, and a multi-channel feature map M i,r is output and converted into a feature sequence S i,r after shape change, and then processed by the multi-head attention operation layer to obtain a feature sequence O i,r ; 所述编码器中第r个ViTAEv2基础编码模块的旁路分支对所述第i张印地语文本图像xi进行处理,得到特征图M′i,r,并通过形状变化后转化为特征序列S′i,rThe bypass branch of the rth ViTAEv2 basic encoding module in the encoder processes the i-th Hindi text image x i to obtain a feature map M′ i,r , and converts it into a feature sequence S′ i,r after shape change; 将Oi,r和S′i,r的对应位置元素相加后,得到融合后的序列fi,r并输入到第r个ViTAEv2基础编码模块的前馈编码层中,并依次经过层归一化操作、带有线性整流激活函数的两个全连接层的处理后,输出第r个视觉特征序列Pi,rAfter adding the corresponding position elements of O i,r and S ′ i,r , the fused sequence fi ,r is obtained and input into the feedforward coding layer of the r-th ViTAEv2 basic coding module, and then processed by layer normalization operation and two fully connected layers with linear rectification activation function, and then the r-th visual feature sequence Pi ,r is output; 当r=2,3,…,R时,第r-1个视觉特征序列Pi,r-1输入编码器的第r个ViTAEv2基础编码模块中进行处理,相应得到第r个视觉特征序列Pi,r;从而由第R个ViTAEv2基础编码模块输出的第R个视觉特征序列Pi,R作为所述编码器模块输出的视觉特征序列
Figure FDA0004048018780000021
其中,
Figure FDA0004048018780000022
表示第i张印地语文本图像xi的第j个视觉特征向量;n表示视觉特征向量的总数。
When r=2,3,…,R, the r-1th visual feature sequence Pi,r-1 is input into the rth ViTAEv2 basic coding module of the encoder for processing, and the rth visual feature sequence Pi ,r is obtained accordingly; thus, the Rth visual feature sequence Pi,R output by the Rth ViTAEv2 basic coding module is used as the visual feature sequence output by the encoder module.
Figure FDA0004048018780000021
in,
Figure FDA0004048018780000022
represents the j-th visual feature vector of the i-th Hindi text image xi ; n represents the total number of visual feature vectors.
3.根据权利要求2所述的一种基于增强视觉变换器网络的印地语图文识别方法,其特征在于,所述步骤2中的解码器由线性分类层以及K个相同的Transformer基础解码模块堆叠而成,任意第k个Transformer基础解码模块包含:掩码多头自注意力层、编码器-解码器间的多头注意力层、带有残差连接的前馈层;3. The Hindi image and text recognition method based on enhanced visual transformer network according to claim 2, characterized in that the decoder in step 2 is composed of a linear classification layer and K identical Transformer basic decoding modules stacked together, and any k-th Transformer basic decoding module comprises: a masked multi-head self-attention layer, a multi-head attention layer between encoder and decoder, and a feedforward layer with residual connection; 当k=1时,将文字标签gti进行嵌入表示,得到嵌入表示结果seqi,并输入第k个掩码多头注意力层中进行处理,得到第k个自注意力处理结果序列
Figure FDA0004048018780000023
When k = 1, the text label gt i is embedded to obtain the embedding result seq i , and then input into the k-th masked multi-head attention layer for processing to obtain the k-th self-attention processing result sequence
Figure FDA0004048018780000023
所述视觉特征序列
Figure FDA0004048018780000024
输入所述解码器中,由第k个编码器-解码器间的多头注意力层对
Figure FDA0004048018780000025
Figure FDA0004048018780000026
进行处理后,得到第k个跨模态处理结果序列
Figure FDA0004048018780000027
The visual feature sequence
Figure FDA0004048018780000024
Input to the decoder, the multi-head attention layer between the kth encoder-decoder
Figure FDA0004048018780000025
and
Figure FDA0004048018780000026
After processing, the kth cross-modal processing result sequence is obtained
Figure FDA0004048018780000027
第k个带有残差连接的前馈层对
Figure FDA0004048018780000028
进行处理后得到第k个Transformer基础解码模块的处理结果序列
Figure FDA0004048018780000029
The kth feed-forward layer pair with residual connection
Figure FDA0004048018780000028
After processing, the processing result sequence of the kth Transformer basic decoding module is obtained
Figure FDA0004048018780000029
当k=2,3,…,R时,第k-1个基础解码器模块输出的处理结果序列
Figure FDA00040480187800000210
输入所述第k个Transformer基础解码模块中的掩码多头注意力层中进行处理,输出的结果与
Figure FDA00040480187800000211
一起经过第k个编码器-解码器间的多头注意力层和带有残差连接的前馈层的处理后,得到第k个Transformer基础解码模块的处理结果序列
Figure FDA00040480187800000212
从而由第K个Transformer基础解码模块得到第K个处理结果序列
Figure FDA00040480187800000213
When k = 2, 3, ..., R, the processing result sequence output by the k-1th basic decoder module is
Figure FDA00040480187800000210
The input is processed in the masked multi-head attention layer in the k-th Transformer basic decoding module, and the output result is the same as
Figure FDA00040480187800000211
After being processed by the multi-head attention layer between the kth encoder-decoder and the feed-forward layer with residual connection, the processing result sequence of the kth Transformer basic decoding module is obtained.
Figure FDA00040480187800000212
Thus, the Kth processing result sequence is obtained from the Kth Transformer basic decoding module
Figure FDA00040480187800000213
所述线性分类层利用
Figure FDA00040480187800000214
进行预测,得到第i张印地语文本图像xi的预测概率
Figure FDA00040480187800000215
其中,
Figure FDA00040480187800000216
表示第i张印地语文本图像xi中第t个位置字符概率,T表示最大解码长度。
The linear classification layer uses
Figure FDA00040480187800000214
Make a prediction and get the prediction probability of the i-th Hindi text image x i
Figure FDA00040480187800000215
in,
Figure FDA00040480187800000216
represents the probability of the character at the t-th position in the ith Hindi text image xi , and T represents the maximum decoding length.
4.一种电子设备,包括存储器以及处理器,其特征在于,所述存储器用于存储支持处理器执行权利要求1-3中任一所述印地语图文识别方法的程序,所述处理器被配置为用于执行所述存储器中存储的程序。4. An electronic device comprising a memory and a processor, wherein the memory is used to store a program that supports the processor to execute the Hindi image and text recognition method according to any one of claims 1 to 3, and the processor is configured to execute the program stored in the memory. 5.一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,其特征在于,所述计算机程序被处理器运行时执行权利要求1-3中任一所述印地语图文识别方法的步骤。5. A computer-readable storage medium having a computer program stored thereon, wherein the computer program executes the steps of the Hindi image and text recognition method according to any one of claims 1 to 3 when executed by a processor.
CN202310032987.4A 2023-01-10 2023-01-10 A Hindi Image-Text Recognition Method Based on Enhanced Vision Transformer Network Pending CN115984883A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310032987.4A CN115984883A (en) 2023-01-10 2023-01-10 A Hindi Image-Text Recognition Method Based on Enhanced Vision Transformer Network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310032987.4A CN115984883A (en) 2023-01-10 2023-01-10 A Hindi Image-Text Recognition Method Based on Enhanced Vision Transformer Network

Publications (1)

Publication Number Publication Date
CN115984883A true CN115984883A (en) 2023-04-18

Family

ID=85962386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310032987.4A Pending CN115984883A (en) 2023-01-10 2023-01-10 A Hindi Image-Text Recognition Method Based on Enhanced Vision Transformer Network

Country Status (1)

Country Link
CN (1) CN115984883A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117994861A (en) * 2024-03-21 2024-05-07 之江实验室 Video action recognition method and device based on multi-mode large model CLIP

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117994861A (en) * 2024-03-21 2024-05-07 之江实验室 Video action recognition method and device based on multi-mode large model CLIP

Similar Documents

Publication Publication Date Title
CN106980683B (en) Blog text abstract generating method based on deep learning
CN109902293A (en) A text classification method based on local and global mutual attention mechanism
CN109948691A (en) Image description generation method and device based on deep residual network and attention
CN111046668A (en) Named Entity Recognition Method and Device for Multimodal Cultural Relic Data
CN108830287A (en) The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN111079532A (en) Video content description method based on text self-encoder
CN111444367B (en) Image title generation method based on global and local attention mechanism
CN117421591A (en) Multi-modal characterization learning method based on text-guided image block screening
CN111143563A (en) Text classification method based on fusion of BERT, LSTM and CNN
CN112650886B (en) Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN114549850B (en) A multimodal image aesthetic quality assessment method to solve the missing modality problem
CN110516530A (en) An image description method based on non-aligned multi-view feature enhancement
CN112070114A (en) Scene text recognition method and system based on Gaussian constrained attention mechanism network
CN114818721A (en) Event joint extraction model and method combined with sequence labeling
CN114612748B (en) A cross-modal video clip retrieval method based on feature decoupling
CN113128232A (en) Named entity recognition method based on ALBERT and multi-word information embedding
CN114299512B (en) A zero-shot seal character recognition method based on Chinese character root structure
CN114238649B (en) Language model pre-training method with common sense concept enhancement
CN107463928A (en) Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM
CN115147607A (en) Anti-noise zero-sample image classification method based on convex optimization theory
CN118312833A (en) Hierarchical multi-label classification method and system for travel resources
CN118070816A (en) Hybrid expert visual question answering method and system based on strong visual semantics
CN116841893A (en) Improved GPT 2-based automatic generation method and system for Robot Framework test cases
CN115984883A (en) A Hindi Image-Text Recognition Method Based on Enhanced Vision Transformer Network
CN118170836B (en) File knowledge extraction method and device based on structure priori knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination