CN114067237A

CN114067237A - Video data processing method, device and equipment

Info

Publication number: CN114067237A
Application number: CN202111264126.6A
Authority: CN
Inventors: 龚培彬; 杨毅; 孙甲松
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2022-02-18

Abstract

The present application provides a video data processing method, device, and device, and relates to computer technology. The method includes: acquiring a video to be detected, and the video to be detected includes a plurality of texts; and detecting the text in the video to be detected according to a preset text detection model , wherein the text detection model is obtained by training the neural network model according to the attention mechanism and the preset shape-aware loss function; according to the detected text, a video containing a text detection frame is output, and the text detection frame is used to indicate that the text is in location within the video. The method of the present application can solve the problem that accuracy and speed cannot be taken into account in text detection. While achieving high-accuracy text detection, the speed of text detection is greatly improved, which is more suitable for practical applications and solves the problem of text detection. Less efficient technical issues.

Description

Video data processing method, device and equipment

Technical Field

The present application relates to technologies, and in particular, to a method, an apparatus, and a device for processing video data.

Background

At present, with the continuous development of computer vision technology, the Scene Text Detection (Scene Text Detection) technology is also continuously improved, the Scene Text Detection refers to a task of labeling texts in pictures or videos by using visual Text boxes, belongs to a basic and key task in the field of computer vision, and is a key step of introducing a post-processing method, and the post-processing method comprises Text recognition, Text retrieval, license plate recognition, Text visual question and answer and the like, so that the texts in the videos need to be recognized.

In the prior art, when a text in a video is identified, the text in the video is usually detected according to a deep learning neural network model, and text detection methods based on deep learning can be divided into two categories: the method comprises the steps of (1) a text detection method based on regression and a text detection method based on segmentation, wherein the text detection method based on regression is to take a text as a target to be detected and obtain a text detection box through direct regression; the text detection method based on segmentation is to classify pixels of an image, identify whether the pixels of the image belong to a text, and obtain a final text detection box by combining a post-processing method.

However, in the text detection method based on regression in the prior art, due to the limitation of the shape of the text detection box, the detection effect on the bent text is poor, and in the text detection method based on segmentation, the final text detection box needs to be obtained by combining a post-processing method, which reduces the speed of detecting the text, so the existing text detection method causes poor detection effect or slow detection speed, and further causes low efficiency of detecting the text.

Disclosure of Invention

The application provides a video data processing method, a video data processing device and video data processing equipment, which are used for solving the technical problem of low text detection efficiency.

In a first aspect, the present application provides a video data processing method, including:

acquiring a video to be detected, wherein the video to be detected comprises a plurality of texts;

detecting a text in the video to be detected according to a preset text detection model, wherein the text detection model is obtained by training a neural network model according to an attention mechanism and a preset shape perception loss function;

and outputting a video containing a text detection box according to the detected text, wherein the text detection box is used for marking the position of the text in the video.

Further, detecting the text in the video to be detected according to a preset text detection model, including:

detecting the text in the video to be detected by using a preset attention mechanism of a text detection model;

and determining the area of the pixel block where the text is located by using a preset shape-aware loss function.

Further, outputting a video containing a text detection box according to the detected text, comprising:

generating a text detection box with the area equal to that of the pixel block of the text according to the area of the pixel block of the text;

and outputting the video containing the text detection box according to the text detection box.

Further, the video to be detected is a real-time video or an offline video.

Further, the video to be detected is a real-time video; outputting a video containing a text detection box according to the detected text, wherein the video comprises:

outputting a real-time picture containing a text detection box corresponding to each frame of picture along with the playing of each frame of picture by the real-time video so as to obtain a video containing the text detection box; the text detection box is specifically used for marking the position of the text in each frame of picture.

Further, the video to be detected is an offline video; outputting a video containing a text detection box according to the detected text, wherein the video comprises:

outputting a video containing a text detection box according to the text detection box contained in each frame of picture of the offline video; the text detection box is specifically used for marking the position of the text in each frame of picture.

Further, the method further comprises:

acquiring a plurality of image data sets, wherein each image data set comprises a plurality of texts;

setting a shape-aware penalty function; the shape-aware loss function comprises a text part loss function, a text core part loss function and a pixel vector part loss function;

and training a neural network model by using the image data set based on the shape perception loss function until the shape perception loss function obtains a minimum value, and obtaining a trained text detection model.

Further, the image dataset comprises horizontal text, oblique text and arbitrarily shaped text.

Further, setting a shape-aware penalty function, comprising:

acquiring a first actual image containing a text tag value of a pixel where a text is located, a background tag value of a pixel where a background except the text is located in the image data set, and a first predicted image containing the text tag value and the background tag value, which is detected by the text detection model;

carrying out similarity comparison on the first actual image and a first predicted image to obtain a first similarity value between the text tag value and the background tag value in the first actual image and the text tag value and the background tag value in the first predicted image, and setting the first similarity value as the value of the text part loss function;

according to the first similarity value and the text label value of the pixel where the text is located, adjusting the text label value of the pixel where the text is located corresponding to the attention mechanism;

acquiring a second actual image containing a text tag value of a pixel where a text core is located, a background tag value of a pixel where a background except the text core is located in the image data set, and a second predicted image containing the text tag value and the background tag value detected by the text detection model;

carrying out similarity comparison on the second actual image and a second predicted image to obtain a second similarity value between the text tag value and the background tag value in the second actual image and the text tag value and the background tag value in the second predicted image, and setting the second similarity value as the value of the text core part loss function;

according to the second similar value and the text label value of the pixel where the text core is located, adjusting the text label value of the pixel where the text core is located corresponding to the attention mechanism;

determining a third similarity value according to the number of the texts in the image data set, the number of pixels in a pixel block where each text is located and the average value of the feature vectors of the pixels of each text, and setting the third similarity value as the value of the partial loss function of the pixel vector;

and adjusting the area of each text corresponding to the attention mechanism according to the third similarity value.

Further, training a neural network model by using the image data set based on the shape-aware loss function until the shape-aware loss function obtains a minimum value, to obtain a trained text detection model, including:

training a neural network model using the image dataset based on the shape-aware loss function;

determining the first similarity value, the second similarity value and the third similarity value included by the shape-aware loss function, and determining a weighted sum of the first similarity value, the second similarity value and the third similarity value according to a hyper-parameter corresponding to the second similarity value and a hyper-parameter corresponding to the third similarity value;

and when the weighted sum obtains the minimum value, obtaining a trained text detection model.

In a second aspect, the present application provides a video data processing apparatus comprising:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a video to be detected, and the video to be detected comprises a plurality of texts;

the detection unit is used for detecting the text in the video to be detected according to a preset text detection model, wherein the text detection model is obtained by training a neural network model according to an attention mechanism and a preset shape perception loss function;

and the output unit is used for outputting a video containing a text detection box according to the detected text, wherein the text detection box is used for marking the position of the text in the video.

Further, the detection unit includes:

the detection module is used for detecting the text in the video to be detected by utilizing a preset attention mechanism of a text detection model;

and the first determining module is used for determining the area of the pixel block where the text is located by utilizing a preset shape-aware loss function.

Further, the output unit includes:

the generating module is used for generating a text detection box with the area equal to the area of the pixel block where the text is located;

and the output module is used for outputting the video containing the text detection box according to the text detection box.

Further, the video to be detected is a real-time video or an offline video.

Further, the video to be detected is a real-time video; the output unit is specifically configured to:

Further, the video to be detected is an offline video; the output unit is specifically configured to:

Further, the apparatus further comprises:

a second acquisition unit configured to acquire a plurality of sets of image data, the sets of image data including a plurality of texts;

a setting unit for setting a shape-aware penalty function; the shape-aware loss function comprises a text part loss function, a text core part loss function and a pixel vector part loss function;

and the training unit is used for training a neural network model by using the image data set based on the shape perception loss function until the shape perception loss function obtains a minimum value, and obtaining a trained text detection model.

Further, the setting unit includes:

a first obtaining module, configured to obtain a first actual image including a text tag value of a pixel where a text is located, a background tag value of a pixel where a background other than the text is located in the image data set, and a first predicted image including the text tag value and the background tag value detected by the text detection model;

a first setting module, configured to perform similarity comparison between the first actual image and a first predicted image to obtain a first similarity value between the text tag value and the background tag value in the first actual image and the text tag value and the background tag value in the first predicted image, and set the first similarity value as a value of the text portion loss function;

the first adjusting module is used for adjusting the text label value of the pixel where the text is located corresponding to the attention mechanism according to the first similar value and the text label value of the pixel where the text is located;

a second obtaining module, configured to obtain a second actual image that includes a text tag value of a pixel where a text kernel is located, a background tag value of a pixel where a background other than the text kernel is located in the image data set, and a second predicted image that includes the text tag value and the background tag value and is detected by the text detection model;

a second setting module, configured to perform similarity comparison between the second actual image and a second predicted image to obtain a second similarity value between the text tag value and the background tag value in the second actual image and the text tag value and the background tag value in the second predicted image, and set the second similarity value as a value of the text core loss function;

a second adjusting module, configured to adjust a text label value of a pixel where the text core is located corresponding to the attention mechanism according to the second similar value and the text label value of the pixel where the text core is located;

a third setting module, configured to determine a third similarity value according to the number of the texts in the image data set, the number of pixels in a pixel block where each text is located, and an average value of the number of pixels in each text, and set the third similarity value as a value of the pixel vector partial loss function;

and the third adjusting module is used for adjusting the area of each text corresponding to the attention mechanism according to the third similarity value.

Further, the training unit includes:

the training module is used for training a neural network model by utilizing the image data set based on the shape-perceived loss function;

a second determining module, configured to determine the first similarity value, the second similarity value, and the third similarity value included in the shape-aware loss function, and determine a weighted sum of the first similarity value, the second similarity value, and the third similarity value according to a hyper-parameter corresponding to the second similarity value and a hyper-parameter corresponding to the third similarity value;

and the third determining module is used for obtaining a trained text detection model when the weighted sum obtains the minimum value.

In a third aspect, the present application provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and the processor implements the method of the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of the first aspect when executed by a processor.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.

The video data processing method, the video data processing device and the video data processing equipment, which are provided by the application, are used for acquiring a video to be detected, wherein the video to be detected comprises a plurality of texts; detecting a text in a video to be detected according to a preset text detection model, wherein the text detection model is obtained by training a neural network model according to an attention mechanism and a preset shape perception loss function; and outputting a video containing a text detection box according to the detected text, wherein the text detection box is used for marking the position of the text in the video. In the scheme, the text detection model is obtained by training the neural network model according to the attention mechanism and the preset shape perception loss function, so that the text in the video to be detected can be detected according to the preset text detection model, the position of the text in the video to be detected is marked by using text detection boxes, and the corresponding video containing the text detection boxes is output. Therefore, an attention mechanism is introduced when the neural network model is trained, the parameter quantity and the calculated quantity can be reduced, the overall speed of the text detection method is improved, a preset loss function of shape perception is introduced, pixels of different texts can be more accurately distinguished and pixels of the same text are gathered together, an optimized text detection model is obtained through training, the optimized text detection model is used for detecting a video to be detected, the problem that both accuracy and speed cannot be considered simultaneously in text detection can be solved, the speed of text detection is greatly improved while high-accuracy text detection is achieved, practical application is more adapted, and the technical problem that the text detection efficiency is lower is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic flowchart of a video data processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another video data processing method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an attention module provided in the present application;

FIG. 4 is a schematic structural diagram of a feature deepening module and a feature fusion module provided in the present application;

fig. 5 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of another video data processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure.

In one example, with the continuous development of computer vision technology, a Scene Text Detection (Scene Text Detection) technology is also continuously improved, the Scene Text Detection refers to a task of labeling texts in pictures or videos by using visual Text boxes, belongs to a fundamental and key task in the field of computer vision, and is a key step of introducing a post-processing method, wherein the post-processing method comprises Text recognition, Text retrieval, license plate recognition, Text visualization question and answer and the like, and therefore, texts in videos need to be recognized. In the prior art, when a text in a video is identified, the text in the video is usually detected according to a deep learning neural network model, and text detection methods based on deep learning can be divided into two categories: the method comprises the steps of (1) a text detection method based on regression and a text detection method based on segmentation, wherein the text detection method based on regression is to take a text as a target to be detected and obtain a text detection box through direct regression; the text detection method based on segmentation is to classify pixels of an image, identify whether the pixels of the image belong to a text, and obtain a final text detection box by combining a post-processing method. However, in the text detection method based on regression in the prior art, due to the limitation of the shape of the text detection box, the detection effect on the bent text is poor, and in the text detection method based on segmentation, the final text detection box needs to be obtained by combining a post-processing method, which reduces the speed of detecting the text, so the existing text detection method causes poor detection effect or slow detection speed, and further causes low efficiency of detecting the text.

The application provides a video data processing method, a video data processing device and video data processing equipment, and aims to solve the technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a video data processing method according to an embodiment of the present application, as shown in fig. 1, the method includes:

101. the method comprises the steps of obtaining a video to be detected, wherein the video to be detected comprises a plurality of texts.

For example, the execution subject of this embodiment may be an electronic device, or a terminal device, or a video data processing apparatus or device, or other apparatuses or devices that can execute this embodiment, which is not limited in this respect. In this embodiment, an execution main body is described as an electronic device.

First, a video to be detected needs to be acquired. The video to be detected can be obtained by shooting, or the video to be detected can be obtained from a memory; or acquiring the video to be detected from the webpage, or receiving the video to be detected transmitted by other equipment. The video to be detected comprises a plurality of texts.

102. And detecting the text in the video to be detected according to a preset text detection model, wherein the text detection model is obtained by training a neural network model according to an attention mechanism and a preset shape perception loss function.

For example, the electronic device may train the neural network model according to an attention mechanism and a preset shape-aware loss function to obtain a text detection model, then input the video to be detected into the text detection model, and detect the video to be detected through the text detection model.

103. And outputting a video containing a text detection box according to the detected text, wherein the text detection box is used for marking the position of the text in the video.

For example, the electronic device may output a video including a text detection box according to the detected text, where the text detection box is used to mark a position of the text in the video, and the text detection box may be a polygonal box such as a rectangle, a square, or the like.

In the embodiment of the application, the video to be detected is obtained, and the video to be detected comprises a plurality of texts. And detecting the text in the video to be detected according to a preset text detection model, wherein the text detection model is obtained by training a neural network model according to an attention mechanism and a preset shape perception loss function. And outputting a video containing a text detection box according to the detected text, wherein the text detection box is used for marking the position of the text in the video. In the scheme, the text detection model is obtained by training the neural network model according to the attention mechanism and the preset shape perception loss function, so that the text in the video to be detected can be detected according to the preset text detection model, the position of the text in the video to be detected is marked by using text detection boxes, and the corresponding video containing the text detection boxes is output. Therefore, an attention mechanism is introduced when the neural network model is trained, the parameter quantity and the calculated quantity can be reduced, the overall speed of the text detection method is improved, a preset loss function of shape perception is introduced, pixels of different texts can be more accurately distinguished and pixels of the same text are gathered together, an optimized text detection model is obtained through training, the optimized text detection model is used for detecting a video to be detected, the problem that both accuracy and speed cannot be considered simultaneously in text detection can be solved, the speed of text detection is greatly improved while high-accuracy text detection is achieved, practical application is more adapted, and the technical problem that the text detection efficiency is lower is solved.

Fig. 2 is a schematic flowchart of another video data processing method according to an embodiment of the present application, and as shown in fig. 2, the method includes:

201. a plurality of image datasets is acquired, the image datasets including a plurality of texts therein.

In one example, the image dataset includes horizontal text, oblique text, and arbitrarily shaped text.

Illustratively, an electronic device acquires a plurality of image datasets, which may include videos, images, and the like, the image datasets including a plurality of types of datasets, the datasets including: ICDAR2013, ICDAR2015, CTW1500, and TotalText, the text detection boxes of the first two defined as rectangular boxes, with the ICDAR2013 dataset focused on horizontal text, the detection boxes parallel to the image borders, the ICDAR2015 dataset focused on oblique text, the detection boxes allowing for tilting; the latter two focus on arbitrarily shaped text, which can be arbitrarily shaped. The applicability of the detection method in various situations can be reflected by transverse and longitudinal comparisons on different types of data sets.

202. The method comprises the steps of obtaining a first actual image containing a text tag value of a pixel where a text is located and a background tag value of a pixel where a background except the text is located in an image data set, and obtaining a first prediction image containing the text tag value and the background tag value, wherein the first prediction image is detected by a text detection model.

For example, the electronic device may process the image data set, mark a corresponding text label value for a pixel where a text is located, where the text label value is 1, mark a corresponding background label value for a pixel where a background other than the text is located in the image data set, where the background label value is 0, and then obtain a first actual image including the text label value and the background label value; and then detecting a text tag value and a background tag value in the image data set through a text detection model, and further obtaining a first predicted image containing the text tag value and the background tag value.

203. And carrying out similarity comparison on the first actual image and the first prediction image to obtain a first similarity value between the text tag value and the background tag value in the first actual image and the text tag value and the background tag value in the first prediction image, and setting the first similarity value as the value of the text part loss function.

For example, the text part Loss function may reflect information of a position where the text is learned by the text detection model, the text part Loss function may employ a Dice Loss function (Dice Loss), the electronic device may perform similarity comparison between the first actual image and the first predicted image to obtain a first similarity value between a text tag value and a background tag value in the first actual image and a text tag value and a background tag value in the first predicted image, and set the first similarity value as a value of the text part Loss function, where the Dice Loss function has a formula as follows:

wherein, P_text(i) Text label value, G, in the first predicted image for the text portion representing the position of pixel i_text(i) Text label value, Loss, of the text portion representing the position of pixel i in the first actual image_textRepresenting a first similarity value.

204. And adjusting the text label value of the pixel where the text is located corresponding to the attention mechanism according to the first similar value and the text label value of the pixel where the text is located.

For example, the electronic device may determine accuracy of the text prediction by the text detection model according to the first similarity value, and when the accuracy is higher, may adjust a text label value of a pixel where the text is located corresponding to the attention mechanism according to a text label value of a pixel where the text is located, so that the attention mechanism focuses on the text label value of the pixel where the text is located.

205. And acquiring a second actual image containing a text tag value of a pixel where the text core is located, a background tag value of a pixel where the background except the text core is located in the image data set, and a second predicted image containing the text tag value and the background tag value detected by the text detection model.

For example, the electronic device may process the image data set, determine a text core in advance, mark a corresponding text label value for a pixel where the text core is located, where the text label value is 2, mark a corresponding background label value for a pixel where a background other than the text core is located in the image data set, where the background label value is 3, and then obtain a second actual image including the text label value and the background label value; and then detecting a text tag value and a background tag value in the image data set through a text detection model, and further obtaining a second predicted image containing the text tag value and the background tag value.

206. And carrying out similarity comparison on the second actual image and the second prediction image to obtain a second similarity value between the text tag value and the background tag value in the second actual image and the text tag value and the background tag value in the second prediction image, and setting the second similarity value as the value of the text core part loss function.

Exemplarily, the text core loss function may reflect information of a position where the text core is learned by the text detection model, the text core loss function may employ a dice loss function, the electronic device may perform similarity comparison between the second actual image and the second predicted image to obtain a second similarity value between the text tag value and the background tag value in the second actual image and the text tag value and the background tag value in the second predicted image, and set the second similarity value as a value of the text core loss function, where the dice loss function has a formula as follows:

wherein, P_kernel(i) Text tag value, G, in the second predicted image of the text core representing the position of pixel i_kernel(i) Text label value, Loss, of the text core representing the location of pixel i in the second actual image_kernelA second similar value is indicated.

207. And adjusting the text label value of the pixel where the text core is located corresponding to the attention mechanism according to the second similar value and the text label value of the pixel where the text core is located.

Exemplarily, the electronic device may determine accuracy of the text core predicted by the text detection model according to the second similar value, and when the accuracy is higher, may adjust the text label value of the pixel where the text core is located corresponding to the attention mechanism according to the text label value of the pixel where the text core is located, so that the attention mechanism is focused on the text label value of the pixel where the text core is located.

208. And determining a third similarity value according to the number of texts in the image data set, the number of pixels in a pixel block where each text is located and the average value of the feature vectors of the pixels of each text, and setting the third similarity value as the value of the pixel vector partial loss function.

Illustratively, the pixel vector partial loss function may reflect that the text detection model learns information to distinguish pixels belonging to different texts, and the formula of the pixel vector partial loss function is as follows:

Loss_embedding＝L_agg+L_dis

wherein Lagg makes the characteristics of pixels belonging to the same text closer, L_disThe features of the pixels belonging to different texts are made to be more distant, N represents the number of texts contained in the image dataset, T_iIndicates the number of pixels, mu, in the ith text_iThe average value of the feature vectors representing all the pixels in the ith text, eta and gamma represent distance thresholds, are hyper-parameters set in advance, W_scale(i)And W_dist(i，j)Is a correlation coefficient of shape perception, and the specific calculation formula is as follows:

where h and w represent the height and width of the image in the image dataset, diag (T)_i) Representing a text T_iDiagonal length of (1), centredist (T)_i，T_j) Representing a text T_iAnd T_jDistance of the center point.

209. And according to the third similarity value, adjusting the area of each text corresponding to the attention mechanism.

Illustratively, the electronic device can determine an accuracy with which the text detection model predicts the text based on the third similarity value, and when the accuracy is high, the attention mechanism can be adjusted to distinguish different pixels, such that the attention mechanism focuses on distinguishing between belonging to different texts.

210. And training the neural network model by using the image data set based on the shape-aware loss function.

For example, the electronic device may train the neural network model using four types of data sets in the image data set based on the shape-aware loss function, count the accuracy, the recall rate, the F1 score, and the detection speed as indexes, and compare the optimal model effect in the presence or absence of each key module (the feature deepening module and the feature fusion module are replaced by multiple CNN + pooling layers, and the shape-aware loss function sets the shape coefficient to 1 for replacement). During training, a neural network extraction feature backbone network and related configuration can be set, the neural network extraction feature backbone network is set as a residual error network ResNet-18, the related configuration is set as a hyper-parameter and data loading interface of each module of the system, an attention mechanism is used for guiding a training feature deepening module and a feature fusion module, learning of image text features is enhanced, and an image data set is used for training a neural network model based on a loss function of shape perception, wherein the feature deepening module and the feature fusion module are CNN + attention pooling structures.

As shown in fig. 3, fig. 3 is a schematic structural diagram of an attention module provided in the present application, including: the method comprises the following steps of carrying out a series of processing steps such as vector merging, pooling and multi-layer CNN (CNN) on low-level features and deepened features to obtain the deepened features; as shown in fig. 4, fig. 4 is a schematic structural diagram of a feature deepening module and a feature fusing module provided in the present application, including: the system comprises an original feature, a feature deepening module, a feature fusion module and a feature obtained after fusion. Extracting a hierarchical feature map through a backbone network, introducing an attention mechanism, a feature deepening module and a feature fusion module for improving the quality of the feature map, wherein a core module (namely an attention module) of the attention mechanism is designed according to the figure 2, after the high-level feature and the low-level feature are combined, an attention coefficient is obtained through a pooling layer and a plurality of layers of CNNs after a sigmoid function is activated, the attention coefficient is multiplied by the original feature, namely useful information in the original feature is screened out in a weighting mode, and the high-level feature is combined to serve as a deeper feature. The structure of fig. 3 hierarchically combines the attention module and the convolutional layer in fig. 2 to obtain a feature deepening module and a feature fusion module applied to the system.

211. And determining a first similarity value, a second similarity value and a third similarity value which are included by the shape-perceived loss function, and determining a weighted sum of the first similarity value, the second similarity value and the third similarity value according to the hyperparameter corresponding to the second similarity value and the hyperparameter corresponding to the third similarity value.

For example, the electronic device may determine a first similarity value, a second similarity value, and a third similarity value included in the shape-aware loss function, and determine a weighted sum of the first similarity value, the second similarity value, and the third similarity value according to a hyper-parameter corresponding to the second similarity value and a hyper-parameter corresponding to the third similarity value, where the formula for calculating the weighted sum is as follows:

Loss＝Loss_text+αLoss_kernel+βLoss_embedding

wherein, alpha and beta are hyper-parameters for balancing various loss functions, alpha is a hyper-parameter corresponding to the second similarity value, and beta is a hyper-parameter corresponding to the third similarity value.

212. And when the weighted sum obtains the minimum value, obtaining the trained text detection model.

Illustratively, the test results for accuracy, recall, F1 score, detection speed, feature deepening module and feature fusion module, shape-aware loss function are as follows:

table 1 ICDAR2013 dataset test results

Table 2 ICDAR2015 data set test results

TABLE 3 CTW1500 data set test results

TABLE 4 TotalText data set test results

From the test results of the above four data sets, it can be seen that:

in the aspect of speed, the loss functions of the feature deepening module, the feature fusion module and the shape perception bring great speed improvement for text detection, the detection time of one picture can be shortened by 4-6ms, and the speed improvement is realized by combining the feature deepening module and the feature fusion module, is not linearly superposed and is slightly increased by 4-7 ms.

In the aspect of indexes, the influence brought by the two modules is not great, so that the whole method keeps fluctuation in a relatively stable level, the accuracy can be kept at a high level of 85-90%, the method is completely suitable for the requirement of text detection in real life, and the combined use of the two modules brings about 1% improvement in terms of F1 score with strong comprehensiveness. Therefore, when the weighted sum obtains the minimum value, the neural network model converges, and a trained text detection model is further obtained.

213. The method comprises the steps of obtaining a video to be detected, wherein the video to be detected comprises a plurality of texts.

In one example, the video to be detected is a real-time video or an offline video.

Illustratively, the real-time video may be a video input by a real-time camera, the offline video may be a cached video, and the like.

214. And detecting the texts in the video to be detected by using a preset attention mechanism of a text detection model.

For example, since the target data focused by the attention mechanism has been adjusted step by step according to the loss function in the process of training the text detection model, the electronic device may detect the text in the video to be detected by using the preset attention mechanism of the text detection model.

215. And determining the area of a pixel block where the text is located by using a preset shape-aware loss function.

For example, since the font of the text may be large or small, in order to achieve better detection effect, the electronic device may determine the area of the pixel block where the text is located by using a preset shape-aware loss function.

216. Generating a text detection box with the same area according to the area of the pixel block where the text is located; and outputting the video containing the text detection box according to the text detection box.

Step 216 includes two ways:

first mode of step 216: the video to be detected is a real-time video; outputting a real-time picture containing a text detection box corresponding to each frame of picture along with the playing of each frame of picture of the real-time video so as to obtain a video containing the text detection box; the text detection box is specifically used for marking the position of the text in each frame of picture.

Second mode of step 216: the video to be detected is an offline video; outputting a video containing a text detection box according to the text detection box contained in each frame of picture of the offline video; the text detection box is specifically used for marking the position of the text in each frame of picture.

For example, the electronic device may generate a text detection box with an area equal to that of a pixel block where the text is located, and then output a video including the text detection box according to the text detection box, where the video including the text detection box is output in the following two ways:

in the first mode, the video to be detected is a real-time video; when a video is input and played in real time according to devices such as a camera and the like, along with the playing of each frame of picture of the real-time video, the electronic equipment outputs a real-time picture which corresponds to each frame of picture and contains a text detection box so as to obtain the video containing the text detection box, wherein the text detection box is specifically used for marking the position of the text in each frame of picture.

In a second mode, the video to be detected is an offline video; when the offline video is played, the electronic device outputs the video containing the text detection box according to the text detection box contained in each frame of picture of the offline video, wherein the text detection box is specifically used for marking the position of the text in each frame of picture.

In the embodiment of the application, a plurality of image data sets are obtained, wherein each image data set comprises a plurality of texts; the method comprises the steps of obtaining a first actual image containing a text tag value of a pixel where a text is located and a background tag value of a pixel where a background except the text is located in an image data set, and obtaining a first prediction image containing the text tag value and the background tag value, wherein the first prediction image is detected by a text detection model. And carrying out similarity comparison on the first actual image and the first prediction image to obtain a first similarity value between the text tag value and the background tag value in the first actual image and the text tag value and the background tag value in the first prediction image, and setting the first similarity value as the value of the text part loss function. And adjusting the text label value of the pixel where the text is located corresponding to the attention mechanism according to the first similar value and the text label value of the pixel where the text is located. And acquiring a second actual image containing a text tag value of a pixel where the text core is located, a background tag value of a pixel where the background except the text core is located in the image data set, and a second predicted image containing the text tag value and the background tag value detected by the text detection model. And carrying out similarity comparison on the second actual image and the second prediction image to obtain a second similarity value between the text tag value and the background tag value in the second actual image and the text tag value and the background tag value in the second prediction image, and setting the second similarity value as the value of the text core part loss function. And adjusting the text label value of the pixel where the text core is located corresponding to the attention mechanism according to the second similar value and the text label value of the pixel where the text core is located. And determining a third similarity value according to the number of texts in the image data set, the number of pixels in a pixel block where each text is located and the average value of the feature vectors of the pixels of each text, and setting the third similarity value as the value of the pixel vector partial loss function. And according to the third similarity value, adjusting the area of each text corresponding to the attention mechanism. And training the neural network model by using the image data set based on the shape-aware loss function. And determining a first similarity value, a second similarity value and a third similarity value which are included by the shape-perceived loss function, and determining a weighted sum of the first similarity value, the second similarity value and the third similarity value according to the hyperparameter corresponding to the second similarity value and the hyperparameter corresponding to the third similarity value. And when the weighted sum obtains the minimum value, obtaining the trained text detection model. The method comprises the steps of obtaining a video to be detected, wherein the video to be detected comprises a plurality of texts. And detecting the texts in the video to be detected by using a preset attention mechanism of a text detection model. And determining the area of a pixel block where the text is located by using a preset shape-aware loss function. Generating a text detection box with the same area according to the area of the pixel block where the text is located; and outputting the video containing the text detection box according to the text detection box. Therefore, the trained and optimized text detection model is used for detecting the video to be detected, the problem that the accuracy and the speed cannot be simultaneously considered in text detection can be solved, the speed of text detection is greatly increased while high-accuracy text detection is realized, the text detection method is more suitable for practical application, and the technical problem that the text detection efficiency is lower is solved.

Fig. 5 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application, and as shown in fig. 5, the apparatus includes:

the first obtaining unit 51 is configured to obtain a video to be detected, where the video to be detected includes a plurality of texts.

The detecting unit 52 is configured to detect a text in the video to be detected according to a preset text detection model, where the text detection model is obtained by training the neural network model according to an attention mechanism and a preset shape-aware loss function.

And the output unit 53 is configured to output a video including a text detection box according to the detected text, where the text detection box is used to indicate a position of the text in the video.

The apparatus of this embodiment may execute the technical solution in the method, and the specific implementation process and the technical principle are the same, which are not described herein again.

Fig. 6 is a schematic structural diagram of another video data processing apparatus according to an embodiment of the present application, and based on the embodiment shown in fig. 5, as shown in fig. 6, the detecting unit 52 includes:

the detecting module 521 is configured to detect a text in the video to be detected by using a predetermined attention mechanism of the text detection model.

A first determining module 522, configured to determine the area of the pixel block where the text is located by using a preset shape-aware loss function.

In one example, the output unit 53 includes:

the generating module 531 is configured to generate a text detection box with an area equal to that of a pixel block where the text is located.

And an output module 532, configured to output a video including the text detection box according to the text detection box.

In one example, the video to be detected is a real-time video; the output unit 53 is specifically configured to:

outputting a real-time picture containing a text detection box corresponding to each frame of picture along with the playing of each frame of picture of the real-time video so as to obtain a video containing the text detection box; the text detection box is specifically used for marking the position of the text in each frame of picture.

In one example, the video to be detected is an offline video; the output unit 53 is specifically configured to:

In one example, the apparatus further comprises:

a second acquiring unit 61 for acquiring a plurality of sets of image data, the image data sets including a plurality of texts therein.

A setting unit 62 for setting a shape-aware penalty function; the shape-aware loss function includes a text part loss function, a text core part loss function, and a pixel vector part loss function.

And the training unit 63 is configured to train the neural network model by using the image data set based on the shape-aware loss function until the trained text detection model is obtained when the shape-aware loss function obtains the minimum value.

In one example, the setting unit 62 includes:

the first obtaining module 621 is configured to obtain a first actual image including a text tag value of a pixel where a text is located, a background tag value of a pixel where a background other than the text is located in the image data set, and a first predicted image including the text tag value and the background tag value detected by the text detection model.

The first setting module 622 is configured to perform similarity comparison on the first actual image and the first predicted image to obtain a first similarity value between the text tag value and the background tag value in the first actual image and the text tag value and the background tag value in the first predicted image, and set the first similarity value as a value of the text portion loss function.

The first adjusting module 623 is configured to adjust a text label value of a pixel where the text is located corresponding to the attention mechanism according to the first similarity value and the text label value of the pixel where the text is located.

A second obtaining module 624, configured to obtain a second actual image that includes a text label value of a pixel where the text kernel is located, a background label value of a pixel where the background is located in the image data set except the text kernel, and a second predicted image that includes the text label value and the background label value and is detected by the text detection model.

A second setting module 625, configured to perform similarity comparison between the second actual image and the second predicted image to obtain a second similarity value between the text tag value and the background tag value in the second actual image and the text tag value and the background tag value in the second predicted image, and set the second similarity value as a value of the text core loss function.

The second adjusting module 626 is configured to adjust the text label value of the pixel where the text core is located corresponding to the attention mechanism according to the second similar value and the text label value of the pixel where the text core is located.

A third setting module 627, configured to determine a third similarity value according to the number of texts in the image data set, the number of pixels in a pixel block where each text is located, and an average value of the number of pixels in each text, and set the third similarity value as a value of a pixel vector partial loss function.

And a third adjusting module 628 for adjusting the area of each text corresponding to the attention mechanism according to the third similarity value.

In one example, the training unit 63 includes:

a training module 631, configured to train the neural network model with the image data set based on the shape-aware loss function.

The second determining module 632 is configured to determine a first similarity value, a second similarity value, and a third similarity value included in the shape-aware loss function, and determine a weighted sum of the first similarity value, the second similarity value, and the third similarity value according to a hyper-parameter corresponding to the second similarity value and a hyper-parameter corresponding to the third similarity value.

And the third determining module 633 is configured to obtain a trained text detection model when the weighted sum obtains a minimum value.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 7, the electronic device includes: a memory 71, a processor 72;

the memory 71 stores therein a computer program that is executable on the processor 72.

The processor 72 is configured to perform the methods provided in the embodiments described above.

The electronic device further comprises a receiver 73 and a transmitter 74. The receiver 73 is used for receiving instructions and data transmitted from an external device, and the transmitter 74 is used for transmitting instructions and data to an external device.

Fig. 8 is a block diagram of an electronic device, which may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, etc., according to an embodiment of the present application.

The apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of the components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, communications component 816 further includes a Near Field Communications (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Embodiments of the present application also provide a non-transitory computer-readable storage medium, where instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method provided by the above embodiments.

An embodiment of the present application further provides a computer program product, where the computer program product includes: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of processing video data, comprising:

2. The method according to claim 1, wherein detecting the text in the video to be detected according to a preset text detection model comprises:

3. The method of claim 2, wherein outputting the video including the text detection box based on the detected text comprises:

4. The method according to claim 1, wherein the video to be detected is a real-time video or an offline video.

5. The method according to claim 4, wherein the video to be detected is a real-time video; outputting a video containing a text detection box according to the detected text, wherein the video comprises:

6. The method according to claim 4, wherein the video to be detected is an offline video; outputting a video containing a text detection box according to the detected text, wherein the video comprises:

7. The method according to any one of claims 1-6, further comprising:

8. The method of claim 7, wherein the image dataset comprises horizontal text, oblique text, and arbitrarily shaped text.

9. The method of claim 7, wherein setting a shape-aware penalty function comprises:

10. An electronic device, comprising a memory, a processor, a computer program being stored in the memory and being executable on the processor, the processor implementing the method of any of the preceding claims 1-9 when executing the computer program.