CN105868758B

CN105868758B - method and device for detecting text area in image and electronic equipment

Info

Publication number: CN105868758B
Application number: CN201510030520.1A
Authority: CN
Inventors: 陈益如; 何源; 何梦超; 童志军; 张洪明
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-01-21
Filing date: 2015-01-21
Publication date: 2019-12-17
Anticipated expiration: 2035-01-21
Also published as: CN105868758A

Abstract

the application discloses a method and a device for detecting a text area in an image and electronic equipment. The method for detecting the text area in the image comprises the following steps: extracting candidate text line region images from the target image; judging whether the candidate text line region image is a text region or not by adopting a trained deep learning text/non-text classifier, and marking the region judged as the text region; and merging the partitions marked as text areas to obtain the text area of the target image. By adopting the method provided by the application, the character region detection suitable for different types of images, different languages and characters and different styles of fonts can be realized, so that the technical scheme has universality; the adaptability to the diversity of the text line region and the noise interference resistance are improved, and the accuracy of the detection result is ensured; the classifier judgment area is greatly reduced, and the detection speed is improved.

Description

Method and device for detecting text area in image and electronic equipment

Technical Field

The application relates to the field of image detection, in particular to a method and a device for detecting a text area in an image and electronic equipment.

Background

Text information in an image is important information for understanding the content of the image, and text recognition of the image is a basic technology for realizing the understanding of the content of the image. However, since text recognition is premised on detecting text regions in an image, in order to understand the content of the image, the text regions in the image must be detected first.

at present, there are two commonly used text region detection methods in an image, one is a text region detection method based on MSER and Adaboost classifiers, and the method is implemented as follows: firstly, extracting a candidate text region by adopting MSER; then, artificially designing text-related features, such as character width variance, aspect ratio of a candidate text region and the like, and merging the candidate text region into a candidate text line region by adopting a Metric Learning method; and finally, filtering the candidate text line region by adopting an Adaboost classifier, wherein the reserved text line region is the detected text region. But this method is less accurate. Secondly, a text region detection method adopting a CNN (conditional Neural network) model is realized, and the method comprises the following steps: firstly, inputting a positive sample image (containing a text image) and a negative sample image (not containing the text image) into a CNN model, and training a text/non-text classifier; then, in the detection stage, the input image is traversed by using a sliding window, the window image intercepted by the sliding window is input into a text/non-text classifier which is trained in advance, whether the window image is a positive sample or a negative sample is judged, and if the window image is judged to be the positive sample by the classifier, the window image is the detected text region. However, in order to detect characters with different sizes in an image, a sliding window needs to perform multi-scale traversal on an input image, and the process generates hundreds of millions of window image input text/non-text classifiers for judgment, so that the method is very time-consuming and low in processing speed.

In the prior art, no matter the Adaboost classifier or the text/non-text classifier is adopted to filter the candidate text line region, the candidate text line region is filtered by adopting a manually designed text feature input classifier. However, since the fonts and styles of characters in an image are varied and have no fixed form, it is impossible to detect the varied characters in the image by one or more feature fusion methods. And the artificially designed text feature rule needs to realize the region filtering of the candidate text line by setting an experience threshold, and different types of images may correspond to different experience thresholds, so the artificially designed text feature rule cannot be universally used for text region detection of different image types.

in summary, due to the problems of lack of generality, low accuracy and slow speed in the prior art, the prior art cannot be applied to text region detection of different types of images, different languages and characters, and different styles of fonts, cannot accurately detect text regions in images, and the detection process consumes much time.

disclosure of Invention

The application provides a method and a device for detecting a text area in an image and electronic equipment, and aims to solve the problems of lack of generality, low accuracy and low speed in the prior art.

The application provides a method for detecting a text area in an image, which comprises the following steps:

extracting candidate text line region images from the target image;

Judging whether the candidate text line region image is a text region or not by adopting a trained deep learning text/non-text classifier, and marking the region judged as the text region;

and merging the partitions marked as text areas to obtain the text area of the target image.

optionally, the deep learning text/non-text classifier adopts a Cuda-context frame as a frame.

optionally, five hidden layers are arranged on the Cuda-vent frame.

Optionally, five hidden layers on the Cuda-vent frame sequentially include, from input to output, a first convolution layer, a first Pooling layer, a second convolution layer, a second Pooling layer, and a full-link layer.

Optionally, the method for judging whether the candidate text line region image is a text region by using the trained deep learning text/non-text classifier and marking the region judged as the text region includes:

Traversing the candidate text line region image by using a sliding window, and intercepting the candidate text line region image corresponding to the sliding window as a window image of the candidate text line region image;

Calculating the probability that each window image traversed is a text region through the deep learning text/non-text classifier;

and if the probability that the window image is the text region exceeds a preset threshold value, marking the region corresponding to the window image as the text region.

optionally, in the step of marking the region corresponding to the window image as the text region if the probability that the window image is the text region exceeds a predetermined threshold, the threshold is obtained in the following manner:

recording the probability that each traversed window image is a text region;

and calculating the average probability of the candidate text line region images as the text regions according to the probability that each window image is the text region, and taking the average probability as the predetermined threshold value, or taking the probability value which is higher or lower than the average probability by a predetermined value as the predetermined threshold value on the basis of the average probability.

Optionally, before the step of determining whether the candidate text line region image is a text region by using the trained deep learning text/non-text classifier, training the deep learning text/non-text classifier includes: and providing the text image of m columns of pixels of the line n as a positive sample and the non-text image of m columns of pixels of the line n as a negative sample to the deep learning text/non-text classifier, wherein m and n are fixed integer values.

Optionally, the positive sample is a text image of 24 columns by 24 rows of pixels, and the negative sample is a non-text image of 24 columns by 24 rows of pixels.

optionally, the extracting a candidate text line region image from the target image specifically includes:

Carrying out binarization processing on the target image to obtain a binary image of the target image;

and performing layout analysis on the binary image to obtain a candidate text line region image of the target image.

optionally, the binarizing the target image to obtain a binary image of the target image specifically includes:

Receiving the target image;

calculating an edge image of the target image by adopting a Canny algorithm;

calculating a gray level image of the target image by adopting a color space conversion algorithm;

according to the gray value of the edge pixel and the 8 neighborhood pixels in the edge image in the gray image, marking the edge pixel and the 8 neighborhood pixels in the edge image as a foreground image pixel or a background image pixel;

Marking other pixels except the edge pixel and 8 neighborhood pixels thereof in the edge image as unknown pixels;

marking all pixels in the unknown pixel region as the foreground image pixels or the background image pixels according to the distribution of the foreground image pixels and the background image pixels in the edge of the unknown pixel region;

and binarizing the edge image by taking the pixel points marked as the foreground image pixels in the edge image as foreground pixels to obtain a first binary image of the target image, and binarizing the edge image by taking the pixel points marked as the background image pixels in the edge image as foreground pixels to obtain a second binary image of the target image.

optionally, the marking, according to the gray value of the edge pixel in the edge image and the gray value of the 8-neighborhood pixels thereof in the gray image, the edge pixel and the 8-neighborhood pixels thereof in the edge image as a foreground image pixel or a background image pixel specifically includes:

acquiring the gray value of the selected edge pixel and the 8 neighborhood pixels in the gray image;

calculating the gray average value of the gray values of the selected edge pixels and the 8 adjacent pixels thereof;

And comparing the gray values of the selected edge pixels and the 8 neighborhood pixels with the gray average value in sequence, if the gray value of the compared pixel is smaller than the gray average value, marking the compared pixel as the foreground image pixel, otherwise, marking the compared pixel as the background image pixel.

Optionally, the marking all pixels in the unknown pixel region as the foreground image pixels or the background image pixels according to the distribution of the foreground image pixels and the background image pixels in the edge of the unknown pixel region specifically includes:

counting the number of foreground image pixels and the number of background image pixels in the edge of the unknown pixel region;

comparing the number of foreground image pixels and the number of background image pixels in the unknown pixel region edge;

if the number of the foreground image pixels is larger than that of the background image pixels, all the unknown pixels are marked as the foreground image pixels, and otherwise, the unknown pixels are marked as the background image pixels.

Optionally, the performing layout analysis on the binary image to obtain a candidate text line region image of the target image specifically includes: and performing layout analysis on the first binary image and the second binary image respectively to obtain a first candidate text line region image and a second candidate text line region image of the target image.

optionally, the method for judging whether the candidate text line region image is a text region by using the trained deep learning text/non-text classifier includes: and respectively judging whether the first candidate text line region image and the second candidate text line region image are text regions by adopting a deep learning text/non-text classifier.

Optionally, the merging the partitions marked as text regions to obtain the text regions of the target image specifically includes:

Merging the partition marked as a text region in the first candidate text line region image into a first text region, and merging the partition marked as a text region in the second candidate text line region image into a second text region;

And combining the first text region and the second text region, and removing the region where the first text region and the second text region are overlapped to obtain the text region of the target image.

optionally, the performing layout analysis on the binary image to obtain a candidate text line region image of the target image specifically includes:

Receiving the binary image;

performing connected domain analysis on the binary image to obtain a connected domain of the binary image;

Combining the connected domains overlapped in the binary image to obtain a candidate text region image of the target image;

And combining the candidate text region images in the binary image according to the position relation and the characteristic relation between the candidate text region images to obtain the candidate text line region image of the target image.

optionally, the features of the candidate text region image include an aspect ratio of the candidate text region image and a color of the candidate text region image.

optionally, after obtaining the candidate text line region image of the target image, outputting coordinates of the candidate text line region image in the target image, where the specific manner is as follows:

Calculating to obtain a circumscribed rectangle of the candidate text line region image;

and obtaining the position coordinates of the circumscribed rectangle in the target image, and taking the position coordinates as the coordinates of the candidate text line region image in the target image.

optionally, the position coordinates of the circumscribed rectangle in the target image are represented in any one of the following manners:

Coordinate positions of four vertexes of the circumscribed rectangle;

The coordinate position of one vertex of the circumscribed rectangle and the length dimension of the circumscribed rectangle.

optionally, the obtaining of the text region of the target image specifically includes:

and obtaining the coordinates of the text area in the target image through calculation.

Correspondingly, this application still provides a text region detection device in the image, includes:

A candidate text line region image extracting unit for extracting a candidate text line region image from the target image;

A candidate text line region image judgment unit, configured to read the candidate text line region image provided by the candidate text line region image extraction unit, judge whether a region of the candidate text line region image is a text region by using a trained deep learning text/non-text classifier, and mark the region judged as the text region;

a text region obtaining unit, configured to read the partitions marked as text regions provided by the candidate text line region image determining unit, and merge the partitions marked as text regions to obtain the text region of the target image.

optionally, five hidden layers are arranged on the Cuda-vent frame.

recording the probability that each traversed window image is a text region;

optionally, before the step of determining whether the candidate text line region image is a text region by using the trained deep learning text/non-text classifier, training the deep learning text/non-text classifier includes: and the sample providing unit is used for providing the text image of m columns of pixels by n lines as a positive sample and the non-text image of m columns of pixels by n lines as a negative sample to the deep learning text/non-text classifier, wherein m and n are fixed integer values.

Receiving the target image;

calculating an edge image of the target image by adopting a Canny algorithm;

receiving the binary image;

coordinate positions of four vertexes of the circumscribed rectangle;

In addition, the present application also provides an electronic device, including:

A display;

A processor;

A memory for storing a text region detection file in an image, the text region detection file in the image when executed by the processor extracting candidate text line region images from a target image; judging whether the candidate text line region image is a text region or not by adopting a trained deep learning text/non-text classifier, and marking the region judged as the text region; and merging the partitions marked as text areas to obtain the text area of the target image.

optionally, five hidden layers are arranged on the Cuda-vent frame.

recording the probability that each traversed window image is a text region;

compared with the prior art, the method has the following advantages:

According to the method, the device and the electronic equipment for detecting the text area in the image, the candidate text line area image is extracted from the target image; judging whether the candidate text line region image is a text region or not by adopting a trained deep learning text/non-text classifier, and marking the region judged as the text region; and merging the partitions marked as text areas to obtain the text area of the target image. The technical scheme adopts a trained deep learning text/non-text classifier to judge the candidate text line region images, so that the character region detection suitable for different types of images, different languages and characters and different styles and fonts is realized, and the technical scheme has universality; by carrying out regional judgment on the candidate text line region image, the adaptability to the diversity of the text line region and the anti-noise interference capability are improved, and the accuracy of the detection result is ensured; by extracting the candidate text line region and then judging whether the region is a text region, the judgment region of the classifier is greatly reduced, and the detection speed is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flowchart of an embodiment of a method for detecting text regions in an image according to the present application;

FIG. 2 is a schematic diagram of an embodiment of an apparatus for detecting text regions in an image according to the present application;

fig. 3 is a schematic diagram of an embodiment of an electronic device of the present application.

Detailed Description

in the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

the application provides a method and a device for detecting text areas in an image and electronic equipment respectively, and the following specific embodiments are provided:

Fig. 1 is a flowchart illustrating an embodiment of a method for detecting a text region in an image according to the present application. The method comprises the following steps:

step S101: and extracting candidate text line region images from the target image.

Before text region detection in an image, an image is selected as a detected target image, and the image is input into a text region detection device which receives the input target image. In the present application, the target image may be various types of images, such as a natural scene image, an advertisement image, a merchandise image, a poster image, a document scan image, and the like. The text in the target image can also be characters of different languages, and the text style can also be different styles of fonts, such as a conventional printing font or a PS pattern font.

since the method for detecting the text region in the image is used as a basic technology of the method for recognizing the text of the image, it is often executed in a preprocessing stage of other processing algorithms, and therefore the method for detecting the text region in the image is required to have an execution efficiency of real-time processing, which needs to improve the speed of detecting the text region. In the application, in order to improve the detection speed of the text region, when the text region in the image is detected, firstly, the candidate text line region image needs to be extracted from the target image, so that an obvious non-text region can be removed from the target image, the judgment region of a subsequent classifier is greatly reduced, and the detection speed is improved.

it should be noted that, in this embodiment, the extracting of the candidate text line region image from the target image specifically includes: 1) carrying out binarization processing on the target image to obtain a binary image of the target image; 2) and performing layout analysis on the binary image to obtain a candidate text line region image of the target image. Of course, in this embodiment, the process of extracting the candidate text line region image from the target image may be implemented by the method described above, and in other embodiments, it may also be implemented by other methods.

1) The binarizing the target image to obtain the binary image of the target image may be implemented by using an edge-based binarizing method, that is, in this embodiment, the binarizing the target image to obtain the binary image of the target image may specifically include: receiving the target image; calculating an edge image of the target image by adopting a Canny algorithm; calculating a gray level image of the target image by adopting a color space conversion algorithm; according to the gray value of the edge pixel and the 8 neighborhood pixels in the edge image in the gray image, marking the edge pixel and the 8 neighborhood pixels in the edge image as a foreground image pixel or a background image pixel; marking other pixels except the edge pixel and 8 neighborhood pixels thereof in the edge image as unknown pixels; marking all pixels in the unknown pixel region as the foreground image pixels or the background image pixels according to the distribution of the foreground image pixels and the background image pixels in the edge of the unknown pixel region; and binarizing the edge image by taking the pixel points marked as the foreground image pixels in the edge image as foreground pixels to obtain a first binary image of the target image, and binarizing the edge image by taking the pixel points marked as the background image pixels in the edge image as foreground pixels to obtain a second binary image of the target image.

in this embodiment, the marking, according to the gray value of the edge pixel in the edge image and the 8 neighboring pixels thereof in the gray image, the edge pixel and the 8 neighboring pixels thereof in the edge image as a foreground image pixel or a background image pixel may specifically include: acquiring the gray value of the selected edge pixel and the 8 neighborhood pixels in the gray image; calculating the gray average value of the gray values of the selected edge pixels and the 8 adjacent pixels thereof; and comparing the gray values of the selected edge pixels and the 8 neighborhood pixels with the gray average value in sequence, if the gray value of the compared pixel is smaller than the gray average value, marking the compared pixel as the foreground image pixel, otherwise, marking the compared pixel as the background image pixel. If so, selecting an edge pixel from the edge image as P (i, j), and finding the gray value of the corresponding position of 9 pixels including P (i, j) and 8 neighborhood pixels thereof in the gray image; calculating the gray average value of the 9 pixel gray values, and recording as Gy; and sequentially comparing the gray value corresponding to the 9 pixels in the gray image with the gray average value Gy, if the gray value of a pixel point is smaller than Gy, marking the pixel point as a foreground image pixel and marking as TEXT, otherwise, marking the pixel point as a background image pixel and marking as BACK.

in this embodiment, the marking all pixels in the unknown pixel region as the foreground image pixels or the background image pixels according to the distribution of the foreground image pixels and the background image pixels in the edge of the unknown pixel region may specifically include: counting the number of foreground image pixels and the number of background image pixels in the edge of the unknown pixel region; comparing the number of foreground image pixels and the number of background image pixels in the unknown pixel region edge; if the number of the foreground image pixels is larger than that of the background image pixels, all the unknown pixels are marked as the foreground image pixels, and otherwise, the unknown pixels are marked as the background image pixels. If the UNKNOWN pixel is marked as UNKNOWN, the process of classifying and marking the UNKNOWN pixel is to count the number of BACK pixels and TEXT pixels in the edge marked as UNKNOWN region in the edge image; if the number of BACK pixels in the edge of the UNKNOWN region is larger than the number of TEXT pixels, all pixels in the UNKNOWN region are marked as BACK, otherwise, all pixels in the UNKNOWN region are marked as TEXT. Thus, an edge image composed of three labels of TEXT, BACK and UNKNOWN is changed into an edge image composed of two labels of TEXT and BACK.

In the process of carrying out binarization processing on the target image, an edge image of the target image is calculated through a Canny algorithm (namely an edge detection algorithm), then the edge image is marked as a ternary image formed by three labels of TEXT, BACK and UNKNOWN according to a gray value, classification marking is carried out on the UNKNOWN, finally, the TEXT and the BACK are respectively used as foreground pixels to carry out binarization on the edge image, and finally, a binarization image of the target image is obtained. The process can better retain tiny characters in the target image, and can remove more non-text areas from the target image. Of course, in the case that the target image does not contain the fine text or the fine text in the target image does not need to be detected, the above-mentioned binarization processing process for the target image may also be implemented by other methods, such as a Niblack algorithm.

2) Performing layout analysis on the binary image to obtain a candidate text line region image of the target image, which may be implemented by a method of analyzing a connected domain and merging text regions based on a position relationship and a feature relationship between the text regions, that is, in this embodiment, performing layout analysis on the binary image to obtain the candidate text line region image of the target image specifically includes: receiving the binary image; performing connected domain analysis on the binary image to obtain a connected domain of the binary image; combining the connected domains overlapped in the binary image to obtain a candidate text region image of the target image; and combining the candidate text region images in the binary image according to the position relation and the characteristic relation between the candidate text region images to obtain the candidate text line region image of the target image.

In this embodiment, the features of the candidate text region images include an aspect ratio of the candidate text region images and colors of the candidate text region images.

in this embodiment, after obtaining the candidate text line region image of the target image, outputting the coordinates of the candidate text line region image in the target image in a specific manner: calculating to obtain a circumscribed rectangle of the candidate text line region image; and obtaining the position coordinates of the circumscribed rectangle in the target image, and taking the position coordinates as the coordinates of the candidate text line region image in the target image. In this embodiment, the position coordinates of the circumscribed rectangle in the target image are represented in any one of the following manners: coordinate positions of four vertexes of the circumscribed rectangle; the coordinate position of one vertex of the circumscribed rectangle and the length dimension of the circumscribed rectangle.

The layout analysis method supports horizontal text lines, vertical text lines and text lines inclined at small angles, and is simple, stable and fast. Of course, the layout analysis method may also employ other methods, such as machine Learning methods like Metric Learning, which implement text line clustering by calculating similarities between candidate text regions, without requiring processing speed or otherwise.

it should be noted that, in a case where two binary images, namely, a first binary image and a second binary image, are obtained by performing binarization processing on a target image, performing layout analysis on the binary images specifically is, in this embodiment, performing layout analysis on the binary images to obtain candidate text line region images of the target image, and specifically is: and performing layout analysis on the first binary image and the second binary image respectively to obtain a first candidate text line region image and a second candidate text line region image of the target image.

step S102: and judging whether the candidate text line region image is a text region or not by adopting a trained deep learning text/non-text classifier, and marking the region judged as the text region.

in step S101, the candidate text line region of the target image is obtained, and in order to complete the detection of the text region in the target image, it is necessary to determine whether the candidate text line region is a text region, so as to further remove the non-text region in the candidate text line region.

In order to make the method provided by the present application have universality, i.e. can be applicable to detection of character regions of different types of images, characters in different languages and fonts in different styles, in the present application, a trained deep learning text/non-text classifier is used to determine whether the candidate text line region image is a text region. The deep learning text/non-text classifier automatically learns the relevant characteristics of the text from the training sample through a multi-layer neural network structure, and can be trained to obtain the text/non-text classifier which can adapt to different forms and fonts by enriching the diversity of the text in the training image positive sample. And because the deep learning text/non-text classifier does not need to artificially design text features, the classification precision of the deep learning text/non-text classifier is far better than that of other classifiers.

in order to enable the method provided by the application to have accuracy, in the application, when the candidate text line region image is judged, whether the candidate text line region image is a text region is judged in a partition mode, so that the adaptability to the diversity of the text line region and the anti-noise interference capability are improved, and the accuracy of the detection result is ensured.

Therefore, after the candidate text line region of the target image is obtained, the method provided by the present application needs to be implemented, in this step, a trained deep learning text/non-text classifier is further adopted to determine whether the candidate text line region image partition region is a text region, and the partition determined as the text region is marked.

Regarding the deep learned text/non-text classifier adopted in this step, in this embodiment, the frame adopted by the deep learned text/non-text classifier may be a Cuda-conditional frame. In this embodiment, five hidden layers are disposed on the Cuda-vent frame. In this embodiment, the five hidden layers on the Cuda-vent frame are a first convolution layer, a first Pooling layer, a second convolution layer, a second Pooling layer, and a full-link layer in sequence from input to output.

Regarding the deep learning text/non-text classifier used in this step, it should be noted that, in this embodiment, before the step of determining whether the candidate text line region image sub-region is a text region by using the trained deep learning text/non-text classifier, training the deep learning text/non-text classifier includes: and providing the text image of m columns of pixels of the line n as a positive sample and the non-text image of m columns of pixels of the line n as a negative sample to the deep learning text/non-text classifier, wherein m and n are fixed integer values. In this embodiment, the positive sample is a text image with 24 columns by 24 rows of pixels, and the negative sample is a non-text image with 24 columns by 24 rows of pixels.

regarding the partition method adopted in this step, it should be noted that the partition method may be implemented by using a sliding window, in this case, this step specifically includes, in this embodiment, the determining, by using a trained deep learning text/non-text classifier, whether the partition area of the candidate text line area image is a text area, and marking the partition area determined as the text area, specifically including: traversing the candidate text line region image by using a sliding window, and intercepting the candidate text line region image corresponding to the sliding window as a window image of the candidate text line region image; calculating the probability that each window image traversed is a text region through the deep learning text/non-text classifier; and if the probability that the window image is the text region exceeds a preset threshold value, marking the region corresponding to the window image as the text region.

In this embodiment, in the step of marking the region corresponding to the window image as the text region if the probability that the window image is the text region exceeds a predetermined threshold, the threshold may be obtained as follows: recording the probability that each traversed window image is a text region; and calculating the average probability of the candidate text line region images as the text regions according to the probability that each window image is the text region, and taking the average probability as the predetermined threshold value, or taking the probability value which is higher or lower than the average probability by a predetermined value as the predetermined threshold value on the basis of the average probability.

it should be noted that, in the case of performing layout analysis on the first binary image and the second binary image respectively to obtain two candidate text line region images of the first candidate text line region image and the second candidate text line region image, this step is specifically, in this embodiment, the determining, by using the trained deep learning text/non-text classifier, whether the candidate text line region image is a text region according to the sub-region is specifically: and respectively judging whether the first candidate text line region image and the second candidate text line region image are text regions by adopting a deep learning text/non-text classifier.

step S103: and merging the partitions marked as text areas to obtain the text area of the target image.

after the step S102 is executed, the partitions of one or more candidate text line region images marked as text regions may be obtained, and in order to finally complete the detection of the text regions in the target image, the partitions marked as text regions are merged in this step to obtain the text regions of the target image. Finally, judging candidate text line region images by adopting a trained deep learning text/non-text classifier, and realizing character region detection suitable for different types of images, different languages and characters and different styles of fonts, so that the technical scheme has universality; by carrying out regional judgment on the candidate text line region image, the adaptability to the diversity of the text line region and the anti-noise interference capability are improved, and the accuracy of the detection result is ensured; by extracting the candidate text line region and then judging whether the region is a text region, the judgment region of the classifier is greatly reduced, and the detection speed is improved.

in this embodiment, the obtaining of the text region of the target image may specifically be: and obtaining the coordinates of the text area in the target image through calculation. That is, the text region detected from the target image can be finally represented in the form of coordinates so as to read out the detected text region from the target image.

It should be noted that, in the case that two candidate text line region images of the first candidate text line region image and the second candidate text line region image are obtained by performing layout analysis on the first binary image and the second binary image, respectively, this step may specifically be that, in this embodiment, the merging the partitions marked as text regions to obtain the text region of the target image specifically includes: merging the partition marked as a text region in the first candidate text line region image into a first text region, and merging the partition marked as a text region in the second candidate text line region image into a second text region; and combining the first text region and the second text region, and removing the region where the first text region and the second text region are overlapped to obtain the text region of the target image.

in the foregoing embodiment, a method for detecting a text region in an image is provided, and correspondingly, an apparatus for detecting a text region in an image is also provided in the present application. Fig. 2 is a schematic diagram of an embodiment of a device for detecting text areas in images according to the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The device for detecting text areas in images of the embodiment comprises:

a candidate text line region image extracting unit 201 for extracting a candidate text line region image from the target image;

A candidate text line region image determining unit 202, configured to read the candidate text line region image provided by the candidate text line region image extracting unit, determine whether a region of the candidate text line region image is a text region by using a trained deep learning text/non-text classifier, and mark the region determined as the text region;

A text region obtaining unit 203, configured to read the partitions marked as text regions provided by the candidate text line region image determining unit, and merge the partitions marked as text regions to obtain the text region of the target image.

Optionally, five hidden layers are arranged on the Cuda-vent frame.

recording the probability that each traversed window image is a text region;

receiving the target image;

calculating an edge image of the target image by adopting a Canny algorithm;

receiving the binary image;

Coordinate positions of four vertexes of the circumscribed rectangle;

an embodiment of the present application further provides an electronic device, as shown in fig. 3, which is a schematic diagram of an embodiment of the electronic device of the present application. An electronic device of this embodiment, the electronic device includes:

A display 301;

a processor 302;

a memory 303, configured to store a text area detection file in an image, where the text area detection file in the image is executed by the processor to extract a candidate text line area image from a target image; judging whether the candidate text line region image is a text region or not by adopting a trained deep learning text/non-text classifier, and marking the region judged as the text region; and merging the partitions marked as text areas to obtain the text area of the target image.

Optionally, five hidden layers are arranged on the Cuda-vent frame.

recording the probability that each traversed window image is a text region;

The above embodiments of the method, the apparatus, and the electronic device for detecting a text area in an image provided by the present application are introduced in detail, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

in a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

the memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. a method for detecting a text area in an image is characterized by comprising the following steps:

Extracting candidate text line region images from the target image;

merging the partitions marked as text areas to obtain the text areas of the target image;

The extracting of the candidate text line region image from the target image includes: carrying out binarization processing on the target image to obtain a binary image of the target image; performing layout analysis on the binary image to obtain a candidate text line region image of the target image;

the binarizing the target image to obtain a binary image of the target image includes:

calculating an edge image and a gray image of the target image;

According to the gray value of the edge pixel and the 8 neighborhood pixels in the edge image in the gray image, marking the edge pixel and the 8 neighborhood pixels in the edge image as a foreground image pixel or a background image pixel; marking other pixels except the edge pixel and 8 neighborhood pixels thereof in the edge image as unknown pixels; marking all pixels in the unknown pixel region as the foreground image pixels or the background image pixels according to the distribution of the foreground image pixels and the background image pixels in the edge of the unknown pixel region;

2. the method of claim 1, wherein the deep learning text/non-text classifier adopts a Cuda-conditional frame as a frame.

3. the method for detecting text regions in images according to claim 2, wherein five hidden layers are arranged on the Cuda-conditional frame.

4. The method for detecting text regions in images according to claim 3, wherein five hidden layers on the Cuda-conditional frame are a first scrolling layer, a first Pooling layer, a second scrolling layer, a second Pooling layer and a full connection layer in sequence from input to output.

5. the method according to claim 1, wherein the determining whether the candidate text line region image sub-region is a text region by using a trained deep learning text/non-text classifier, and marking the sub-region determined as the text region specifically comprises:

6. The method according to claim 5, wherein in the step of marking the region corresponding to the window image as the text region if the probability that the window image is the text region exceeds a predetermined threshold, the threshold is obtained as follows:

Recording the probability that each traversed window image is a text region;

7. The method according to any one of claims 1 to 6, wherein before the step of determining whether the candidate text line region image sub-region is a text region by using the trained deep-learning text/non-text classifier, training the deep-learning text/non-text classifier comprises: and providing the text image of m columns of pixels of the line n as a positive sample and the non-text image of m columns of pixels of the line n as a negative sample to the deep learning text/non-text classifier, wherein m and n are fixed integer values.

8. The method according to claim 7, wherein the positive sample is a text image with 24 columns by 24 rows of pixels, and the negative sample is a non-text image with 24 columns by 24 rows of pixels.

9. the method for detecting a text region in an image according to claim 1, wherein the binarizing the target image to obtain a binary image of the target image further comprises:

receiving the target image;

Calculating an edge image of the target image by adopting a Canny algorithm;

and calculating the gray level image of the target image by adopting a color space conversion algorithm.

10. The method according to claim 1, wherein the labeling of the edge pixels and their 8-neighborhood pixels in the edge image as foreground image pixels or background image pixels according to the gray scale values of the edge pixels and their 8-neighborhood pixels in the edge image in the gray scale image specifically comprises:

11. the method according to claim 1, wherein the labeling all pixels in the unknown pixel region as the foreground image pixels or the background image pixels according to the distribution of the foreground image pixels and the background image pixels in the edge of the unknown pixel region specifically comprises:

12. the method for detecting text regions in images according to claim 1, wherein the performing layout analysis on the binary image to obtain candidate text line region images of the target image specifically comprises: and performing layout analysis on the first binary image and the second binary image respectively to obtain a first candidate text line region image and a second candidate text line region image of the target image.

13. The method according to claim 12, wherein the determining whether the candidate text line region image sub-region is a text region by using the trained deep learning text/non-text classifier specifically comprises: and respectively judging whether the first candidate text line region image and the second candidate text line region image are text regions by adopting a deep learning text/non-text classifier.

14. the method according to claim 13, wherein the merging the partitions marked as text regions to obtain the text region of the target image comprises:

15. the method for detecting text regions in images according to claim 1, wherein performing layout analysis on the binary image to obtain candidate text line region images of the target image specifically includes:

Receiving the binary image;

16. the method of claim 15, wherein the features of the candidate text region image comprise an aspect ratio of the candidate text region image and a color of the candidate text region image.

17. The method for detecting text regions in images according to claim 15, wherein after obtaining the candidate text line region image of the target image, outputting coordinates of the candidate text line region image in the target image by:

18. The method according to claim 17, wherein the position coordinates of the circumscribed rectangle in the target image are represented by any one of the following ways:

Coordinate positions of four vertexes of the circumscribed rectangle;

19. the method for detecting text regions in an image according to claim 1, wherein the obtaining the text regions of the target image specifically includes:

20. an apparatus for detecting a text region in an image, comprising:

A text region obtaining unit, configured to read the partitions marked as text regions provided by the candidate text line region image determining unit, merge the partitions marked as text regions, and obtain a text region of the target image;

calculating an edge image and a gray image of the target image;

21. the apparatus according to claim 20, wherein the deep learning text/non-text classifier adopts a Cuda-conditional frame as a frame.

22. The apparatus for detecting text-in-image area according to claim 21, wherein five hidden layers are disposed on the Cuda-conditional frame.

23. The apparatus of claim 22, wherein five hidden layers on the Cuda-conditional frame are a first scrolling layer, a first Pooling layer, a second scrolling layer, a second Pooling layer and a full connection layer in sequence from input to output.

24. the apparatus according to claim 20, wherein the determining whether the candidate text line region image sub-region is a text region by using the trained deep learning text/non-text classifier, and marking the sub-region determined as the text region comprises:

25. the apparatus according to claim 24, wherein in the step of marking the region corresponding to the window image as the text region if the probability that the window image is the text region exceeds a predetermined threshold, the threshold is obtained as follows:

recording the probability that each traversed window image is a text region;

26. The apparatus according to any one of claims 20 to 25, wherein before the step of determining whether the candidate text line region image sub-region is a text region by using the trained deep-learning text/non-text classifier, the training of the deep-learning text/non-text classifier comprises: and the sample providing unit is used for providing the text image of m columns of pixels by n lines as a positive sample and the non-text image of m columns of pixels by n lines as a negative sample to the deep learning text/non-text classifier, wherein m and n are fixed integer values.

27. the device according to claim 26, wherein the positive sample is a text image of 24 columns by 24 rows of pixels, and the negative sample is a non-text image of 24 columns by 24 rows of pixels.

28. the apparatus according to claim 20, wherein said binarizing the target image to obtain a binary image of the target image, further comprises:

Receiving the target image;

Calculating an edge image of the target image by adopting a Canny algorithm;

29. the apparatus according to claim 20, wherein the labeling of the edge pixels and their 8-neighborhood pixels in the edge image as foreground image pixels or background image pixels according to the gray scale values of the edge pixels and their 8-neighborhood pixels in the edge image in the gray scale image specifically comprises:

30. The apparatus according to claim 20, wherein the labeling all pixels in the unknown pixel region as the foreground image pixels or the background image pixels according to the distribution of the foreground image pixels and the background image pixels in the edge of the unknown pixel region specifically comprises:

31. The apparatus for detecting text regions in images according to claim 20, wherein the performing layout analysis on the binary image to obtain candidate text line region images of the target image specifically comprises: and performing layout analysis on the first binary image and the second binary image respectively to obtain a first candidate text line region image and a second candidate text line region image of the target image.

32. The apparatus according to claim 31, wherein the determining whether the candidate text line region image sub-region is a text region by using the trained deep learning text/non-text classifier specifically comprises: and respectively judging whether the first candidate text line region image and the second candidate text line region image are text regions by adopting a deep learning text/non-text classifier.

33. The apparatus according to claim 32, wherein the merging the partitions marked as text regions to obtain the text region of the target image comprises:

34. the apparatus according to claim 20, wherein the performing layout analysis on the binary image to obtain the candidate text line region image of the target image specifically includes:

Receiving the binary image;

35. the apparatus according to claim 34, wherein the features of the candidate text region image comprise an aspect ratio of the candidate text region image and a color of the candidate text region image.

36. The apparatus for detecting text-in-image area according to claim 34, wherein after obtaining the candidate text line area image of the target image, the coordinates of the candidate text line area image in the target image are output by:

37. the apparatus according to claim 36, wherein the position coordinates of the circumscribed rectangle in the target image are represented by any one of:

coordinate positions of four vertexes of the circumscribed rectangle;

38. the apparatus for detecting text regions in an image according to claim 20, wherein the obtaining the text regions of the target image specifically includes:

39. an electronic device, characterized in that the electronic device comprises:

a display;

A processor;

A memory for storing a text region detection file in an image, the text region detection file in the image when executed by the processor extracting candidate text line region images from a target image; judging whether the candidate text line region image is a text region or not by adopting a trained deep learning text/non-text classifier, and marking the region judged as the text region; merging the partitions marked as text areas to obtain the text areas of the target image;

Calculating an edge image and a gray image of the target image;

40. The electronic device of claim 39, wherein the deep learning text/non-text classifier employs a Cuda-conditional framework.

41. The electronic device of claim 40, wherein five hidden layers are configured on the Cuda-conditioner frame.

42. the electronic device of claim 41, wherein five hidden layers on the Cuda-vector framework are a first scrolling layer, a first Pooling layer, a second scrolling layer, a second Pooling layer, and a fully connected layer in sequence from input to output.

43. the electronic device according to claim 39, wherein the determining whether the candidate text line region image sub-region is a text region by using the trained deep learning text/non-text classifier, and marking the sub-region determined as the text region, specifically comprises:

44. The electronic device according to claim 43, wherein in the step of marking the region corresponding to the window image as the text region if the probability that the window image is the text region exceeds a predetermined threshold, the threshold is obtained by:

Recording the probability that each traversed window image is a text region;

45. the electronic device of any one of claims 39-44, wherein training the deep-learning text/non-text classifier is performed before the step of determining whether the candidate text line region image sub-region is a text region by using the trained deep-learning text/non-text classifier, and comprises: and providing the text image of m columns of pixels of the line n as a positive sample and the non-text image of m columns of pixels of the line n as a negative sample to the deep learning text/non-text classifier, wherein m and n are fixed integer values.

46. the electronic device of claim 45, wherein the positive examples are text images of 24 columns by 24 rows of pixels and the negative examples are non-text images of 24 columns by 24 rows of pixels.