CN112101386A

CN112101386A - Text detection method and device, computer equipment and storage medium

Info

Publication number: CN112101386A
Application number: CN202011020108.9A
Authority: CN
Inventors: 郭双双; 李斌; 龚星
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2020-12-18
Anticipated expiration: 2040-09-25
Also published as: CN112101386B

Abstract

The application relates to a text detection method, a text detection device, a computer device and a storage medium, wherein the method comprises the following steps: acquiring a picture to be detected; extracting picture characteristics of the picture to be detected based on different scales; determining character related information of each character in the picture to be detected corresponding to each scale according to the picture characteristics of each scale; and integrating the relevant information of each character corresponding to each scale to obtain the character detection result of each character in the picture to be detected. The method realizes the detection of the characters by extracting the picture characteristics from different scales, and can adapt to characters of different scales in a natural scene, thereby avoiding the problem of inaccurate text detection caused by inconsistent picture sizes and the like.

Description

Text detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a text detection method, an apparatus, a computer device, and a storage medium.

Background

The text detection method under the natural scene can be used for positioning text information in the image so as to perform character recognition on the detected text box picture containing characters subsequently and finally obtain structured information. The text detection method is widely applied to the fields of image retrieval, video analysis, automatic driving and the like, and the accuracy of text detection influences the accuracy of subsequent processes of text recognition, information structuring and the like, so that the development of an effective text detection algorithm in a natural scene is very important.

However, in an actual natural scene, various uncontrollable interference factors, such as image brightness variation, distortion of a shot image, non-uniform text scale, text bending, foreign matter shielding and the like, generally exist, and the text detection method in the natural scene is still a difficult task due to the problems. In recent years, with the rapid development of deep learning technology, text detection technology in natural scenes has also made a dramatic progress.

At present, text detection which is frequently used can be realized through an artificial intelligence technology, mainly comprises a text detection model and a character recognition model, however, the method usually has high requirements on the fineness degree, if the position of a certain character is positioned wrongly, the performance of subsequent character recognition and information structuring can be seriously influenced, and due to the limitation and complexity of shooting conditions in a real scene, character images shot by different users through different hardware have great difference, and the problems of different brightness, different size, different angles, shielding objects and the like mainly exist, so that the result of text detection is easy to cause to be inaccurate.

Disclosure of Invention

In view of the above, it is necessary to provide a text detection method, an apparatus, a computer device and a storage medium capable of improving the accuracy of the detection result.

A text detection method, the method comprising:

acquiring a picture to be detected;

extracting picture characteristics of the picture to be detected based on different scales;

determining character related information of each character in the picture to be detected corresponding to each scale according to the picture characteristics of each scale;

and integrating the relevant information of each character corresponding to each scale to obtain the character detection result of each character in the picture to be detected.

A text detection apparatus, the apparatus comprising:

the acquisition module is used for acquiring a picture to be detected;

the characteristic extraction module is used for extracting the picture characteristics of the picture to be detected based on different scales;

the character information determining module is used for determining character related information of each character in the picture to be detected corresponding to each scale according to the picture characteristics of each scale;

and the result integration module is used for integrating the relevant information of each character corresponding to each scale to obtain the character detection result of each character in the picture to be detected.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a picture to be detected;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a picture to be detected;

According to the text detection method, the text detection device, the computer equipment and the storage medium, after the picture to be detected is obtained, the picture characteristics of the picture to be detected are extracted from different scales, the character related information in the picture to be detected under the corresponding scale is determined according to the picture characteristics of each scale, and finally the character detection result of each character in the picture to be detected is obtained by integrating the character related information corresponding to each scale. The method realizes the detection of the characters by extracting the picture characteristics from different scales, and can adapt to characters of different scales in a natural scene, thereby avoiding the problem of inaccurate text detection caused by inconsistent picture sizes and the like.

Drawings

FIG. 1 is a diagram of an application environment of a text detection method in one embodiment;

FIG. 2 is a flow diagram that illustrates a method for text detection in one embodiment;

FIG. 3 is a schematic view illustrating a multi-scale feature extraction process performed on a picture to be detected in one embodiment;

fig. 4 is a schematic flow chart illustrating a process of integrating information related to each character corresponding to each scale to obtain a character detection result of each character in a picture to be detected in one embodiment;

FIG. 5 is a flowchart illustrating a text detection method according to another embodiment;

FIG. 6 is a diagram illustrating an association box between characters in one embodiment;

FIG. 7 is an exemplary graph of a Gaussian thermodynamic diagram in one embodiment;

FIG. 8 is a diagram illustrating the generation of a correlation matrix for the positions of two adjacent characters in one embodiment;

fig. 9 is a schematic flow chart illustrating a process of determining a field detection result in a to-be-detected picture based on inter-character association information corresponding to each scale in one embodiment;

FIG. 10 is a diagram illustrating a structure of a text detection network in accordance with an exemplary embodiment;

FIG. 11 is a flowchart illustrating a text detection method according to an exemplary embodiment;

fig. 12(1) is a schematic diagram of a detection result of performing text detection and output on a picture in a natural scene in an embodiment;

fig. 12(2) is a schematic diagram illustrating a detection result of text detection and output for a picture in a natural scene in an embodiment;

fig. 12(3) is a schematic diagram illustrating a detection result of text detection and output for a picture in a natural scene in an embodiment;

FIG. 13 is a block diagram showing the structure of a text detection apparatus according to an embodiment;

FIG. 14 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The text detection method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. After the server 104 acquires the picture to be detected, the picture features of the picture to be detected are extracted from different scales, the character related information in the picture to be detected under the corresponding scale is determined according to the picture features of the scales, and finally the character detection result of each character in the picture to be detected is obtained by integrating the character related information corresponding to each scale. In some embodiments, the terminal 102 may capture a picture, and the server 104 obtains the picture to be detected from the terminal 102. The terminal 102 may be, but is not limited to, various devices with a photographing function, such as a camera, and the like. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

Cloud technology (Cloud technology) is based on a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

An artificial intelligence cloud Service is also commonly referred to as AIaaS (AI as a Service, chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services.

In an embodiment, as shown in fig. 2, a text detection method is provided, which is exemplified by applying the method to the server in fig. 1, and in this embodiment, the method includes the following steps:

step S210, acquiring a picture to be detected.

In this embodiment, a picture to be subjected to text detection is recorded as a picture to be detected; in one embodiment, a picture to be detected input by a user can be acquired; or directly acquiring a picture from a connected picture acquisition device as a picture to be detected; or the corresponding picture can be read from the database as the picture to be detected.

Step S220, extracting the picture characteristics of the picture to be detected based on different scales.

Multiscale is actually sampling of signals with different granularities, and different characteristics can be observed under different scales, so that different tasks are completed; generally, more detail can be seen with less/denser granularity sampling, and the overall trend can be seen with more/sparser granularity sampling; for example, in this embodiment, the picture to be detected may be subjected to multi-scale feature extraction with different sizes and resolutions, where a small-size feature map has a strong characterization capability for a large object, and a large-size feature map has a strong characterization capability for a small object. In the embodiment, the features of the picture to be detected are respectively extracted under different scales to obtain the picture features under different scales. The picture features are the most basic attributes or features characterizing a picture, and the picture features can be natural features which can be recognized by human vision or artificially defined features.

In one embodiment, the characteristic extraction of different scales is carried out on the picture to be recognized through the neural network model determined through training, and the picture characteristics under different scales are obtained. In a specific embodiment, as shown in fig. 3, a schematic flow chart of multi-scale feature extraction for a picture to be detected is shown. The method can simultaneously extract the characteristic information on a plurality of scales, and can greatly improve the network performance. With the increase of the number of layers of the neural network, the size of the feature map is continuously reduced, the small-size feature map has stronger representation capability on large objects, and the large-size feature map has stronger representation capability on small objects, so the feature map with high resolution and low resolution is a feature capable of better extracting characters in the picture.

And step S230, determining character related information of each character in the to-be-detected picture corresponding to each scale according to the picture characteristics of each scale.

For the extracted picture features of each scale, information related to characters in the picture can be obtained according to the picture features, and the information is recorded as character related information in this embodiment. In one embodiment, the character-related information includes: position information of a single character; the character position information related in this step is character position information corresponding to each scale. In another embodiment, the character-related information includes: whether each pixel point in the picture characteristics of each scale is a foreground or not and the distance of each pixel point relative to the character frame where the pixel point is located. In other embodiments, the character-related information may also refer to other information related to the character.

In one embodiment, determining the character related information of each character in the to-be-detected picture corresponding to each scale according to the picture features of each scale respectively includes: respectively determining the foreground classification result of each pixel point and the distance between each pixel point and the character frame corresponding to the pixel point aiming at the picture characteristics corresponding to any scale; wherein the character related information includes: the foreground classification result of each pixel point and the distance between each pixel point and the character frame where the pixel point is located.

In one embodiment, for each scale, determining corresponding character-related information from picture features may be implemented by a neural network; the neural network comprises two branches: the image processing system comprises a pixel type classification branch and a position regression branch, wherein the pixel type classification branch is trained and used for determining the probability that each pixel point belongs to the foreground according to the image characteristics, and the position regression branch is trained and used for determining the distance between each pixel point and the character frame where the pixel point is located according to the image characteristics. The distance between the pixel point and the character frame comprises the distance between the pixel point and 4 frames of the character frame; in one embodiment, the corner points of the frame of the character can be used to determine the distance between the pixel point and the 4 frames.

Further, in one embodiment, the pixel class classification branch is a pixel-level binary algorithm, and the training process includes the steps of: the training process of the pixel class classification branch employs a standard cross entropy loss, as in equation (1). If the certain pixel point is located in the marked character frame, the classification of the pixel point is classified as a foreground class. In the embodiment, more foreground samples are provided, so that the problem of unbalanced number of the foreground and the background can be avoided to a great extent.

Wherein, N is the total number of training samples, and xyc represents traversing all pixel points. Y is_xycThe real label of each pixel point is represented, the value is 0 or 1,

the probability of 1 is predicted for each pixel point, and the value range is [0,1 ]]。

The pixel position regression branch learns the distance of each pixel relative to the 4 character frames, and the more accurate edge detection result can be obtained by utilizing the information of the 4 corner points of the frames. The training process of the pixel position regression branch adopts IoU loss, defined as formula (2), and if the intersection ratio of the predicted value and the actual value is small, the loss function is large.

Wherein, the IOU _ value represents the intersection ratio of the predicted character frame and the actual character frame, and is defined as: intersection area of two text borders/union area of two text borders. N is a radical of_posThe number of the real character frames is used for normalizing the loss value.

In one embodiment, all scales of image feature information pass through the same pixel class classification branch and the position regression branch, for the position regression branch, a scale value is independently learned for each scale of image feature during training, and each scale feature is subjected to scale transformation and then input into the position regression branch.

Step S240, integrating the relevant information of each character corresponding to each scale to obtain the character detection result of each character in the picture to be detected.

After the character related information is obtained, since the character related information is information under different scales, scale conversion needs to be performed on the character related information to obtain a character detection result of the original picture to be detected of the character under the scale. In addition, when obtaining the character-related information, the character-related information is only some information related to the character, and therefore it is necessary to integrate the obtained information related to the character to obtain information such as a frame position of the character.

In an embodiment, the character related information includes a foreground classification result of each pixel point and a distance between each pixel point and a character frame where the pixel point is located, and in this embodiment, as shown in fig. 4, the character related information corresponding to each scale is integrated to obtain a character detection result of each character in the picture to be detected, including steps S141 to S145.

In step S141, any unselected scale is selected as the current scale.

And step S142, determining a target pixel point set belonging to the foreground according to the foreground classification result of each pixel point under the current scale.

In this embodiment, a set composed of pixels belonging to the foreground is a target pixel set, that is, a set of pixels belonging to a character; in one embodiment, the foreground classification result of each pixel under the current scale includes a probability value that each pixel belongs to the foreground, and determining the target pixel set belonging to the foreground according to the foreground classification result of each pixel includes: and determining the pixel points with the probability value larger than the preset foreground probability threshold value as the pixel points belonging to the foreground to obtain a target pixel point set. Wherein, predetermine the prospect probability threshold value and can set up according to actual conditions.

And S143, determining potential frame pixel points from the target pixel point set according to the distance between each pixel point and the character frame where the pixel point is located, and obtaining the potential character frame position corresponding to the potential frame pixel points under the current scale.

In step S142, all the pixel points belonging to the foreground are screened, and in combination with the distance between each pixel point in the target pixel point set and the character frame where the pixel point is located, the pixel points possibly belonging to the character frame can be determined, and in this embodiment, the pixel points are marked as potential frame pixel points, and the position information corresponding to the potential frame pixel points is obtained, so that the position of the potential character frame can be obtained.

And S144, screening out the accurate character frame position under the current scale from the potential character frame positions, and returning to the step of selecting any unselected scale as the current scale until all scales are selected.

In one embodiment, the selection and removal of repeated character frame information from the potential character frames can be realized by adopting a non-maximum value inhibition method, the character frames which have relatively smaller probability values and larger overlapping degree with the character frames having larger foreground probability values and are larger than a preset overlapping degree threshold value are removed by the non-maximum value inhibition method, and an accurate character frame is output for each character.

Among them, the Non-Maximum Suppression method (NMS) suppresses an element that is not a Maximum as the name implies, and can be understood as a local Maximum search. The local representation is a neighborhood, and the neighborhood has two variable parameters, namely the dimension of the neighborhood and the size of the neighborhood. In one embodiment, removing the character frames which have relatively small probability values and overlap degrees with the character frames having larger foreground probability values and are larger than a preset overlap degree threshold value by a non-maximum value suppression method includes: aiming at any character, selecting a first character frame from the potential character frames, wherein the first character frame is a character frame corresponding to a pixel point with the maximum probability value belonging to the foreground; and acquiring other potential character frames of the same character, calculating the overlapping degree of the potential character frames with the first character frame, and deleting the potential character frames with the overlapping degree larger than a preset overlapping degree threshold value to obtain the accurate character frame of the character. The calculated overlapping degree of the character frame and the character frame can be determined by calculating the overlapping degree of the area surrounded by the character frame; the preset overlap threshold value can be set according to actual conditions.

And S145, respectively mapping the accurate character frame positions corresponding to the scales to the picture to be detected to obtain the character detection result of each character in the picture to be detected.

Since the precise character frame of each character determined in the above steps is in different scales, the precise character frame information obtained in each scale needs to be subjected to character transformation and mapped to the corresponding scale of the original picture to be detected, so that the character detection result of each character in the picture to be detected can be obtained. In one embodiment, mapping the accurate character frame position corresponding to each scale to the picture to be detected includes: and multiplying the character frame position under each scale by the characteristic step length of the corresponding scale to obtain the character detection result of the picture to be detected.

After the character related information is obtained, analyzing and integrating the character related information to obtain frame information of characters in the picture to be detected, mapping the character frame information obtained under each scale to the original picture scale to obtain position information of the character frame of each character in the picture to be detected, namely character detection results of each character in the picture to be detected; the method in the embodiment is combined with the deep learning technology, so that the text detection method has higher accuracy and robustness and has stronger adaptability to low-quality images.

According to the text detection method, after the picture to be detected is obtained, the picture characteristics of the picture to be detected are extracted from different scales, the character related information in the picture to be detected under the corresponding scale is determined according to the picture characteristics of each scale, and finally the character detection result of each character in the picture to be detected is obtained by integrating the character related information corresponding to each scale. The method realizes the detection of the characters by extracting the picture characteristics from different scales, and can adapt to characters of different scales in a natural scene, thereby avoiding the problem of inaccurate text detection caused by inconsistent picture sizes and the like.

Further, in an embodiment, as shown in fig. 5, after extracting the picture features of the picture to be detected based on different scales, step S510 and step S520 are further included.

Step S510, determining the associated information between the characters corresponding to the respective scales according to the picture features of the respective scales.

The inter-character association information indicates an association relationship between characters in the picture, and includes information such as whether two characters are adjacent, a distance between the two characters, and a probability that the two characters belong to the same field.

In one embodiment, determining the associated information between the characters corresponding to each scale according to the picture features of each scale respectively includes: determining an incidence matrix between two adjacent characters corresponding to each scale according to the picture characteristics of each scale; and generating corresponding character incidence relation response graphs respectively based on incidence matrixes between two adjacent characters under different scales, wherein the incidence information between the characters comprises the character incidence relation response graphs.

Wherein, the incidence matrix between two characters comprises an incidence frame between two characters. After obtaining the incidence matrix between the two characters, generating a character incidence relation response graph between the two characters according to the incidence matrix; in one embodiment, the character association relationship response graph comprises a gaussian thermodynamic graph.

In one embodiment, a corresponding inter-character association matrix, that is, an association frame between two adjacent characters, is determined according to picture features under different scales through a trained character association branch, as shown in fig. 6, which is a schematic diagram of an association frame between characters, wherein a solid line frame in the diagram represents an association frame between two characters; and generating an ellipse by taking the center of the association frame as an origin, the width/2 of the character frame and the height/2 of the character frame as two-axis radiuses, and generating a character association relation response diagram according to the origin and the distance between each point in the ellipse and the origin. In a specific embodiment, the character association relationship response graph is a Gaussian thermodynamic graph; generating a Gaussian thermodynamic diagram through a python function, wherein a simple schematic diagram of the Gaussian thermodynamic diagram is shown in FIG. 7; each association box corresponds to a gaussian map, and the darker the color in the gaussian thermodynamic map indicates that the characters belong to the same field more likely.

Further, in one embodiment, during training of the character relevance branch: generating an incidence matrix between adjacent characters according to the coordinate position of a single character, generating a character incidence relation response graph by taking the central point of the incidence matrix as an original point, and training a preset neural network to obtain a character incidence branch as a learning target. The training process of the character relevance branch adopts a minimum variance loss function, and is defined as shown in a formula (3). Meanwhile, in order to avoid the problem of imbalance of the number of the foreground and the background, the loss of all foreground pixels is considered, and the pixels with larger loss values are screened out from the background pixels, so that the total number of the screened background pixels and the total number of the screened foreground pixels are kept consistent. Therefore, the problem of unbalanced number of positive and negative cases can be solved, online difficult case mining can be realized, and network performance is improved.

Wherein N is the number of pixel points participating in loss gradient feedback, G_x,yFor the true label, P, of each pixel point in the sample_x,yA prediction label for each pixel point in the sample.

In one embodiment, a sample picture obtained in the training process of the character relevance branch contains marking information of which characters belong to the same field, and then a relevance matrix between the characters is generated for adjacent characters in the field; in this embodiment, generating the association matrix between adjacent characters according to the coordinate position of a single character includes: selecting two vertexes on the same diagonal of a character frame (label information) of each character as a target vertex and a central point of the character frame; based on the target vertex and the center point of two adjacent characters, a correlation matrix between the two adjacent characters is determined.

In one embodiment, the sample contains the label information that the characters belong to the same field, and then the association information is generated for the adjacent characters in the field. Further, generating the incidence matrix between the adjacent characters according to the coordinate position of the single character comprises the following steps: assume that the 2-point coordinates (upper left and lower right coordinates) of character 1 are: [ (x1, y1), (x2, y2) ], 2-point coordinates of character 2 are [ (x3, y3), (x4, y4) ], and as shown in fig. 8, the 4-point coordinates (upper left, upper right, lower left) of the associated borders of the two characters are:

[(x1+x2+c1)/3,(y1+y1+d1)/3],

[(x3+x4+c2)/3,(y3+y3+d2)/3],

[(x3+x4+c2)/3,(y4+y4+d2)/3],

[(x1+x2+c1)/3,(y2+y2+d1)/3]

where (c1, c2), (d1, d2) are coordinates of center points of the character 1 and the character 2, respectively, as points at cross positions in fig. 8.

Step S520, determining a field detection result in the picture to be detected based on the associated information between the characters corresponding to all scales.

In one embodiment, the inter-character association information includes probabilities that two characters belong to the same field, and determining the field detection result based on the inter-character association information corresponding to each scale includes: determining fields in the picture to be detected according to the incidence relation of every two adjacent characters in the incidence information between the characters, and obtaining the frame position information of all the fields in the picture. In one embodiment, the character detection and the field detection for the picture to be detected may be in parallel.

Further, in an embodiment, the determining a field detection result in the picture to be detected based on the inter-character association information corresponding to each scale, as shown in fig. 9, includes steps S311 to S314:

step S311, converting the character association relationship response map of each scale into a black-and-white association relationship response map according to a preset response threshold.

In one embodiment, converting the character association relation response graph of each scale into a black and white association relation response graph according to a preset response threshold, including: and setting the pixel point with the response value smaller than the preset response threshold value in the character association relation response graph as 0, otherwise, setting the pixel point as 1. And converting all the pixel points to obtain a corresponding black-white incidence relation response image. The preset response threshold value can be set according to actual conditions.

And S312, performing connected domain analysis based on the black-white incidence relation response graph to obtain connected domain information corresponding to each scale.

Connected domains generally refer to connected regions. In a specific embodiment, based on the black-and-white association response graph of each scale, the connected component within the state function of the OPENCV open source library is called to obtain the connected domain information corresponding to each scale.

And step 313, determining the position information of the field frame of the scale according to the connected domain information corresponding to each scale.

In this embodiment, the connected domain obtained after the connected domain analysis is performed based on the black-and-white association response map actually corresponds to a field in the picture to be detected; therefore, connected domain edge information in the connected domain information is obtained, and the field frame position information of the scale where the connected domain edge information is located is determined.

And step S314, mapping the field frame position information of each scale to the picture to be detected to obtain a field detection result in the picture to be detected.

Similarly to character detection, after the field frame position information corresponding to each scale is obtained, the field frame position information of each scale is mapped to the scale of the original picture to be detected, so as to obtain the position information of the field in the picture to be detected, namely the field detection result.

In the above embodiment, not only the character detection is performed on the characters in the picture to be detected, but also the field detection result is obtained by detecting the field in the picture to be detected, and the output at the field level can also greatly reduce the pressure of the subsequent information structuring process.

In another embodiment, after obtaining the relevant information of the characters in each scale, the method further comprises performing character recognition according to the picture features in each scale to obtain a character recognition result; further, the character recognition result is output together with the subsequent character detection result. In one embodiment, image information and text information are processed simultaneously through a unified network in conjunction with multi-modal learning to obtain more accurate field output results.

In another embodiment, extracting picture features from different scales of a picture to be detected is completed through a multi-resolution network, and completing character detection and field detection based on the picture features of each scale is respectively realized through a plurality of different branches in a trained neural network model, for example, the branch comprises a pixel class classification branch, a pixel position regression branch and a character correlation branch; fig. 10 is a schematic structural diagram of a text detection network in an embodiment. In this embodiment, the detection of the character and the detection of the field are respectively realized by a trained neural network model, so that different branches can be mutually promoted, and the accuracy of the trained model is higher. In this embodiment, the text detection network includes a multi-scale feature extraction network, a pixel class classification branch, a pixel position regression branch, and a character association branch. Wherein the multi-scale feature extraction network comprises a convolutional layer (conv) and a plurality of residual blocks (ResBlock); the pixel classification branch and the pixel position regression branch are convolution layers (conv) respectively; the character relevance branch includes a merge layer (merge) and a convolution layer (conv).

Further, in one embodiment, the method further comprises: and matching the detection result of each character in the picture to be detected with the detection result of each field to obtain the corresponding relation between the character and the field of the picture to be detected.

In the above embodiment, the character frame position (character detection result) of each character in the picture to be detected and the field frame position (field detection result) of each field have been determined, and the matching between the characters and the fields is realized by calculating the overlapping degree between each character frame and the field frame, so that which characters in the picture to be detected belong to the same field can be determined, and the text detection result of the corresponding relationship between the characters and the fields is output.

In one embodiment, assuming that a set of single character information obtained according to the character detection result is C, and a set of field information obtained according to the field detection result is W, each character is matched with each field. We traverse each field in the W set, calculate the degree of overlap of each character with that field, and if the degree of overlap is greater than a preset overlap threshold (set to 0.8 in one embodiment), it indicates that the character matches that field successfully. It will be appreciated that in other embodiments, the preset overlap threshold may be set to other values depending on the actual circumstances.

In the embodiment, the obtained character detection result and the field detection result in the picture to be detected are matched based on the overlapping degree, so that the incidence relation between the characters and the fields in the picture to be detected can be output, the detection results are enriched, and the pressure of subsequent information structuring can be reduced.

The application also provides an application scene, and the application scene applies the text detection method. Specifically, the application of the text detection method in the application scenario is as follows:

fig. 11 is a schematic flow chart of the text detection method in this embodiment, which includes the steps of: acquiring pictures to be detected, and extracting multi-scale features of a plurality of pictures to be detected; carrying out single character category classification, single character position regression and character correlation calculation according to the multi-scale picture characteristics; and outputting the text detection results of the character and field levels by combining the results of the character category classification, the character position regression and the relevance among the characters of all scales.

Firstly, an image needing character detection, namely a picture to be detected, is obtained as input data of a network model.

Secondly, the picture to be detected is input into the multi-resolution network, the picture characteristic information on a plurality of scales can be extracted simultaneously, and the size of the characteristic graph is continuously reduced along with the increase of the number of layers of the neural network.

Then, based on the multi-scale picture characteristic information, the position information and the inter-character relevance information of a single character are respectively output through 2 independent branches by adopting a multi-task learning criterion. Specifically, for each pixel of the feature map, the single character branch network can output the probability that each pixel is a character (foreground) and the distance between each pixel and the 4 frames of the character through continuous training and learning. For the output profile of the inter-character association information branch, the value on each pixel represents the probability that the pixel is an inter-character association.

For a single character detection branch, all the scale feature information passes through the same pixel class classification branch and the pixel position regression branch, so that the parameter number and the calculated amount of the model can be reduced, and the detection effect can be improved. It is worth mentioning that the distribution of the feature information of each scale is different, and if one output branch is shared, the training process is likely to be abnormal. Therefore, in the embodiment, a scale value is independently learned for each scale feature, and each scale feature is subjected to scale transformation and then input to the final position regression branch. The classification branch of the single character is a pixel-level binary algorithm, and the probability that each pixel is the character foreground is output. The position regression branch of the pixel outputs the distance between each pixel and 4 frames of the character, and meanwhile, more accurate edge detection results can be obtained by utilizing the information of 4 corner points of the frames.

For the character relevance branch, outputting a relevance matrix between adjacent characters in the image features of each scale, and further generating an inter-character relevance relation response graph by taking the center point of the relevance matrix as an origin, wherein the inter-character relevance relation response graph is a Gaussian thermodynamic diagram in a specific embodiment.

Finally, the specific position information of each character in the picture to be detected and which characters belong to the same field can be obtained by carrying out post-processing on the output results of the single character branch and the character relevance branch. Obtaining pixel points with foreground probability values larger than a specified threshold (usually set to 0.5) based on the pixel classification chart, wherein the pixel points (represented by a set P) are regarded as position points with high possibility of character existence;

for each point in the set P, obtaining the distance information of the points from the character frame based on the pixel position regression graph, thereby determining the frame position of the character to which the point belongs, and representing all the character frame information by using a set B;

and (4) possibly, repeated character borders exist in the set B, and a non-maximum suppression technology is adopted to remove the borders with relatively small foreground probability values and more overlapping with the character borders with large probability values. Finally, only one most accurate frame information is output for each character.

After the frame of the character detected in each scale is obtained, the frame information can be mapped back to the original image according to the information such as the characteristic step length of the scale. For example, a feature step size of 2 for a certain scale, then the character coordinates at that scale would need to be multiplied by a factor of 2 accordingly. Based on the character association relation response graph, setting the pixel point with the response value smaller than the specified threshold (set to be 0.2) to be 0, and otherwise, setting the pixel point to be 1; and analyzing the connected domains of the response graph output by the step to obtain n connected domains. For each connected domain, its edge information, i.e., the bounding box coordinates of each field, is obtained.

After the position information of the single character and the position information of the field are obtained, assuming that the set of the single character information is C and the set of the field information is W, the field to which the single character belongs is also determined for each character. Traversing each field in the W set, calculating the degree of overlap of each character with the field, and if the degree of overlap is greater than a specified threshold (set to 0.8), indicating that the character matches the field successfully.

In a specific embodiment, the text detection method is applied to a container character analysis system and a layout analysis system of a document image, and the recall rate and the accuracy rate of character and field positioning results exceed 97%.

The text detection method in the application scene combines the traditional image processing technology and the deep learning technology to perform character detection on the shot image, has higher accuracy and robustness, and has stronger adaptability to low-quality images. Meanwhile, an effective multi-resolution network structure is designed, multi-scale features can be extracted from an input image to adapt to characters with different scales, and the method is suitable for texts with various shapes in natural scenes, such as horizontal texts and vertical texts. In addition, the method simultaneously outputs the coordinate information of a single character and the information of the field level, so that double branches are mutually promoted, and the output of the field level can greatly reduce the pressure of the subsequent information structuring process.

Further, in an embodiment, the detection result diagram is divided into a detection result diagram for performing text detection and outputting on pictures in a natural scene as shown in fig. 12(1), 12(2), and 12(3), wherein a dotted frame represents a field detection result, and a solid frame represents a character detection result.

It should be understood that although the various steps in the flow charts of fig. 2-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 13, there is provided a text detection apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an obtaining module 1310, a feature extraction module 1320, a character information determination module 1330, and a result integration module 1340, wherein:

an obtaining module 1310, configured to obtain a picture to be detected.

The feature extraction module 1320 is configured to extract image features of the image to be detected based on different scales.

The character information determining module 1330 is configured to determine, according to the picture features of the respective scales, character-related information of each character in the to-be-detected picture corresponding to the respective scales.

The result integrating module 1340 is configured to integrate the relevant information of each character corresponding to each scale to obtain a character detection result of each character in the picture to be detected.

According to the text detection device, after the picture to be detected is obtained, the picture characteristics of the picture to be detected are extracted from different scales, the character related information in the picture to be detected under the corresponding scale is determined according to the picture characteristics of each scale, and finally the character detection result of each character in the picture to be detected is obtained by integrating the character related information corresponding to each scale. The method realizes the detection of the characters by extracting the picture characteristics from different scales, and can adapt to characters of different scales in a natural scene, thereby avoiding the problem of inaccurate text detection caused by inconsistent picture sizes and the like.

In an embodiment, the character information determining module 1330 of the apparatus is specifically configured to: respectively determining the foreground classification result of each pixel point and the distance between each pixel point and the character frame corresponding to the pixel point aiming at the picture characteristics corresponding to any scale; wherein the character related information includes: the foreground classification result of each pixel point and the distance between each pixel point and the character frame where the pixel point is located.

In one embodiment, the above apparatus further comprises: the inter-character association information determining module is used for determining inter-character association information corresponding to each scale according to the picture characteristics of each scale; and the field detection result output module is used for determining the field detection result in the picture to be detected based on the associated information between the characters corresponding to all scales.

In one embodiment, the module for determining the association information between characters of the apparatus includes: the incidence matrix generating module is used for determining the incidence matrix between two adjacent characters corresponding to each scale according to the picture characteristics of each scale; the character incidence relation response graph generating module is used for generating corresponding character incidence relation response graphs respectively based on incidence matrixes between two adjacent characters under different scales; the inter-character association information includes a character association relationship response map.

In one embodiment, the result integration module 1340 of the apparatus includes: the scale selection unit is used for selecting any unselected scale as the current scale; the foreground screening unit is used for determining a target pixel point set belonging to the foreground according to the foreground classification result of each pixel point under the current scale; the potential frame screening unit is used for determining potential frame pixel points from the target pixel point set according to the distance between each pixel point and the character frame where the pixel point is located, and obtaining the potential character frame position corresponding to the potential frame pixel points under the current scale; the accurate frame screening unit is used for screening the accurate character frame position under the current scale from the potential character frame positions, and returning to the step of selecting any unselected scale as the current scale until all scales are selected; and the mapping unit is used for mapping the accurate character frame positions corresponding to the scales to the picture to be detected respectively to obtain the character detection result of each character in the picture to be detected.

In an embodiment, the field detection result output module of the apparatus includes: the conversion unit is used for converting the character incidence relation response image of each scale into a black and white incidence relation response image according to a preset response threshold; the connected domain analysis unit is used for carrying out connected domain analysis based on the black and white incidence relation response graph to obtain connected domain information corresponding to each scale; the frame position determining unit is used for determining field frame position information of the scale according to the connected domain information corresponding to each scale; and the field detection result output unit is used for mapping the field frame position information of each scale to the picture to be detected to obtain the field detection result in the picture to be detected.

In one embodiment, the above apparatus further comprises: and the corresponding relation output module is used for matching the detection result of each character in the picture to be detected with the detection result of each field to obtain the corresponding relation between the character and the field of the picture to be detected.

For the specific definition of the text detection device, reference may be made to the above definition of the text detection method, which is not described herein again. The modules in the text detection device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 14. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as character detection results, field detection results and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text detection method.

Those skilled in the art will appreciate that the architecture shown in fig. 14 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A text detection method, the method comprising:

acquiring a picture to be detected;

2. The text detection method according to claim 1, wherein determining the character-related information of each character in the to-be-detected picture corresponding to each scale according to the picture features of each scale respectively comprises:

respectively determining the foreground classification result of each pixel point and the distance between each pixel point and the character frame corresponding to the pixel point aiming at the picture characteristics corresponding to any scale; the character-related information includes: the foreground classification result of each pixel point and the distance between each pixel point and the character frame where the pixel point is located.

3. The text detection method according to claim 2, wherein the integrating information related to each character corresponding to each scale to obtain a character detection result of each character in the picture to be detected comprises:

selecting any unselected scale as a current scale;

determining a target pixel point set belonging to the foreground according to the foreground classification result of each pixel point under the current scale;

determining potential frame pixel points from the target pixel point set according to the distance between each pixel point and the character frame where the pixel point is located, and obtaining the potential character frame position corresponding to the potential frame pixel points under the current scale;

screening out the accurate character frame position under the current scale from the potential character frame positions, and returning to the step of selecting any unselected scale as the current scale until all scales are selected;

and respectively mapping the accurate character frame position corresponding to each scale to the picture to be detected to obtain the character detection result of each character in the picture to be detected.

4. The text detection method according to claim 1, further comprising, after extracting the picture features of the picture to be detected based on different scales:

determining the associated information between the characters corresponding to all the scales according to the picture characteristics of all the scales respectively;

and determining a field detection result in the picture to be detected based on the associated information between the characters corresponding to all scales.

5. The text detection method according to claim 4, wherein determining the associated information between the characters corresponding to the respective scales according to the picture features of the respective scales respectively comprises:

determining an incidence matrix between two adjacent characters corresponding to each scale according to the picture characteristics of each scale;

generating corresponding character incidence relation response graphs respectively based on incidence matrixes between two adjacent characters under different scales; the inter-character association information comprises the character association relation response graph.

6. The text detection method according to claim 5, wherein the determining the field detection result in the picture to be detected based on the inter-character association information corresponding to each scale comprises:

converting the character incidence relation response image of each scale into a black and white incidence relation response image according to a preset response threshold;

performing connected domain analysis based on the black-white incidence relation response graph to obtain connected domain information corresponding to each scale;

determining field frame position information of the scale according to the connected domain information corresponding to each scale;

and mapping the field frame position information of each scale to the picture to be detected to obtain a field detection result in the picture to be detected.

7. The text detection method of claim 4, further comprising:

and matching the detection result of each character in the picture to be detected with the detection result of each field to obtain the corresponding relation between the character and the field of the picture to be detected.

8. A text detection apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring a picture to be detected;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.