CN111738069A

CN111738069A - Face detection method and device, electronic equipment and storage medium

Info

Publication number: CN111738069A
Application number: CN202010404206.6A
Authority: CN
Inventors: 孟欢欢; 柴振华
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-10-02

Abstract

The application discloses a face detection method, a face detection device, electronic equipment and a storage medium, wherein the method comprises the steps of generating a feature map of a face detection image, wherein the feature map comprises original feature maps with a plurality of scales and fusion feature maps corresponding to the original feature maps; determining face position offset according to the original feature map; performing variable convolution on each fused feature image according to the face position offset to obtain a variable convolution feature image of each fused feature image; and determining a face detection result based on the variable convolution characteristic graph. The face feature expression capacity in the fusion feature image is improved based on variable convolution according to the face position offset obtained by the multi-scale original feature map, the face feature expression capacity is obviously improved, the face detection accuracy is obviously improved, the performance of a face detection model is improved, the calculation method is simple, the calculation amount is small, the calculation efficiency is high, the application scene of the face detection is greatly expanded, and the face feature expression capacity fusion model is particularly suitable for multi-target face detection.

Description

Face detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image recognition, and in particular, to a method and an apparatus for face detection, an electronic device, and a storage medium.

Background

The face detection refers to positioning the position and size of a face in a picture, and is the basis of face intelligent analysis such as face recognition and face attributes (age, gender and the like). In recent years, deep learning has advanced in this field, and it can be roughly classified into three categories: a cascade method, a two-stage method, a single-stage method.

The cascade method generally uses a plurality of models, and is a method for gradually filtering non-face samples and refining face positions, but the speed is unstable, and the speed is slow for pictures with a plurality of faces.

The two-stage method is to generate a candidate region in the first stage and then classify and regress the candidate region in the second stage, and has high detection accuracy, but the two-stage method has the serious defect of slow speed.

The single-stage method is to directly position the face position through classification and regression, a Selective Refinement Network (Selective Refinement Network) method is proposed by Chinese academy electronics, the balance of the number of positive and negative samples is improved by filtering easily classified negative samples on a small-scale face, the position of the face is iteratively refined on a large-scale face, and the positioning accuracy of the large-scale face is improved.

Disclosure of Invention

In view of the above, the present application is proposed to provide a face detection method, apparatus, electronic device and storage medium that overcome or at least partially solve the above-mentioned problems.

According to an aspect of the present application, there is provided a face detection method, including:

generating a feature map of the face detection image, wherein the feature map comprises original feature maps with a plurality of scales and fusion feature maps corresponding to the original feature maps;

determining face position offset according to the original feature map;

performing variable convolution on each fused feature image according to the face position offset to obtain a variable convolution feature image of each fused feature image;

and determining a face detection result based on the variable convolution characteristic graph.

Optionally, in the above method, generating a feature map of the face detection image includes:

generating an original feature map of the face detection image in a bottom-up and down-sampling mode;

a fused feature map corresponding to each original feature map is generated in a top-down and down-sampling manner.

Optionally, in the above method, generating the fused feature map corresponding to each original feature map in a top-down and top-up sampling manner includes:

and determining the weight of the target fusion feature map according to the channel number of the original feature map corresponding to the target fusion feature map and the channel number of the upper fusion feature map of the target fusion feature map.

Optionally, in the above method, determining the face position offset according to the original feature map includes:

performing anchor point frame regression on each original feature map respectively;

and determining the face position offset according to the anchor point frame regression result of each original feature map and the anchor point frame corresponding to each anchor point frame regression result.

Optionally, in the above method, determining a face detection result based on the variable convolution feature map includes:

respectively carrying out anchor point frame classification and anchor point frame regression on each original feature map;

according to the anchor point frame classification result of the first type of original feature graph, performing anchor point frame classification on the first type of fusion feature graph corresponding to the first type of original feature graph;

performing anchor point frame regression on the second type of fusion feature map corresponding to the second type of original feature map according to the anchor point frame regression result of the second type of original feature map;

determining a face detection result according to an anchor point frame regression result of the first type of original feature map, an anchor point frame classification result of the first type of fusion feature map, an anchor point frame classification result of the second type of original feature map and an anchor point frame regression result of the second type of fusion feature map;

wherein, the first kind of original feature map is a lower layer feature map of the second kind of original feature map.

Optionally, the method further includes:

and respectively carrying out receptive field enhancement treatment on the original characteristic diagram and the fused characteristic diagram.

Optionally, the method is implemented based on a face detection model, and the face detection model is obtained by training in the following manner:

inputting the training image into a face detection model to obtain a face detection result;

calculating multiple types of loss function values according to the labeling information of the training image and the face detection result, wherein the loss function comprises at least one of the following: a face classification loss function, a face position loss function, a key point loss function and a face segmentation function;

and updating the parameters of the face detection model according to the loss function values.

According to still another aspect of the present application, there is provided a face detection apparatus, including:

the characteristic image generating unit is used for generating a characteristic image of the face detection image, wherein the characteristic image comprises original characteristic images with a plurality of scales and fused characteristic images corresponding to the original characteristic images;

the characteristic image processing unit is used for determining the position offset of the face according to the original characteristic image; performing variable convolution on each fused feature image according to the face position offset to obtain a variable convolution feature image of each fused feature image;

and the detection unit is used for determining a face detection result based on the variable convolution characteristic diagram.

Optionally, in the above apparatus, the feature map generating unit is configured to generate an original feature map of the face detection image in a bottom-up and down-sampling manner; a fused feature map corresponding to each original feature map is generated in a top-down and down-sampling manner.

Optionally, in the apparatus, the feature map generating unit is configured to determine the weight of the target fusion feature map according to the number of channels of the original feature map corresponding to the target fusion feature map and the number of channels of the upper-layer fusion feature map of the target fusion feature map.

Optionally, in the apparatus, the feature map processing unit is configured to perform anchor point frame regression on each original feature map; and the face position deviation determining module is used for determining the face position deviation according to the anchor point frame regression result of each original feature map and the anchor point frame corresponding to each anchor point frame regression result.

Optionally, in the apparatus, the detection unit is configured to perform anchor point frame classification and anchor point frame regression on each original feature map respectively; the anchor point frame classification device is used for classifying the anchor point frames of the first type of fusion characteristic graphs corresponding to the first type of original characteristic graphs according to the anchor point frame classification result of the first type of original characteristic graphs; the anchor point frame regression module is used for performing anchor point frame regression on the second type of fusion feature graph corresponding to the second type of original feature graph according to the anchor point frame regression result of the second type of original feature graph; the face detection device is used for determining a face detection result according to an anchor point frame regression result of the first type of original feature map, an anchor point frame classification result of the first type of fusion feature map, an anchor point frame classification result of the second type of original feature map and an anchor point frame regression result of the second type of fusion feature map; wherein, the first kind of original feature map is a lower layer feature map of the second kind of original feature map.

Optionally, in the above apparatus: and the characteristic map processing unit is also used for respectively carrying out receptive field enhancement processing on the original characteristic map and the fused characteristic map.

Optionally, the apparatus is implemented based on a face detection model, and the face detection model is obtained by training in the following manner: inputting the training image into a face detection model to obtain a face detection result; calculating multiple types of loss function values according to the labeling information of the training image and the face detection result, wherein the loss function comprises at least one of the following: a face classification loss function, a face position loss function, a key point loss function and a face segmentation function; and updating the parameters of the face detection model according to the loss function values.

In accordance with yet another aspect of the present application, there is provided an electronic device, wherein the electronic device includes: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method as any one of above.

According to yet another aspect of the application, a computer readable storage medium is provided, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method as any of the above.

According to the technical scheme, the feature map of the face detection image is generated, and the feature map comprises original feature maps with multiple scales and fusion feature maps corresponding to the original feature maps; determining face position offset according to the original feature map; performing variable convolution on each fused feature image according to the face position offset to obtain a variable convolution feature image of each fused feature image; and determining a face detection result based on the variable convolution characteristic graph. The face feature expression capacity in the fusion feature image is improved according to the face position offset obtained according to the multi-scale original feature map based on the variable convolution, the face detection accuracy is obviously improved, the performance of a face detection model is improved, the calculation method is simple, the calculation amount is small, the calculation efficiency is high, the application scene of the face detection is greatly expanded, and the face detection method is particularly suitable for multi-target face detection.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 shows a schematic flow diagram of a face detection method according to an embodiment of the present application;

FIG. 2 illustrates a flow diagram for determining face detection results based on a variable convolution feature map according to an embodiment of the present application;

FIG. 3 is a flow chart diagram illustrating a face detection method according to another embodiment of the present application;

FIG. 4 is a schematic structural diagram of a face detection apparatus according to an embodiment of the present application;

FIG. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 6 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The idea of the present application is that, aiming at the problem that the positioning information output from the first step is not fully utilized by the single-stage two-step method to improve the features of the second step, and meanwhile, the weights of the channels when the bottom-up and top-down features are fused are not considered, the preliminary face position information obtained by using the original feature map is used to correct the fused feature map corresponding to the original feature map based on the variable convolution, and different weights are given to different features, so as to obtain a more accurate face detection result, thereby effectively solving the problems proposed in the background art, which will be exemplified below.

Fig. 1 shows a schematic flow chart of a face detection method according to an embodiment of the present application, where the method includes:

step S110: and generating a feature map of the face detection image, wherein the feature map comprises original feature maps with a plurality of scales and fused feature maps corresponding to the original feature maps.

Face detection techniques are used in more and more fields, such as face beautification, photo album classification, and the like. The method for detecting the human face is high in detection accuracy rate, small in calculated amount and high in calculation rate.

Firstly, generating a feature map of a face detection image, wherein the feature map comprises original feature maps with a plurality of scales and fused feature maps corresponding to the original feature maps.

The method is characterized in that when image features are described, the image features are expressed through multiple scales, the multiple scale expression is used for describing a target structure in a certain scale range, the basic idea is to embed a family of signals with variable size parameters into an original signal, the structure of the signal in a large scale is simplified by the signal in a small scale, and the multiple scale expression mode includes but is not limited to image pyramid expression, scale space expression and the like.

In this application, the image features referred to in the original feature map include, but are not limited to, color features, shape features, edge features, texture features, spatial relationship features, and the like.

When the feature extraction is carried out, the image can be automatically segmented to divide an object or color area contained in the image, and then the image feature is extracted according to the areas; it is also possible to simply divide the image evenly into regular sub-blocks and then extract features for each image sub-block.

The algorithm for feature extraction may be any one of the prior art, such as: a Haar (Haar) feature extraction method, an LBP (Local Binary Pattern) feature extraction method, a SIFT (Scale-invariant feature transform) feature extraction method, and the like, which can be specifically implemented by using a machine learning model, and the structure of the machine learning model can be a Convolutional Neural Network (CNN) and the like. Here, a common convolutional neural network structure ResNet is taken as an example, and a brief explanation is given.

The main idea of ResNet is to add a direct channel in the Network, i.e. the idea of high-way Network, before ResNet appears, the Network structure is that the performance input is non-linearly transformed, while high-way Network allows a certain proportion of the output of the previous Network layer to be preserved. Therefore, the neural network of the layer can learn the residual error of the output of the last network without learning the whole output, the traditional convolution network or the full-connection network has the problems of information loss, loss and the like more or less during information transmission, and the deep network cannot be trained due to gradient disappearance or gradient explosion. ResNet solves the problem to a certain extent, input information is directly bypassed to output, the integrity of the information is protected, the whole network only needs to learn the part with difference between input and output, and the learning goal and difficulty are simplified.

And after the original feature maps of a plurality of scales are obtained, fusing the features of the original feature maps to obtain fused feature maps corresponding to the original feature maps. Feature extraction is the extraction process of semantic features from a low level to a high level. For example, for the extraction of the face features, the features extracted at the lower layer of the network are only some contour features, the features extracted deeply along with the network may be higher semantic features such as eyes and noses, and the features of the whole face contour may be extracted until the network reaches the deepest layer.

However, as the network grows deeper, each layer loses some information, and by the last layer, more information is lost. In order to solve the problem, the features of the original feature maps of multiple scales can be fused to obtain a fused feature map corresponding to each original feature map, and the basic idea is to add the original feature map of the previous layer to the original feature map of the previous layer before the convolution operation is performed on the layer network, so that some information of the previous layer can be retained, and the loss of the information of the layer can be reduced.

The algorithm for feature fusion may employ any one of the existing techniques, including but not limited to: an algorithm based on Bayesian decision theory, an algorithm based on sparse representation theory, and an algorithm based on deep learning theory. For example, the feature fusion algorithm based on the sparse representation theory is to establish a feature joint sparse matrix after extracting multiple features from a sample, the matrix is the result of multi-feature fusion, and the method is to fuse dictionaries of different types of features. For another example, a feature fusion algorithm based on a deep learning theory is to fuse features obtained by a plurality of neural networks to obtain a fused feature. Step S120: and determining the face position offset according to the original feature map.

The method and the device utilize the variable convolution to correct the fused feature map corresponding to the original feature map, wherein the variable convolution is added with an offset variable relative to the standard convolution, so that the sampling range of the convolution can have a certain degree of adjustment space near the regular lattice point.

The face position offset may also be referred to as a face position offset, and for example, the anchor point frame is used for sampling, and the face position offset may be understood as an offset between a result obtained by regression of the face position and an absolute position of the original anchor point frame in the original image. Specifically, it can be understood that the label (label) of the anchor point frame after the face position regression is assumed to be: upper left (xlu1, ylu2), lower right (xrd, yrd2), the absolute coordinates of the original anchor point frame in the original image are (x1, y1) (x2, y2), and the face position offset at the upper left is defined as: (x1-xlu1)/(x2-x1), which is normalized data, can obtain the face position offset of upper left, lower right and lower right similarly, and a more popular and simple understanding is how many times the x and y coordinates of the upper left and lower right are offset up and down based on the width and height of the original anchor point frame.

When the network performs offset regression of a face position according to an original feature map, an image which contains a face (positive face) or a part of the face (part face) and is selected preliminarily is generally used, because if the image is subjected to regression by using a background image which does not contain the face (negative face), inaccuracy of a regression result is caused, and for the image which partially contains the face, the network can perform regression according to local features such as a nose, eyes, ears and the like.

Step S130: and performing variable convolution on each fused feature image according to the face position offset to obtain the variable convolution feature image of each fused feature image.

The variable convolution is also called deformable convolution (deformable convolution). The standard convolution network uses modules in which the geometry is fixed and the ability of modeling the geometric transformation is inherently limited, and the reason for adapting to the geometric deformation is the regular lattice sampling in the standard convolution, and in order to weaken this limitation, an offset variable is added to the position of each sampling point in the convolution kernel, and through these variables, the convolution kernel can sample randomly around the current position and is not limited to the previous regular lattice, so that the expanded convolution operation is called deformable convolution, which is the biggest difference of the deformable convolution with respect to the standard convolution network.

In the application, the obtained human face position offset is used as a variable of offset additionally added by the variable convolution according to the standard convolution, so that the size and the position of a deformable convolution kernel can be dynamically adjusted according to the image content needing to be identified at present, and the visual effect of the variable convolution kernel is that the sampling point positions of the convolution kernels at different positions can be adaptively changed according to the image content, so that the variable convolution kernel is adaptive to the geometric deformation such as the shape, the size and the like of different objects.

Step S140: and determining a face detection result based on the variable convolution characteristic graph.

After the variable convolution is adopted to process each fusion feature map according to the face position offset, a variable convolution feature map is obtained, in the map, the information of the face is fully described, a face detection result can be determined according to the map, and the face detection result can include but is not limited to the information of the size, the position and the like of the face.

The method shown in fig. 1 shows that the face feature expression capability in the fused feature image is improved based on the variable convolution according to the face position offset obtained from the multi-scale original feature image, the face detection accuracy is obviously improved, the performance of the face detection model is improved, the calculation method is simple, the calculation amount is small, the calculation efficiency is high, the application scene of the face detection is greatly expanded, and the method is particularly suitable for multi-target face detection, and has a good effect in the experiment of parent-child image detection.

In an embodiment of the present application, in the above method, generating the feature map of the face detection image includes: generating an original feature map of the face detection image in a bottom-up and down-sampling mode; a fused feature map corresponding to each original feature map is generated in a top-down and down-sampling manner.

In this embodiment, it is recommended to use pyramid image expression when performing multi-size expression on image features, and generate an image pyramid from an original face detection image, where the image pyramid is a structure for explaining an image with multiple resolutions, generally according to 2ⁿ(n is 0,1,2 ….) averaging to obtain the original image corresponding to the image at the bottom layer, then averaging by 4 × 4 pixels to form 2 image stages, and so on to form multi-stage pyramid image, i.e. each stage (2) (2 stages)ⁱ) The total number of pixels of the image is equal to the previous stage (2)^i-1) The image is reduced by a factor of 4. The number of pyramid layers can be calculated according to the resolution of the image, the possible noise of the image, the size of the image and other relevant factors. In the process of generating the original feature map of the face detection image, when generating an original feature map with a larger scale, a bottom-up mode may be adopted, specifically, refer to fig. 2, fig. 2 is a schematic flow chart of determining a face detection result based on a variable convolution feature map according to this embodiment, and an nth layer (n is a natural number) of the original feature map is denoted as Cn, where C3, C4, and C5 are sampled in a bottom-up mode, so that pixels of each layer are continuously reduced, and the amount of computation can be greatly reduced; however, when generating the original feature maps (C6, C7) with smaller dimensions, a down-sampling mode can be adopted so as not to lose precision, and since the feature maps with smaller dimensions do not need a large amount of computing resources, the computing speed is guaranteed, and the detection accuracy is also guaranteed.

Similarly, when generating the fused feature map corresponding to each original feature map, the fused feature map corresponding to each original feature map may be generated in a top-down manner when the scale is large (P3, P4, P5), and in a down-sampling manner when the scale is small (P6, P7).

In one embodiment of the present application, in the above method, generating the fused feature map corresponding to each original feature map in a top-down and up-sampling manner includes: and determining the weight of the target fusion feature map according to the channel number of the original feature map corresponding to the target fusion feature map and the channel number of the upper fusion feature map of the target fusion feature map.

In the prior art, the weight of the channel of each feature is not considered in feature fusion, and different weights are given to the channels of each feature in the embodiment. The method for giving weight recommended by the embodiment is as follows: and determining the weight of the target fusion feature map according to the channel number of the original feature map corresponding to the target fusion feature map and the channel number of the upper fusion feature map of the target fusion feature map.

Let the original feature map of the nth layer (n is a natural number) be Cn, and the corresponding fused feature map be Pn, taking P4 as an example, the calculation method is shown in formula 1:

P4＝Conv(W_c4*Conv(C4)+W_p4*Upsample(P5)) (1)

wherein, W_c4The number of elements of the vector is equal to the number of channels of the Conv (C4) feature, W_p4Equal to the number of channels of the upsamplle (P5). When the technical scheme of the application is realized by using the face detection model, the training stage of the face detection model can learn W_c4And W_p4All of the element values are greater than 0, and W_c4And W_p4The sum of the corresponding elements is 1.

In an embodiment of the present application, in the method, determining the face position offset according to the original feature map includes: performing anchor point frame regression on each original feature map respectively; and determining the face position offset according to the anchor point frame regression result of each original feature map and the anchor point frame corresponding to each anchor point frame regression result.

The detection mode of laying the anchor frame (anchor) can quickly and accurately detect the target in the image, and in the embodiment, the anchor frame is adopted to confirm the primary position and the offset of the human face. Specifically, anchor point frames are laid on the original feature maps, for example, in a sliding window manner, then, anchor point frame regression is performed on the original feature maps respectively, anchor point frame regression results according to the original feature maps are obtained, the results include, but are not limited to, coordinates of two corners or four corners of the anchor point frames, and the face position deviation can be obtained according to the coordinates of the anchor point frames after regression and the coordinates of the anchor point frames before regression corresponding to the coordinates of the anchor point frames.

In an embodiment of the present application, in the above method, determining a face detection result based on the variable convolution feature map includes: respectively carrying out anchor point frame classification and anchor point frame regression on each original feature map; according to the anchor point frame classification result of the first type of original feature graph, performing anchor point frame classification on the first type of fusion feature graph corresponding to the first type of original feature graph; performing anchor point frame regression on the second type of fusion feature map corresponding to the second type of original feature map according to the anchor point frame regression result of the second type of original feature map; determining a face detection result according to an anchor point frame regression result of the first type of original feature map, an anchor point frame classification result of the first type of fusion feature map, an anchor point frame classification result of the second type of original feature map and an anchor point frame regression result of the second type of fusion feature map; wherein, the first kind of original feature map is a lower layer feature map of the second kind of original feature map.

As shown in fig. 2, which is a schematic flow chart of determining a face detection result based on a variable convolution feature map according to this embodiment, the determination of the face detection result is divided into 2 main steps, where reference numeral 1 is a classification step and reference numeral 2 is a regression step. The original feature maps of all scales are divided into first-class original feature maps (C3, C4 and C5) and second-class original feature maps (C6 and C7), the first-class original feature maps correspond to a classification step, and the second-class original feature maps correspond to a regression step; the fused feature maps corresponding to the first type of original feature maps are first type fused feature maps (P3, P4 and P5), and the fused feature maps corresponding to the second type of original feature maps are second type fused feature maps (P6 and P7). The first type of original feature map is a lower-layer feature map of the second type of original feature map, namely C3, C4 and C5 are lower-layer images of C6 and C7, and P3, P4 and P5 are lower-layer images of P6 and P7.

Firstly, the original feature maps are respectively subjected to anchor block classification and anchor block regression, namely, each layer C3-C7 is independently subjected to anchor block classification and regression.

And then, according to the anchor point frame classification result of the first-class original feature map, performing anchor point frame classification on the first-class fused feature map, wherein C3, C4 and C5 are lower-layer images and have more negative samples (background images without face information), so that the first-class fused feature map is classified again through P3, P4 and P5. Since the main objective of using the lower-layer feature map is to detect a small-scale face in a face detection image, if the effect of performing regression on P3, P4 and P5 is not obvious at this time, P3, P4 and P5 are only classified, so that a large number of negative samples are selected, the calculation resources are greatly saved, the calculation efficiency is improved, and the detection accuracy is guaranteed.

And performing anchor point frame regression on the second type of fusion feature map according to the anchor point frame regression result of the second type of original feature map, wherein the detection result is more accurate because C6 and C7 are upper-layer feature maps and primary regression is performed through P6 and P7. Since the main objective of using the upper-layer feature map is to detect a large-scale face in a face detection image, and classification at C6 and C7 can obtain more accurate results, the classification can be performed without using P6 and P7.

Finally, determining a face detection result according to the anchor point frame regression results (regression results of C3, C4 and C5) of the first type of original feature maps, the anchor point frame classification results (classification results of P3, P4 and P5) of the first type of fused feature maps, the anchor point frame classification results (classification results of C6 and C7) of the second type of original feature maps and the anchor point frame regression results (regression results of P6 and P7) of the second type of fused feature maps.

Therefore, the anchor point frame classification is selectively carried out on the lower layer feature diagram, the anchor point frame regression is selectively carried out on the upper layer feature diagram, the computing resource can be saved, and the computing efficiency can be improved. Of course, it is also possible to perform anchor point frame regression again on P3-P5, perform anchor point frame classification again on P6-P7, and finally determine the face detection result directly from the anchor point frame classification result and the anchor point frame regression result obtained from P3-P7.

In one embodiment of the present application, the method described above further comprises: and respectively carrying out receptive field enhancement treatment on the original characteristic diagram and the fused characteristic diagram.

The receptive field is defined as the size of the area mapped by the pixel points on the feature map output by each layer of the convolutional neural network on the original input image, that is, the receptive field represents the range area of a specific neural network feature in the input space, including the position of the feature (the central position of the receptive field) and the size of the area (the size of the receptive field), so that the receptive field of a feature can be described by using the central position of the area and the size of the feature.

The large receptive field is helpful for learning long-distance spatial position relationship (long-range spatial relationship), establishing an implicit spatial model (implicit spatial model), and the like, so that the performance of the embodiment is improved by respectively performing receptive field enhancement processing on the original feature map and the fused feature map.

The method for increasing the receptive field is not limited in the present application, and any one or a combination of several of the prior art may be adopted, for example, the method includes, but is not limited to, increasing pooling layers, increasing kernel size of convolution kernel (no Chinese name in the industry), increasing the number of convolution layers, and the like.

In an embodiment of the present application, the above method is implemented based on a face detection model, and the face detection model is obtained by training in the following manner: inputting the training image into a face detection model to obtain a face detection result; calculating multiple types of loss function values according to the labeling information of the training image and the face detection result, wherein the loss function comprises at least one of the following: a face classification loss function, a face position loss function, a key point loss function and a face segmentation function; and updating the parameters of the face detection model according to the loss function values.

The model training of the face detection model mainly comprises the following steps: and inputting the training image into the face detection model to obtain a face detection result of the face image and a label of the training image.

And calculating a loss value according to the label of the training image and the face detection result, and finally updating the parameters of the face detection model according to the loss value.

The Loss function related in the embodiment is one or a combination of several of a face classification Loss function (Focal Loss), a face position Loss function (Complete ion Loss), a key point Loss function (Landmark Loss), and a face Segmentation Loss function (Segmentation Loss).

Specifically, calculating a face classification loss value according to a face classification loss function according to a face classification result; calculating a face segmentation loss value according to the foreground and background classification result through a face segmentation loss function; calculating a face position loss value according to a face position loss function by using a face position prediction result; the regression result of the key point position calculates the loss value of the key point through the loss function of the key point, and the specific algorithm is not repeated.

Fig. 3 shows a schematic flow chart of a face detection method according to another embodiment of the present application.

Firstly, generating a feature map of a face detection image, wherein the feature map comprises original feature maps (C3-C7) with a plurality of scales and fused feature maps (P3-P7) corresponding to the original feature maps. Wherein, the meanings of C3-C7 and P3-P7 are the same as the above, and are not repeated.

And (3) performing receptive field enhancement treatment on each original characteristic map (C3-C7), and performing first-step classification and first-step regression on the original characteristic maps to obtain first-step classification results and first-step regression results of C3-C7.

Determining the position offset of the human face according to the regression result of the first step of C3-C7 and the position of the anchor point frame; and performing the receptive field enhancement processing on each fused feature map (P3-P7), and performing variable convolution according to each original feature map after the receptive field enhancement and the obtained human face position offset to obtain a variable convolution feature map of each fused feature map.

Performing second-step classification according to the first-step classification result of the variable convolved P3-P5 combined with C3-C5 to obtain a second-step classification result of P3-P5; and performing second-step regression according to the variable convolved first-step regression results of P6 and P7 combined with C6 and C7 to obtain second-step regression results of P6 and P7.

The second step classification results of P3-P5 according to the first step regression results of C3-C5; and determining the face detection result according to the classification results of the first step of C6 and C7 and the regression results of the second step of P6 and P7.

FIG. 4 is a schematic structural diagram of a face detection apparatus according to an embodiment of the present application; the face detection apparatus 400 includes:

and the feature map generating unit 410 is configured to generate a feature map of the face detection image, where the feature map includes original feature maps of multiple scales and fused feature maps corresponding to the original feature maps.

The algorithm for feature fusion may employ any one of the existing techniques, including but not limited to: an algorithm based on Bayesian decision theory, an algorithm based on sparse representation theory, and an algorithm based on deep learning theory. For example, the feature fusion algorithm based on the sparse representation theory is to establish a feature joint sparse matrix after extracting multiple features from a sample, the matrix is the result of multi-feature fusion, and the method is to fuse dictionaries of different types of features. For another example, a feature fusion algorithm based on a deep learning theory is to fuse features obtained by a plurality of neural networks to obtain a fused feature.

A feature map processing unit 420, configured to determine a face position offset according to the original feature map; and performing variable convolution on each fused feature map according to the face position offset to obtain the variable convolution feature map of each fused feature map.

A detection unit 430, configured to determine a face detection result based on the variable convolution feature map.

In an embodiment of the present application, in the above apparatus, the feature map generating unit 410 is configured to generate an original feature map of the face detection image in a bottom-up and down-sampling manner; a fused feature map corresponding to each original feature map is generated in a top-down and down-sampling manner.

In an embodiment of the present application, in the above apparatus, the feature map generating unit 410 is configured to determine the weight of the target fused feature map according to the number of channels of the original feature map corresponding to the target fused feature map and the number of channels of the upper-layer fused feature map of the target fused feature map.

In an embodiment of the present application, in the above apparatus, the feature map processing unit 420 is configured to perform anchor point frame regression on each original feature map; and the face position deviation determining module is used for determining the face position deviation according to the anchor point frame regression result of each original feature map and the anchor point frame corresponding to each anchor point frame regression result.

In an embodiment of the present application, in the above apparatus, the detecting unit 430 is configured to perform anchor point frame classification and anchor point frame regression on each original feature map; the anchor point frame classification device is used for classifying the anchor point frames of the first type of fusion characteristic graphs corresponding to the first type of original characteristic graphs according to the anchor point frame classification result of the first type of original characteristic graphs; the anchor point frame regression module is used for performing anchor point frame regression on the second type of fusion feature graph corresponding to the second type of original feature graph according to the anchor point frame regression result of the second type of original feature graph; the face detection device is used for determining a face detection result according to an anchor point frame regression result of the first type of original feature map, an anchor point frame classification result of the first type of fusion feature map, an anchor point frame classification result of the second type of original feature map and an anchor point frame regression result of the second type of fusion feature map; wherein, the first kind of original feature map is a lower layer feature map of the second kind of original feature map.

In one embodiment of the present application, in the above apparatus: the feature map processing unit 420 is further configured to perform a receptive field enhancement process on the original feature map and the fused feature map, respectively.

In an embodiment of the present application, the above apparatus is implemented based on a face detection model, and the face detection model is obtained by training as follows: inputting the training image into a face detection model to obtain a face detection result; calculating multiple types of loss function values according to the labeling information of the training image and the face detection result, wherein the loss function comprises at least one of the following: a face classification loss function, a face position loss function, a key point loss function and a face segmentation function; and updating the parameters of the face detection model according to the loss function values.

It should be noted that the face detection apparatuses in the foregoing embodiments can be respectively used to execute the face detection methods in the foregoing embodiments, and therefore, detailed description is not given one by one.

According to the method, the feature map of the face detection image is generated, and the feature map comprises original feature maps with multiple scales and fused feature maps corresponding to the original feature maps; determining face position offset according to the original feature map; performing variable convolution on each fused feature image according to the face position offset to obtain a variable convolution feature image of each fused feature image; and determining a face detection result based on the variable convolution characteristic graph. The face feature expression capacity in the fusion feature image is improved according to the face position offset obtained according to the multi-scale original feature map based on the variable convolution, the face feature expression capacity in the fusion feature image is obviously improved, the face detection accuracy is obviously improved, the calculation method is simple, the calculation amount is small, the calculation efficiency is high, the application scene of the face detection is greatly expanded, and the face feature expression capacity fusion method is particularly suitable for multi-target face detection.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various application aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, application is directed to less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a face detection apparatus according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

For example, fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 500 comprises a processor 510 and a memory 520 arranged to store computer executable instructions (computer readable program code). The memory 520 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 520 has a storage space 530 storing computer readable program code 531 for performing any of the method steps in the above described method. For example, the storage space 530 for storing the computer readable program code may include respective computer readable program codes 531 for respectively implementing various steps in the above method. The computer readable program code 531 may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as described in fig. 6. FIG. 6 shows a schematic diagram of a computer-readable storage medium according to an embodiment of the present application. The computer readable storage medium 600 has stored thereon a computer readable program code 531 for performing the steps of the method according to the application, readable by the processor 510 of the electronic device 500, which computer readable program code 531, when executed by the electronic device 500, causes the electronic device 500 to perform the steps of the method described above, in particular the computer readable program code 531 stored on the computer readable storage medium may perform the method shown in any of the embodiments described above. The computer readable program code 531 may be compressed in a suitable form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A face detection method, comprising:

generating a feature map of a face detection image, wherein the feature map comprises original feature maps with a plurality of scales and fused feature maps corresponding to the original feature maps;

determining face position offset according to the original feature map;

2. The method of claim 1, wherein generating the feature map of the face detection image comprises:

3. The method of claim 2, wherein the generating fused feature maps corresponding to the respective original feature maps in a top-down and up-sampling manner comprises:

4. The method of claim 1, wherein determining a face position offset from the raw feature map comprises:

5. The method of claim 1, wherein determining a face detection result based on the variable convolution feature map comprises:

performing anchor point frame regression on a second type of fusion feature map corresponding to a second type of original feature map according to an anchor point frame regression result of the second type of original feature map;

wherein the first type of original feature map is a lower-layer feature map of the second type of original feature map.

6. The method of claim 1, further comprising:

7. The method according to any one of claims 1-6, wherein the method is implemented based on a face detection model, the face detection model being trained by:

and updating parameters of the face detection model according to the loss function value.

8. A face detection apparatus, comprising:

9. An electronic device, wherein the electronic device comprises: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform the method of any one of claims 1-7.

10. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-7.