CN109993086A

CN109993086A - Method for detecting human face, device, system and terminal device

Info

Publication number: CN109993086A
Application number: CN201910215573.9A
Authority: CN
Inventors: 李江; 王行; 李骊; 周晓军; 盛赞; 李朔; 杨淼
Original assignee: Beijing HJIMI Technology Co Ltd
Current assignee: Beijing HJIMI Technology Co Ltd
Priority date: 2019-03-21
Filing date: 2019-03-21
Publication date: 2019-07-09
Anticipated expiration: 2039-03-21
Also published as: CN109993086B

Abstract

This specification provides a kind of method for detecting human face, device, system and terminal device, this method comprises: obtaining depth image and the color image with the deepness image registration；The RGB information of depth information and color image to depth image is normalized jointly；The inner parameter of the camera of depth information and the shooting depth image based on depth image, obtains being sized for face candidate frame；By neural network model trained in advance, determine that score is higher than the face candidate frame of given threshold；The face candidate frame for being higher than given threshold based on the score determines target candidate frame as target human face region.Model inspection precision and robustness are improved using the depth information for the depth image being registrated with color image as the input data of neural network model using the embodiment of the present application；The size of candidate frame is set by depth information and camera internal reference simultaneously, accelerates detection speed, and further improve detection accuracy.

Description

Face detection method, device and system and terminal equipment

Technical Field

The present disclosure relates to the field of face detection technologies, and in particular, to a face detection method, apparatus, system, and terminal device.

Background

With the development of face detection technology, the application value of the face detection technology in the fields of security access control, visual detection, content-based image retrieval and the like is increasing.

Most of the current face detection algorithms are based on color images and are performed by applying a multi-task Cascaded Convolutional neural network (MTCNN).

The color image is greatly influenced by factors such as illumination conditions, resolution, colors and the like, and the requirement on training data is high, so that the robustness of an algorithm model is poor; and MTCNN models are more, algorithm logic is complex, and response speed is slow.

Disclosure of Invention

In order to overcome the problems in the related art, the present specification provides a face detection method, apparatus, system and terminal device.

Specifically, the method is realized through the following technical scheme:

according to a first aspect of embodiments of the present specification, there is provided a face detection method, including:

obtaining a depth image and a color image registered with the depth image;

normalizing the depth information of the depth image and the RGB information of the color image together, and outputting normalized four-channel information, wherein the normalized four-channel information comprises the normalized RGB information and the normalized depth information;

obtaining a set size of a face candidate frame based on depth information of a depth image and internal parameters of a camera for shooting the depth image;

determining a face candidate frame with a score higher than a set threshold value through a pre-trained neural network model based on the normalized four-channel information and the set size of the face candidate frame;

and determining a target candidate frame as a target face area based on the face candidate frame with the score higher than a set threshold value.

According to a second aspect of embodiments of the present specification, there is provided a face detection apparatus including:

an image acquisition unit for obtaining a depth image and a color image registered with the depth image;

the normalization unit is used for normalizing the depth information of the depth image and the RGB information of the color image together and outputting normalized four-channel information, wherein the normalized four-channel information comprises the normalized RGB information and the normalized depth information;

a size acquisition unit for obtaining a set size of a face candidate frame based on depth information of a depth image and internal parameters of a camera that captures the depth image;

the first determining unit is used for determining the face candidate frame with the score higher than a set threshold value through a pre-trained neural network model based on the normalized four-channel information and the set size of the face candidate frame;

and a second determination unit configured to determine a target candidate frame as the target face region based on the face candidate frame whose score is higher than a set threshold.

According to a third aspect of embodiments herein, there is provided a terminal device including: the system comprises an internal bus, a memory, a processor and an external interface which are connected through the internal bus; wherein,

the external interface is used for obtaining a depth image and a color image which is registered with the depth image;

the memory is used for storing machine readable instructions corresponding to the face detection;

the processor is configured to read the machine-readable instructions on the memory and execute the instructions to implement the following operations:

According to a fourth aspect of embodiments herein, there is provided a face detection system, comprising: depth cameras, color cameras, and terminal devices, wherein,

the depth camera is used for shooting a depth image;

the color camera to capture a color image, the depth camera registered with the color camera;

the terminal equipment is used for obtaining a depth image and a color image which is registered with the depth image; normalizing the depth information of the depth image and the RGB information of the color image together, and outputting normalized four-channel information, wherein the normalized four-channel information comprises the normalized RGB information and the normalized depth information; obtaining a set size of a face candidate frame based on depth information of a depth image and internal parameters of a camera for shooting the depth image; determining a face candidate frame with a score higher than a set threshold value through a pre-trained neural network model based on the normalized four-channel information and the set size of the face candidate frame; and determining a target candidate frame as a target face area based on the face candidate frame with the score higher than a set threshold value.

According to a fifth aspect of embodiments herein, there is provided a face detection system, comprising: camera with depth information and terminal device, wherein,

the camera with the depth information is used for shooting a depth image and a color image which is registered with the depth image;

By applying the face detection embodiment provided by the application, the depth information of the depth image registered with the color image is used as the input data of the neural network model, so that the model detection precision and robustness are improved; meanwhile, the size of the candidate frame is set through the depth information and the camera internal parameters, so that the detection speed is increased, and the detection precision is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.

Fig. 1 is a flowchart illustrating a face detection method according to an exemplary embodiment of the present application.

FIG. 2 is a flow chart illustrating a method of training a neural network model according to an exemplary embodiment of the present application.

Fig. 3 is a schematic structural diagram of a face detection apparatus according to an exemplary embodiment of the present application.

Fig. 4 is a block diagram of a terminal device shown in the present application according to an exemplary embodiment.

Fig. 5 is a schematic structural diagram of a face detection system according to an exemplary embodiment of the present application.

Fig. 6 is a schematic structural diagram of another face detection system according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.

Referring to fig. 1, a flowchart of a face detection method in an example of the present application is shown. The method may comprise the steps of:

in step 101, a depth image and a color image registered to the depth image are obtained.

Wherein the depth image can be captured by a depth camera and the color image can be captured by a color camera; or the depth image and the color image are captured by a camera of depth information.

In this embodiment, the depth image and the color image should be registered to ensure that each pixel in the depth image can find its relative pixel in the color image, and the two corresponding pixels are measurements of the same position in space.

In one example, the depth image and the color image may be registered by:

depth images and color images are registered if they are taken at the same angle and position by a camera with depth information, or if two images are obtained from the same shot.

If the depth image and the color image are captured by a depth camera and a color camera, respectively, then the depth image and the color image captured with the cameras at the same angle and position are registered by calibrating the depth camera and the color camera with the same method in the same scene.

For example, the depth camera and the color camera are scaled separately by the tensor calibration method, and the scenes to which both cameras are calibrated must be identical. And calibrating the depth camera and the color camera, namely obtaining the internal parameters of the depth camera and the color camera.

The internal parameters of the camera, including the focal length, the position of the principal point (the position of the image plane), and the size ratio of the pixels to the real environment, are inherent properties of the camera for conversion between the camera coordinate system and the plane coordinate system.

In step 102, the depth information of the depth image and the RGB information of the color image are normalized together, and normalized four-channel information is output.

Wherein the normalized four-channel information includes normalized RGB information and normalized depth information.

In one example, the depth information of the depth image and the RGB information of the color image may be collectively normalized by:

normalizing the depth information in the interval of 0-255 to obtain four-channel information in the same range;

and normalizing the four-channel information in the same range in a range of [0,1] or [ -1,1 ].

The color image is in RGB format, and the pixel values of the color image include information of three channels of RGB, i.e. the pixel value of each pixel includes R, G, B components, which are all in the range of [0,255 ]. In order to normalize the depth information and the RGB information together, the depth information is first normalized in the interval of 0 to 255, and the depth information is also converted into a range of [0,255], so that four-channel information in the same range is obtained: r, G, B information and depth information. Then, the R, G, B information and the depth information are normalized within the range of [0,1] or [ -1,1] again to obtain normalized four-channel information.

In step 103, the set size of the face candidate frame is obtained based on the depth information of the depth image and the internal parameters of the camera that captured the depth image.

The depth value of the depth image directly reflects the distance between the object and the camera, and the internal parameters of the camera include the size ratio of the pixels to the real environment, so that the set size of the face candidate box can be obtained for the depth image shot by the camera and the registered color image based on the range of the actual size of the head of the human body.

It should be understood by those skilled in the art that the set size of the face candidate frame can be adjusted according to actual situations and needs.

In the related art, when the MTCNN model is used for face detection, a multi-scale full-image extraction candidate frame is needed for face detection, and a large number of detected face candidate frames are generated, so that a large number of unnecessary operations are performed, and the response speed of the algorithm is reduced.

In the embodiment, a single-scale face candidate box can be set based on the depth information of the depth image and the internal parameters of the camera, so that unnecessary operation is avoided, and the response speed of the algorithm is improved.

In step 104, based on the normalized four-channel information and the set size of the face candidate frame, the face candidate frame with the score higher than the set threshold is determined through a pre-trained neural network model.

In one example, the neural network model may be trained by the following method. As shown in fig. 2, the method comprises the steps of:

in step 201, a sample depth image and a sample color image registered with the sample depth image are obtained.

In this step, the registration method of the sample depth image and the sample color image may be the same as the registration method of the depth image and the color image in step 101.

In step 202, the regions of the face in the sample depth image and the sample color image are marked.

In marking, only the sample depth image or only the sample color image may be marked, and then corresponding marks are generated on corresponding pixels in the registered sample images. The sample depth image and the sample color image may be marked at the same time to generate marking data. The marking data comprises marking values of different pixel coordinates, wherein the marking values can be 1 or 0, wherein 1 represents a face pixel, and 0 represents a non-face pixel; alternatively, 0 may represent a face pixel and 1 may represent a non-face pixel. That is, in the tag data, each pixel is tagged with a tag of whether it is a face pixel or not.

In step 203, the depth information of the sample depth image and the RGB information of the sample color image are collectively normalized, and normalized sample four-channel information is output.

Wherein the normalized sample four-channel information includes normalized sample depth information and normalized sample RGB information.

In this step, normalization can be performed using the same method as in step 102.

In step 204, the normalized sample four-channel information and the labeled data are input into the neural network model for training until the iteration number is satisfied or the loss converges.

In one example, the neural network model may be a convolutional neural network model.

After training, the neural network can mark pixels belonging to the face in the data of the depth image and the data of the registered color image. That is, inputting the normalized four-channel information into the pre-trained neural network model, it is possible to output the pixel data with the flag of face pixel. In the pixel data, each pixel carries a flag as to whether it is a face pixel or not.

Since the efficiency of face pixel detection using all pixel information of the depth image and the grayscale image is low, in the present embodiment, the detection of face pixels is performed in conjunction with a candidate frame of a set size:

and selecting normalized four-channel information in a corresponding range each time through the candidate frame with the set size, inputting the selected normalized four-channel information into a pre-trained neural network model, and judging pixel data belonging to the face from the input pixel data by using the model. That is, the model outputs pixel data with a flag as to whether a face pixel is present.

And sliding the face candidate frame by a set step length, and outputting data corresponding to each selection after traversing all normalized four-channel information by changing the selection range of the face candidate frame, namely selecting the normalized four-channel information in different ranges each time. Wherein the set step length can be set according to the precision required by the face detection.

And determining a face candidate frame containing face pixels with the scores higher than a set threshold value from the data output by each selection. For each selection, the score of the selected face candidate frame may be obtained according to the number (or proportion) of face pixels included in the face candidate frame. That is, the higher the number (or proportion) of face pixels contained in the face candidate frame selected each time is, the higher the score is; otherwise the score is lower. When the score of the selected face candidate frame is higher than a set threshold value, the selected face candidate frame is considered to contain the face; otherwise, the face is not considered to be included. Wherein, the set threshold value can be set according to the accuracy required by the face detection.

In this embodiment, as described above, the number (or ratio) of face pixels included in the data output by each selection may be used to score the selection of the face candidate frame, and whether the face candidate frame includes a face or not may be determined according to the score of the selection. Those skilled in the art will appreciate that other factors may be used to evaluate the score of the face candidate box.

In step 105, a target candidate frame is determined as a target face region based on the face candidate frame whose score is higher than a set threshold.

There may be one or more face candidate boxes with scores higher than a set threshold. When there are a plurality of face candidate frames with scores higher than the set threshold, since the positions of the faces in each of the face candidate frames may be different, one of the plurality of face candidate frames with the best selection effect may be determined as the target candidate frame.

In one example, a non-maximum suppression NMS algorithm is used to determine target candidate boxes from face candidate boxes with scores above a set threshold. It should be understood by those skilled in the art that the method for determining the target candidate frame is not limited to the above, and other methods may be adopted, for example, a face candidate frame with the most central face may be selected as the target candidate frame, and the like.

And the determined target candidate frame is taken as a target face area, so that the face detection is realized. The target face area can be displayed in a color image, and can also be displayed in a depth image.

Corresponding to the embodiments of the method, the present specification also provides embodiments of an apparatus, a system and a terminal device.

Referring to fig. 3, a block diagram of an embodiment of a face detection apparatus according to the present application is shown. The device includes: an image acquisition unit 310, a normalization unit 320, a size acquisition unit 330, a first determination unit 340, and a second determination unit 350.

The image acquisition unit 310 is configured to obtain a depth image and a color image registered with the depth image;

a normalization unit 320, configured to normalize the depth information of the depth image and the RGB information of the color image together, and output normalized four-channel information, where the normalized four-channel information includes the normalized RGB information and the normalized depth information;

a size obtaining unit 330 for obtaining a set size of a face candidate frame based on depth information of a depth image and an internal parameter of a camera that captures the depth image;

a first determining unit 340, configured to determine, based on the normalized four-channel information and the set size of the face candidate frame, a face candidate frame with a score higher than a set threshold through a pre-trained neural network model;

a second determining unit 350, configured to determine a target candidate frame as the target face region based on the face candidate frame with the score higher than the set threshold.

Referring to fig. 4, a block diagram of an embodiment of a terminal device according to the present application is shown. The terminal device includes:

an internal bus 410, and a memory 420, a processor 430, and an external interface 440 connected by the internal bus.

Wherein the external interface 440 is configured to obtain a depth image and a color image registered with the depth image;

a memory 420 for storing machine readable instructions corresponding to face detection;

a processor 430 to read the machine-readable instructions on the memory and execute the instructions to perform the following operations:

Referring to fig. 5, a block diagram of an embodiment of a face detection system according to the present application is shown. The system may include: a depth camera 510, a color camera 520, and a terminal device 530.

Wherein, the depth camera 510 is used for shooting a depth image;

a color camera 520 for capturing a color image, the depth camera being in registration with the color camera;

a terminal device 530 for obtaining a depth image and a color image registered with the depth image; normalizing the depth information of the depth image and the RGB information of the color image together, and outputting normalized four-channel information, wherein the normalized four-channel information comprises the normalized RGB information and the normalized depth information; obtaining a set size of a face candidate frame based on depth information of a depth image and internal parameters of a camera for shooting the depth image; determining a face candidate frame with a score higher than a set threshold value through a pre-trained neural network model based on the normalized four-channel information and the set size of the face candidate frame; and determining a target candidate frame as a target face area based on the face candidate frame with the score higher than a set threshold value.

Referring to fig. 6, a block diagram of another embodiment of the face detection system of the present application is shown. The difference between this embodiment and the system shown in fig. 5 is that the depth image and the color image registered with the depth image are taken by a camera 610 with depth information.

In the embodiments of the present application, the computer readable storage medium may be in various forms, such as, in different examples: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof. In particular, the computer readable medium may be paper or another suitable medium upon which the program is printed. Using these media, the programs can be electronically captured (e.g., optically scanned), compiled, interpreted, and processed in a suitable manner, and then stored in a computer medium.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A face detection method, comprising:

obtaining a depth image and a color image registered with the depth image;

2. The method of claim 1, wherein the depth image and the color image are registered by:

respectively shooting the depth image and the color image by using a camera with depth information; or

Under the same scene, a depth camera and a color camera are calibrated by the same method, wherein the depth camera is used for shooting a depth image, and the color camera is used for shooting a color image.

3. The method of claim 1, wherein the co-normalizing the depth information of the depth image and the RGB information of the color image comprises:

normalizing the depth information in an interval of 0-255 to obtain four-channel information in the same range, wherein the four-channel information comprises RGB information and depth information;

the four-channel information is normalized in the range of [0,1] or [ -1,1 ].

4. The method of claim 1, wherein the neural network model is trained by:

obtaining a sample depth image and a sample color image registered with the sample depth image;

marking the areas where the human faces are located in the sample depth image and the sample color image to generate marking data;

carrying out common normalization on the depth information of the sample depth image and the RGB information of the sample color image, and outputting normalized sample four-channel information, wherein the normalized sample four-channel information comprises normalized sample depth information and normalized sample RGB information;

and inputting the four-channel information of the normalized sample and the marked data into a neural network model for training until the iteration times are met or loss convergence is achieved.

5. The method of claim 1, wherein determining, by a pre-trained neural network model, a face candidate box with a score above a set threshold based on the normalized four-channel information and a set size of the face candidate box comprises:

selecting normalized four-channel information in a corresponding range through a human face candidate frame with a set size, and inputting the selected normalized four-channel information into a pre-trained neural network model, wherein the neural network model outputs pixel data with whether a human face pixel is marked or not;

sliding the face candidate frame by a set step length, traversing all normalized four-channel information, and outputting data corresponding to each selection;

and determining the face candidate frame with the score higher than a set threshold value based on the data output by each selection, wherein the score is obtained according to the number of face pixels contained in the face candidate frame.

6. The method according to claim 1, wherein the determining a target candidate box as the target face region based on the face candidate box with the score higher than a set threshold comprises:

and determining a target candidate box from the face candidate boxes with the scores higher than a set threshold value by using a non-maximum value suppression NMS algorithm.

7. A face detection apparatus, comprising:

8. A terminal device, comprising: the system comprises an internal bus, a memory, a processor and an external interface which are connected through the internal bus; wherein,

9. A face detection system, comprising: depth cameras, color cameras, and terminal devices, wherein,

the depth camera is used for shooting a depth image;

10. A face detection system is characterized by comprising a camera with depth information and a terminal device, wherein,