CN112541576B

CN112541576B - Biological living body identification neural network construction method of RGB monocular image

Info

Publication number: CN112541576B
Application number: CN202011475744.0A
Authority: CN
Inventors: 卢丽; 韩强; 闫超
Original assignee: Sichuan Yifei Technology Co ltd
Current assignee: Sichuan Yifei Technology Co ltd
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2024-02-20
Anticipated expiration: 2040-12-14
Also published as: CN112541576A

Abstract

The invention discloses a biological living body identification neural network of RGB monocular images, which comprises a root module, a plurality of repeatable modules, a feature extraction module and an output module which are sequentially connected from front to back; the root module comprises a convolution layer, a batch normalization layer and an activation layer which are sequentially connected and packaged from front to back; the repeatable module comprises a convolution layer, a batch normalization layer, an activation layer, a depth separable convolution layer, a batch normalization layer, an activation layer, a convolution layer and a batch normalization layer which are sequentially connected and packaged from front to back; if the repeatable module performs downsampling, the tail end of the repeatable module is also provided with a space attention layer; the feature extraction module comprises a global average pooling layer, a full-connection layer, an activation layer and a regularization layer which are sequentially connected and packaged from front to back; the output module is a full-connection layer with weight values subjected to regularization treatment. Through the scheme, the method and the device have the advantages of being simple in logic, small in technical workload, high in calculation accuracy and the like.

Description

Biological living body identification neural network construction method of RGB monocular image

Technical Field

The invention relates to the technical field of biological living body identification in computer face recognition, in particular to a biological living body identification neural network of RGB monocular images and a construction method thereof.

Background

In the technical field of living organism identification in computer face identification, the computer face identification technology adopts a camera or a sensor to collect relevant information such as face images and perform functions such as identity comparison, identity confirmation, attribute identification and the like. At present, the computer face recognition technology is widely applied to various fields such as security protection, attendance checking, finance, traffic, intelligent terminals and the like. In practical application, it is required to accurately determine whether the face collected by the camera or the sensor is derived from a living organism, i.e. a real natural person, instead of an attack behavior of a non-living organism, such as a behavior of performing counterfeiting by using a photograph or a mobile phone video. Biological living body identification is important to ensure the safety and reliability of the whole face recognition process.

Currently, there are a variety of biological in-vivo identification techniques; for example, a 3D structured light lens is utilized to reconstruct a three-dimensional structure of a shooting target; or an infrared sensor is used for collecting infrared characteristic information of a shooting target and the like; for example, the patent application number is 202011114943.9, and the name is a Chinese patent of face vein combined face recognition method and device, which adopts an infrared camera to collect face vein images, adopts an RGB camera to collect living face images and non-moving face images, and fuses the face vein images and face photos to form preprocessed living face images and non-living face images. And combining the three-channel images to form a new thinking face image. Because facial vein images are relatively complex in distribution, the calculation workload of the merging operation is large, and the hardware investment cost is increased.

For example, the Chinese invention patent with the patent application number of 201710478894.9 and the name of photo fake convolutional neural network training method and human face living body detection method is constructed by constructing a training set; acquiring images in a training set; detecting a human face in the image; the face is cut and then normalized and sent into a convolutional neural network, wherein the convolutional neural network comprises an input layer, a plurality of convolutional layers, a ReLU layer, a max-pooling layer, a full-connection layer, a Dropout layer and a SoftmaxWithLoss layer; the convolutional neural network is trained. The disadvantage is that such networks and loss functions have high requirements for consistency of data distribution between training samples and samples in practical applications, and are prone to overfitting problems. If the samples collected during practical application have obvious differences with training samples due to factors such as illumination, hardware, environment and the like, the classification accuracy of the neural network is lower.

Further, as the Chinese patent with the patent application number of 201911358984.X and the name of "a multi-feature multi-model living body face recognition method", the method is as follows: acquiring RGB face images to be identified; decomposing the whole area of the RGB face image into a plurality of local areas, and dividing to obtain an RGB image associated with each local area; performing feature transformation on the RGB image of the local area to obtain a corresponding HSV image, and combining the RGB image and the HSV image to form input image information and outputting the input image information; respectively inputting the input image information into each neural network model in the corresponding classification network model for recognition so as to respectively obtain model characteristics output by each neural network model corresponding to the local area; model features output by all neural network models in all classification network models are input into a feature output layer in a unified mode to form and input a feature output matrix into a fusion feature network model for recognition, so that a living body face recognition result of the RGB face image is output. Although it obtains a plurality of features and corresponding training models through segmentation, it is directly related to the resolution of images and photos, and if the resolution, size, etc. of fake photos are enough to match with a real human living body, there is also a problem of recognition errors.

Therefore, the above methods can identify the attack of the non-living body to a certain extent, but all require specific hardware to collect the related information, and require additional cost. For some applications, such as some mobile internet applications, the techniques cannot be used when authentication is required by using a front-end camera of the mobile phone.

As is well known, because the information provided by monocular RGB images is more limited than other methods such as 3D structured light, and in the practical application process, the environment, illumination, attack modes of non-living organisms during image acquisition, and the acquisition of camera specifications and models are all possible; it can be seen that neural networks trained using monocular RGB images result in lower accuracy and are prone to over-fitting problems, i.e. better performance in some scenarios and significant degradation in network performance in other scenarios.

Therefore, it is urgently required to provide a biological living body recognition neural network using a common camera and a construction method thereof, which have the advantages of simple logic, less calculation workload and high precision.

Disclosure of Invention

Aiming at the problems, the invention aims to provide a biological living body identification neural network of RGB monocular images and a construction method thereof, and adopts the following technical scheme:

the biological living body identification neural network of the RGB monocular image comprises a root module, a plurality of repeatable modules, a feature extraction module and an output module which are sequentially connected from front to back;

the root module comprises a convolution layer, a batch normalization layer and an activation layer which are sequentially connected and packaged from front to back;

the repeatable module comprises a convolution layer, a batch normalization layer, an activation layer, a depth separable convolution layer, a batch normalization layer, an activation layer, a convolution layer and a batch normalization layer which are sequentially connected and packaged from front to back; if the repeatable module performs downsampling, the tail end of the repeatable module is also provided with a space attention layer;

the feature extraction module comprises a global average pooling layer, a full-connection layer, an activation layer and a regularization layer which are sequentially connected and packaged from front to back;

the output module is a full-connection layer with weight values subjected to regularization treatment.

A construction method of a biological living body identification neural network of RGB monocular images comprises the following steps:

the root module is obtained by connecting and packaging the convolution layer, the batch normalization layer and the activation layer from front to back in sequence;

sequentially connecting and packaging the convolution layer, the batch normalization layer, the activation layer, the depth separable convolution layer, the batch normalization layer, the activation layer, the convolution layer and the batch normalization layer from front to back to obtain a repeatable module; if the repeatable module performs downsampling, the tail end of the repeatable module is also provided with a space attention layer;

the feature extraction module is obtained by sequentially connecting and packaging a global average pooling layer, a full connecting layer, an activating layer and a regularization layer from front to back;

the full-connection layer with the weight value subjected to regularization treatment is used as an output module;

and sequentially connecting the root module, the repeatable modules, the feature extraction module and the output module to obtain the convolutional neural network.

Further, the step size of the depth separable convolution layer is set to 2.

Still further, the input and output of the spatial attention layer adopts the following steps:

using the input characteristic tensor of the space attention module and combining an attention mapping function to obtain an attention map which has the same size as the input characteristic tensor and has a channel of 1;

element multiplication is carried out on attention force diagram and input characteristic tensor, broadcast expansion is carried out on attention force diagram in channel direction, and weighted attention characteristic tensor is obtained, wherein the expression is as follows:

where X is the input feature tensor of the spatial attention module, X' is the weighted attention feature tensor,for element multiplication operations, F _at Is an attention mapping function;

splicing the weighted attention characteristic tensor and the input characteristic tensor along the channel direction to obtain an output tensor of the spatial attention layer, wherein the expression is as follows:

X _out ＝concat([X,X′],axis＝"channel")

wherein X is _out For the output tensor of the spatial attention layer, concat is a channel stitching function, and axis parameters specify the dimension of tensor stitching as the channel direction.

Still further, the attention of the spatial attention module is generated by adopting a first convolution layer, a second convolution layer and an activation layer adopting a sigmoid function, wherein the first convolution layer, the second convolution layer and the activation layer are sequentially arranged and packaged from front to back; and the expression of the attention mapping function is:

F _at (X)＝sigmoid(Conv ₂ (Conv ₁ (X)))

wherein sigmoid is a sigmoid activation function, conv ₁ And Conv ₂ The convolution operation functions of the first convolution layer and the second convolution layer are respectively.

Preferably, the convolution kernel size of the first convolution layer is 1x1, and the number of convolution output channels is 1; the convolution kernel size of the second convolution layer is one of 3x3, 5x5 and 7x7, and the number of convolution output channels is 1.

Preferably, the regularization layer employs L2 regularization.

Further, the regularized full-connection layer of the weight value adopts L2 regularization treatment, and the number of output neurons of the full-connection layer is 1, and the expression is as follows:

w＝Normalize(w ₀ )

cosθ＝Dense _w (Y)

wherein w is ₀ The normal is an L2 regularization function, the Dense is a full-connection operation function, w represents the weight of the operation function, Y represents the input tensor of the full-connection layer, cos theta represents the output tensor of the full-connection layer, and the mathematical meaning is the cosine of the included angle between the Y and w vectors.

Further, the output of the convolutional neural network is the cosine of the included angle between the feature vector output by the feature extraction module and the weight value vector of the full-connection layer with the weight value subjected to regularization processing:

when the training sample of the convolutional neural network is a living sample, the expression of the loss function of the neural network is:

when the training sample of the convolutional neural network is a non-living sample, the expression of the loss function of the neural network is:

the cos theta represents the output of the full-connection layer with the weight value subjected to regularization treatment, s and m are super parameters, wherein the value range of s is any numerical value between 10 and 90, and the value range of m is any numerical value between 0.2 and 0.7.

Preferably, the value of the super parameter s is 30, and the value of the super parameter m is 0.5.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention skillfully adopts the spatial attention layer, and the weighted attention characteristic tensor in the spatial attention layer contains the spatial attention information, so that the spatial structure information such as a screen frame which possibly appears in an attack image can be focused more accurately, and the network precision can be effectively improved;

(2) The invention splices the weighted attention characteristic tensor and the input tensor of the attention layer according to the channel dimension; the weighted attention feature tensor can accurately pay attention to the space structure information, and the input tensor of the attention layer can more reserve texture detail information, such as mole lines and the like when a screen is flipped and the like possibly appearing in an attack image; the dimension splicing method can effectively utilize the two types of information, thereby improving the precision of the network;

(3) The spatial attention layer is only used when the repeatable module performs downsampling; if the repeatable module does not perform downsampling, the repeatable module is not added, and the network precision is improved, and meanwhile, the parameters and the operand of the network are only increased in a small extent;

(4) The invention relates to a feature extraction layer and an output layer, which are used for regularizing the weight values of a full-connection layer with the extracted features and weight values subjected to regularization treatment. During training, the loss function is in an asymmetric form. For the living body sample, the included angle between the characteristic vector of the living body sample and the weight value of the full-connection layer with the weight value subjected to regularization treatment can be compressed to be as small as possible. For non-living samples, the loss function performs loss penalty on samples with an included angle between the feature vector of the non-living sample and the weight value of the regularized full-connection layer being smaller than arccos (m-1), and performs no loss penalty on samples with an included angle between the feature vector of the non-living sample and the weight value of the regularized full-connection layer being larger than arccos (m-1).

(5) Compared with the traditional cross entropy loss function, the method can carry out loss penalty on all non-living samples; the cross entropy loss function is very prone to the phenomenon of network over-fitting for training samples. The loss function in the invention can reduce the over-fitting phenomenon, improve the generalization performance of the network, and ensure that the trained network can obtain better performance in different application scenes;

in conclusion, the method has the advantages of simple logic, less technical workload, high calculation precision and the like, and has high practical value and popularization value in the technical field of biological living body identification in computer face identification.

Drawings

For a clearer description of the technical solutions of the embodiments of the present invention, the drawings to be used in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered as limiting the scope of protection, and other related drawings may be obtained according to these drawings without the need of inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a root module structure according to the present invention.

Fig. 2 is a schematic diagram of a repeatable module configuration without downsampling according to the present invention.

Fig. 3 is a schematic diagram of a repeatable module configuration for downsampling according to the present invention.

Fig. 4 is a schematic view of a spatial attention layer structure of the present invention.

Fig. 5 is a schematic structural diagram of a feature extraction module according to the present invention.

Fig. 6 is a schematic structural diagram of an output module according to the present invention.

Fig. 7 is a schematic diagram of the overall structure of the present invention.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the present invention will be further described with reference to the accompanying drawings and examples, and embodiments of the present invention include, but are not limited to, the following examples. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.

Examples

As shown in fig. 1 to 7, the present embodiment provides a biological living body identification neural network of RGB monocular images and a construction method thereof, and compared with a common method for classifying living bodies by using images, the present invention uses a spatial attention mechanism, regularized feature extraction, and an asymmetric loss function, thereby improving the accuracy of the neural network, and simultaneously has good generalization performance for different application scenarios.

The first step is that the root module is obtained by connecting and packaging the convolution layer, the batch normalization layer, the activation layer and the sequence from front to back. In this embodiment, the convolution kernel size of the convolution layer is set to 3×3, and the number of output channels is 32.

Step two, sequentially connecting and packaging the front to back according to a convolution layer, a batch normalization layer, an activation layer, a depth separable convolution layer, a batch normalization layer, an activation layer, a convolution layer, a batch normalization layer and an optional spatial attention layer to obtain a repeatable module;

the following description is needed: if the downsampling is needed in the repeatable module, the downsampling is performed in the depth separable convolution layer, i.e., the step size in the depth separable convolution layer is set to 2. Meanwhile, if downsampling is performed, a spatial attention layer is added in the module. If no downsampling is performed, the step size in the depth separable convolutional layer is set to 1 and no spatial attention layer is added in the module.

In this embodiment, the input/output of the spatial attention layer adopts the following steps:

(1) Using the input characteristic tensor of the space attention module and combining an attention mapping function to obtain an attention map which has the same size as the input characteristic tensor and has a channel of 1;

(2) Element multiplication, carrying out element multiplication on attention force and input characteristic tensor, carrying out broadcast expansion on attention force in the channel direction, and obtaining weighted attention force characteristic tensor, wherein the expression is as follows:

where X is the input feature tensor of the spatial attention module, X' is the weighted attention feature tensor,for element multiplication operations, F _at The output of the function is an attention map of channel 1, which is the same size as the input tensor.

(3) Splicing the weighted attention characteristic tensor and the input characteristic tensor along the channel direction to obtain an output tensor of the spatial attention layer, wherein the expression is as follows:

X _out ＝concat([X,X′],axis＝"channel")

In this embodiment, the attention of the spatial attention module is generated by using a first convolution layer, a second convolution layer and an activation layer using a sigmoid function, which are sequentially arranged and packaged from front to back; and the expression of the attention mapping function is:

F _at (X)＝sigmoid(Conv ₂ (Conv ₁ (X)))

In this embodiment, 11 repeatable modules are arranged in total and connected sequentially. Downsampling and adding spatial attention layers were performed at the 2 nd, 6 th, 10 th and 11 th modules, respectively. Conv ₁ The convolution kernel size of the convolution function is 1x1, and the number of output channels is 1.Conv ₂ The convolution kernel size of the convolution function is 3x3, and the number of output channels is also 1.

Thirdly, sequentially connecting and packaging the global average pooling layer, the full connection layer, the activation layer and the regularization layer from front to back to obtain a feature extraction module; the number of output units of the full connection layer in this step is 256.

Fourth, the full-connection layer (namely the special full-connection layer) with the regularized weight value is used as an output module, and the main difference between the special full-connection layer and the common full-connection layer is that: the weight values in the special full connection layer are also regularized, and the special full connection layer does not use the offset.

In this embodiment, when the training sample of the convolutional neural network is a living sample, the expression of the loss function of the neural network is:

the cos θ represents the output of the full-connection layer with the regularized weight value, the super parameter s is 30, and the super parameter m is 0.5.

Fifthly, sequentially connecting a root module, a plurality of repeatable modules, a feature extraction module and an output module to obtain a living body identification neural network;

to verify the feasibility and good performance of the method, the present example was tested with a proprietary live dataset. The data set contains 5 thousands of real person images collected under good illumination conditions, and 2 thousands of attack samples are respectively attacked by using a mobile phone screen and a photo. The test set contains test sets with good illumination (the same scene) and darker illumination (other scenes). The control group used the same training set and test set, but in the neural network used, the spatial attention layer was removed, the special fully-connected layer was replaced with the normal fully-connected layer, and the loss function of the neural network was replaced with the Softmax function commonly used for the classification network. The experimental results are as follows:

as can be seen from the results, the recognition accuracy of the embodiment is high, and the recognition method has good generalization performance for different application scenes, particularly scenes not included in the training set.

In summary, the invention aims at the field of biological living body identification by utilizing RGB monocular images, improves the identification accuracy of the network by adding a space attention, a special full-connection layer, an asymmetric loss function and other modes, and has good generalization performance for different application scenes. Compared with the similar technology, the invention has outstanding substantive characteristics and remarkable progress, and has high practical value and popularization value.

The above embodiments are only preferred embodiments of the present invention and are not intended to limit the scope of the present invention, but all changes made by adopting the design principle of the present invention and performing non-creative work on the basis thereof shall fall within the scope of the present invention.

Claims

1. The construction method of the biological living body recognition neural network for the RGB monocular image is characterized by comprising a root module, a plurality of repeatable modules, a feature extraction module and an output module which are sequentially connected from front to back;

the output module is a full-connection layer with weight values subjected to regularization treatment;

the input and output of the spatial attention layer adopts the following steps:

X _out ＝concat([X,X′],axis＝"channel")

wherein X is _out For the output tensor of the spatial attention layer, concat is a channel splicing function, and the axis parameter designates the dimension of tensor splicing as the channel direction;

the attention of the spatial attention module is generated by adopting a first convolution layer, a second convolution layer and an activation layer adopting a sigmoid function, wherein the first convolution layer, the second convolution layer and the activation layer are sequentially arranged and packaged from front to back; and the expression of the attention mapping function is:

F _at (X)＝sigmoid(Conv ₂ (Conv ₁ (X)))

wherein sigmoid is a sigmoid activation function, conv ₁ And Conv ₂ Convolution operation functions of the first convolution layer and the second convolution layer respectively;

the output of the neural network is the cosine quantity of an included angle between the feature vector output by the feature extraction module and the weight value vector of the full-connection layer with the weight value subjected to regularization treatment:

when the training sample of the neural network is a living sample, the expression of the loss function of the neural network is:

when the training sample of the neural network is a non-living sample, the expression of the loss function of the neural network is:

2. The method for constructing a biological living body recognition neural network for RGB monocular images according to claim 1, wherein the step size of the depth separable convolutional layer is set to 2.

3. The construction method for a biological living body recognition neural network for RGB monocular images according to claim 1, wherein the convolution kernel size of the first convolution layer is 1x1, and the number of convolution output channels is 1; the convolution kernel size of the second convolution layer is one of 3x3, 5x5 and 7x7, and the number of convolution output channels is 1.

4. The method for constructing a biological living body recognition neural network for RGB monocular images according to claim 1, wherein the regularization layer adopts L2 regularization.

5. The construction method of a biological living body recognition neural network for an RGB monocular image according to claim 1, wherein the regularized weight value full-connection layer is L2 regularized, and the number of output neurons of the full-connection layer is 1, and the expression is:

w＝Normalize(w ₀ )

cosθ＝Dense _w (Y)

6. The construction method for a biological living body recognition neural network for an RGB monocular image according to claim 1, wherein the super parameter s takes a value of 30 and the super parameter m takes a value of 0.5.