Disclosure of Invention
Based on this, it is necessary to provide a TripletLoss-based face authentication method, apparatus, computer device and storage medium for solving the problem of low reliability of the conventional face authentication method.
A triple Loss-based face authentication method comprises the following steps:
acquiring a certificate photo and a scene photo of a person based on the face authentication request;
respectively carrying out face detection, key point positioning and image preprocessing on the scene photo and the certificate photo to obtain a scene face image corresponding to the scene photo and a certificate face image corresponding to the certificate photo;
inputting the scene face image and the certificate face image into a pre-trained convolutional neural network model for face authentication, and acquiring a first feature vector corresponding to the scene face image and a second feature vector corresponding to the certificate face image, which are output by the convolutional neural network model; wherein the convolutional neural network model is obtained based on supervised training of a triplet loss function;
calculating cosine distances of the first feature vector and the second feature vector;
and comparing the cosine distance with a preset threshold value, and determining a face authentication result according to the comparison result.
In one embodiment, the method further comprises:
acquiring a training sample with marks, wherein the training sample comprises a certificate face image and at least one scene face image which are marked to belong to each marked object;
training a convolutional neural network module according to the training samples, and generating triple elements corresponding to the training samples through an OHEM; the triplet elements include a reference sample, a positive sample, and a negative sample;
training the convolutional neural network model based on the supervision of a triple loss function according to the triple elements of each training sample; the triple loss function takes cosine distance as a measurement mode and optimizes model parameters through a random gradient descent algorithm;
and inputting the verification set data into the convolutional neural network model, and obtaining the trained convolutional neural network model for face authentication when the training end condition is reached.
In another embodiment, the step of training a convolutional neural network model according to the training samples and generating a triplet of elements corresponding to each training sample by OHEM includes:
randomly selecting an image as a reference sample, and selecting an image which belongs to the same label object and is different from the reference sample as a positive sample;
according to an OHEM strategy, cosine distances among features are extracted by using a convolutional neural network model trained currently, and for each reference sample, an image which has the smallest distance and belongs to a different class from the reference sample is selected from other images which do not belong to the label object and serves as a negative sample of the reference sample.
In another embodiment, the triplet loss function includes a definition of cosine distances for homogeneous samples and a definition of cosine distances for heterogeneous samples.
In another embodiment, the triplet loss function is:
wherein cos (·) represents the cosine distance, which is calculated by
N is the number of the triples,
a feature vector representing a reference sample,
a feature vector representing a homogeneous positive sample,
feature vectors representing heterogeneous negative examples [ ·]
+The meanings of (A) are as follows:
α
1is an inter-class spacing parameter, α
2Is an intra-class interval parameter.
In another embodiment, the method further comprises: initializing by using basic model parameters trained on the basis of mass open source face data, and adding a normalization layer and a triple loss function layer behind a feature output layer to obtain a convolutional neural network model to be trained.
A triple Loss-based face authentication device comprises: the system comprises an image acquisition module, an image preprocessing module, a feature acquisition module, a calculation module and an authentication module;
the image acquisition module is used for acquiring a certificate photo and a scene photo of a person based on the face authentication request;
the image preprocessing module is used for respectively carrying out face detection, key point positioning and image preprocessing on the scene photo and the certificate photo to obtain a scene face image corresponding to the scene photo and a certificate face image corresponding to the certificate photo;
the feature acquisition module is used for inputting the scene face image and the certificate face image into a pre-trained convolutional neural network model for face authentication, and acquiring a first feature vector corresponding to the scene face image and a second feature vector corresponding to the certificate face image, which are output by the convolutional neural network model; wherein the convolutional neural network model is obtained based on supervised training of a triplet loss function;
the calculating module is used for calculating cosine distances of the first feature vector and the second feature vector;
and the authentication module is used for comparing the cosine distance with a preset threshold value and determining a face authentication result according to the comparison result.
In another embodiment, the apparatus further comprises: the system comprises a sample acquisition module, a triple acquisition module, a training module and a verification module;
the sample acquisition module is used for acquiring a training sample with marks, wherein the training sample comprises a certificate face image and at least one scene face image which are marked to belong to each marked object;
the triple acquisition module is used for training a convolutional neural network model according to the training samples and generating triple elements corresponding to the training samples through an OHEM; the triplet elements include a reference sample, a positive sample, and a negative sample;
the training module is used for training the convolutional neural network model according to the triplet elements of each training sample and the supervision of a basis triplet loss function; the triple loss function takes cosine distance as a measurement mode and optimizes model parameters through a random gradient descent algorithm;
and the verification module is used for inputting verification set data into the convolutional neural network model to obtain the trained convolutional neural network model for face authentication when a training ending condition is reached.
A computer device, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the steps of the triple Loss-based face authentication method when executing the computer program.
A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the triple Loss-based face authentication method described above.
The human face authentication method, the human face authentication device, the computer equipment and the storage medium based on triple Loss utilize a pre-trained convolutional neural network to carry out human face authentication, because a convolutional neural network model is obtained based on the supervised training of a triple Loss function, the similarity of a scene human face image and a certificate human face image is obtained by calculating the cosine distance of a first feature vector corresponding to the scene human face image and a second feature vector corresponding to the certificate human face image, the cosine distance is measured by the included angle of space vectors, the directional difference is reflected better, the distribution attribute of a human face feature space is better met, and the reliability of the human face authentication is improved.
Detailed Description
Fig. 1 is a schematic structural diagram of a triple Loss-based face authentication system according to an embodiment. As shown in fig. 1, the face authentication system includes a server 101 and an image capture device 102. The server 101 is connected to the image capturing apparatus 102 via a network. The image capturing device 102 captures a real-time scene photo of the user to be authenticated and a certificate photo, and transmits the captured real-time scene photo and certificate photo to the server 101. The server 101 determines whether the person in the scene photo and the person in the identification photo are the same person, and authenticates the identity of the user to be authenticated. Based on the specific application scenario, the image capturing apparatus 102 may be a camera or a user terminal with a camera function. Taking the account opening site as an example, the image acquisition device 102 may be a camera; taking account opening via internet as an example, the image capturing device 102 may be a mobile terminal with a camera function.
In other embodiments, the face authentication system may further include a card reader for reading a certificate photo in a certificate (e.g., identification card, etc.) chip.
Fig. 2 is a flowchart of a triple Loss-based face authentication method in an embodiment. As shown in fig. 2, the method includes:
s202, acquiring a certificate photo and a scene photo of a person based on the face authentication request.
The certificate photo refers to a photo corresponding to a certificate capable of proving the identity of a person, such as a certificate photo printed on an identity card or a certificate photo in a chip. The certificate photo can be obtained by photographing the certificate, and the certificate photo stored in the certificate chip can also be read by the card reader. The certificate in this embodiment may be an identity card, a driving license, a social security card, or the like.
The scene photo of the person refers to a photo of the user to be authenticated in the field environment, wherein the photo is collected during authentication of the user to be authenticated. The field environment refers to the environment where the user is located when taking a picture, and the field environment is not limited. The scene photo may be acquired by using a mobile terminal having a camera function to capture the scene photo and send the scene photo to the server.
The face authentication means comparing a scene picture of a person collected on site with a certificate picture in identity information and judging whether the person is the same person. The face authentication request is triggered based on actual application operations, for example, based on an account opening request of a user. And the application program prompts the user to collect the photo on a display interface of the user terminal, and sends the collected photo to the server for face authentication after the photo collection is finished.
And S204, respectively carrying out face detection, key point positioning and image preprocessing on the scene photo and the certificate photo to obtain a scene face image corresponding to the scene photo and a certificate face image corresponding to the certificate photo.
Face detection refers to recognizing a photo and acquiring a face region in the photo.
And the key point positioning means that the positions of the key points of the human face in each picture are obtained for the human face area detected in the picture. The key points of the human face comprise eyes, nose tips, mouth corner tips, eyebrows and contour points of all parts of the human face.
In this embodiment, a cascade convolutional neural network MTCNN method based on multitask joint learning may be used to simultaneously complete face detection and face key point detection, and a face detection method based on LBP features and a face key point detection method based on shape regression may also be used.
The image preprocessing refers to performing portrait alignment and shearing processing according to the position of the detected face key point in each picture, so as to obtain a scene face image with normalized size and a certificate face image. The scene face image is a face image obtained by performing face detection, key point positioning and image preprocessing on a scene photo, and the certificate face image is a face image obtained by performing face detection, key point positioning and image preprocessing on a certificate photo.
And S206, inputting the scene face image and the certificate face image into a pre-trained convolutional neural network model for face authentication, and acquiring a first feature vector corresponding to the scene face image output by the convolutional neural network model and a second feature vector corresponding to the certificate face image.
The convolutional neural network model is trained in advance according to training samples based on the supervision of a triplet loss function. The convolutional neural network comprises a convolutional layer, a pooling layer, an activation function layer and a full-connection layer, and each neuron parameter of each layer is determined through training. And acquiring a first feature vector of the scene face image output by the full connection layer of the convolutional neural network model and a second feature vector corresponding to the certificate face image by utilizing the trained convolutional neural network and through network forward propagation.
The triple (triplet) refers to randomly selecting a sample from a training data set, wherein the sample is called a reference sample, then randomly selecting a sample which belongs to the same person as the reference sample as a positive sample, and selecting a sample which does not belong to the same person as a negative sample, thereby forming a triple (the reference sample, the positive sample and the negative sample). Because the testimony comparison is mainly based on the comparison between the certificate photo and the scene photo, but not the comparison between the certificate photo and the certificate photo or between the scene photo and the scene photo, the triple modes mainly have two combinations: when the identification photo image is taken as a reference sample, the positive sample and the negative sample are both scene photos; when the scene photo image is taken as a reference sample, the positive sample and the negative sample are both certificate photos.
And training a parameter-shared network aiming at each sample in the triples to obtain the feature expression of the three elements. The objective of improving the triple loss is to learn that the distance between the feature expressions of the reference sample and the positive sample is as small as possible, the distance between the feature expressions of the reference sample and the negative sample is as large as possible, and the distance between the feature expressions of the reference sample and the positive sample and the distance between the feature expressions of the reference sample and the negative sample are at a minimum interval.
S208, calculating the cosine distance between the first characteristic vector and the second characteristic vector.
Cosine distance, also called cosine similarity, is a measure of the magnitude of the difference between two individuals using the cosine value of the angle between two vectors in a vector space. The larger the cosine distance between the first characteristic vector and the second characteristic vector is, the larger the similarity between the scene face image and the certificate face image is, and the smaller the cosine distance between the first characteristic vector and the second characteristic vector is, the smaller the similarity between the scene face image and the certificate face image is. When the cosine distance between the scene face image and the certificate face image is more received at 1, the probability that the two images belong to the same person is higher, and when the cosine distance between the scene face image and the certificate face image is smaller, the probability that the two images belong to the same person is lower.
In the conventional triple loss (triplet loss) method, euclidean distances are used to measure the similarity between samples. The Euclidean distance measurement is that the absolute distance of each point in the space is directly related to the position coordinate of each point, and the Euclidean distance measurement does not accord with the distribution attribute of the face feature space. In this embodiment, the distribution attribute of the face feature space and the actual application scene are considered, and the cosine distance is used to measure the similarity between samples. Cosine distance measures the included angle of the space vector, and the difference in direction is reflected rather than the position, so that the distribution attribute of the face feature space is better met.
Specifically, the formula for calculating the cosine distance is:
where x represents the first feature vector and y represents the second feature vector.
And S210, comparing the cosine distance with a preset threshold value, and determining a face authentication result according to the comparison result.
The authentication result comprises authentication passing, namely the certificate photo and the scene photo belong to the same person. The authentication result also includes authentication failure, i.e. the certificate photo and the scene photo do not belong to the same person.
Specifically, the cosine distance is compared with a preset threshold, when the cosine distance is greater than the preset threshold, the similarity between the certificate photo and the scene photo is greater than the preset threshold, the authentication is successful, and when the cosine distance is less than the preset threshold, the similarity between the certificate photo and the scene photo is less than the preset threshold, the authentication is failed.
The triple Loss-based face authentication method utilizes a pre-trained convolutional neural network to perform face authentication, the convolutional neural network model is obtained based on the supervised training of a triple Loss function, the similarity of the scene face image and the certificate face image is obtained by calculating the cosine distance of a first feature vector corresponding to the scene face image and a second feature vector corresponding to the certificate face image, and the cosine distance is the included angle of the space vectors and reflects the difference in direction better, so that the triple Loss-based face authentication method is more in line with the distribution attribute of the face feature space, and the reliability of the face authentication is improved.
In another embodiment, the face authentication method further comprises the step of training a convolutional neural network model for face authentication. FIG. 3 is a flow diagram of the steps in training a convolutional neural network model for face authentication in one embodiment. As shown in fig. 3, this step includes:
s302, obtaining a training sample with marks, wherein the training sample comprises a certificate face image and at least one scene face image which are marked to belong to each marked object.
In this embodiment, the object, i.e., the person, is marked, and the training sample is marked with the scene face image and the certificate face image belonging to the same person in human units. Specifically, the scene face image and the certificate face image can be obtained by carrying out face detection, key point positioning and image preprocessing on the marked scene photo and certificate photo.
Face detection refers to recognizing a photo and acquiring a face region in the photo.
And the key point positioning means that the positions of the key points of the human face in each picture are obtained for the human face area detected in the picture. The key points of the human face comprise eyes, nose tips, mouth corner tips, eyebrows and contour points of all parts of the human face.
In this embodiment, a cascade convolutional neural network MTCNN method based on multitask joint learning may be used to simultaneously complete face detection and face key, and a face detection method based on LBP features and a face key point detection method based on shape regression may also be used.
The image preprocessing refers to performing portrait alignment and shearing processing according to the position of the detected face key point in each picture, so as to obtain a size-normalized scene face image and a certificate face image. The scene face image is a face image obtained by performing face detection, key point positioning and image preprocessing on a scene photo, and the certificate face image is a face image obtained by performing face detection, key point positioning and image preprocessing on a certificate photo.
S304, training a convolutional neural network model according to the training samples, and generating triple elements corresponding to the training samples through an OHEM; the triplet elements include a reference sample, a positive sample, and a negative sample.
There are two combinations of triplets: when the identification photo image is taken as a reference sample, the positive sample and the negative sample are both scene photo images; when the scene photo image is taken as a reference sample, the positive sample and the negative sample are both certificate photo images.
Specifically, taking a certificate photo as an example of a reference image, randomly selecting a certificate photo sample of one person from a training data set, wherein the sample is called a reference sample, then randomly selecting a scene photo sample belonging to the same person as the reference sample as a positive sample, and selecting a scene photo sample not belonging to the same person as a negative sample, thereby forming a (reference sample, positive sample, negative sample) triplet.
That is, the positive sample and the reference sample are the same type of sample, that is, belong to the same person image. Negative examples are heterogeneous examples of reference examples, i.e. images that do not belong to the same person. In the training process of the convolutional neural network, an OHEM (Online Hard sample mining) strategy is adopted for constructing a triplet on line by a negative sample, namely, in the process of each iterative optimization of the network, the current network is used for carrying out forward calculation on a candidate triplet, an image which does not belong to the same user as the reference sample in the training sample and has the closest cosine distance is selected as the negative sample, and thus the triplet element corresponding to each training sample is obtained.
In one embodiment, the step of training the convolutional neural network according to the training samples and generating the triplet elements corresponding to each training sample includes the following steps S1 and S2:
s1: one image is randomly selected as a reference sample, and images which belong to the same label object and have different types from the reference sample are selected as positive samples.
The category refers to the image type to which the training sample belongs, and in this embodiment, the category of the training sample includes a scene face image and a certificate face image. Because the face authentication mainly refers to the comparison between the certificate photo and the scene photo, the reference sample and the positive sample belong to different categories, and if the reference sample is the scene face image, the positive sample is the certificate face image; if the reference sample is the certificate face image, the positive sample is the scene face image.
S2: according to an OHEM strategy, extracting cosine distances among features by using a convolutional neural network model trained currently, and selecting an image which has the minimum distance and belongs to a different class from a reference sample as a negative sample of the reference sample from other images which do not belong to the same label object for each reference sample.
The negative sample is selected from face images of labels which are not of the same person as the reference sample, specifically, the negative sample adopts an OHEM strategy to construct a triple on line in the training process of the convolutional neural network, namely, in the process of each iterative optimization of the network, the current network is used for carrying out forward calculation on a candidate triple, and an image which is not of the same user as the reference sample in the training sample, has the closest cosine distance and does not belong to the same category as the reference sample is selected as the negative sample. That is, the negative examples are of a different class than the reference examples. It can be considered that if the identity photo is taken as a reference sample in the triplet, both the positive sample and the negative sample are scene photos; otherwise, if the scene photo is taken as the reference sample, the positive sample and the negative sample are both certificate photos.
S306, training a convolutional neural network model based on supervision of a triple loss function according to the triple elements of each training sample, wherein the triple loss function optimizes model parameters by a random gradient descent algorithm by taking cosine distance as a measurement mode.
The personal identification verification terminal verifies the identity of the user by comparing whether the chip photo of the user certificate is consistent with the scene photo, the data collected by the background is only two pictures of a sample of a single person, namely the certificate photo and the scene photo captured at the moment of comparison, but the number of different individuals can be thousands. If the data with a large number of categories and few similar samples are trained by a classification-based method, the parameters of the classification layer are too large to lead to the fact that the network is very difficult to learn, so that the method for metric learning is considered to solve the problem. The typical technique of metric learning is to use a triple loss (triplet loss) method to learn an effective feature mapping by constructing image triplets, and the feature distance of the similar sample is smaller than that of the dissimilar sample under the mapping, so as to achieve the purpose of correct comparison.
The objective of triple loss is to learn that the distance between the feature expressions of the reference sample and the positive sample is as small as possible, the distance between the feature expressions of the reference sample and the negative sample is as large as possible, and the distance between the feature expressions of the reference sample and the positive sample and the distance between the feature expressions of the reference sample and the negative sample are at a minimum interval.
In another embodiment, the triplet loss function includes a definition of cosine distances for homogeneous samples and a definition of cosine distances for heterogeneous samples.
Wherein, the homogeneous sample refers to the reference sample and the positive sample, and the heterogeneous sample refers to the reference sample and the negative sample. The cosine distance of the homogeneous sample refers to the cosine distance between the reference sample and the positive sample, and the cosine distance of the heterogeneous sample refers to the cosine distance between the reference sample and the negative sample.
On one hand, the original triplet loss method only considers the difference between classes but not the difference in classes, and if the distribution in the classes is not converged enough, the generalization capability of the network is weakened, and the adaptability to the scene is reduced. On the other hand, the original triplet loss method adopts Euclidean distance to measure the similarity between samples, and actually, the cosine distance is more adopted to measure in a feature comparison link after the face model is deployed. The Euclidean distance measurement is the absolute distance of each point in the space, and is directly related to the position coordinates of each point; the cosine distance measures the included angle of the space vector, and the difference in direction is reflected rather than the position, so that the distribution attribute of the face feature space is better met.
The basic idea of the original triplet loss is to make the distance between the reference sample and the positive sample smaller than the distance between the reference sample and the negative sample by metric learning and the difference between the distances larger than a minimum spacing parameter α, so the original triplet loss function is as follows:
where N is the number of triplets,
representing the feature vector of the reference sample (anchor),
a feature vector representing homogeneous positive samples (positive),
a feature vector representing a heterogeneous negative sample (negative).
Representing the L2 paradigm, the euclidean distance. [. the]
+The meanings of (A) are as follows:
it can be seen from the above equation that the original triplet loss function only defines the distance between the similar sample (anchor) and the heterogeneous sample (anchor), i.e. the inter-class distance is increased as much as possible by the interval parameter α, and the intra-class distance is not defined, i.e. the distance between the similar samples is not constrained.
In order to solve the problems, the invention provides an improved triplet loss method, which retains the definition of the inter-class distance in the original method and increases the constraint term of the intra-class distance, so that the intra-class distance is converged as much as possible. The loss function expression is as follows:
wherein cos (·) represents the cosine distance, which is calculated by
N is the number of the triples,
a feature vector representing a reference sample,
a feature vector representing a homogeneous positive sample,
feature vectors representing heterogeneous negative examples [ ·]
+The meanings of (A) are as follows:
α
1is an inter-class spacing parameter, α
2Is an intra-class interval parameter.
Compared to the original triThe method comprises the steps of using a simple loss function, changing Euclidean distance into cosine distance in a measurement mode of the improved triple loss function, keeping consistency of the measurement mode in a training stage and a deployment stage, and improving continuity of feature learning, wherein a first item of the new triple loss function is consistent with an original triple loss function and is used for increasing inter-class difference, and a second item adds distance constraint on similar sample pairs (positive tuples) and is used for reducing intra-class difference α
1Is interval parameter between classes, the value range is 0-0.2, α
2The interval parameter is an intra-class interval parameter, and the value range is 0.8-1.0. It should be noted that, since the measurement is performed in a cosine manner, the obtained measurement value corresponds to the similarity between two samples, so that the difference between the two samples is measured
In the expression, only the cosine similarity of the negative tuple is α
1Samples within the range that are greater than the sine-tuple cosine similarity will actually participate in the training.
The model is trained based on the improved triple loss function, and the model is subjected to back propagation optimization training through the joint constraint of the inter-class loss and the intra-class loss, so that the similar samples are as close as possible in the feature space, and the heterogeneous samples are as far away from the feature space, the recognition ability of the model is improved, and the reliability of the face authentication is improved.
And S308, inputting the verification set data into the convolutional neural network, and obtaining the trained convolutional neural network for face authentication when the training end condition is reached.
Specifically, 90% of the data from the pool of witness image data was taken as the training set, and the remaining 10% was taken as the validation set. And calculating an improved triplet loss value based on the formula, and feeding the triplet loss value back to the convolutional neural network for iterative optimization. And simultaneously observing the performance of the model in the verification set, and when the verification performance is not increased any more, the model reaches a convergence state and the training stage is terminated.
According to the face authentication method, on one hand, the constraint on the intra-class sample distance is added in the loss function of the original triplet loss, so that the intra-class difference is reduced while the inter-class difference is increased, and the generalization capability of the model is improved; on the other hand, the original metric mode of the triplet loss is changed from the Euclidean distance to the cosine distance, the consistency of the metrics of training and deployment is kept, and the continuity of feature learning is improved.
In another embodiment, the step of training the convolutional neural network further comprises: initializing by using basic model parameters trained on the basis of massive open source face data, and adding a normalization layer and an improved triple loss function layer behind a feature output layer to obtain a convolutional neural network to be trained.
Specifically, when the problem of testimony unification is solved by deep learning, the testimony comparison application performance of a conventional deep face recognition model obtained by training based on mass internet face data in a specific scene is greatly reduced, testimony data sources in the specific application scene are limited, and training results are not ideal due to insufficient samples in direct learning, so that a method for effectively carrying out extended training on scene data of a small data set is greatly needed to be developed, the accuracy of the face recognition model in the specific application scene is improved, and market application requirements are met.
The deep learning algorithm usually depends on mass data training, in the application of testimony and testimony combination, the problem of heterogeneous sample comparison of certificate photo and scene photo comparison is solved, and the performance of a conventional deep face recognition model obtained based on mass internet face data training is greatly reduced in testimony and testimony comparison application. However, the source of the testimony data is limited (the testimony data is required to be provided with the identity card image of the same person and the corresponding scene image at the same time), the data volume which can be used for training is small, the training effect is not ideal due to insufficient samples in the direct training, therefore, when the testimony-integrated model training is carried out by deep learning, the idea of transfer learning is often utilized, a basic model with reliable performance on an open source test set is firstly trained on the basis of massive internet face data, then secondary extension training is carried out on the limited testimony data, so that the model can automatically learn the feature representation of a specific mode, and the model performance is improved. This process is illustrated in fig. 6.
In the process of secondary training, the whole network is initialized by using the pre-trained basic model parameters, then an L2 normalization layer and an improved triplet loss layer are added after the feature output layer of the network, and the structural diagram of the convolutional neural network to be trained is shown in fig. 7.
In an embodiment, a flow diagram of a face authentication method is shown in fig. 8, and includes three stages, namely a data acquisition and preprocessing stage, a training stage, and a deployment stage.
And in the data acquisition and preprocessing stage, a card reader module of the people and identity card verification terminal equipment reads a certificate chip photo, a front camera captures a scene photo, and a certificate face image and a scene face image with normalized sizes are obtained after passing through a face detector, a key point detector and a face alignment and shearing module.
And in the training stage, 90% of data is taken from the testimonial image data pool as a training set, and the rest 10% is taken as a verification set. Because the testimonial comparison mainly refers to the comparison between the certificate photo and the scene photo, if the certificate photo is taken as a reference image (anchor) in the triple, the other two images are both the scene photo; otherwise, if the scene photo is taken as a reference picture, the other two pictures are both certificate photos. And adopting an OHEM online triple construction strategy, namely in the process of each iterative optimization of the network, utilizing the current network to perform forward calculation on the candidate triple, screening effective triples meeting the conditions, calculating an improved tripletloss value according to the formula, and feeding the tripletloss value back to the network for iterative optimization. And simultaneously observing the performance of the model in the verification set, and when the verification performance is not increased any more, the model reaches a convergence state and the training stage is terminated.
And a deployment stage, wherein when the trained model is deployed to a human authentication terminal for use, the images acquired by the equipment are subjected to the same preprocessing program as the training stage, then the feature vector of each human face image is obtained through network forward calculation, the similarity of the two images is obtained through calculating the cosine distance, then judgment is carried out according to a preset threshold value, the image larger than the preset threshold value is the same person, and the image larger than the preset threshold value is different persons.
According to the face authentication method, the original triplet loss function only limits the learning relationship of the inter-class distance, the original triplet loss function is improved, the constraint term of the intra-class distance is increased, the intra-class difference can be reduced as far as possible while the inter-class difference of the network is increased in the training process, and therefore the generalization capability of the network is improved, and the scene adaptability of the model is improved. In addition, the cosine distance replaces the Euclidean distance measurement mode in the original triplet loss, the distribution attribute of the face feature space is better met, the consistency of the measurement mode in the training stage and the deployment stage is kept, and the comparison result is more reliable.
In one embodiment, there is provided a face authentication apparatus, as shown in fig. 9, including: an image acquisition module 902, an image pre-processing module 904, a feature acquisition module 906, a calculation module 908, and an authentication module 910.
And an image obtaining module 902, configured to obtain a certificate photo and a scene photo of a person based on the face authentication request.
And the image preprocessing module 904 is configured to perform face detection, key point positioning, and image preprocessing on the scene photo and the certificate photo respectively to obtain a scene face image corresponding to the scene photo and a certificate face image corresponding to the certificate photo.
The feature acquisition module 906 is configured to input the scene face image and the certificate face image into a pre-trained convolutional neural network model for face authentication, and acquire a first feature vector corresponding to the scene face image output by the convolutional neural network model and a second feature vector corresponding to the certificate face image; the convolutional neural network model is obtained based on supervised training of a triplet loss function.
A calculating module 908 for calculating cosine distances of the first eigenvector and the second eigenvector.
And the authentication module 910 is configured to compare the cosine distance with a preset threshold, and determine a face authentication result according to the comparison result.
The face authentication device utilizes the pre-trained convolutional neural network to carry out face authentication, the convolutional neural network model is obtained based on the supervised training of the improved triple loss function, the similarity of the scene face image and the certificate face image is obtained by calculating the cosine distance of the first characteristic vector corresponding to the scene face image and the second characteristic vector corresponding to the certificate face image, the cosine distance is the included angle of the space vectors, the difference in direction is reflected rather than the position, the distribution attribute of the face characteristic space is met better, and the reliability of the face authentication is improved.
As shown in fig. 9, in another embodiment, the face authentication apparatus further includes: a sample acquisition module 912, a triplet acquisition module 914, a training module 916, and a validation module 918.
A sample acquiring module 912, configured to acquire a labeled training sample, where the training sample includes one certificate face image and at least one scene face image labeled to each labeled object.
The triple obtaining module 914 is used for training the convolutional neural network model according to the training samples and generating triple elements corresponding to the training samples through the OHEM; the triplet elements include a reference sample, a positive sample, and a negative sample.
Specifically, the triple obtaining module 914 is configured to randomly select an image as a reference sample, select an image that belongs to the same label object and is different from the reference sample in category as a positive sample, and further extract a cosine distance between features according to an OHEM policy by using a currently trained convolutional neural network model, and for each reference sample, select an image that has the smallest distance and is different from the reference sample in category as a negative sample of the reference sample from other face images that do not belong to the same label object.
Specifically, when the certificate photo is taken as a reference sample, the positive sample and the negative sample are both scene photos; when the scene photo is taken as a reference sample, the positive sample and the negative sample are both certificate photos.
The training module 916 is configured to train a convolutional neural network model based on supervision of a triplet loss function according to triplet elements of each training sample, where the triplet loss function optimizes model parameters by a random gradient descent algorithm with cosine distances as a measurement mode.
Specifically, the modified triplet loss function includes a definition of cosine distances for homogeneous samples and a definition of cosine distances for heterogeneous samples.
The modified triplet loss function is:
wherein cos (·) represents the cosine distance, which is calculated by
N is the number of the triples,
a feature vector representing a reference sample,
a feature vector representing a homogeneous positive sample,
feature vectors representing heterogeneous negative examples [ ·]
+The meanings of (A) are as follows:
α
1is an inter-class spacing parameter, α
2Is an intra-class interval parameter.
And the verification module 918 is used for inputting the verification set data into the convolutional neural network model to obtain the trained convolutional neural network model for face authentication when the training end condition is reached.
In another embodiment, the face authentication apparatus further includes a model initialization module 920, configured to initialize the trained basic model parameters based on the massive open-source face data, and add a normalization layer and a triple loss function layer after the feature output layer to obtain a convolutional neural network to be trained. On one hand, the human face authentication device increases the constraint on the intra-class sample distance in the loss function of the original tripletloss, so that the intra-class difference is reduced while the inter-class difference is increased, and the generalization capability of the model is improved; on the other hand, the original metric mode of the triplet loss is changed from the Euclidean distance to the cosine distance, the consistency of the metrics of training and deployment is kept, and the continuity of feature learning is improved.
A computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the face authentication method of the above embodiments are implemented.
A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the face authentication method of each of the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.