1. Introduction
Masked face recognition (MFR) is a challenging problem to be solved nowadays. With the presence of COVID-19, many people have started to put on their masks to prevent inhaling the virus in the air, and these actions save many people’s life by reducing the transmission rate of COVID-19. However, using MFR led to the challenges and failure of user authentication and verification systems such as the face recognition system. A conventional face recognition system cannot verify or recognize a person’s face that is half-covered by a face mask. This might be a critical problem in security for access control for a person’s authorization in entering restricted areas, labs, or rooms. It will become worse when it comes to an unconstraint public environment. In public areas, criminals can move “undetectably” by wearing a mask since the CCTV fails to capture a criminal’s face when it is covered with a mask. This will further increase the failure rate of a face recognition system when detecting or recognizing masked facial images. In order to overcome exploitation, an MFR system must be designed for face detection and recognition. Overall, there are two consequences when face recognition fails to recognize masked facial images:
Crime rate will increase since criminals can avoid camera face recognition systems when they put on the mask.
Face recognition-based biometric systems will have a low true acceptance rate (TAR) for access control or identification purposes.
According to Déniz’s research on face recognition using histograms of oriented gradients (HOG) [
1], the authors proposed three essential steps for using the HOG to build a face recognition system. Firstly, the authors implemented the HOG descriptor for grid extraction. Secondly, they fused the HOG descriptor with different scales. Third, they performed a dimension reduction process to reduce noise for preventing overfitting. This work inspires us to provide a solution for building face recognition that can recognize mask facial images with the deep learning approach with the combination of HOG descriptors.
This paper will discuss the proposed method, Histogram-based Recurrent Neural Network (HRNN) MFR, which can recognize masked facial images and increase the face recognition system’s TAR. TAR, which is also called true match rate (TMR), is a metric evaluation for biometrics authentication using a probability formulation by accepting an authorized person. The probability in TAR in this experiment is presented as a percentage of how many times a label’s masked facial images have been matched correctly with the unmasked images.
HRNN is a combination based on feature extraction of the histogram of oriented gradient (HOG) and deep learning approach recurrent neural network (RNN). The combination of these two approaches works well and could achieve a high TAR based on evaluating two benchmark datasets. In this context, the HOG feature descriptor extracts the gradient and orientation from the normalized dataset in the feature extraction process. This process helps to minimize the feature size by extracting the essential element from the data for further training processes and also reduces the computational burden of the machine. Next, the deep learning process with the RNN trains the feature extracted from the HOG. The reason for choosing the RNN deep learning approach is that each output of the RNN is dependent on the previous output. The backpropagation mechanism in RNN uses the actual output and puts it back into the neurons in the neural network with some mathematical calculation to give the desired result.
After the training process, the model is evaluated with different categories on each benchmark dataset. The detailed experiment will be presented in
Section 5 of the paper. The hyperparameter tuning process helps to reduce the problem encountered during the deep learning process, including the overfitting and underfitting problem. The experiment result will be evaluated with TAR, recorded, and compared to each category.
Section 2 of the paper will discuss the related research work and literature review of face recognition and MFR.
Section 3 shows the motivation and contribution of this paper and the inspiration for the proposed method.
Section 4 of the paper will explain the proposed solution with the implementation of the algorithm in the proposed method.
Section 5 will evaluate the proposed method’s performance and experimental analysis with state-of-the-art methods. The last section of this paper concludes the works.
3. Motivation and Contribution
This section discusses the motivation and contribution of the proposed method in MFR. In recent years, wearing face masks has become a habit among people to avoid spreading COVID-19 viruses. However, this trend eventually complicates the face recognition system’s failure to recognize masked facial images and causes the system to have a low TAR.
The main problem that causes the system to fail to recognize masked facial images is that the existing facial recognition algorithms cannot accurately detect a human’s facial region. Once the system fails to detect the facial region, the system malfunctions in extracting the feature from the images. On the other point, leaks of information gather because half of the facial region is covered with “unknown things” from the face recognition perspective. To overcome the weaknesses of the face recognition system, an MFR system is necessary for enhancing the existing facial recognition system, which allows the system to recognize and authorize a person effectively in the mask-on condition.
In this paper, we proposed an MFR system using an integration of feature extraction and a deep learning approach named HRNN. The main idea of the proposed method is to utilize the HOG to perform feature extraction from masked facial images and the RNN for the deep learning processes. The proposed method can enhance the training and testing speed compared to a default CNN. In addition, the HOG can perform better than PCA during the training phase.
The main challenge of this experiment Is to build a reliable face recognition system that can recognize masked facial images MFR and a system that can hit a high recognition rate for the testing phase. First, two open-source datasets are selected, RMFD and LFW-SMFD, to test the model. The datasets are fetched to the pre-processing data phase. In this phase, the normalization and feature extraction processes are performed. Next, the extracted feature vector is fed to RNN for deep learning processes.
Overall, the main contributions of the proposed method are:
Address the weaknesses of the conventional face recognition system. The proposed method can effectively recognize unconstrained masked face images by using a deep learning approach through two benchmarked datasets;
Solve overfitting and underfitting problems by using hyperparameter tuning and HOG feature extraction. The proposed method can fit large, scaled datasets without having overfitting and underfitting problem, which will be substantiated in the experimental section;
Solve slow training and testing speed while using big datasets. The proposed method improves the training and testing speed performance by using a large-scale dataset compared with the default CNN settings.
The work was inspired by the benchmarked MFR framework using MTCNN [
20] designed with FaceNet feature extraction and SVM adopted classifier to predict the output of the experiment. Compared with our proposed HRNN method, we use a non-pre-trained feature extraction method, the HOG feature descriptor. This allows the MFR system to speed up the feature extraction phases. Next, the RNN is implemented to train and predict the feature. HRNN adopts a lighter architecture than the benchmarked framework during the feature extraction and classification phase, in which the computational time, training, and testing speed are enhanced.
Furthermore, compared with the benchmarked framework using the cropping approaches to ensure the system can extract features more, focusing on facial region, our proposed method only uses greyscaling during the feature extraction process. This is because the data are already being processed with cropping. Therefore, there is no necessity to repeat the process. Lastly, the RNN can solve overfitting and underfitting problems using the hyperparameter tuning method, which helps with performance accuracy.
4. Proposed Solution
This section will discuss the overall algorithm and experimental processes. Through the reading of the paper [
36], we understand that the benchmarked framework uses the FaceNet and SVM feature extraction and classification, respectively. This motivates us to propose a more time-saving method that uses the HOG rather than FaceNet for feature extraction and RNN as the deep-learning phase. Our proposed method has the same ability in recognizing various types of masked and different categories of labels. It can solve the low acceptance rates of conventional facial recognition systems and speed up the training time for large-scale datasets. It combines feature extraction and the deep learning mechanism to build a reliable MFR system.
Figure 1 shows an overall diagram of the proposed HRNN MFR using an RNN with a HOG feature extraction. All images, including masked and unmasked facial images, are imported from the dataset. The size of all input images is resized from 160 × 160 pixels to 28 × 28 pixels by Equation (1) to fit the neural network. Equation (1) shows the process of resizing an image. Equation (1),
represents the width pixel of the image,
y represents the height pixel of the image, and
is the resized image. Before the feature extraction process, all images are normalized with the greyscaled process Equation (2). Equation (2) shows the greyscale equation for the normalization process.
is an additive color in computer vision. It refers to the red, green, and blue pixels of a display. In Equation (2),
, and
are constant values that multiply with
,
, and
to build
a greyscaled image.
Next is the feature extraction process, all the
are fetched to Equation (3), the HOG feature extraction process, to extract the feature vector for each image. Equation (3) shows the HOG feature extraction process. The
in Equation (3) is built with the gradient magnitude
Equation (4) and orientation
Equation (5). In Equation (4),
represents the total width pixel of the image,
represents the total height pixel of the image,
is the total gradient/magnitude. In Equation (5),
represents the orientation/direction.
Furthermore, a min–max normalization process is implemented to reduce the training data’s total feature size. Equation (6) refers to the process of min–max normalization
. The main reason for this process is to transform all the features into 0 and 1 data. This helps to speed up the process when we train the data.
In Equation (6),
represents the size of a HOG feature vector in pixels,
refers to the minimum value of the RGB,
represents the maximum value of RGB.
represent the new size of
and
represent the new size of
. Next, let
,
represent features normalized by Equation (6), and
is trained by using RNN. Equation (7) shows the equation of RNN.
In Equation (7),
represents the current hidden state in the neural network, the function of
and its past hidden state of
.
refers to the current input data and
refers to the parameter input of function
. The data were trained with the sequential model settings in RNN with the LSTM architecture Equation (8) and combined with the Dropout Equation (11) and Flatten Functions Equation (12) as the parameter. At last, the neural network is optimized with the Adam function Equation (15), sparse categorical cross-entropy as the loss function Equation (16) and Softmax Equation (17) as the activation function.
In Equation (8),
represents the forget gate,
represents the input gate,
represents the output gate,
refers to the cell state,
refers to the current hidden state and
refers to the previous hidden state.
and
represent sigmoid Equation (9) and tanh Equation (10) activation functions, respectively.
,
,
,
,
,
,
and
represents the weight metrics.
,
,
and
represent the biases.
In Equation (11),
=
where the
output is based on the unit of
in the layer of
. In Equation (12),
represents the semimajor axis and
represents the semiminor axis. Equation (15) shows the process of the Adam optimizer. The Adam optimizer is made up of two essentials component, which are momentum Equation (13) and root mean square propagation (RMSP) Equation (14).
In Equation (13),
refers to the aggregate of gradients at time
,
refers to the aggregate of gradients at a previous time.
represents the weights at current
.
represents the weights at future
.
refers to the learning rate.
refers to the derivative of the loss function,
represents the derivative of weight in
.
represents the moving average parameter. Equation (14),
represents the sum of the square of past gradients.
refers to a small positive constant. Equation (15) shows the dropout process in the neural network. Equation (16) shows the sparse categorical cross-entropy as the loss function in the neural network.
where
is the number of classes,
is the truth label and
represents the Softmax probability in
class.
Equation (17) shows the process of the Softmax activation function. refers to the Softmax, represents the input vector, represents the standard exponential for the input vector, refers to the number of classes in the multiclassifier. represents the standard exponential for the output vector.
Algorithm 1 shows the overall process of the proposed HRNN with the expected inputs and outputs.
Algorithm 1. Histogram-based Recurrent Neural Network |
Input: = Masked and unmasked facial images imported from datasets. represents the maximum number of labels in the dataset. Output: Prediction Model.
- 1.
Let , represents a counter for the amount of loop. - 2.
= the maximum number of labels in the dataset. - 3.
Load into the experiment. - 4.
repeat - 5.
Compute with Equation (1) for . - 6.
Compute with Equation (2) for . - 7.
Compute with Equation (3) for . - 8.
- 9.
until - 10.
Compute with Equation (6) for . - 11.
Compute with Equation (7) for . - 12.
return Prediction Model.
|
5. Experimental Results
This section will evaluate the performance and results of the HRNN. Two benchmarked datasets: Labeled Face in the Wild Stimulated Masked Face Dataset (LFW-SMFD) [
35] and Real-world Masked Face Dataset (RMFD) [
33] are tested with the proposed method. There are 243 total labels and 1996 facial images in the RMFD dataset. Additionally, LFW-SMFD has 2271 total labels and 5442 samples of facial images. The Testdir (TD) is a self-build dataset to test and predict the possibilities of the methods, models, and algorithms that work for the actual benchmark datasets. The TD dataset is a mini dataset with a subset of both benchmarked datasets from RMFD and LFW-SMFD; it consists of 36 total labels and 251 facial images. The primary purpose of creating the TD dataset is to predict the performance of different approaches used in the experiment in a shorter time. The result shows that the TD dataset could test a single cycle of the experiment within 10 min rather than using the whole benchmarked dataset with a longer time for a single experiment.
Table 1,
Table 2 and
Table 3 present the experimental results from three different datasets.
Table 1 shows the experiment results in the TD dataset. The main objective of the test is to find the effectiveness of each method that suits the benchmarked datasets and shorten the experimental processing time. All of the epochs in the experiment are fixed at 50 to test the approaches in each experiment, which prevents the experiment from having inequity results. At first, experiments No. 101 and 102 are tested with the KNN approaches. No. 101 has no feature extraction method implemented, and No. 102 has the ResNet50 as the feature extraction approaches. The results show that No. 101 have 3% higher than No. 102, which is 33.28% and 30.80%. Next, experiments No. 109, 110, 111, and 112 are tested with the SVM approaches. Each experiment is implemented using different feature extraction types: ResNet50, VGG16, Inceptionv3, and EfficientNetB7. The results of all four categories are unsatisfactory and have low TAR in the experiment. Experiment No. 109 has the lowest TAR, which is 7.22% and No. 111 achieves 42.45% of TAR. All of the remains experiments are tested on the RNN method. The RNN method fits well with the TD dataset. Experiment No. 108 tests the combination of RNN and SVM approaches. It shows a bad result of 2.4% TAR, and it can be concluded that this method is unsuitable for the dataset. Experiment No. 103 is tested with the RNN with no feature extraction method, the result shows a high TAR of 99.19%, and it might work for the benchmarked datasets. To find the answer, the feature extraction of the HOG approach is added to RNN, which is experiment No. 113. The result shows that it perfectly fits well to the datasets and achieves a 100% TAR. Before this category of experiment ends, the HOG feature extraction is replaced with another transfer learning module as a feature extraction same as the SVM experiment test. However, the results show that the experiment results have low TAR. Experiment No. 114 implements a PCA dimension reduction process for the RNN and HOG approach. It achieves a 30.40% TAR. With the result of
Table 1, we can conclude that the RNN with the feature extraction approach can outperform other approaches and give reliable results.
Table 2 shows the comparison result for the RMFD benchmark dataset. All of the epochs in the experiment are set at 50. Experiments No. 201 and 202 are tested with the KNN approach. No. 201 has no feature extraction and No. 202 have the feature extraction with the ResNet50 transfer learning approach No. 202 has higher TAR than No. 102, which is 34.88% and 29.59%, respectively. Experiment No. 203, 204, 205, and 206 on the RNN approach. At first, No. 203 is conducted to test the performance of RNN in RMFD dataset with no feature extraction implemented. The result of No. 203 seems good and achieves a TAR of 91.63%. Experiment No. 204 tested the transfer learning feature extraction approach. In the table, there is only one transfer learning module that is tested. This is because, according to
Table 1, another transfer learning module shows that it performs terribly with low TAR.
Nevertheless, the Inceptionv3 approach has always had the highest TAR among the other transfer learning modules. The experiment on the transfer learning approach Inceptionv3, which is No. 204, can achieve a 48.42% TAR. Experiment No. 205 uses the approach from
Table 1 which has the highest TAR compared with other experiments, which perform a 99.60% TAR in RMFD benchmarked dataset. At last, an extra experiment, No. 206, is tested to find out what happens when the epoch goes on after 50. The experiment has 100 epochs and achieves a 98.67%, which is 1% lower than No. 205. Therefore, we can conclude that the combination of the RNN and HOG method works well in the RMFD dataset and the larger epoch will not necessarily increase the TAR.
Table 3 shows the experiment result of the LFW-SMFD benchmark dataset. All of the epochs in the experiment are fixed at 50. The first experiment tested with the LFW-SMFD dataset is experiment No. 301. The experiment was tested with the KNN approach, and no feature extraction was added. The results show that No. 301 can achieve only 18.53% TAR. Next, the transfer learning approach ResNet50 as feature extraction was tested with the KNN, which is No. 302 and achieves a 43.70% TAR. Experiment No. 303 tested with the VGG16 as the feature extraction approach with KNN. Experiment No. 303 achieves a 42.91% TAR. The remaining experiment is conducted using the RNN with different feature extractions to observe the difference in the performance. Experiment No. 304, using RNN with no feature extraction method, can perform a TAR of 47.35%. Next, No. 305 uses an Inceptionv3 as the feature extraction method and achieves an 11.93% TAR. At last, experiment No. 306 which is the best approach in
Table 1 and Table RNN with HOG feature extraction, performs well in the LFW-SMFD dataset and achieves the highest TAR of 99.56%. We can conclude that the RNN with HOG approach is the best method for both benchmark datasets.
Figure 2 shows the RMFD evaluation graph for training and validation loss (left) and training and validation accuracy (right) with the proposed method. The loss graph refers to how bad or good the model performs after each epoch while training. The loss graph in
Figure 2 started from above a 3.5 loss in the 0th epoch and continuously dropped until a 0.3 loss in the 50th epoch, representing good training for the model. On the right site, the accuracy graph refers to evaluating the model performance in an explicable way. The accuracy graph in
Figure 2 started from 0 accuracies in the 0th epoch until 0.99 accuracies in the 50th epoch in which the model is well trained.
Figure 3 shows the LFW-SMFD evaluation graph for training and validation loss (left) and training and validation accuracy (right) with the proposed method. The loss graph in
Figure 3 started from above a 6.8 loss in the 0th epoch and continuous dropping until 0.1 loss in the 50th epoch. On the right site, the accuracy graph in
Figure 3 started from 0 accuracies in the 0th epoch until 0.99 accuracies in the 50th epoch.
Table 4 shows the computational performance comparison on both benchmark datasets. The experiment is conducted on the processor Intel(R) Core(TM) i7-6700HQ CPU @2.60GHz 2.59 GHz, 16.0 GB RAM, 64-bit operating system, x64-based processor, NVIDIA GeForce GTX 1060.
According to
Table 4, the HRNN method needs only 20 min of training time compared to the RNN, which requires 46 min of training time in the RMFD dataset. Next, for the LFW-SMFD dataset, the HRNN has also faster than RNN, which is 76 min and 182 min in training time and 0.0463 img/s and 0.1839 img/s in testing time, respectively. The result shows that HRNN has improved both the benchmark dataset’s training and testing computational time.
Table 5 exhibits the performance comparison of the proposed method and state-of-the-art methods.
At first, the MFR with ResNet50 achieved a 47.19% accuracy. According to the experiment conducted in
Table 1, the ResNet 50 has a poor performance when combined with the RNN. The proposed method changes the feature extraction method to HOG to overcome the problem. Secondly, the MFR using FaceMaskNet-21 reaches 88.92% accuracy. It uses the FaceMaskNet-21 to produce a 128-dimension encoding to support the recognition system. FaceMaskNet-21 is a deep neural network with convolutional layers, ReLU, cross-channel normalization, maxpooling, and Softmax. FaceMaskNet-21 is an exemplary architecture for MFR. To further improve the accuracy, the proposed method implements an extra feature extraction process to further extract the feature from the data. As a result, the proposed method has successfully increased the accuracy by integrating the HOG after a CNN model. Next, mask face recognition using MFCosface achieves 98.54% accuracy. The method uses large-margin cosine loss to build the MFR system. In the experiment, the proposed method has a better computational time with the RNN and HOG combination than the MFCosface method. MFR using MTCNN, SVM, and FaceNet approach results in 98.10% accuracy. It uses the MTCNN for facial detection, FaceNet for feature extraction, and support vector machine (SVM) classification. The proposed method does not implement the facial detection algorithm such as MTCNN since the dataset used for the proposed method is cropped. In addition, we noticed that the FaceNet and SVM do not perform as well as the HOG in the experiment. In addition, another state-of-the-art method of the LCDL approach, which is a variant of the LCD, could achieve a slightly lower accuracy of 98.00% compared with the proposed method. In summary, the proposed method uses the HOG feature descriptor as the feature extractor, and RNN as the deep learning outperforms other state-of-the-art methods with the highest accuracy of 99.56%.