Nothing Special   »   [go: up one dir, main page]

Skip to main content

Task-aware image quality estimators for face detection

Abstract

Understanding the quality of face images plays a critical role in enhancing the efficiency of end-to-end face analytics systems, which perform tasks such as face detection, alignment, and recognition in a sequential manner. Recently, the development of face quality estimators (QEs) specifically for face recognition has received significant attention. However, in end-to-end face analytics systems, the performance of the face detector directly affects the face recognizer. Thus, assessing the suitability of an image for face detection before passing it on to face recognition can improve the resource utilization in such systems. In this research, we first introduce the detectability (DET) score of an image, a novel quality metric that links image quality to face detection performance. We use this DET score to design two novel QEs for face detection: supervised face detection quality estimator (SFDQE) and unsupervised face detection quality estimator (UFDQE). We also propose the mAP vs. reject protocol (mvR), a systematic evaluation protocol for assessing QEs in the context of face detection. In our experiments, we illustrate the effectiveness of SFDQE and UFDQE in determining the suitability of an image for face detection. Furthermore, we show the ability of our QEs to generalize; each is a powerful tool for image quality estimation in general object detection scenarios.

1 Introduction

Quality estimation is a crucial aspect of video analytics systems [1]. Quality estimators (QEs) provide a general understanding of how suitable a specific image or video is for a given analytics system, without the need to directly assess the performance of the system on the input. Traditional image quality estimators (QEs) [2,3,4] are intended to evaluate perceptual quality only. They address the question, “do people perceive this image as high quality or not?”.

Numerous factors other than perceptual quality can influence how suitable an image is for a particular video analytics tasks, such as the capture and post-processing, the surroundings, background, and illumination. Substandard input images are likely to lead to poor performance or an inconclusive outcome. While conventional QEs perform well on estimating perceptual quality, they cannot correlate the performance of a practical video analytics system on a given input image.

In practical video analytics systems, complex deep learning models are used to perform various computer vision tasks, such as object detection, segmentation, and matching, often in resource-limited environments. For instance, in an end-to-end face analytics system, multiple tasks such as face detection, alignment, and recognition are carried out sequentially as shown in Fig. 1. These tasks are interdependent, meaning the performance of one module often relies on the success of the previous ones. In [5], we demonstrated that in an end-to-end practical face analytics system, the performance of the recognition module will always depend on how well faces are detected by the detection module. If the detector performs poorly on a specific image, it would be inefficient to pass this image onto the recognition module and waste valuable resources. Hence, there is a need for quality estimators (QEs) that can link image quality with the performance of various modules in such an end-to-end system to ensure efficient use of system resources. Ideally, these QEs should be lightweight in nature, allowing the user to quickly decide if an input is appropriate or not for a specific task.

Fig. 1
figure 1

End-to-end face analytics system

In the current literature on task-specific QEs for face analytics, there are several QEs specifically designed for face recognition tasks [6,7,8,9,10,11,12]. These face recognition QEs assess the suitability of an image for face recognition tasks, that is, how well can face recognition be performed on a given face image [13, 14]. Face detection is a critical pre-processing step to face recognition in end-to-end face analytics systems since it is responsible for the localization and alignment of faces in an image. However, QEs that assess whether an image is suitable for face detection have not been designed as of yet. Furthermore, it is not enough to design QEs for face detection; one must also be able to interpret the quality scores.

There is also a clear distinction between how perceptual QEs and task-specific QEs are evaluated. Perceptual QEs are typically evaluated by computing the correlation between the predicted quality score and subjective ground truth from human viewers. Evaluation methods to prove the effectiveness of task-specific QEs, however, are still being defined. In [15], we see a thorough analysis of evaluation protocols that exist for face recognition QEs [8, 15, 16]. However, suitable evaluation protocols need to be developed to evaluate QEs for tasks like face detection.

In this paper, our objective is to design QEs that can determine the quality of an image in terms of the face detection performance of well-known deep learning-based face detectors like RetinaFace [17]. To achieve this, we first introduce a novel face detection specific quality metric called the Detectability (DET) score that quantifies the suitability of an image for face detection. Using this DET score as a reference quality metric, we design two QEs for face detection: supervised face detection quality estimator (SFDQE) and unsupervised face detection quality estimator (UFDQE), as depicted in Figs. 2 and 3. In our supervised approach, SFDQE, we follow the conventional approach of training a deep learning network on the DET scores of images which are used as ground truth labels. However, this approach heavily relies on labelled data, which might not always be accessible. To overcome this, we also propose a novel unsupervised QE for face detection called UFDQE. UFDQE estimates the suitability of an image for face detection by calculating the distortions in its latent space feature maps that occur when the image is repeatedly passed through a face detection model with active regularization layers, such as Dropouts [18], Dropblocks [19], or Disouts [20]. Here, we refer to these layers as “feature distortion” layers. UFDQE does not require any ground truth labels to estimate a quality score of an image. We also propose the mAP vs. reject (mvR) protocol that evaluates the effectiveness of SFDQE and UFDQE in the context of face detection.

Fig. 2
figure 2

Block diagram of the supervised face detection quality estimator (SFDQE) that is trained on ground truth DET scores of images

Fig. 3
figure 3

Block diagram of the unsupervised face detection quality estimator (UFDQE) that utilizes feature distortion layers to estimate the quality score of an image

In Sect. 2, we provide essential background information about the face detection task and its importance in practical end-to-end face analytics systems [5]. This section also provides details on perceptual QEs, QEs designed for face recognition tasks, and the existing evaluation protocols for assessing QEs in relation to task performance.

Our novel quality metric, the detectability (DET) score, is introduced in Sect. 3, where we delve into how the DET score quantifies the suitability of an image for face detection. In Sect. 4, we propose our novel unsupervised and supervised face detection QEs, UFDQE and SFDQE, respectively. The experiments carried out to evaluate the performance of our face detection QEs are outlined in Sect. 5. Finally, in Sect. 6, we provide a summary of our findings and suggest potential directions for future research. The main contributions of this paper can therefore be summarized as follows:

  1. 1.

    The detectability (DET) score of an image, a novel quality metric that links image quality to the face detection performance of popular face detectors such as RetinaFace [17] and serves as a reference metric to design QEs for face detection is proposed in Sect. 3.

  2. 2.

    SFDQE, a novel supervised QE for face detection in which a deep learning network is trained using the DET scores of images as ground truth labels is proposed in Sect. 4.1.

  3. 3.

    UFDQE, a novel unsupervised QE for face detection that estimates the quality score of an image by computing the variations in its latent space feature maps caused by feature distortion layers is proposed in Sect. 4.2. UFDQE removes the requirement to compute the ground truth DET scores of images.

  4. 4.

    mAP vs. reject (mvR) protocol for evaluating the performance of QEs in the context of face detection is proposed in Sect. 3.3.1.

2 Background and related work

In this section, we provide essential background information regarding the task of face detection, focusing on popular data sets and specific deep learning networks used to perform the task. We also provide details regarding traditional perceptual QEs, task-specific QEs specifically in the domain of face recognition, and the existing evaluation methods used to assess QEs in terms of task performance.

2.1 Face detection

Face Detection involves the automatic classification and localization of human faces in images and videos. It is an essential task that supports numerous applications such as face recognition and verification in an end-to-end face analytics system. Face detection models identify the location of faces in an image along with the landmarks on each face that include the eyes, nose, and the mouth. These landmarks are required to align all the faces in an image to a specific template suitable for face recognition tasks. While face recognition tasks such as identification and verification are inherently complex, face detection is also challenging due to the numerous variations in facial appearances. These variations can encompass different pose orientations (frontal or non-frontal), obstructions, image orientations, lighting conditions, and facial expressions.

2.2 Face detection data sets

The design and evaluation of face detection algorithms and QEs for face detection, are heavily reliant on the choice of data sets used. Face detection data sets provide a diverse array of face images for training and testing. These images are annotated with face bounding boxes and landmarks. Over time, numerous unique and challenging face detection data sets have been developed, varying in image quantity, diversity (including differing lighting conditions, poses, expressions, and occlusions), and the number of faces per image.

In this paper, we use images from well-known face detection data sets, WiderFace [21] and the face detection data set and benchmark (FDDB) [22], to assess QEs for face detection. The WiderFace data set is one of the largest of its kind, with over 32,000 images and 393,000+ marked faces, sourced from the internet, TV shows, and films. It includes a broad spectrum of environments, ethnic groups, poses, and expressions, and is commonly used as a benchmark for face detection algorithms. The FDDB data set, while smaller, contains 5171 faces across 2845 images from everyday scenarios. Both data sets feature faces in various orientations and poses, and present challenges due to occlusions and difficult lighting conditions.

The WiderFace and FDDB data sets are divided into pre-defined train and test sets which are separate, meaning there is no overlap between the images in these sets. Deep learning face detectors are optimized on the train set of these data sets. Subsequently, their performance is assessed on the test set of the data sets to evaluate their ability to accurately locate faces in the unseen images.

2.3 Deep learning-based face detectors

Deep learning models such as multi-task cascaded convolutional networks (MTCNN) [23], you only look once (YOLO) [24], single shot MultiBox detector (SSD) [25], and RetinaFace [17] have significantly improved face detection performance, surpassing traditional techniques, such as the Haar cascades classifier [26] and the histogram of oriented gradients [27]. The MTCNN face detector [23] is a three-step cascading network of Convolutional Neural Networks (CNNs) that can identify faces and their landmarks in an image. The process involves generating candidate windows in the first stage, refining these windows in the second stage, and further refining the result and producing facial landmarks in the third stage. YOLO [24], a system for real-time object detection, treats object detection as a regression problem to spatially separated bounding boxes and related class probabilities. It applies a single neural network to the entire image, segmenting it into regions, and predicting bounding boxes and probabilities for each region. This unified model is notably faster than traditional methods, making it ideal for real-time applications like face detection. SSD [25] is another widely used face detection algorithm that discretizes the bounding box output space into a set of default boxes with varying aspect ratios and scales per feature map location. Contrary to methods that predict bounding boxes from multiple feature maps at different scales, SSD predicts bounding boxes from a single scale feature map, and the bounding boxes are modified to align with the scale of the feature map. This method enhances the efficiency and accuracy of face detection.

More recently, RetinaFace [17] has shown the best face detection performance amongst popular deep learning models. The RetinaFace [17] architecture is a single-stage design that densely samples face locations and scales on feature pyramids while leveraging multi-task losses for face detection, alignment, and 3D reconstruction tasks. It has demonstrated promising results and faster speeds compared to previous methods, such as MTCNN and SSD. The architecture of RetinaFace which consists of three parts: the Backbone, the Feature Pyramid Network (FPN) and the Context Modelling layers, and the classification and localization heads. The Backbone of RetinaFace is typically based on either ResNet [28] or MobileNet [29] modules depending on the application. In applications where accuracy is a priority, the more complex ResNet50 backbone is used. Conversely, for applications where efficiency and speed are crucial, the lighter MobileNet backbone is employed.

The backbone outputs several feature maps of different resolutions that are then input to the FPN. The FPN is tasked with extracting face-specific features from the output of the Backbone and creating feature maps on five different scales. The Context Module, appended to the FPN, is designed to gather additional contextual information from these features and is comprised of five unique filters. Finally, the output of the context module is then input into the classification and localization heads that output a face bounding box and landmarks. The entire architecture is optimized using a multi-level loss function for various tasks that include face classification, face detection, landmark regression, and dense 3D face regression. RetinaFace has consistently demonstrated superior performance on numerous face detection data sets, such as WiderFace [21] and FDDB [22]. Although, newer models such as YoloV8-Face [30], Poly-NL [31], and ASFD [32] have shown slightly better performance recently, RetinaFace continues to be one of the most powerful and accessible face detectors. For this reason, we use RetinaFace as the underlying framework of our novel face detection QEs. We also use it as the face detection network in the mAP vs. reject evaluation protocol.

2.4 Perceptual quality estimators

Conventional no-reference image quality estimators are designed to assess the perceptual quality of images. They aim to assess how an image is viewed by a human, that is, whether a human deems the image to be of poor or excellent quality. These QEs are not designed to assess image quality for specific tasks like face detection. We have included traditional perceptual QEs such as BRISQUE [2] and PIQE [4] as well as more recent deep learning-based approaches such as UNIQUE [33] and MUSIQ [34] in our experiments. This provides a useful contrast regarding the improvements that can be obtained by designing a QE specifically for face detection.

The blind/referenceless image spatial quality evaluator (BRISQUE) [2] is a technique for evaluating image quality without the need for a flawless or clean reference image. It leverages locally normalized luminance coefficients to derive spatial natural scene statistics (NSS) features from a mix of natural and distorted images, which are subsequently used to measure potential degradation in natural image statistics. Owing to its efficiency and effectiveness, BRISQUE is frequently chosen for use cases where there are no reference images, such as in real-time video or image quality evaluation. The Perceptual Image Quality Evaluator (PIQE) is another popular no-reference image quality estimator that uses perceptual features derived from the human visual system to evaluate the quality of an image. The strength of PIQE lies in the fact that it does not require training. For BIRSQUE and PIQE, a lower quality score indicates a higher quality image and vice versa.

The unified no-reference image quality and uncertainty evaluator (UNIQUE) [33] is a deep learning-based no-reference image quality assessment model that leverages multiple convolutional and fully connected layers to extract both low-level and high-level features from images to determine a quality score. Similarly, the multi-scale image quality transformer (MUSIQ) [34] employs a transformer-based architecture to capture both global and local image quality features, providing a more comprehensive evaluation of image quality compared to traditional CNNs. Both UNIQUE and MUSIQ are trained on large data sets of images with diverse quality levels, enabling these models to learn from a wide range of image quality levels and distortions. In the case of deep learning-based perceptual QEs, a higher quality score indicates that the image is more suitable for human perception and vice versa. Experimental results in [33, 34] have demonstrated that both UNIQUE and MUSIQ outperform traditional perceptual quality estimators, such as BRISQUE and PIQE.

2.5 Task-specific quality estimators

Task-specific QEs are designed to link image quality to the performance of an analytics task, such as image classification, object detection, and object recognition. For example, an approach using deep learning-based feature extractors is used to assess the quality of an image for the classification task in [35]. In the domain of end-to-end face analytics specifically, most recent task-specific QE approaches are developed to associate face recognition performance to face image quality. The most popular learning-based approaches for face image quality estimation are SER-FIQ [6], SDD-FIQA [7], and FaceQNet [8].

FaceQNet [8] is a supervised method designed to correlate the quality of a face image with its face recognition accuracy. It uses a BioLab-ICAO framework to generate image quality ground-truths in a way that the quality score corresponds to the ICAO compliance level. Advanced deep learning frameworks are then used to predict image quality scores.

On the other hand, SER-FIQ [6] is an unsupervised method that relies on the robustness of the embedding feature vector generated at the end of face recognition models to assign a quality score to face images. In this approach, face images are repeatedly passed through a face recognition network like ArcFace with dropout [36] activated. Dropout introduces variability in the feature vectors produced for a specific image. The SER-FIQ quality score for a face image is the Euclidean distance between the various feature vectors. SER-FIQ relies on an underlying face recognition network and hence, the training details of SER-FIQ are an important detail when interpreting the quality scores.

SDD-FIQA [7] is another unsupervised approach for face image quality estimation. SDD-FIQA depends on an underlying face recognition network as well as a fixed database of face images to form intra-class and inter-class distributions for each face in the database. Next, SDD-FIQA calculates a similarity distribution distance for each image (using the Wasserstein Distance) between its intra-class and inter-class distributions. A high-quality image should lie closer to its intra-class samples and differ significantly from its inter-class samples. This distance is then used as a ground truth to train a regression network that assists in generating quality scores for face images.

Although the quality score prediction of both SER-FIQ and SDD-FIQA is unsupervised, these two approaches rely on an underlying face recognition network that needs to be trained on appropriate face data. Furthermore, SDD-FIQA also requires a fixed data set to determine inter-class and intra-class distributions. As a result, these are important considerations that should be taken into account while interpreting the recognition-specific image quality scores estimated by SER-FIQ and SDD-FIQA.

For all the face recognition QEs, a higher quality score indicates that the image is more suitable for face recognition. However, while these QEs are effective in evaluating image quality for face recognition, they are not suitable for face detection tasks. This is due to the fact that these QEs operate on pre-processed and aligned face images, which are derived from face detection outputs, such as face bounding boxes and landmarks. In our previous work [13], we also showed that these face recognition QEs are highly sensitive to the procedure of face alignment as well.

Recently, more face recognition QEs have been introduced that utilize newer deep learning frameworks, such as diffusion models [9, 10], transformers [11], and vision language models [12]. However, these QEs continue to operate on perfectly aligned and pre-processed recognition-specific images that contain a single face and hence, these would not be suitable to assess the quality of images for face detection.

2.6 Evaluation protocols for task-specific quality estimators

In the current literature, there exist few evaluation methods to prove the effectiveness of task-specific QEs. In [15], we presented a thorough analysis of evaluation protocols that exist for face recognition QEs, such as the error vs. reject (EvR) protocol [16], the Best, Medium, and Worst (BMW) protocol [8] and the Gallery-Query (GQ) protocol [15]. However, suitable evaluation protocols need to be developed to evaluate QEs for face detection.

We review the error vs. reject (EvR) protocol here, because it forms a template for the MvR protocol we define below. EvR was introduced by [16] and has been extensively used to evaluate biometric and face quality measures [6, 16, 37,38,39]. This protocol characterizes whether a quality measure can effectively rank images by their usefulness and their potential reliability to a system. In this protocol, face images from a face recognition data set are rank-ordered based on quality scores predicted by various QEs. Subsequently, a fraction of face images are excluded from the data set according to these quality scores. Finally, a face recognizer is evaluated on the remaining images in the data set. The EvR protocol helps demonstrate how well the QE orders low-quality images (by reading from the left of the plot), as well as how effectively it orders high-quality images (by reading from the right of the plot) of a face recognition data set. These align with potential systems goals where a user might want to prioritize high-quality (more reliable) images, or to request human review before acting on a potential recognition result using a low-quality input.

Another protocol that is commonly used to evaluate face-based QE performance is the “Best-Middle-Worst” (BMW) performance protocol. In this protocol, the data set is split into three different groups based on the quality score, that is, the best, middle, and worst groups where each group makes up 33% of all images in the data set. Then, we demonstrate that the subsets (ideally) create ordered performance. This protocol has been used in FaceQNET [8] as well as the NIST fingerprint quality project [40]. Relative to the EvR protocol, this protocol requires the comparison of fewer pairs, because it only considers pairs within each partition. However, the method of splitting the data set into three sets without any constraints lacks any real-world use case implications.

Recently, the Gallery-Query (GQ) protocol was proposed in [15] which focuses on the underlying task of face identification. This evaluation protocol explores the effectiveness of a face QE when used to construct the gallery set in a face identification setting; this has direct implication for a real-world use case. In addition to understanding the effectiveness of QEs at the alternative task of face identification, another benefit of this evaluation protocol is that it is significantly faster when compared to the existing evaluation protocols.

It is also possible to construct targeted stress tests [13] to examine if a QE is robust to explainable changes to the input image. One approach is to ask the question—can a QE consistently predict when performance degradation happens across multiple perturbations and across multiple subjects? This approach was shown to expose a weakness of FaceQNet for images subjected to compression or noise.

3 Detectability (DET) score: metric to assess image quality in terms of face detection performance

Estimating the quality of an image for tasks like face detection addresses the question,” is this image suitable for face detection?” To answer this, we need a quality metric that accurately links the performance of deep learning-based face detection models like RetinaFace [17] to image quality. This quality metric can then be used as a reference to design QEs for face detection.

In this section, we first describe how face detectors are typically evaluated. Then, we propose the Detectability (DET) score of an image which links image quality to face detection performance.

3.1 Variation of mean-average precision (mAP) with intersection-over-union (IoU) threshold

The effectiveness of object detection models on a specific data set is measured using the mean-Average-Precision (mAP) metric [41], calculated at a specific Intersectionover-Union (IoU) threshold. This IoU threshold is chosen based on the application or use case, determining the level of confidence with which the object detector must localize objects in an image. To calculate the mAP, a single IoU threshold is fixed; this is typically either 0.25, 0.5, or 0.75 depending on the application intended for the object detector. A higher IoU threshold requires the face detector to be more confident in localizing faces. Conversely, a lower IoU threshold means the detector can be less confident in face localization.

The Intersection-over-Union (IoU) threshold measures the overlap between the actual position of an object in an image (ground truth bounding box) and its predicted location by the object detector. The degree of overlap is crucial for classifying a detection as a true positive or a false positive. At lower IoU thresholds, a small overlap between the predicted and actual object locations is enough to classify the prediction as a true positive. However, as the IoU threshold increases, the predictions of the detector need to be more closely aligned with the ground-truth bounding boxes to be classified as true positives. The precision and recall values calculated at a particular IoU threshold, when plotted, create the Precision-Recall (PR) Curve, which is then transformed into an interpolated PR curve. The area under this interpolated PR curve gives us the Average Precision (AP) of the object detector for a specific class. The mAP of the detector is the average AP across all classes. In our case, since we are only detecting faces, the mAP of the face detector is the same as its AP. A higher mAP value suggests better face detection performance, while a lower value indicates the opposite.

It is crucial to understand that the mAP of a face detector can change significantly depending on the IoU threshold used. A higher IoU threshold makes it harder for the detector to accurately localize faces in an image resulting in a lower mAP. Similarly, a lower IoU threshold makes it easier for the detector to localize faces in an image resulting in a higher mAP. The variation in mAP based on IoU thresholds is image specific as well, and can thus serve as a measure to determine the suitability of an image for face detection.

3.2 The DET score of an image

The DET score of an image links image quality to face detection performance by averaging the mAP of the face detector on a given image at various IoU thresholds rather than at one specific IoU threshold. This approach allows the DET score to capture how a face detector performs on a given image in all use cases resulting in a robust measure that captures the quality of an image for face detection. This is important because the IoU threshold for evaluating a face detector largely depends on the use case scenario.

Figure 4 depicts the procedure to calculate the DET score of an image. First, we use a face detection network like RetinaFace-ResNet50 [17] on the image to obtain face location predictions. Then, we compare these predictions to the ground-truth face locations at “n” different IoU thresholds. Here, we set n = 11; that is, we increase the IoU threshold in steps of 0.05, 0.1, 0.2, 0.3,…, 0.9, 0.95. The mAP of the face detector is then calculated at each IoU threshold, and the mAP values are averaged to obtain the DET score of the image. This is summarized in the following equation:

$$ {\text{DET}}\;{\text{Score}} = \frac{1}{n}\sum\limits_{i = 0.05,0.1, \ldots ,0.95} {\mathop {{\text{mAP}}_{{{\text{FaceDectector}}}} @{\text{IoU}} = i}\limits_{{{\text{ground}}\;{\text{truth}} \times {\text{pred}}}} } $$
(1)

where n is the total number of different IoU thresholds at which the mAP of the face detector is calculated.

Fig. 4
figure 4

Block diagram depicting the procedure to calculate the detectability (DET) score of an image

The DET score ranges from 0 to 1, with 1 being the highest quality. Images more suitable for face detection will therefore have higher DET scores, as the mAP of such images will remain high even at higher IoU thresholds. Thus, the DET score of an image effectively links image quality to face detection performance. In our upcoming sections, we leverage the DET scores to develop novel face detection QEs.

In Fig. 5, we illustrate how the DET score correlates image quality to face detection performance by using the RetinaFace-ResNet50 face detector and a few images from the test set of the WiderFace [21] data set. In this figure, we plot the mAP values of RetinaFace-ResNet50 obtained on an image along the y-axis and the IoU thresholds at which the mAP is determined are plotted along the x-axis. For images 1 and 2, the mAP stays consistent up to high IoU thresholds. Thus, the face detector is very confident in localizing faces in these images, and the DET score of these images is high. For images 3 and 4, the mAP is initially high but decreases as the IoU threshold increases, which shows that the detector is less confident in predicting the face locations in such images. Therefore, these images have lower DET scores than images 1 and 2. Images 5 and 6 show low mAP at all IoU thresholds, suggesting that the detector finds it very difficult to localize faces in these images. As a result, these images have the lowest DET scores.

Fig. 5
figure 5

DET score for six images from the WiderFace [21] data set computed by averaging the mAP at various IoU thresholds

Figure 5 also demonstrates the need for averaging mAPs across various IoU thresholds for the DET score calculation. Depending on the application or use case, the mAP of a face detector is evaluated at a fixed IoU threshold of either 0.25, 0.5, or 0.75 during testing [42]. However, calculating the DET score at a fixed IoU threshold can lead to poor image ranking. For instance, Images 1 and 2 have the same mAP at the typically used IoUs of 0.25, 0.5, or 0.75. But eventually, Image 2 experiences a more rapid mAP decline than Image 1, making it less suitable for face detection. Therefore, when calculating the DET score of a face detector, it is crucial to average the mAP across the entire IoU threshold range. This captures a more comprehensive and accurate link between the quality of an image and its face detection performance, resulting in a more robust face detection-specific image quality score. This is explored in more detail in Sect. 3.3.2

3.3 Effectiveness of DET score to link image quality and face detection performance

In this section, we validate the effectiveness of the DET score of an image. The DET score is designed to be meaningful and interpretable, directly reflecting the performance of popular face detectors on a specific image. To demonstrate this, we first introduce the mAP vs. reject (mvR) protocol, an evaluation protocol that effectively shows the variation in the performance of a face detector based on image quality scores. Following this, we use the mvR protocol to demonstrate how the DET score can be employed to assess the suitability of an image for face detection.

3.3.1 mAP vs. reject (mvR) protocol: evaluating the effectiveness of image quality scores for face detection

The mAP vs. reject (mvR) protocol is an adaptation of the error vs. reject (EvR) protocol [16] which has been widely applied in the assessment of biometric and face quality measures [6, 16, 37,38,39]. The mvR protocol is an evaluation protocol used to assess the performance of QEs in the context of face detection specifically. It essentially characterizes whether a quality measure can effectively rank images by their usefulness and potential reliability for face detection.

The mvR protocol depends on three key elements to evaluate the efficiency of a QE for face detection: a data set, quality scores evaluated by a QE, and a face detector. In this protocol, we rank-order the face images according to their quality scores predicted by task-specific QEs. We then reject a fraction of the images based on the ranking specified by each QE; that is, we drop a percentage of the images from the data set that have low scores assigned by a QE. Next, we evaluate the mAP of the face detector on the all the remaining images of the data set after excluding the images with lower quality scores, but only at one specific IoU threshold. In our case, we fix the IoU threshold to 0.5 which is commonly used for testing object detectors.

It is important to reiterate the difference between the mvR evaluation protocol and the DET score calculation. The mvR protocol calculates the mAP of the face detector for the whole data set or a subset of the data set, but only at a single IoU threshold. On the contrary, as specified in Sect. 3.2, the DET score is calculated for a single image by averaging the mAP of the face detector on that image across multiple IoU thresholds. The ideal behavior of the mvR curve for a quality metric like the DET score, which is designed to link image quality and face detection performance, is that the mAP of the detector should increase as we remove images with low DET scores from the data set.

Figure 6 illustrates the implementation of the mvR protocol, where we rank-order images present in the WiderFace [21] test set according to their DET scores. The mAP of the RetinaFace-ResNet50 face detector is then computed at a fixed IoU threshold of 0.5 on the remaining images of the data set after excluding a certain percentage of low quality images. The resulting mvR curve for the DET score, as depicted in Fig. 6, is created by plotting the mAP of the face detector at a specific IoU threshold of 0.5 on the y-axis against the percentage of images dropped along the x-axis.

Fig. 6
figure 6

mAP vs. reject protocol on the WiderFace data set using the DET score. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

The mvR curve in Fig. 6 demonstrates that as we remove poor-quality images based on the DET score, the mAP of the face detector starts improving. The most significant improvement is observed at lower drop percentages when the lowest scoring images are excluded, that is, when images that are not suitable for face detection are dropped from the data set. The mAP of the face detector reaches a maximum value when about 60% of the images are dropped from the data set. This implies that the WiderFace data set has a good balance between low and high quality images. Furthermore, this curve enables us to evaluate the effectiveness of the DET Score in ranking both low-quality images (by reading from the left of the plot), as well as high-quality images (by reading from the right of the plot). This is particularly relevant in systems where prioritizing high-quality images or requiring human review for detections based on low-quality inputs may be desirable. Thus, we can conclude that the DET score effectively ranks images based on their suitability for face detection; that is, as we drop images based on their DET scores, the mAP of the face detector increases.

3.3.2 Can the performance of the DET score be used as a reference to design QEs for face detection?

As illustrated in Fig. 6, the DET score effectively ranks images in the WiderFace data set, thereby successfully assessing the suitability of an image for face detection. However, prior to using the DET score as a reference metric to design QEs for face detection, we must ascertain whether it is necessary to average the mAP of the face detector across different IoU thresholds. It may be worth considering the mAP of the face detector at a specific IoU threshold as a quality score instead.

To confirm this, we first compute the DET score for every image in the WiderFace data set test set using Eq. 1. Following that, we calculate the mAP values of RetinaFace-ResNet50 on the images at a specific IoU threshold, such as 0.25, 0.5, and 0.75. We directly use this mAP value at a fixed IoU threshold as an image quality score and refer to this as the “Fixed-IoU” quality score. We then plot mvR curves for the DET score and the Fixed-IoU scores. During the evaluation using the mvR protocol, we fix the testing IoU threshold of the face detector to be one of the fixed IoU thresholds used to calculate the Fixed-IoU quality score.

Figures 7, 8, and 9 display the mvR curves of the DET score and Fixed-IoU scores calculated on the images from the test set of the WiderFace data set. In these figures, we use the RetinaFace-Resnet50 detector at different testing IoU thresholds of 0.25, 0.5, and 0.75, respectively, to create the mvR plots. In all these figures, we plot the mAP of the detector at a fixed IoU threshold along the y-axis against the percentage of images dropped along the x-axis.

Fig. 7
figure 7

mAP vs. reject protocol on the WiderFace data set using the DET score and Fixed-IoU scores. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.25

Fig. 8
figure 8

mAP vs. reject protocol on the WiderFace data set using the DET score and Fixed-IoU scores. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

Fig. 9
figure 9

mAP vs. reject protocol on the WiderFace data set using the DET score and fixed-IoU scores. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.75

The first thing we notice in these figures is the different starting points for the mvR curves. In Fig. 7, all the mvR curves start at an mAP value of 0.845 which is the mAP of the face detector at an IoU threshold of 0.25 on the entire data set without dropping any images. However, when we increase the IoU threshold to 0.5 in Fig. 8, we see that the mvR curves start at an mAP of 0.756. Similarly, in Fig. 9, the mvR curves start at an mAP of 0.405. This shows that as the IoU threshold increases, it becomes harder for the detector to confidently localize faces in an image.

The next thing we see is the general trend that all the mvR curves trend upwards as we drop images from the data set based on the rank-ordering of the quality scores. This indicates that the DET score and the Fixed-IoU scores are effective in determining the suitability of an image for face detection. However, the mvR curve of the DET score, which averages the mAP of the detector across all IoUs, is the most suitable in all cases; unlike the Fixed-IoU score curves that are only effective in specific scenarios.

Now consider the mvR curves in Fig. 7 specifically, which shows the mvR curves at a fixed testing IoU threshold of 0.25. From this figure, it is clear that the mvR curve of the DET score (purple curve) is much higher than the Fixed-IoU curves at all specific IoU thresholds. More importantly, we also see that even when the FixedIoU score is calculated at an IoU threshold of 0.25 (gray curve), which is the IoU used to estimate the mAP of the face detector, the mvR curve lies below the DET score curve. This means that the rank-ordering of images is not as good as that of the DET score when a Fixed-IoU score is used as a quality measure. Furthermore, when a large percentage of images are dropped from the data set based on this Fixed-IoU score, the mAP of the face detector does not reach its maximum possible value; this maximum is achieved when the DET score is used.

Similar outcomes are observed when the mvR curves are plotted when the mAP of the face detector is calculated at a fixed IoU of 0.5 and 0.75, as shown in Figs. 8 and 9, respectively. The results in Fig. 8 are very similar to those in Fig. 7, that is, when the Fixed-IoU score is calculated at an IoU threshold of 0.5 which is the IoU used to estimate the mAP of the face detector, its mvR curve (pink curve) lies below the DET score curve. From the right side of the plot, we observe that discarding low-quality images based on fixed-IoU scores calculated at IoU thresholds of 0.25 (orange curve) and 0.5 (curve) reduces the mAP of the face detector on the remaining images of the data set, which should technically only contain high-quality images. This contradicts the expectation that as we remove low quality images, the mAP of the face detector should increase.

As mentioned earlier in Sect. 3.2, the main issue with using the Fixed-IoU scores is that many images tend to have a similar mAP at a specific IoU threshold. This fails to capture the mAP variation of a face detector across the entire range of IoU thresholds, as depicted in Fig. 5. Consequently, many images receive the same quality score, resulting in a less precise image ranking, so that some high-quality images might be discarded over poor ones.

The DET score on the other hand, is very effective in determining the suitability of an image for face detection irrespective of the IoU at which the mAP of the face detector is calculated for the mvR protocol. This is because the DET score captures the overall performance of the face detector on an image across a large range of IoU thresholds rather than using a single fixed IoU threshold. Hence, in the upcoming sections, we design QEs that can accurately predict the DET score of an image such that the mvR curves of the QEs lie as close to the DET score curve as possible.

4 Quality estimators for face detection

Having established how the DET score is effective in rank-ordering images based on their suitability for face detection, our next goal is to develop Quality Estimators (QEs) that can predict image quality scores that rank-order images in accordance with their respective DET scores. This is because, to calculate the DET score of an image, we need ground-truth face locations which are unlikely to be available for a given image in many situations. In this section, we introduce two new QEs for face detection that estimate a quality score of an image in a more efficient and streamlined manner: the Supervised Face Detection Quality Estimator (SFDQE) and the Unsupervised Face Detection Quality Estimator (UFDQE), bypassing the DET score procedure detailed in Sect. 3.2.

4.1 SFDQE: supervised face detection quality estimator

In this section, we introduce our novel Supervised Face Detection Quality Estimator (SFDQE), as shown in Fig. 10. This approach follows the traditional supervised approach to train deep learning networks; that is, it involves training a deep learningbased QE network by using the DET scores of images as ground truth labels. The QE network is optimized to directly predict a quality score of an image that is as close to its DET score as possible.

Fig. 10
figure 10

Detailed block diagram of the supervised face detection quality estimator (SFDQE)

Our supervised approach, SFDQE, employs a deep learning QE network that is very similar to the face detection network. The QE network includes a Backbone network, based on the MobileNet architecture [29], and a Quality Score Regression Head. The QE network is considerably simpler than a face detection model like RetinaFaceResNet50. This is because unlike the face detector, the QE network uses a lighter MobileNet backbone, and does not need Context Modeling layers or Classification and Localization heads that are necessary to identify and precisely localize faces in an image. The low-level features captured by the Backbone architecture, along with the FPN, are sufficient to encapsulate the face-related semantic information in an image, which aids in measuring its task-specific quality. The output feature maps of the FPN are input to a Quality Score Regression Head, which consists of fully connected layers only. The Quality Score Regression head predicts a quality score for an image that closely matches its DET score.

To train the QE network of SFDQE, we first compute the DET scores for all images in the training set of our data set for which ground-truth face locations are available. The QE network takes an image as input and directly predicts a DET score of the image. We use the mean squared error loss function to measure the difference between the predicted and actual DET scores. The QE network is optimized to minimize this error, aiming to make its DET score predictions as accurate as possible. When evaluating SFDQE, we use unseen images from the test set of our data set to gauge the accuracy of its estimates compared to the actual DET scores. Although we do have ground-truth face locations for these images, SFDQE has never seen these images during its training phase.

SFDQE depends on labeled data, specifically, the ground truth DET scores of images, calculated using known face locations. However, this data might not always be readily available and could require substantial time and effort to produce. If there is not enough training data, it can be challenging to guarantee high-quality performance for SFDQE. In addition, training SFDQE can be time-consuming, particularly with larger data sets.

In Sect. 5, we perform an extensive evaluation of the SFDQE using the mvR protocol, assessing how closely its performance aligns with the DET score curve depicted in Fig. 6. We also discuss the reasoning behind specific design decisions related to the architecture of the QE network in SFDQE and how these choices affect its quality estimation performance.

4.2 UFDQE: unsupervised face detection quality estimator

Unsupervised methods are preferred over supervised ones because they can be trained on unlabeled data and they do not require pre-training. This makes them particularly useful in real-world scenarios where data labeling is expensive or impractical. In addition, developing unsupervised approaches by leveraging existing deep learning techniques would enhance their applicability in practical settings. With this in mind, we also propose a new Unsupervised Face Detection Quality Estimator (UFDQE), as depicted in Fig. 11.

Fig. 11
figure 11

Detailed block diagram of the unsupervised face detection image quality estimator (UFDQE)

UFDQE exploits the distortions created in the feature maps of an image as it traverses an underlying face detection network that includes regularization layers, such as Dropouts [18], Dropblocks [19], and Disouts [20]. UFDQE quantifies these feature map distortions to compute a quality score that associates image quality to face detection performance without the need for any ground-truth labels.

4.2.1 Feature distortion layers

In deep learning networks, regularization layers such as Dropouts [18], Dropblocks [19], and Disouts [20] are commonly used to avoid overfitting. These layers work by restricting a network from using certain parts of its architecture as an input passes through it. Thus, when the same input is passed through the network, it is forced to use a different sub-network to learn the specifics of an input. Due to this, different latent space features are produced each time the same input passes through a network with these regularization layers. In other words, these layers alter the features of a given input as it passes through a deep learning network. Therefore, in this paper, we refer to these layers as “feature distortion” layers.

As an image passes through a face detection network, the feature distortion layers present in the network distort the feature maps produced by the convolutional layers of the network. Each convolutional layer generates multiple feature maps for a given input image. Each feature map of the network learns about a specific part of the input image. However, it is possible for multiple feature maps to learn about the same part of the image. Dropouts, when used with convolutional layers, operate by removing an entire feature map from the output of the layer. This method is less effective as it does not account for the fact that multiple feature maps can learn the same part of the input image. Therefore, even with Dropouts present, all the vital information about the image continues to flow through the network and the network is never forced to improve its learning capacity. To address this, Dropblocks and Disouts were introduced. Instead of dropping an entire feature map, these methods drop a continuous region in the feature maps. Dropblocks and Disouts achieve more effective regularization than Dropouts as they encourage the network to activate different parts of each feature map to learn the specifics of an input image.

Usually, these feature distortion layers are deactivated once a deep learning network is trained, meaning that no feature distortions occur during testing. However, in the case of UFDQE, we choose to keep these layers activated, so that we can generate feature distortions. This allows UFDQE to utilize existing face detector architectures and deep learning techniques to link image quality and face detection performance.

4.2.2 Design of UFDQE

Figure 11 provides a detailed block diagram of UFDQE. In UFDQE, we create a QE network that is identical to the face detection network from Fig. 10, but with feature distortion layers added after the Backbone of the network. This QE network, like the face detection network, predicts face locations in an image but in the presence of the feature distortion layers. As an image is passed through this QE network, the feature maps of the image due to the distortions introduced by the feature distortion layers after the Backbone. This leads to variations in feature maps in subsequent network modules, such as the feature pyramid, context modeling layers, and the classification and localization heads. Because of these feature map variations, the QE network produces different face location predictions for the same image each time it is processed by the QE network. In UFDQE, a single image is passed multiple times through the QE network with feature distortion layers. Then, the variation in the face location predictions of the QE network for the image is measured to determine its quality score.

Although one can draw some parallels between the designs of UFDQE and face recognition QEs like SER-FIQ [6], UFDQE has distinct features that make it suitable for assessing the quality of images for face detection. The performance of SER-FIQ relies heavily on the underlying face recognition network, while UFDQE relies on a face detection network to determine the quality score of an image for face detection. Unlike SER-FIQ, which captures feature distortions only at the last layer of a face recognition network using Dropouts, UFDQE utilizes convolution-specific feature distortion layers such as Dropblocks and Disouts in addition to Dropouts after the backbone of a face detector. This allows UFDQE to capture and accumulate feature map variations that occur across several parts of the face detector, such as the FPN and context modelling layers, and the classification and localization heads. This design of UFDQE can result in more reliable quality score predictions that strongly correlate image quality to face detection performance, as further discussed in Sect. 5.5.3.

4.2.3 Predicting the quality score of an image using UFDQE

UFDQE leverages an underlying face detection network with feature distortion layers to estimate a quality score for an image without needing ground-truth face locations. This quality score essentially assesses the consistency of the face location predictions made by the QE network of UFDQE, indicating the suitability of an image for face detection.

The quality score estimated of an image estimated by UFDQE is computed using Eq. 2, which is easier to understand when read from right to left:

$$ {\text{UFDQE}}\;{\text{Quality}}\;{\text{Score}} = \frac{1}{{p\left( {p - 1} \right)}}\sum\limits_{\begin{subarray}{l} j,k = 1 \\ j \ne k \end{subarray} }^{p} \frac{1}{n} \sum\limits_{i = 0.05,0.1, \ldots ,0.095} {\mathop {{\text{mAP}}_{{{\text{QE}}}} @{\text{IoU}} = i}\limits_{{{\text{pred}}_{j} \times {\text{pred}}_{k} }} } $$
(2)

where n specifies the number of different IoU thresholds used and p specifies the number of predictions made by the QE network on an image.

The second summation in Eq. 1 is quite similar to the equation used to calculate the DET score (Eq. 1). While the DET score compares the ground truth face locations of an image to the face location predictions of a face detection network at various IoU thresholds, UFDQE calculates a quality score by first comparing multiple predictions of the QE network on the same image at different IoU thresholds. Each prediction of the QE network is compared to all of its other predictions, resulting in multiple mAP values. The first summation in Eq. 2 then averages all these mAP values to derive the final quality score of an image. Here, we compare p = 10 different predictions of the QE network on the same image. Each prediction is compared to every other prediction at n = 11 different IoU thresholds in steps of 0.05, 0.1, 0.2, 0.3,…,0.9, 0.95, resulting in 90 different mAP values for the same image. These mAP values are then averaged to obtain the final quality score of the image.

4.2.4 Effectiveness of UFDQE in linking image quality and face detection performance

Understanding how the quality score estimated using UFDQE is crucial, but interpreting it in a meaningful way is equally important. In the case of UFDQE, an image with a higher quality score means that the face location predictions of the QE network on the image stays consistent across multiple runs. This essentially indicates that the feature maps of the image undergo minimal variation as it passes through the feature distortion layers of the QE network. This indicates that it is more suitable for face detection. Similarly, an image with a lower quality score would be less suitable for face detection as its feature maps would vary considerably as it passes through the feature distortion layers of the QE network.

In Fig. 12, we provide the quality scores estimated using UFDQE, for the same images from the WiderFace data set, as shown in Fig. 5. From this figure, we see that the scores estimated using UFDQE classify the images in the same rank-ordering as the DET score. Therefore, UFDQE estimates effective quality scores that determine the suitability of an image for face detection. In Sect. 5, we delve deeper into the performance of UFDQE and the effectiveness of the quality scores it predicts in linking the quality of an image for face detection.

Fig. 12
figure 12

Quality score for six images from the WiderFace [21] data set obtained by using UFDQE

4.2.5 Advantages of UFDQE over SFDQE

While both SFDQE and UFDQE are novel QEs that connect image quality and face detection performance, UFDQE does possess a few advantages over SFDQE.

The main advantage is its simplicity of implementation. While SFDQE is built on existing face detection networks, it still requires training on ground truth DET score labels, as shown in Fig. 10. In contrast, UFDQE can be directly built on top of deep learning face detectors which include feature distortion layers. This allows UFDQE to predict the quality scores of an image without any additional training of the network.

Another advantage of UFDQE is that while SFDQE necessitates the design of a Quality Score Regression Head for its QE network (Fig. 10), the QE network in UFDQE employs the identical architecture as the face detection network. If the face detection network used for UFDQE lacks feature distortion layers, they can be easily added after the Backbone network as shown in Fig. 11 without any other changes to the modules of the network. However, this approach would require retraining or finetuning based on the ground truth face locations from a face detection data set. Despite this, it removes the necessity to calculate ground truth DET score labels required to train SFDQE.

5 Experiments and results

In this section, we discuss experiments that answer important research questions pertaining to the effectiveness of QEs in linking image quality to face detection performance. These research questions are as follows:

  1. 1.

    Are perceptual QEs effective in linking image quality and face detection performance? (Sect. 5.2)

  2. 2.

    Are face recognition QEs effective in linking image quality and face detection performance? (Sect. 5.3)

  3. 3.

    Are the task-specific QEs, SFDQE and UFDQE, effective in linking image quality and face detection performance? (Sect. 5.4)

  4. 4.

    How do certain design choices impact the performance of SFDQE and UFDQE? (Sect. 5.5)

  5. 5.

    Can SFDQE and UFDQE perform well in a cross-data set evaluation scenario? (Sect. 5.6)

  6. 6.

    Can the design principles of SFDQE and UFDQE be used to create QEs for other object detection tasks? (Sect. 5.7)

We first provide essential information about our experimental framework, including the face detection data sets used, the structure of the deep learning networks of SFDQE, UFDQE, and the face detector utilized for the mvR protocol. Then, we delve deeper into each of the research questions listed above. In our first experiment, we verify that perceptual QEs are not effective for evaluating the quality of images in a task-specific scenario like face detection. Then, we demonstrate why face recognition QEs are not suitable for the face detection image quality assessment. We then demonstrate the effectiveness of our QEs, SFDQE and UFDQE, for face detection and discuss the decisions made while designing these QEs. Next, we show that our face detection QEs perform well on unseen face data sets. Finally, we show that the design of our QEs is generalizable and not specific to face detection; that is, we show that similar design principles can be followed to create QEs for other object detection tasks like pedestrian detection.

5.1 Experimental framework

In our experiments, our objective is to utilize images from well-known face detection data sets to test our novel QEs and perceptual QEs for the task of face detection using the mvR protocol. To achieve this objective, we use the experimental framework shown in Fig. 13.

Fig. 13
figure 13

Block diagram of the experimental framework used to evaluate QEs

Here, we rely on images from popular face detection data sets, WiderFace [21] and the Face Detection Data Set and Benchmark (FDDB) [22] to evaluate QEs for face detection. We use the train set of these data sets for training our QE networks and face detection networks. The test set images are utilized only for evaluation.

In terms of task-specific QEs, we evaluate the effectiveness of our novel QEs, SFDQE and UFDQE. Both SFDQE and UFDQE, are built on top of an underlying face detection network, RetinaFace [17]. We use the lightweight MobileNet [29] backbone in our QE networks, instead of the ResNet50 [28] backbone. We also evaluate popular face recognition QEs, such as SER-FIQ [6] and SDD-FIQA [7] in the context of face detection in some of our experiments. Furthermore, we evaluate perceptual QEs, such as BRISQUE [2], PIQE [4], UNIQUE [33], and MUSIQ [34] as well.

We use the mvR protocol for evaluation of the QEs, as described in Sect. 3.3.1. During the evaluation, we use the mvR curve of the DET score from Fig. 6 as the target performance curve, that is, we would want the curves of the QEs to follow an upwards trend and lie as close as possible to the DET Score curve.

In most of our experiments, we rely on a very specific default configuration of SFDQE, UFDQE, and the mvR protocol which are as follows:

  • SFDQE consists of the MobileNet backbone, a FPN, and a Quality Score Regression Head consisting of fully connected layers. We first create ground truth DET scores of the images used for training using Eq. 1. The QE network of SFDQE is then trained from scratch on these ground truth DET scores.

  • UFDQE consists of the MobileNet backbone with the other modules of the RetinaFace unchanged. As the RetinaFace architecture does not have any feature distortion layers, we add Disouts [20] after the Backbone. The Disout layers randomly remove contiguous 6 × 6 regions with a probability of 50% as the feature map of an image passes through it. We choose 6 × 6 as this is the optimal block size specified in [20]. All the modules of UFDQE are trained from scratch using ground truth face locations to ensure suitable image quality scores are estimated.

  • For evaluation of QEs with the mvR protocol, we use the RetinaFace-ResNet50 face detector. The mAP of RetinaFace-ResNet50 is evaluated at a fixed IoU threshold of 0.5 on the data set after dropping images from it at varying percentages (5%, 10%, 20%,…, 90%, and 95%). The RetinaFace-ResNet50 face detector is trained and evaluated on the same set of images that are used for the QEs.

We will use this default configuration for most of our experiments unless otherwise specified.

5.2 Are perceptual QEs effective in linking image quality and face detection performance?

The most crucial question that arises in the context of task-specific image quality assessment is “Is there a need to design task-specific image quality estimators when perceptual image quality estimators exist?” In this section, we demonstrate how existing perceptual QEs such as BRISQUE [2], PIQE [4], UNIQUE [33], and MUSIQ [34] are not suitable for estimating image quality for tasks like face detection; hence, the design of task-specific QEs is necessary.

Figures 14 and 15 depict the mvR curves of the perceptual QEs on the WiderFace and FDDB data set, respectively. In these figures, we plot the mAP of the face detector on the y-axis vs. the percentage of dropped images on the x-axis. We use the default configuration of the mvR protocol for this experiment. From these figures, it is evident that the perceptual QEs are poor at ranking images based on their suitability for face detection. This is evident from the fact that the mvR curves of the perceptual QEs do not lie anywhere close to the DET score curve. These curves also do not have a clear upwards trend, which indicates that they are ineffective in linking image quality and face detection performance. Consequently, we do not evaluate perceptual QEs in any of our future experiments.

Fig. 14
figure 14

mAP vs. reject protocol on the WiderFace data set using the perceptual QEs. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

Fig. 15
figure 15

mAP vs. reject protocol on the FDDB data set using the perceptual QEs. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

The mvR curves of the perceptual QEs on the WiderFace data set shown in Fig. 14 show an upward trend at higher image drop percentages, indicating they are moderately effective at ranking higher-quality images relative to lower-quality images. The performance of all the perceptual QEs is better on the FDDB data set, as shown in Fig. 15. The mvR curves trend upwards at lower image drop percentages indicating that they are more effective at finding images suitable for face detection in the FDDB data set when compared to the WiderFace data set. For both the data sets, the more recent deep learning-based QEs outperform the traditional QEs. However, the mvR curves of the perceptual QEs are significantly far away from the optimal DET score curve. This indicates their ineffectiveness in linking image quality to face detection performance.

5.3 Are face recognition QEs effective in linking image quality and face detection performance?

In the current literature on task-specific image quality assessment for face analytics, the primary focus is on the face recognition task. Face recognition QEs have been extensively used to link the quality of pre-processed and aligned face images to the performance of face recognizers. In this section, we demonstrate that popular face recognition QEs, such as SER-FIQ [6] and SDD-FIQA [7], are inadequate for estimating image quality of images for face detection. Therefore, the development of face detection-specific QEs is necessary.

Figures 16 and 17 display the mvR curves of face recognition quality estimators (QEs) on the WiderFace and FDDB data sets, respectively. In these figures, the mAP of the face detector is plotted on the y-axis, while the percentage of dropped images is on the x-axis. The experiment uses the default mvR protocol configuration. These figures clearly show that face recognition QEs, like perceptual QEs, are ineffective at ranking images for face detection. This ineffectiveness is evident as their mvR curves do not align closely with the DET score curve and lack an upward trend. This poor performance could potentially be attributed to the underlying deep learning models of these QEs, which are trained on pre-processed and aligned images specifically for face recognition. Consequently, we will not evaluate face recognition QEs in any future experiments.

Fig. 16
figure 16

mAP vs. reject protocol on the WiderFace data set using SER-FIQ and SDD-FIQA. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

Fig. 17
figure 17

mAP vs. reject protocol on the FDDB data set using SER-FIQ and SDD-FIQA. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

5.4 Are SFDQE and UFDQE effective in linking image quality and face detection performance?

Here, we illustrate the effectiveness of our novel face detection-based image quality estimators, SFDQE and UFDQE in linking image quality and face detection performance.

Figures 18 and 19 depict the mvR curves of SFDQE and UFDQE on the WiderFace and FDDB data set, respectively. In these figures, we plot the mAP of the face detector on the y-axis vs. the percentage of dropped images on the x-axis. We use the default configuration of SFDQE, UFDQE, and the mvR protocol for this experiment. The mvR curves of SFDQE and UFDQE are notably closer to the optimal DET score curve and have an upward trend. That is, as we drop low quality images based on the scores assigned by these QEs, the mAP of the face detector improves. This indicates that these QEs are effective in determining if an image is suitable for face detection or not. Comparing these mvR curves to the mvR curves of perceptual QE in Figs. 14 and 15, demonstrates the value of these novel face detection QEs. However, although SFDQE and UFDQE show encouraging results, there is still potential for improvement when comparing their mvR performance to the DET score curve.

Fig. 18
figure 18

mAP vs. reject protocol on the WiderFace data set using SFDQE and UFDQE. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

Fig. 19
figure 19

mAP vs. reject protocol on the FDDB data set using SFDQE and UFDQE. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

In Fig. 18, we see that both SFDQE and UFDQE demonstrate promising results in image quality assessment for face detection on the WiderFace data set. SFDQE exhibits slightly better performance than UFDQE on the WiderFace data set. This observation is significant as even though UFDQE does not utilize ground-truth labels for the estimation of the quality score of an image, its performance is very close to that of the supervised SFDQE. Figure 19 illustrates the effectiveness of our novel QEs on the FDDB data set. As with our previous findings, SFDQE yields the best results but UFDQE is not far behind. In addition, the performance of our face detection QEs is much closer to the optimal DET score for the FDDB data set in comparison to the WiderFace data set.

In Figs. 18 and 19, we evaluated our face detection QEs using the mvR protocol, where the y-axis represents the mAP of the RetinaFace-ResNet50 face detector. The underlying architecture of our QEs is also based on RetinaFace. This raises a question on how well do our QEs perform when a different face detector is evaluated in the mvR protocol. To answer this question, we determine the performance of our QEs by evaluating the YoloV8-Face [30] face detector with a large backbone called YoloV8LFace in the mvR protocol. The results of this experiment are shown in Figs. 20 and 21. From these figures, we observe that the starting point of the DET score curve for the YoloV8L-Face model is higher than that of the RetinaFace-ResNet50 model, indicating that it is a more powerful face detector. The mvR results are consistent with our previous findings, where the SFDQE and UFDQE mvR curves follow an upward trend and lie close to the DET score curve. This indicates that our QEs perform well across different face detectors. However, in this case, the performance gap between SFDQE and UFDQE is slightly larger compared to when we evaluated the RetinaFace-ResNet50 face detector.

Fig. 20
figure 20

mAP vs. reject protocol on the WiderFace data set using SFDQE and UFDQE. At each drop percentage, YoloV8L-face is evaluated by computing its mAP at a fixed IoU of 0.5

Fig. 21
figure 21

mAP vs. reject protocol on the FDDB data set using SFDQE and UFDQE. At each drop percentage, YoloV8L-face is evaluated by computing its mAP at a fixed IoU of 0.5

5.5 How do certain design choices impact the performance of SFDQE and UFDQE?

Here, we delve deeper into the design choices that were made while creating SFDQE and UFDQE. We discuss how these design choices impact the performance our QEs. This is an important discussion because it allows the users to pick specific settings of our QEs depending on the application.

First, we investigate how the complexity of the Backbone network in SFDQE and UFDQE affects image quality estimation performance for face detection. Next, we discuss how different changes to the architectures of SFDQE and UFDQE can affect their ability to determine the suitability of an image for face detection. Finally, we discuss the effectiveness of our QEs when they are fine-tuned on a specific data set rather than being trained from scratch.

5.5.1 Backbone network considerations for SFDQE and UFDQE

Ideally, QEs for face detection should be lightweight, allowing for a quick assessment of whether an input is suitable for face detection, instead of needing to run a complex face detector on the input itself. For this reason, we pick the MobileNet backbone in the default configuration of SFDQE and UFDQE instead of using the complex ResNet50 backbone that is typically used for face detection. However, there might be scenarios in which we would want to use the ResNet50 backbone QEs to improve their quality estimation performance.

Here, we assess the performance of SFDQE and UFDQE using the ResNet50 backbone instead of the default MobileNet backbone. Our goal is to demonstrate the possible performance improvements that can be achieved when using the ResNet50 backbone, despite the increase in network size and complexity. We conduct this experiment solely on the WiderFace data set, as similar conclusions are drawn from the FDDB data set as well.

Figure 22 presents a comparison between the mvR curves of face detection QEs using ResNet50 and MobileNet backbones. In this figure, we plot the mAP of the face detector on the y-axis vs. the percentage of dropped images on the x-axis. All the QEs irrespective of the Backbone network are trained from scratch. We use the default configuration of the mvR protocol for this experiment. The mvR curves of QEs using the ResNet50 backbone (represented by dashed and dotted lines) are closer to the optimal DET score curve when compared to the default configuration of the QEs using the MobileNet backbone (represented by solid lines). This indicates that both SFDQE and UFDQE perform better with the more complex ResNet50 backbone. However, this improvement comes at the cost of increased model complexity and processing time, with the number of trainable parameters increasing from approximately 2 million (MobileNet) to around 27.2 million (ResNet50).

Fig. 22
figure 22

mAP vs. reject protocol on the WiderFace data set using SFDQE and UFDQE with the MobileNet and ResNet50 backbone networks. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

The key takeaway from this experiment is that a balance between model efficiency and performance can be achieved during the design of our QEs. This gives the user flexibility to pick different Backbones during the design of SFDQE and UFDQE depending on the use case scenario.

5.5.2 Additional design considerations for SFDQE

In our experiments so far, we have utilized the default configuration of SFDQE, which incorporates the MobileNet backbone, a FPN, and a Quality Score Regression Head composed of several fully connected layers. This design prompts questions such as,”What would be the impact on performance if the architecture of SFDQE included Context Modelling Layers?” or “What would occur if SFDQE was designed using only a Backbone network and a Quality Score Regression Head?” These questions are crucial in understanding the design choices and their implications on the performance of the SFDQE.

We address the aforementioned questions by evaluating the performance of SFDQE architectures with different components, as depicted in Fig. 23. Figure 23b represents the default configuration of SFDQE. In Fig. 23a, we remove the FPN from SFDQE, leaving only the Backbone network and a Quality Score Regression Head. In Fig. 23c, we design SFDQE with an architecture similar to the default configuration, but with additional Context Modelling Layers following the FPN. This architecture bears a close resemblance to the overall architecture of the RetinaFace face detector. We conduct this experiment solely on the WiderFace data set, as similar conclusions are drawn from the FDDB data set.

Fig. 23
figure 23

Different architectures for SFDQE; a only Backbone network, b backbone network and a FPN as seen earlier in Fig. 10 (default configuration), and c backbone network, an FPN, and context modelling layers

Figure 24 presents the mvR curves of SFDQE with the different architectures shown in Fig. 23. We use the MobileNet backbone for all the architectures of SFDQE and they are all trained from scratch. In addition, we use the default configuration of the mvR protocol. From this figure, we see that the mvR curves of the architecture of SFDQE in Fig. 23b, c lie above and closer to the optimal DET score curve when compared to the mvR curve of architecture in Fig. 23a. The Backbone network alone is not capable of capturing all the necessary face-related semantic information necessary for face detection-specific image quality assessment. However, when we add a FPN designed to focus on face specific information (Fig. 23b), we see that the performance of SFDQE improves significantly. Furthermore, adding additional Context Modelling layers to SFDQE (Fig. 23c) yields no significant improvement in quality estimation performance.

Fig. 24
figure 24

mAP vs. reject protocol on the WiderFace data set using the different architecture of SFDQE from Fig. 23. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

Thus, when designing SFDQE, it is crucial to include a component like the FPN that can capture face-related semantic information. As demonstrated in [35], the Backbone network is effective at capturing only low-level object-independent features of an image. However, when designing QEs specifically for face detection, we also need components that learn to concentrate on the faces within an image.

5.5.3 Additional design considerations for UFDQE

Until this point, we have presented the performance of UFDQE using Disouts [20] as the feature distortion layers added at the backbone. However, it is of interest to observe how other popular feature distortion layers, such as Dropouts [18] and Dropblocks [19], could impact the performance of UFDQE. In addition, it is important to understand why the feature distortion layers are added after the backbone of a face detector in UFDQE, unlike SER-FIQ [6], which adds Dropouts at the end of a face recognition network. In this section, we discuss the importance of adding feature distortion layers at the backbone of a face detector in UFDQE as well identify if there is a specific feature distortion layer that is optimal for the design of UFDQE.

Here, we compare the performance of UFDQE when it uses either Dropouts, Dropblocks, or Disouts. The feature distortion layers are added after the backbone network of the QE network in UFDQE. For Dropouts, we randomly drop entire feature maps with a probability of 50%. For Dropblocks and Disouts, we randomly remove contiguous (6 × 6) regions as the feature map of an image passes through with a probability of 50%. In addition, we also evaluate the performance of UFDQE when it is designed exactly like SER-FIQ, that is, we evaluate UFDQE performance when we add Dropouts to the final face localization head of the face detection network. This experiment is conducted on both the WiderFace and FDDB data sets, as important conclusions can be drawn from each case.

Figures 25 and 26 depict the mvR curves on the WiderFace and FDDB data sets, respectively, when different feature distortion layers are employed in UFDQE. From these figures, it is evident that each type of feature distortion layer is appropriate for image quality estimation for face detection, as the mvR curves in all cases follow an upward trend similar to the DET score curve. However, depending on the scenario, it may be beneficial to use a specific type of feature distortion layer in UFDQE. These figures also show that the performance of UFDQE suffers if we follow the exact design principles of SER-FIQ, that is, by adding Dropouts to the end of the face detection network in UFDQE.

Fig. 25
figure 25

mAP vs. reject protocol on the WiderFace Data set using the different design choices for UFDQE. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

Fig. 26
figure 26

mAP vs. reject protocol on the FDDB Data set using the different feature distortion layers in UFDQE. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

Focusing on the mvR curve for the WiderFace data set in Fig. 25, we observe that the curve of UFDQE designed with Disouts is closest to the optimal DET score curve, followed by the one with Dropblocks. The mvR curve of UFDQE with dropouts shows the worst performance. This suggests that for the WiderFace data set, designing UFDQE with Disouts will yield the best image quality estimation performance. Interestingly, although UFDQE with Dropouts performs the worst initially, it catches up to the performance of UFDQE with Disouts in retaining the best images of a face detection data set (right side of the plot). Comparing the mvR curve of UFDQE designed like SER-FIQ (pink) to the other curves, we see that the performance takes a significant hit. This indicates that different tasks may require feature distortion layers to be used at different stages of an underlying deep learning network to effectively link image quality to task performance. Hence, following the exact design principles of SER-FIQ would not work in the context of face detection-specific image quality estimation.

When we examine the mvR curves on the FDDB data set in Fig. 26, we notice that all the curves now lie very close to each other, unlike the WiderFace case. In addition, in this case, the Dropouts curve shows the best performance, but it is only marginally better than the performance of Dropblocks and Disouts. In this case as well, the performance of UFDQE suffers if we directly follow the design principles of SER-FIQ. Therefore, all the feature distortion layers are viable in the design of UFDQE.

However, unlike SER-FIQ that uses Dropouts at the end of a face recognition network, UFDQE requires feature distortion layers to be used after the backbone of face detection network to get suitable quality estimation performance. When it comes to selection of feature distortion layers, a safe choice is to use Disouts, as we have seen that they show the best performance in several scenarios. However, to achieve the best performance, it is necessary to evaluate all the feature distortion layers as this decision is dependent on the data being evaluated.

5.5.4 Exploring different training strategies for SFDQE and UFDQE

In all our previous experiments, the underlying QE networks of SFDQE (Fig. 10) and UFDQE (Fig. 11) are trained from scratch. However, the QE networks of SFDQE and UFDQE should ideally be able to leverage certain components of pre-trained face detection models like Backbone networks and merely fine-tune the remaining components for quality estimation. This approach is more practical as training from scratch requires significant time and resources.

In the case of our QEs, fine-tuning essentially means that we do not optimize the Backbone network, and we use it as it is from pre-trained face detection models. We only train the remaining components of the QEs for quality estimation. In the case of fine-tuning SFDQE, we freeze the Backbone network and train only the FPN and Quality Score Regression Head.

Fo UFDQE, we only train the FPN, the Context Modelling Layers, and the Classification and Localization Heads of the underlying face detection network, while keeping the Backbone network frozen. The primary reason to fine-tune the layers in UFDQE is to accommodate the feature distortion layers added after the Backbone. If the underlying network of UFDQE includes feature distortion layers and all its components are pre-trained on a suitable face detection data set, we could potentially create QEs for face detection-based image quality estimation even without any fine-tuning. However, this has not been investigated here as the RetinaFace architecture does not include feature distortion layers.

Figure 27 illustrates the mvR curves of SFDQE and UFDQE, comparing the results when they are either fine-tuned or trained from scratch on the WiderFace data set. Regardless of whether we are training from scratch or fine-tuning, we utilize the MobileNet backbone for our QEs. We conduct this experiment solely on the WiderFace data set, as similar conclusions ca drawn from the FDDB data set as well.

Fig. 27
figure 27

mAP vs. reject protocol on the WiderFace data set using SFDQE and UFDQE when the QEs are either fine-tuned or trained from scratch. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

From Fig. 27, it is clear that the mvR curves of the fine-tuned Quality Estimators (QEs) lie below those of the QEs trained from scratch. This implies that while fine-tuning saves time and resources in setting up the QEs, it also results in a decrease in quality estimation performance. This is because the Backbone network from pre-trained face detectors is not optimized for quality estimation. When we train the QEs from scratch, all the components of the QEs are specifically optimized for quality estimation, resulting in improved performance. The figure also indicates that SFDQE undergoes the most significant performance drop when not trained from scratch. Hence, for SFDQE, it is beneficial to train the entire network to achieve better quality estimation performance. On the contrary, the performance of UFDQE remains relatively stable, even when the network is only fine-tuned. This suggests that simply fine-tuning the layers of a face detection network in UFDQE is sufficient to develop a robust QE for face detection-based image quality estimation.

Both our QEs demonstrate improved quality estimation performance when trained from scratch. This is particularly evident with SFDQE, which experiences a significant performance drop when fine-tuned. Conversely, UFDQE exhibits a minimal performance drop when fine-tuned. This highlights the potential advantages of using an unsupervised approach like UFDQE over a supervised one like SFDQE.

5.6 Can SFDQE and UFDQE perform well in a cross-data set evaluation scenario?

In our experiments so far, we have always trained SFDQE and UFDQE on a specific data set and then evaluated the QEs on the same data set. Our QEs have shown suitable performance in assessing the quality of an image for face detection in this scenario. However, it is important to validate if our QEs can perform well on data that they are not trained on. Here, we evaluate how our QEs perform when they are trained on one specific face data set and then tested on a completely different data set.

Figure 28 shows the mvR performance of SFDQE and UFDQE on the FDDB data set when trained on the WiderFace data set. For comparison, we also include the mvR curves for the QEs trained on the FDDB data set (Fig. 19). Our face detection QEs and the mvR protocol use the default configuration.

Fig. 28
figure 28

mAP vs. reject protocol on the FDDB data set using SFDQE and UFDQE to cross-validate the performance of the QEs. At each drop percentage, RetinaFace-ResNet50 is evaluated by computing its mAP at a fixed IoU of 0.5

From Fig. 28, it is evident that the mvR curves of SFDQE and UFDQE trained on the FDDB data set are higher and closer to the optimal DET score curve when compared to the mvR curves of the QEs trained on the WiderFace data set. We see a bigger gap in the mvR curves on the left side of the plot, indicating that the QEs trained on the WiderFace data set show less effectiveness in eliminating unsuitable images for face detection from the FDDB data set when compared to the QEs trained on the same data set. However, the QEs perform reasonably well in retaining the best images for face detection on the FDDB data set, as seen by the smaller gap between the mvR curves on right side of the plot.

Therefore, our QEs effectively determine the suitability of images for face detection, even on unseen data. However, the best quality estimation performance is achieved when the QEs are trained and tested on images from the same data set. This is a result that is typically expected in deep learning.

5.7 Can the design principles of SFDQE and UFDQE be used to create QEs for other object detection tasks?

Our novel QEs, SFDQE and UFDQE are very effective in determining the suitability of images for face detection. These QEs were specifically designed with the face detection task in mind. However, it is intriguing to see if the same design principles can be used to design QEs for other object detection tasks like pedestrian detection. This would make our design principles very powerful for estimating the quality of images for other object detection tasks as well.

In this section, we design QEs for pedestrian detection named, Supervised Pedestrian Detection Quality Estimator (SPDQE) and Unsupervised Pedestrian Detection Quality Estimator (UPDQE) using the exact same principles as our face detection QEs, SFDQE and UFDQE. These QEs determine if a particular image is suitable for pedestrian detection instead of face detection. Here, we use a popular deep learning network, YOLOv8 [30], as our baseline pedestrian detector; that is, YOLOv8 is the underlying network for SPDQE and UPDQE similar to RetinaFace in the case of SFDQE and UFDQE.

SPDQE and UPDQE are designed using YoloV8-Small which is a variant of YoloV8 with a lightweight backbone. For evaluation with the mvR protocol, we apply the YoloV8-Large pedestrian detector which uses a much larger backbone network in comparison to YoloV8-Small. All the networks used in this experiment are trained from scratch on a subset of the KITTI [43] data set that contains 500 images. These images depict pedestrians in various settings, such as cities, residential areas, highways, and university campuses. It is important to note that this is a preliminary experiment as we used a relatively small data set for our training and evaluation.

Figure 29 shows the mvR curves for SPDQE and UPDQE on the KITTI data set. In this figure, the mAP of the pedestrian detector is plotted on the y-axis against the percentage of discarded images on the x-axis. The results closely mirror those from face detection. The optimal mvR curve of the DET score is evident, with the most notable improvement seen at lower drop percentages when unsuitable images for pedestrian detection are removed. The mAP of the pedestrian detector peaks when approximately 70% of the images are discarded. This indicates that a larger proportion of images in the KITTI data set present difficulties for pedestrian detectors. The upward trend of the mvR curves for SPDQE and UPDQE aligns closely with the optimal DET score curve, indicating that as lower quality images are removed based on the the quality scores assigned by the QEs, the mAP of the pedestrian detector improves.

Fig. 29
figure 29

mAP vs. reject protocol on the KITTI data set using SPDQE and UPDQE. At each drop percentage, YoloV8-L is evaluated by computing its mAP at a fixed IoU of 0.5

Therefore, by applying the design principles of SFDQE and UFDQE, we have shown that we can create QEs for tasks like pedestrian detection. This demonstrates the potential of these design principles for designing QEs for various object detection tasks.

6 Conclusions and future work

In this paper, we introduced a novel image quality metric specific to face detection, termed as the Detectability (DET) score which effectively determines the suitability of an image for face detection. Using the DET score as a reference metric, we developed two novel Quality Estimators (QEs) for face detection: Supervised Face Detection Quality Estimator (SFDQE) and Unsupervised Face Detection Quality Estimator (UFDQE) that link image quality and face detection performance. In addition, we proposed the mAP vs. reject (mvR) protocol to evaluate the performance of QEs in the context of face detection.

In our experiments, we showed the shortcomings of conventional perceptual QEs in assessing image quality for face detection. We demonstrated the efficacy of SFDQE and UFDQE in evaluating the suitability of an image for face detection. We also discussed the design choices behind SFDQE and UFDQE and how these choices impact the performance of the QEs. Furthermore, we provided preliminary evidence that the design principles of SFDQE and UFDQE can be applied to other object detection tasks like pedestrian detection.

While SFDQE and UFDQE show encouraging results, there is still potential for improvement in the performance of QEs for face detection as can be seen from the mvR curves of the DET score. Given the effectiveness of the design principles of SFDQE and UFDQE, it would be interesting to apply them to tasks other than object detection, such as object segmentation. In addition, it can be interesting to explore whether images identified as low quality by our QEs can be made suitable for face detection through enhancement methods, such as image super-resolution or denoising.

Finally, it is important to assess the effects of SFDQE and UFDQE on the performance of face recognition within an end-to-end face analytics system. Previously, it has been impossible to conduct such evaluations because popular face data sets lacked annotations suitable for end-to-end evaluation. In our recent work [44], we showcase the benefits of task-specific QEs in the context of an end-to-end system by utilizing the carefully curated EFAD data set [5] and interpretable evaluation protocols, such as EvR and mvR. Our experiments revealed that task-specific QEs significantly improve the end-stage face recognition performance of the system by filtering out unsuitable images. For instance, by discarding the lowest 5% of images before each analytics task using task-specific QEs, the TPR@FPR: 10−4 of end-stage face recognition jumps from 0.73 to 0.85.

Future research should assess the resource utilization of end-to-end systems utilizing task-specific QEs to thoroughly evaluate their effect on system performance. This will provide a more comprehensive view of the operational efficiencies and potential improvements offered by task-specific QEs. Furthermore, it would be valuable to investigate the reasons behind the low quality scores assigned to certain images by our QEs and explore whether these images can be improved through enhancement techniques, such as image super-resolution or denoising.

Availability of data and materials

The data sets used for training and evaluation of the methods in the current study are all publicly available in the following repositories: WiderFace (http://shuoyang1213.me/WIDERFACE/); FDDB (https://vis-www.cs.umass.edu/fddb/); KITTI (https://www.cvlibs.net/datasets/kitti/).

Abbreviations

QE:

Quality estimator

DET:

Detectability

SFDQE:

Supervised face detection quality estimator

UFDQE:

Unsupervised face detection quality estimator

mvR:

Mean-average-precision vs. reject

FPN:

Feature Pyramid Network

IoU:

Intersection-over-union

References

  1. S.S. Hemami, A.R. Reibman, No-reference image and video quality estimation: applications and human-motivated design. Signal Process.: Image Commun. 25(7), 469–481 (2010)

    Google Scholar 

  2. A. Mittal, A.K. Moorthy, A.C. Bovik, No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (2012)

    Article  MathSciNet  Google Scholar 

  3. A. Mittal, R. Soundararajan, A. Bovik, Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 20, 209–212 (2013). https://doi.org/10.1109/LSP.2012.2227726

    Article  Google Scholar 

  4. N. Venkatanath, D. Praneeth, M.C. Bh, S.S. Channappayya, S.S. Medasani, Blind image quality evaluation using perception based features. Twenty First Natl. Conf. Commun. (2015). https://doi.org/10.1109/NCC.2015.7084843

    Article  Google Scholar 

  5. P. Singh, E.J. Delp, A.R. Reibman, End-to-end evaluation of practical video analytics systems for face detection and recognition. Electron. Imaging 35(16), 111–11111 (2023). https://doi.org/10.2352/EI.2023.35.16.AVM-111

    Article  Google Scholar 

  6. P. Terhorst, J. Kolf, N. Damer, F. Kirchbuchner, A. Kuijper, SER-FIQ: unsupervised estimation of face image quality based on stochastic embedding robustness, in IEEE conference on computer vision and pattern recognition (2020)

  7. F.-Z. Ou, X. Chen, R. Zhang, Y. Huang, S. Li, J. Li, Y. Li, L. Cao, Y.-G. Wang, SDD-FIQA: unsupervised face image quality assessment with similarity distribution distance, in IEEE conference on computer vision and pattern recognition (2021)

  8. J. Hernandez-Ortega, J. Galbally, J. Fierrez, R. Haraksim, L. Beslay, FaceQNet: quality assessment for face recognition based on deep learning. Proc. Int. Conf. Biom. (2019). https://doi.org/10.1109/ICB45273.2019.8987255

    Article  Google Scholar 

  9. Z. Babnik, P. Peer, V. Struc, DifFIQA: face image quality assessment using denoising diffusion probabilistic models, in 2023 IEEE international joint conference on biometrics (IJCB) (2023), pp. 1–10. https://doi.org/10.1109/IJCB57857.2023.10449044

  10. Z. Babnik, P. Peer, V. Struc, eDifFIQA: towards efficient face image quality assessment based on denoising diffusion probabilistic models. IEEE Trans. Biom. Behav. Identity Sci. (2024). https://doi.org/10.1109/TBIOM.2024.3376236

    Article  Google Scholar 

  11. T. Liu, S. Li, M. Xu, L. Yang, X. Wang, Assessing face image quality: a large-scale database and a transformer method. IEEE Trans. Pattern Anal. Mach. Intell. 46(5), 3981–4000 (2024). https://doi.org/10.1109/TPAMI.2024.3350049

    Article  Google Scholar 

  12. F.-Z. Ou, C. Li, S. Wang, S. Kwong, CLIB-FIQA: face image quality assessment with confidence calibration, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2024), pp. 1694–1704

  13. P. Singh, H. Chen, E.J. Delp, A.R. Reibman, Evaluating image quality estimators for face matching, in 2022 IEEE 5th international conference on multimedia information processing and retrieval (MIPR) (2022), pp. 204–209. https://doi.org/10.1109/MIPR54900.2022.00043

  14. T. Schlett, C. Rathgeb, O. Henniger, J. Galbally, J. Fierrez, C. Busch, Face image quality assessment: a literature survey. ACM Comput. Surv. 54(10s), 1–49 (2022). https://doi.org/10.1145/3507901

    Article  Google Scholar 

  15. H. Chen, P. Singh, E.J. Delp, A.R. Reibman, Gallery-query protocol for evaluating face image quality metrics, in 2023 IEEE 25th international workshop on multimedia signal processing (MMSP) (2023), pp. 1–6. https://doi.org/10.1109/MMSP59012.2023.10337666

  16. P. Grother, E. Tabassi, Performance of biometric quality measures. IEEE Trans. Pattern Anal. Mach. Intell. 29(4), 531–543 (2007)

    Article  Google Scholar 

  17. J. Deng, J. Guo, E. Ververas, I. Kotsia, S. Zafeiriou, RetinaFace: single-shot multi-level face localisation in the wild, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020), pp. 5203–5212

  18. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  Google Scholar 

  19. G. Ghiasi, T.-Y. Lin, Q.V. Le, Dropblock: a regularization method for convolutional networks, in Advances in neural information processing systems, vol. 31 (2018)

  20. Y. Tang, Y. Wang, Y. Xu, B. Shi, C. Xu, C. Xu, C. Xu, Beyond dropout: feature map distortion to regularize deep neural networks, in Proceedings of the AAAI conference on artificial intelligence, vol. 34 (2020), pp. 5964–5971

  21. S. Yang, P. Luo, C.-C. Loy, X. Tang, WIDER FACE: a face detection benchmark, in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2016)

  22. V. Jain, E. Learned-Miller, FDDB: a benchmark for face detection in unconstrained settings. Technical report UM-CS-2010-009 (University of Massachusetts, Amherst, 2010)

  23. J. Xiang, G. Zhu, Joint face detection and facial expression recognition with MTCNN, in 2017 4th international conference on information science and control engineering (ICISCE) (2017), pp. 424–427. https://doi.org/10.1109/ICISCE.2017.95

  24. J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time object detection, in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2016)

  25. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD: single shot multibox detector, in Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, Part I 14 (Springer, 2016), pp. 21–37

  26. P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol. 1 (2001). https://doi.org/10.1109/CVPR.2001.990517

  27. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1 (2005), pp. 886–8931. https://doi.org/10.1109/CVPR.2005.177

  28. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in IEEE conference on computer vision and pattern recognition (2016), pp. 770–778

  29. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: inverted residuals and linear bottlenecks, in Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 4510–4520

  30. G. Jocher, A. Chaurasia, J. Qiu, Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics

  31. F. Babiloni, I. Marras, F. Kokkinos, J. Deng, G. Chrysos, S. Zafeiriou, PolyNL: linear complexity non-local layers with 3rd order polynomials, in 2021 IEEE/CVF international conference on computer vision (ICCV) (2021), pp. 10498–10508. https://doi.org/10.1109/ICCV48922.2021.01035

  32. J. Li, B. Zhang, Y. Wang, Y. Tai, Z. Zhang, C. Wang, J. Li, X. Huang, Y. Xia, ASFD: automatic and scalable face detector, in Proceedings of the 29th ACM international conference on multimedia. MM ’21 (Association for Computing Machinery, New York, 2021), pp. 2139–2147. https://doi.org/10.1145/3474085.3475372

  33. W. Zhang, K. Ma, G. Zhai, X. Yang, Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process. 30, 3474–3486 (2021). https://doi.org/10.1109/TIP.2021.3061932

    Article  Google Scholar 

  34. J. Ke, Q. Wang, Y. Wang, P. Milanfar, F. Yang, MUSIQ: multi-scale image quality transformer, in Proceedings of the IEEE/CVF international conference on computer vision (2021), pp. 5148–5157

  35. S. Paul, U. Drolia, Y.C. Hu, S. Chakradhar, AQuA: a new image quality metric for optimizing video analytics systems. ACM Trans. Embed. Comput. Syst. 22(4), 1–29 (2023). https://doi.org/10.1145/3568423

    Article  Google Scholar 

  36. J. Deng, J. Guo, N. Xue, S. Zafeiriou, ArcFace: additive angular margin loss for deep face recognition, in 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2019), pp. 4685–4694. https://doi.org/10.1109/CVPR.2019.00482

  37. H. Kim, S.H. Lee, M.R. Yong, Face image assessment learned with objective and relative face image qualities for improved face recognition, in IEEE international conference on image processing (2015), pp. 4027–4031

  38. A. Dutta, R. Veldhuis, L. Spreeuwers, A Bayesian model for predicting face recognition performance using image quality, in IEEE international joint conference on biometrics (2014)

  39. L. Best-Rowden, A.K. Jain, Learning face image quality from human assessments. IEEE Trans. Inf. Forensics Secur. 13(12), 3064–3077 (2018). https://doi.org/10.1109/TIFS.2018.2799585

    Article  Google Scholar 

  40. E. Tabassi, M. Olsen, O. Bausinger, C. Busch, A. Figlarz, G. Fiumara, O. Henniger, J. Merkle, T. Ruhland, C. Schiel, M. Schwaiger, NIST fingerprint image quality 2. NIST interagency/internal report (NISTIR) (National Institute of Standards and Technology, Gaithersburg, 2021). https://doi.org/10.6028/NIST.IR.8382

  41. G. Salton, M.J. McGill, Introduction to Modern Information Retrieval (McGrawHill Inc, New York, 1986)

    Google Scholar 

  42. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C.L. Zitnick, Microsoft COCO: common objects in context, in Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, Part V 13 (Springer, 2014), pp. 740–755

  43. A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)

    Article  Google Scholar 

  44. P. Singh, A.R. Reibman, Image quality assessment in end-to-end face analytics systems, in IEEE 26th international workshop on multimedia signal processing (MMSP) (2024)

Download references

Acknowledgements

No additional acknowledgements.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

All authors participated in the design of the quality estimators, evaluation protocols, experiments, and writing of the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Praneet Singh or Amy R. Reibman.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, P., Reibman, A.R. Task-aware image quality estimators for face detection. J Image Video Proc. 2024, 44 (2024). https://doi.org/10.1186/s13640-024-00660-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13640-024-00660-1

Keywords