survey

Open access

Deep Learning-based Face Super-resolution: A Survey

Authors:

Junjun Jiang,

Chenyang Wang,

Xianming Liu,

Jiayi MaAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 55, Issue 1

Article No.: 13, Pages 1 - 36

https://doi.org/10.1145/3485132

Published: 23 November 2021 Publication History

All formats PDF

Abstract

Face super-resolution (FSR), also known as face hallucination, which is aimed at enhancing the resolution of low-resolution (LR) face images to generate high-resolution face images, is a domain-specific image super-resolution problem. Recently, FSR has received considerable attention and witnessed dazzling advances with the development of deep learning techniques. To date, few summaries of the studies on the deep learning-based FSR are available. In this survey, we present a comprehensive review of deep learning-based FSR methods in a systematic manner. First, we summarize the problem formulation of FSR and introduce popular assessment metrics and loss functions. Second, we elaborate on the facial characteristics and popular datasets used in FSR. Third, we roughly categorize existing methods according to the utilization of facial characteristics. In each category, we start with a general description of design principles, present an overview of representative approaches, and then discuss the pros and cons among them. Fourth, we evaluate the performance of some state-of-the-art methods. Fifth, joint FSR and other tasks, and FSR-related applications are roughly introduced. Finally, we envision the prospects of further technological advancement in this field.

1 Introduction

Face super-resolution (FSR), a domain-specific image super-resolution problem, refers to the technique of recovering high-resolution (HR) face images from low-resolution (LR) face images. It can increase the resolution of an LR face image of low quality and recover the details. In many real-world scenarios, limited by physical imaging systems and imaging conditions, the face images are always low quality. Thus, with a wide range of applications and notable advantages, FSR has always been a hot topic since its birth in image processing and computer vision.¹

The concept of FSR was first proposed in 2000 by Baker and Kanade [8], who are the pioneers of the FSR technique. They develop a multi-level learning and prediction model based on the Gaussian image pyramid to improve the resolution of an LR face image. Liu et al. [9] propose to integrate a global parametric principal component analysis (PCA) model with a local nonparametric Markov random field model for FSR. Since then, a number of innovative methods have been proposed, and FSR has become the subject of active research efforts. Researchers super-resolve the LR face images by means of global face statistical models [10, 11, 12, 13, 14, 15, 16], local patch-based representation methods [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], or hybrid ones [28, 29]. These methods have achieved good performance; however, they have trouble when meeting requirements in practice. With the rapid development of deep learning technique, attractive advantages over previous attempts have been obtained and have been applied into image or video super-resolution. Many comprehensive surveys have reviewed recent achievements in these fields, i.e., general image super-resolution surveys [30, 31, 32] and video super-resolution survey [33]. Toward FSR, a domain-specific image super-resolution, a few surveys are listed in Table 1. In the early stage of research, References [1, 2, 3, 4, 5, 6] provide a comprehensive review of traditional FSR methods (mainly including patch-based super-resolution, PCA-based methods, etc.), while Liu et al. [7] offer a generative adversarial network– (GAN) based FSR survey. However, so far no literature review is available on deep learning super-resolution specifically for human faces. In this article, we present a comparative study of different deep learning-based FSR methods.

Table 1.

No.	Survey title	Year	Venue
1	A survey of face hallucination [1]	2012	CCBR
2	A comprehensive survey to face hallucination [2]	2014	IJCV
3	A review of various approaches to face hallucination [3]	2015	ICACTA
4	Face super resolution: a survey [4]	2017	IJIGSP
5	Super-resolution for biometrics: a comprehensive survey [5]	2018	PR
6	Face hallucination techniques: a survey [6]	2018	CICT
7	Survey on GAN-based face hallucination with its model development [7]	2019	IET

Table 1. Summary of Face Super-resolution Surveys since 2010

The main contributions of this survey are as follows:

•

The survey provides a comprehensive review of recent techniques for FSR, including problem definition, commonly used evaluation metrics and loss functions, the characteristics of FSR, benchmark datasets, deep learning-based FSR methods, performance comparison of states of the art, methods that jointly perform FSR and other tasks, and FSR-related applications.

•

The survey summarizes how existing deep learning-based FSR methods explore the potential of network architecture and take advantage of the characteristics of face images, as well as compare the similarities and differences among these methods.

•

The survey discusses the challenges and envisions the prospects of future research in the FSR field.

In the following, we will cover the existing deep learning-based FSR methods, and Figure 1 shows the taxonomy of FSR. Section 2 introduces the problem definition of FSR, and commonly used assessment metrics and loss functions. Section 3 presents the facial characteristics (i.e., prior information, attribute information, and identity information) and reviews some mainstream face datasets. In Section 4, we discuss FSR methods. To avoid exhaustive enumeration and take facial characteristics into consideration, FSR methods are categorized according to facial characteristics used. In Section 4, five major categories are presented: general FSR methods, prior-guided FSR methods, attribute-constrained FSR methods, identity-preserving FSR methods, and reference FSR methods. Depending on the network architecture or the utilization of facial characteristics, every category is further divided into several subcategories. Moreover, Section 4 compares the performance of some state-of-the-art methods. Besides, Section 4 also reviews some methods dealing with joint tasks and FSR-related applications. Section 5 concludes the FSR and further discusses the limitations as well as envisions the prospects of further technological advancement.

Fig. 1.

2 Background

2.1 Problem Definition

FSR focuses on recovering the corresponding HR face image from an observed LR face image. The image degradation model

can be mathematically written as

(1)

where

represents the model parameters including blurring kernel, downsampling operation, and noise,

is the observed LR face image, and

is the original HR face image. FSR is devoted to simulating the inverse process of the degradation model and recovers the

from

, which can be expressed as

(2)

where F is the super-resolution model (inverse degradation model),

represents the parameters of F, and

represents the super-resolved result. The optimization of

can be defined as

(3)

where

represents the loss between

and

is the optimal parameter of the trained model. In FSR, mean square error (MSE) loss and

loss are the most popular loss functions, and some models tend to use a combination of multiple loss functions, which will be reviewed in Section 2.2.

The degradation model and parameters are all unavailable in a real-world environment, and

is the only given information. To simulate the image degradation process, researchers tend to use mathematical models to generate some LR and HR pairs to train the model. The simplest mathematical model is

(4)

where

denotes the downsampling operation and s is the scaling factor. However, this pattern is too simple to match the real-world degradation process. To better mimic the real degradation process, researchers design a degradation process with the combination of many operations (e.g., downsampling, blur, noise, and compression) as follows:

(5)

where k is the blurring kernel,

represents the convolutional operation, n denotes the noise, and J denotes the image compression. Various combinations of different operations are used in FSR. They include the widely used bicubic model [34, 35, 36], as well as the general degradation model used for blind FSR [37, 38, 39]. However, they are not introduced in detail in this survey.

2.2 Assessment Metrics and Loss Functions

In deep learning-based FSR methods, the loss function, which measures the difference between

and

, plays an important role in guiding the network training. Upon acquiring the trained network, the reconstruction performance of these methods can be evaluated by the assessment metrics. The preferences of different loss functions are different. For example,

loss tends to produce the result that is faithful to the original image (high Peak Signal-to-Noise Ratio (PSNR) value), and the perceptual and adversarial losses will generate subjectively pleasing results (low Fréchet Inception Distance (FID) [40] and Learned Perceptual Image Patch Similarity (LPIPS) [41] values). In practice, we can choose the appropriate loss function according to the needs. Considering the relationship between loss functions and assessment metrics, we introduce them together in this section.

2.2.1 Image Quality Assessment.

Generally, two main methods of quality evaluation are subjective and objective evaluation. Subjective evaluation relies on the judgement of humans, and tends to invite readers or interviewers to see and assess the quality of the generated images, leading to results always consistent with human perception but time-assuming, inconvenient and expensive. In contrast, the objective evaluation mainly utilizes statistical data to reflect the quality of the generated images. In general, the objective evaluation methods usually produce different results from subjective evaluation metrics, because the starting point of objective evaluation methods is mathematics instead of human visual perception, which leaves the assessment image quality in dispute. Here, we introduce some popular assessment metrics.

PSNR: PSNR is a commonly used objective assessment metric in FSR. Given

and

, the MSE between them is first calculated, and then the PSNR is obtained,

(6)

(7)

where h, w, and c denote the height, width, and channel of the image and

is the maximum possible pixel value (i.e., 255 for 8-bit images). The smaller the pixelwise difference of the two images, the higher the PSNR. In this pattern, PSNR focuses on the distance between every pair of pixels in two images, which is inconsistent with human perception, resulting in poor performance when human perception is more important.

Structural Similarity Index (SSIM): SSIM [42] is also a popular objective assessment metric that measures the structural similarity between two images. To be specific, SSIM measures similarity from three aspects: luminance, contrast, and structure. Given

and

, SSIM is obtained by

(8)

where

, and

denote the similarity of the luminance, contrast, and structure. SSIM varies from 0 to 1. The higher the structural similarity of the two images, the larger the SSIM. Considering the uneven distribution of the image, SSIM is not reliable enough. Thus, multi-scale structural similarity index measure (MS-SSIM) [43] is proposed, which divides the image into multiple windows, first assesses SSIM for every window separately, and then converges them to obtain MS-SSIM.

LPIPS: LPIPS [41] measures the distance between two images in a deep feature space. LPIPS is more in line with human judgement than PSNR and SSIM. The more similar the two images, the smaller the LPIPS.

FID: In contrast to PSNR and SSIM, FID [40] focuses on the difference between

and

in a distribution-wise manner, and it is always applied to assess the visual quality of face images. The better the visual quality, the smaller the FID.

Natural Image Quality Evaluator (NIQE): NIQE [44] is a no-reference metric that measures the distance between two multivariate Gaussian models fitting natural images and the evaluated images without ground truth images. Specifically, the fitting of multivariate Gaussian model is based on the quality-aware features derived from the natural scene statistic model. The better the visual quality, the smaller the NIQE.

Mean Opinion Score (MOS): MOS is a commonly used subjective assessment metric, in contrast to the above objective quantitative metrics. To obtain the MOS, human raters are asked to assign perceptual quality scores to the tested images. Finally, MOS is obtained by calculating the arithmetic mean ratings assigned by human raters. When the number of human raters is small, MOS would be biased while MOS would be faithful enough when the number of human raters is large.

2.2.2 Loss Functions.

Initially, pixelwise

loss (also known as MSE loss) is popular; however, researchers then find that models based on

loss tend to generate smooth results. Then many kinds of loss functions are employed, such as pixelwise

loss, SSIM loss, perceptual loss, adversarial loss, and so on.

Pixelwise Loss: Pixelwise loss measures the distance between the two images at pixel level, including

loss that calculates the mean absolute error,

loss that calculates the mean square error, Huber loss [45], and Carbonnier penalty function [46]. With the constrain of the pixelwise loss, the obtained

can be close enough to the

on the pixel value. From the definition,

loss is sensitive to large errors but indifferent to small errors, while

loss treats them equally. Therefore,

loss has advantages in improving the performance and convergence over

loss. Overall, pixelwise loss can force the model to improve PSNR, but the generated images are always over-smooth and lack high-frequency details.

SSIM Loss: Similar to pixelwise loss, SSIM loss is designed to improve the structure similarity between super-resolved image and the original HR one:

(9)

where

denotes the function of SSIM. Except for SSIM loss, multi-scale SSIM loss can calculate SSIM loss at different scales.

Perceptual Loss: To improve the perceptual quality, one solution is to minimize the perceptual loss:

(10)

where

is the pretrained network and l is the lth layer. In essence, the perceptual loss measures the distance between the features extracted from

(e.g., VGG [47]), and it can evaluate the difference at the semantic level. Perceptual loss encourages the network to generate

that is more perceptually similar to

. The

predicted by the model with perceptual loss always looks more pleasant but usually has lower PSNR than those pixelwise loss-based methods.

Adversarial Loss: Adversarial loss, proposed in GAN [48], is also widely used in FSR. For details, GAN is composed of two models: a generator (G) and a discriminator (D). In FSR, GAN can be described as follows: G is the super-resolution model that generates the super-resolved face with an LR face image as input, and D discriminates whether the output result is generated or real. In the training phase, G and D are trained alternatively. Early methods [34, 49] use cross entropy-based adversarial loss expressed as follows:

(11)

(12)

where

and

denotes the loss function of D and G, respectively,

denotes the function of D, and

is randomly sampled from HR training samples. However, the model trained with this adversarial loss is always unstable and may cause model collapse. Therefore, Wasserstein GAN [50] and WGAN-GP [51] are proposed to alleviate the training difficulties. The model trained with adversarial loss tends to introduce artificial details, leading to worse PSNR and SSIM but pleasing visual quality with smaller FID.

Cycle Consistency Loss: Cycle consistency loss is proposed by CycleGAN [52]. In CycleGAN-based FSR, two cooperated models are used: A super-resolution model super-resolves the

to recover the

, and a degradation model downsamples the

back to

. In turn, the degradation model downsamples the HR face image to obtain

, and then the super-resolution model recovers the

to generate

. The cycle consistent loss is aimed to keep the consistency between

(

) and

(

(13)

In addition to the above loss functions, many other loss functions are also used in FSR, including style loss [53], feature match loss [54], and so on. Due to the limitation of space, we do not introduce them in detail.

3 Characteristics of Face Images

Human face is a highly structured object with its own unique characteristics, which can be explored and utilized in FSR task. In this section, we simply introduce these facial characteristics.

3.1 Prior Information

As shown in Figure 2, structural priors can be found in face images, such as facial landmarks, facial heatmaps, and facial paring maps.

Fig. 2.

•

Facial landmarks: These locate the key points of facial components. The number of landmarks varies in different datasets, such as CelebA [55], which provides five landmarks while Helen [56] offers 194 landmarks.

•

Facial heatmaps: These are generated from facial landmarks. Facial landmarks give accurate points of the facial components, while heatmaps give the probability of the point being a facial landmark. To generate the heatmaps, every landmark is represented by a Gaussian kernel centered on the location of the landmark.

•

Facial parsing maps: These are semantic segmentation maps of face images separating the facial components from face images, including eyes, nose, mouth, skin, ears, hair, and others.

These face structure prior information can provide the location of facial components and facial structure information. We can expect to recover more reasonable target face images if we incorporate these prior knowledge to regularize or guide the FSR models.

3.2 Attribute Information

Second, the attributes, such as gender, hair color, and others, are the affiliated features of face images and can be seen as semantic-level information. In FSR, because of one-to-many maps from LR images to HR ones, the recovered face image may contain artifacts and even wrong attributes. For example, the face in the recovered result does not wear but the ground truth wears eyeglasses. At this time, attribute information can remind the network which attribute should be covered in the result. From a different perspective, attribute information also contains facial details. Taking eyeglasses as an example, the attribute of wearing eyeglasses provides the details of the facial eyes. We provide a concise example of attribute information in Figure 2. Moreover, these attributes are always binary in the face dataset, 1 denotes that the face image has the attribute, while 0 means there is no such information.

3.3 Identity Information

Third, every face image corresponds to a person, which is enabled by identity information. This type of information is always used for keeping the identity consistency between the super-resolved result and the ground truth. On the one hand, the person should not be changed after super-resolution visually. On the other hand, FSR should facilitate the performance of face recognition. Similar to attribute information, identity also offers high-level constraints to the FSR task and is beneficial to face restoration.

3.4 Datasets for FSR

In recent years, many face image datasets are used for FSR, which differ in many aspects, e.g., the number of samples, facial characteristics contained, and others. In Table 2, we list a number of commonly used face image datasets and simply indicate their amount and the facial characteristics offered. For parsing maps and identity, we only present whether they are provided or not, while for attributes and landmarks, we offer the specific amount. Aside from these datasets, many other face datasets are used in FSR, including CACD200 [66], VGGFace2 [67], UMDFaces [68], CASIA-WebFace [69], and others. It is worth noting that all above-mentioned datasets only provide HR face images. If we want to use them for training and evaluating any super-resolution model, then we need to generate the corresponding LR face images using the degradation model introduced in Section 2.

Table 2.

Dataset	Number	#Attributes	#Landmarks	Parsing maps	Identity
CelebA [55]	202,599	40	5	×	✓
CelebAMask-HQ [57]	30,000	×	×	✓	×
Helen [56]	2,330	×	194	✓	×
FFHQ [58]	70,000	×	68	×	×
AFLW [59]	25,993	×	21	×	×
300W [60]	3,837	×	68	×	×
LS3D-W [61]	230,000	×	68	×	×
Menpo [62]	9,000	×	68	×	×
LFW [63]	13,233	73	×	×	✓
LFWA [64]	13,233	40	×	×	✓
VGGFace [65]	3,310,000	×	×	×	✓

Table 2. Summary of Public Face Image Datasets for FSR

4 FSR Methods

At present, various deep learning FSR methods have been proposed. On the one hand, these methods tap the potential of the efficient network for FSR regardless of facial characteristics, i.e., developing a basic convolution neural network (CNN) or GAN for face reconstruction. On the other hand, some approaches focus on the utilization of facial characteristics, e.g., using structure prior information to facilitate face restoration and so on. Furthermore, some recently proposed models introduce additional high-quality reference face images to assist the restoration. Here, according to the type of face image special information used, we divide FSR methods into five categories: general FSR, prior-guided FSR, attribute-constrained FSR, identity-preserving FSR, and reference FSR. In this section, we concentrate on every kind of FSR method and introduce each category in detail.

4.1 General FSR

General FSR methods mainly focus on designing an efficient network and exploit the potential of efficient network structure for FSR without any facial characteristics. In the early days, most of these methods are based on CNN and incorporate various advanced architectures (including back projection, residual network, spatial or channel attention, and so on), to improve the representation ability of the network. Since then, many FSR methods by using advanced networks have been proposed. We divide general FSR methods into four categories: basic CNN-based methods, GAN-based methods, reinforcement learning-based methods, and ensemble learning-based methods. Aiming to present a clear and concise overview, we summarize the general FSR methods in Figure 3.

Fig. 3.

4.1.1 Basic CNN-based Methods.

Inspired by the pioneer deep learning general image super-resolution method [70], some researchers also propose to incorporate the CNN network into the FSR task. Depending on whether they consider the global information and local differences, we can further divide the basic CNN-based methods into three categories: global methods that feed the entire face into the network and recover face images globally, local methods that divide face images into different components and then recover them, and mixed methods that recover face images locally and globally.

Global Methods: In the early years, researchers treat a face image as a whole and recover it globally. Inspired by the strong representative ability of CNN, bi-channel convolutional neural network [71, 72] directly learn a mapping from LR face images to HR ones. Then, benefiting from the performance gain of iterative back projection (IBP) in general image super-resolution, Huang et al. [73] introduce IBP to FSR as an extra post-processing step, developing the super-resolution using deep convolutional networks– (SRCNN) IBP method. After that, the thought of back projection is generally used in FSR [74, 75]. Later, channel and spatial attention mechanisms greatly improve the general image super-resolution methods, which inspires researchers to explore their utilization in FSR. Thus, a number of innovative methods integrating the attention mechanism are proposed [76, 77, 78]. In these works, two representative methods are E-ComSupResNet [77] that introduces a channel attention mechanism and SPARNet [78], which has a well-designed spatial attention for FSR. Besides that, many researchers design the cascaded model and exploit multi-scale information to improve the restoration performance [79, 80, 81].

It is observed that super-resolution in the image domain produces smooth results without high-frequency detail. Considering that wavelet transform can represent the textural and contextual information of the images, WaSRNet [82, 83] transform face images into wavelet coefficients and super-resolve the face images in the wavelet coefficient domain to avoid over-smooth results.

Local Methods: Global methods can capture global information but cannot well recover the face details. Thus, local methods are developed to recover different parts of a face image differently. Super-resolution technique based on definition-scalable inference (SRDSI) [84] decomposes the face into a basic face with low-frequency and a compensation face with high-frequency through PCA. Then, SRDSI recovers the basic face and the compensation face with very deep convolutional network (VDSR) [85] and sparse representation respectively. Finally, the two recovered faces are fused. After that, many patch-based methods have been proposed [86, 87, 88], all of which divide face images into several patches and train models for recovering the corresponding patches.

Mixed Methods: Considering that global methods can capture global structure but ignore local details while local methods focus on local details but lose global structure, a line of research naturally combines global and local methods for capturing global structure and recovering local details simultaneously. At first, global-local network [89, 90] develop a global upsampling network to model global constraints and a local enhancement network to learn face-specific details. To simultaneously capture global clues and recover local details, dual-path deep fusion network [91] constructs two individual branches for learning global facial contours and local facial component details, and then fuses the result of the two branches to generate the final SR result.

4.1.2 GAN-based Methods.

Compared with CNN-based methods that utilize pixelwise loss and generate smooth face images, GAN, first proposed by Goodfellow et al. [48], which can be applied to generate realistic-looking face images with more details, inspires researchers to design GAN-based methods. At first, researchers focus on designing various GANs to learn from paired or unpaired data. In recent years, how to utilize a pretrained generative model to boost FSR has attracted increasing attention. Therefore, GAN-based methods can be divided into general GAN-based methods and generative prior-based methods.

General GAN-based Methods: In the early stage, Yu et al. [34] develop ultra-resolving face images by discriminative generative networks (URDGN), which consists of two subnetworks: a discriminative model to distinguish a real HR face image or an artificially super-resolved output and a generative model to generate SR face images to fool the discriminative model and match the distribution of HR face images. MLGE [92] not only designs discriminators to distinguish face images but also applies edge maps of the face images to reconstruct HR face images. Recently, HiFaceGAN [93] and the works of [94, 95, 96, 97] also super-resolve face images with generative models. Instead of directly feeding the whole face images into the discriminator, PCA-SRGAN [98] decomposes face images into components by PCA and progressively feeds increasing components of the face images into the discriminator to reduce the learning difficulty of the discriminator. The commonality of these types of GAN is that the discriminator outputs a single probability value to characterize whether the result is a real face image. However, Zhang et al. [99] assume that a single probability value is too fragile to represent a whole image, thus they design a supervised pixelwise GAN (SPGAN) whose discriminator outputs a discriminative matrix with the same resolution as the input images, and design a supervised pixelwise adversarial loss, thus recovering more photo-realistic face images.

The above methods rely on the artificial LR and HR pairs generated by a known degradation. However, the quality of the real-world LR image is affected by a wide range of factors such as the imaging conditions and the imaging system, leading to the complicated unknown degradation of real LR images. The gap between real LR images and artificial LR ones is large and will inevitably decrease the performance when applying methods trained on the artificial pairs to real LR images [100]. To settle this problem, real-world super-resolution [101] first estimates the parameters from real LR faces, such as the blur kernel, noise, and compression, and then generates the LR and HR face image pairs with estimated parameters for the training of the model.

LRGAN [102] proposes to learn the degradation before super-resolution from unpaired data. It designs a high-to-low GAN to learn the real degradation process from unpaired LR and HR face images and create paired LR and HR face images for training low-to-high GAN. Specifically, with HR face images as input, the high-to-low GAN generates LR face images (GLR) that should belong to the real LR distribution and be close to the corresponding downsampled HR face images. Then, for low-to-high GAN, GLRs are fed into the generator to recover the SR results that have to be close to HR face images and match the real HR distribution. Goswami et al. [103] further develop a robust FSR method and Zheng et al. [104] utilize semi-dual optimal transport to guide model learning and develop semi-dual optimal transport CycleGAN. Considering that discrepancies between GLRs in the training phase and real LR face images in the testing phase still exist, researchers introduce the concept of characteristic regularization (CR) [105]. Different from LRGAN, CR transforms the real LR face images into artificial LR ones and then conducts super-resolution reconstruction in the artificial LR space. Based on CycleGAN, CR learns the mapping between real LR face images and artificial LR ones. Then, it uses the artificial LR face images generated from real LR ones to fine-tune the super-resolution model, which is pretrained by the artificial pairs.

Generative prior-based methods: Recently, many face generation models, such as popular StyleGAN [58], StyleGAN v2 [106], ProGAN [107], StarGAN [108], and so on, have been proposed and they are capable of generating faithful faces with a high degree of variability. Thus, more and more researchers explore the generative prior of pretrained GAN.

The first generative prior-based FSR method is PULSE [109]. It formulates FSR as a generation problem to generate high-quality SR face image so that the downsampled SR result is close to LR face image. Mathematically, the problem can be expressed as

(14)

where z is a randomly sampled latent vector and the input of the pretrained StyleGAN [58],

is the downsampling operation, s is the downsampling factor, and G denotes the function of the generator. PULSE solves FSR from a new perspective and this inspires many other works.

However, the latent code z in PULSE is randomly sampled and in low dimension, making the generated images lose important spatial information. To overcome this problem, GLEAN [110], CFP-GAN [111], and GPEN [112] are developed. Rather than directly employing the pretrained StyleGAN [58], they develop their own networks and embed the pretrained generation network of StyleGAN [58] into their own networks to incorporate the generative prior. To maintain faithful information, they not only obtain latent code by encoding LR face images instead of randomly sampling, but also extract multi-scale features from LR face images and fuse the features into the generation network. In this way, the generative prior provided by the pretrained StyleGAN can be fully utilized and the important spatial information can be well maintained.

4.1.3 Reinforcement Learning-based Methods.

Deep learning-based FSR methods learn the mapping from LR face images to HR ones, but ignore the contextual dependencies among the facial parts. Cao et al. propose to recurrently discover facial parts and enhance them by fully exploiting the global inter-dependency of the image, then attention-aware face hallucination via deep reinforcement learning (Attention-FH) is proposed [113]. Specifically, Attention-FH has two subnetworks: a policy network that locates the region that needs to be enhanced in the current step, and a local enhancement network that enhances the selected region.

4.1.4 Ensemble Learning-based Methods.

CNN-based methods utilize pixelwise loss to recover face images with higher PSNR and smoother details while GAN-based methods can generate face images with lower PSNR but more high-frequency details. To combine the advantages of different types of methods, ensemble learning is used in adaptive threshold-based multi-model fusion network (ATFMN) [114]. Specifically, ATFMN uses three models (CNN based, GAN based, and RNN based) to generate candidate SR faces, and then fuses all candidate SR faces to reconstruct the final SR result. In contrast to previous approaches, ATFMN exploits the potential of ensemble learning for FSR instead of focusing on a single model.

4.1.5 Discussion.

Here we discuss the pros and cons among these sub-categories in general FSR methods. From a global perspective, the difference between CNN-based and GAN-based methods relies on adversarial learning. CNN-based methods tend to utilize pixelwise loss, leading to higher PSNR and smoother results, while GAN-based methods might recover visually pleasing face images with more details but lower PSNR. Each of them has its own merits. Compared with them, ensemble learning-based method can combine their advantages to make up their deficiencies by integrating multiple models. However, ensemble learning inevitably results in the increase of memory, computation and parameters. Reinforcement learning-based methods recover the attentional local regions by sequentially searching, and consider the contextual dependency of patches from a global perspective, which brings improvement of performance but needs much more training time and computational cost.

4.2 Prior-guided FSR

General FSR methods aim to design efficient networks. Nevertheless, as a highly structured object, human face has some specific characteristics, such as prior information (including facial landmarks, facial parsing maps, and facial heatmaps), which are ignored by general FSR methods. Therefore, to recover facial images with a much clearer facial structure, researchers begin to develop prior-guided FSR methods.

Prior-guided FSR methods refer to extracting facial prior information and utilizing it to facilitate face reconstruction. Considering the order of prior information extraction and FSR, we further divide the prior-guided FSR methods into four parts: (i) pre-prior methods that extract prior information followed by FSR, (ii) parallel-prior methods that extract prior information and FSR simultaneously, (iii) in-prior methods that extract prior information from the intermediate results or features at the middle stag, and (iv) post-prior methods that extract prior information from FSR results. We illustrate the main frameworks of the four categories in Figure 4, outline the development of prior-guided FSR methods in Figure 5 and compare them on several key features in Table 3.

Fig. 4.

Fig. 5.

Table 3.

	Methods	Prior	Extraction	Fusion Strategies
Pre	LCGE [115]	Landmark	Pretrained	Crop
	MNCEFH [116]	Landmark	Pretrained	Crop
	PSFR-GAN [117]	Parsing map	Pretrained	Concatenation
	CAGFace [45]	Parsing map	Pretrained	Concatenation
	FSRG3DFP [120]	3D prior	Joint	SFT
	SeRNet [118]	Parsing map	Pretrained	IRB
Parallel	CBN [121]	Dense correspondence field	Joint	Concatenation
	KPEFH [122]	Parsing map	Joint
	JASRNet [123]	Heatmap	Joint
	ATSENet [124]	Facial boundary heatmap	Joint	FFU
In	FSRNet [35]	Landmark, parsing map, heatmap	Joint	Concatenation
	FSRGFCH [125]	Heatmap	Joint	Concatenation
	DIC [36]	Heatmap	Joint	, AFM
Post	Super-FAN [126]	Heatmap	Joint
	PFSRNet [127]	Heatmap	Pretrained	,

4.2.1 Pre-prior Methods.

These methods first extract face structure prior information and then feed the prior information to the beginning of FSR model. That is, they always extract prior information from LR face images by an extraction network that can be a pretrained network or a subnetwork associated with the FSR model, then take advantage of the prior information to facilitate FSR. To extract the accurate face structure prior, prior-based loss is always used in these methods to train their prior extraction network, which is defined as

(15)

where

is the ground truth prior, P is extracted prior from the super-resolved face image, F can be 1 or 2, and the prior can be heatmap, landmark, and parsing maps in different methods.

In the early years, both LCGE [115] and MNCEFH [116] extract landmarks from LR face images to crop the faces into different components, and then predict high-frequency details for different components. However, accurate landmarks are unavailable especially when LR face images are tiny (i.e., 16

16). Thus, researchers turn to facial parsing maps [45, 117, 118, 119]. PSFR-GAN [117], SeRNet [118], and CAGFace [45] all pretrain a face structure prior extraction network to extract facial parsing maps. Then all of them except SeRNet directly concatenate the prior and LR face images as the input of the super-resolution model while SeRNet designs its improved residual block (IRB) to fuse the prior and features from LR face images. In addition, PSFR-GAN designs a semantic aware style loss to calculate the gram matrix loss for each semantic region separately. Later, super-resolution guided by three-dimensional (3D) facial priors (FSRG3DFP) [120] estimates 3D priors instead of 2D priors to learn 3D facial details and capture facial component information by the spatial feature transform block (SFT).

4.2.2 Parallel-prior Methods.

The above methods ignore the correlation between face structure prior estimation and FSR task: face prior estimation benefits from the enhancement of FSR and vice versa. Thus, parallel-prior methods that perform prior estimation and super-resolution in parallel are proposed, including cascaded bi-network (CBN) [121], KPEFH [122], JASRNet [123], SAAN [128], HaPFSR [129], OBC-FSR [130], and ATSENet [124]. They train the prior estimation and super-resolution networks jointly and require ground truth prior to calculate prior-based loss like Equation (15).

One of the most representative parallel-prior methods is JASRNet. Specifically, JASRNet utilizes a shared encoder to extract features for super-resolution and prior estimation simultaneously. Through this design, the shared encoder can extract the most expressive information for both tasks. In contrast to JASRNet, ATSENet not only extracts shared features for the two tasks, but also feeds features from the prior estimation branch into the feature fusion unit (FFU) in the super-resolution branch.

4.2.3 In-prior Methods.

Pre- and parallel-prior methods directly extract structure prior information from LR face images. Due to the low-quality of LR face images, extracting accurate prior information is challenging. To reduce the difficulty and improve the accuracy of prior estimation, researchers first coarsely recover LR face images and then extract prior information from the enhanced results of LR face images, including FSRNet [35], FSR guided by facial component heatmaps (FSRGFCH) [125], HCFR [131], deep-iterative-collaboration (DIC) [36, 132, 133, 134, 135, 136]. Similarly to parallel-prior methods, in-prior methods always jointly optimize the networks for two tasks.

Specifically, FSRNet [35], FSRGFCH [125], and HCFR [131] first upsample the LR face images to obtain intermediate results, then extract face structure prior from the intermediate results, and finally make use of the prior and intermediate results to recover the final results. FSRNet and FSRGFCH concatenate the intermediate results and the prior and feed the concatenated results into the following network to recover final SR results while HCFR utilizes the prior to segment the intermediate results and recovers final SR results by random forests. Considering that FSR and prior extraction should facilitate each other, DIC [36] proposes to iteratively perform super-resolution and prior extraction tasks. In the first iteration, DIC recovers a face

with super-resolution model and extracts prior (heatmaps)

from

. In the ith iteration, both the LR face image and

are fed into the super-resolution model to obtain

, and then

can be extracted. In this way, the two tasks can promote each other. Moreover, DIC builds an attention fusion module (AFM) to fuse facial prior and the LR face image efficiently.

4.2.4 Post-prior Methods.

In contrast to the above methods, post-prior methods extract the face structure prior from SR result rather than LR face image or intermediate result, and utilize the prior to design loss functions, including Super-FAN [126], progressive FSR network (PFSRNet) [127], and [137]. Super-FAN [126] and PFSRNet [127] first super-resolve LR face images and obtain SR results and then develop a prior estimation network to extract the heatmaps of SR face images and HR ones, and constrains the heatmaps of SR face images and HR ones to be close. PFSRNet further generates multi-scale super-resolved results and applies prior-based loss at every scale. In addition, PFSRNet utilizes heatmaps to generate a mask and calculates facial attention loss

based on the masked SR and HR face images. Compared with the above methods, post-prior methods do not require prior extraction during the inference.

4.2.5 Discussion.

All prior-guided FSR methods need the ground truth of the face structure prior to calculate loss in the training phase. During the testing phase, all prior-guided FSR methods except post-prior methods need to estimate the prior. Due to the loss of information caused by image degradation, LR face images increase the difficulty and limit the accuracy of prior extraction in pre-prior methods, further limiting the super-resolution performance. Although parallel-prior methods can facilitate prior extraction and super-resolution simultaneously by sharing feature extraction, the improvement is still limited. In-prior methods extract prior from the intermediate result, which can improve the performance but increase the memory and computation cost caused by iterative super-resolution procedure especially in the iterative method (DIC) [36]. In post-prior methods, the prior only plays the role of the supervisor during training, while not participating in inference, and they cannot make full use of the specific prior of the input LR face image. Thus, a method that can exploit the prior fully without increasing additional memory or computation cost is on demand.

4.3 Attribute-constrained FSR

Facial attribute is also usually exploited in FSR, and they are called attribute-constrained FSR. As a kind of semantic information, facial attribute provides semantic knowledge, e.g., whether people wear glasses, which is useful for FSR. In the following, we will introduce some attribute-constrained FSR methods.

Different from face structure prior information of which acquisition relies on the image itself, attribute information can be available without LR face images, such as in criminal cases where attribute information may not be clear in LR face images but accurately known by witnesses. Thus, some researchers construct networks on the condition that attribute information is given, while others relax this by estimating attributes. According to this concept, attribute-constrained FSR methods can be divided into two frameworks: given attribute methods and estimated attribute methods. The overview is provided in Figure 6 and Table 4.

Fig. 6.

Table 4.

	Methods	#Attribute	Attribute embedding methods
Given	FSRSA [139]	18	Concatenation and
	EFSRSA [142]	18	Concatenation and
	AGCycleGAN [138]	18	Concatenation and
	AACNN [141]	38	Concatenation
	ATNet [140]	NG	Concatenation and
	ATSENet [124]	NG	Concatenation and
Estimated	RAAN [143]	NG	Attribute channel attention and
	FACN [144]	18	Attribute attention mask and

4.3.1 Given Attribute Methods.

Given the attribute information, how to integrate it into the super-resolution model is the key. For this problem, attribute-guided conditional CycleGAN (AGCycleGAN) [138], FSR with supplementary attributes (FSRSA) [139], expansive FSR with supplementary attributes (EFSRSA), attribute transfer network (ATNet) [140] and ATSENet [124] all directly concatenate attribute information and LR face image (or features extracted from LR face image). AGCycleGAN and FSRSA also feed the attribute into their discriminators to force the super-resolution model to notice the attribute information and develop attribute-based loss to achieve attribute matching, which is defined as

(16)

where A is attribute matched with

while

is the mismatched one. ATSENet feeds the super-resolved result into an attribute analysis network to calculate attribute prediction loss,

(17)

where

is the predicted attribute of the network and

is the ground truth attribute. However, Lee et al. [141] hold that LR face image and attributes belong to different domains, and direct concatenation is unsuitable and may decrease the performance. With regard to this view, Lee et al. construct an attribute augmented convolutional neural network (AACNN) [141], which extracts features from the attribute to boost face super-resolution.

4.3.2 Estimated Attribute Methods.

The above-mentioned given attribute methods work on the condition that all attributes are given, making them limited in real-world scenes where some attributes are missing. Although the missed attributes can be set as unknown, such as 0 or random values, the performance may drop sharply. To this end, researchers build modules to estimate attribute information for FSR. In estimated attribute methods, attribute-based loss forces the network to predict attribute information correctly, which is similar to Equation (17). Estimated attribute methods include residual attribute attention network (RAAN) [143] and facial attribute capsule network (FACN) [144]. RAAN is based on cascaded residual attribute attention blocks (RAAB). RAAB builds three branches to generate shape, texture, and attribute information, respectively, and introduces two attribute channel attention applied to shape and texture information. In contrast, FACN [144] integrates attributes in capsules. Specifically, FACN encodes LR face image into encoded features, and the features are fed into a capsule generation block that produces semantic capsules, probabilistic capsules, and facial attributes. Then, the attribute is viewed as a kind of mask to refine other features by multiplication or summation. With the combination of three information as input, the decoder of FACN can well recover the final SR results.

4.3.3 Discussion.

Given attribute methods require attribute information, making them only applicable in some restricted scenes. Although the attribute can be set as unknown in these methods, the performance may drop sharply. Toward the estimated attribute methods, they need to estimate the attribute and then utilize the attribute. Compared with given attribute methods, they have a wider range of applications but the accuracy of attribute estimation is difficult to guarantee in practice.

4.4 Identity-preserving FSR

Compared with face structure prior and attribute information, identity information containing identity-aware details is essential and identity-preserving FSR methods have received an increasing amount of attention in recent years. They aim to maintain the identity consistency between SR face image and LR one and improve the performance of down-stream face recognition. We show the overview and comparison of some representative methods in Figure 7 and Table 5.

Fig. 7.

Table 5.

	Methods	Loss Functions
Face Recognition-based	SICNN [145]	MSE loss on normalized and
		FH-GAN [146]	MSE loss on and
		WaSRGAN [147]	loss on and
		C-SRIP [150]	Cross entropy loss on and
		IPFH [149]	A-softmax loss on and
		SPGAN [99]	Attention-based loss
Pairwise Data-based	SiGAN [157]	Pair contrastive loss
Pairwise Data-based		IADFH [158]	Adversarial face verification loss

Table 5. Comparison of Identity-preserving FSR Methods

Notably,

(

) is the residual map between

(

) and

4.4.1 Face Recognition-based Methods:.

To maintain identity consistency between

and

, in the training phase, a commonly used design is utilizing face recognition network to define identity loss, e.g., super-identity convolutional neural network (SICNN) [145], face hallucination generative adversarial network (FH-GAN) [146], WaSRGAN [147], [148], identity preserving face hallucination (IPFH) [149], cascaded super-resolution and identity priors (C-SRIP) [150, 151, 152, 153, 154] and ATSENet [124]. The framework of these methods consists of two main components: a super-resolution model and a pretrained face recognition network (FRN), probably an additional discriminator. The super-resolution model super-resolves the input LR face image, generating

, which is fed into FRN to obtain its identity features. Simultaneously,

is also fed into FRN, obtaining its identity features. The identity loss is calculated by

(18)

where FR is the function of FRN. F is 1 in WaSRGAN [147] and 2 in FH-GAN [146, 151]. Some methods calculate the loss on normalized features [145, 155], and some use A-softmax loss [149, 156]. Rather than directly extracting identity features from

and

, C-SRIP [150] feeds residual maps between

(or

) and

(upsampled by bicubic interpolation), respectively, into FRN, and applies cross-entropy loss on them. Moreover, C-SRIP generates multi-scale face images that are fed into different scale face recognition networks.

To fully explore the identity prior, SPGAN [99] feeds identity information extracted by the pretrained FRN to the discriminator at different scales, and designs attention-based identity loss. First, SPGAN generates two attention maps

and

(19)

(20)

(21)

where E denotes the difference,

denotes the element-wise multiplication, b is identity matrix, and

is a 0-1 matrix. At ith row and jth column,

is 0 when

is negative, otherwise

is 1. Then two attention maps are applied to the identity loss

(22)

where

is the identity loss of SPGAN.

4.4.2 Pairwise Data-based Methods.

The training of FRN needs well-labeled datasets. However, a large well-labeled dataset is very costly. One solution is based only on the weakly-labeled datasets. In consideration of this, siamese generative adversarial network (SiGAN) [157] takes advantage of the weak pairwise label (in which different LR face images correspond to different identities) to achieve identity preservation. Specifically, SiGAN has twin GANs (

and

) that share the same architecture but super-resolve different LR face images (

and

) at the same time. As the identities of different LR face images are different, the identities of SR results corresponding to LR face images are also varied. Based on this observation, SiGAN designs an identity-preserving contrastive loss that minimizes the difference between same-identity pairs and maximizes the difference between different-identity pairs,

(23)

(24)

where

is a function used to extract features from the intermediate layers of the generators,

measures the distance between the features of

and

, y is 1 when two LR face images belong to the same identity, and y is 0 when LR face images belong to different identities.

Instead of feeding the pair data into twin generators, identity-aware deep face hallucination (IADFH) [158] feeds pair data into the discriminator. Its discriminator is a three-way classifier that generates fake, genuine and imposter: (i) HR and SR face images with the same or different identities (

) correspond to the fake, which forces the discriminator to distinguish

and

; (ii) two different HR face images of the same identity (

) correspond to the genuine; and (iii) two HR face images with different identities (

) correspond to the imposter. The last two pairs force the discriminator to capture the identity feature. In this pattern, the generator can incorporate the identity information. The loss is called adversarial face verification loss (AFVL),

(25)

(26)

(27)

where

(

) is the loss function of the discriminator (generator), and

(can be

1, 1, 0) are the outputs of the discriminator for fake, genuine and imposter pairs.

4.4.3 Discussion.

Face recognition-based methods design identity loss based on face recognition network that is always pretrained. The training of a face recognition network requires well-labeled datasets that are costly. Instead, pairwise data-based methods take advantage of the contrast between different identities and the similarity between the same identity to maintain identity consistency without well-labeled datasets, which has a wider range of applications.

4.5 Reference FSR

The FSR networks discussed all exploit only the input LR face itself. In some conditions, we may obtain the high-quality face image of the same identity of the LR face image, for example, the person of the LR face image may have other high-quality face images. These high-quality face images can provide identity-aware face details for FSR. Thus, reference FSR methods utilize high-quality face image(s) as reference (R) to boost face restoration. Obviously, the reference face image can be only one image or multiple images. According to the number of R, a guided framework can be partitioned into single-face guided, multi-face guided, and dictionary-guided methods. An overview of reference FSR methods is shown in Figure 8 and the comparison of them is shown in Table 6.

Fig. 8.

Table 6.

	Methods	Same identity	Alignment	Utilization of R
Single-face guided	GFRNet [39]	✓	Landmark	Concatenation
Single-face guided		GWAInet [159]	✓	Flow field	GFENet
Multi-face guided	ASFFNet [37]	✓	Moving least-square	AFFB
Multi-face guided		MEFSR [160]	✓	—	PWAve
Dictionary-guided	JSRFC [161]	×	Landmark	Concatenation
Dictionary-guided		DFDNet [38]	×	—	DFT

Table 6. Comparison of Reference FSR Methods

“–” denotes that the method does not contain the procedure.

4.5.1 Single-face Guided Methods.

At first, a high-quality face image that shares the same identity with the LR face image serves as R, such as guided face restoration network (GFRNet) [39], GWAInet [159]. Since the reference face image and LR face image may have different poses and expressions, which may hinder the recovery of face images, single-face guided methods tend to perform the alignment between the reference face image and the LR face image. After alignment, both the LR face image and aligned reference face image (we name it

) are fed into a reconstruction network to recover the SR result. The differences between GFRNet and GWAInet include two aspects: (i) GFRNet employs landmarks while GWAInet employs flow field to carry out the alignment and (ii) in the reconstruction network, GFRNet directly concatenates the LR face image and

as the input. Nevertheless, GWAInet builds a GFENet to extract features from

and transferring useful features of

to the reconstruction network to recover SR results.

4.5.2 Multi-face Guided Methods.

Single-face guided methods set the problem as an LR face image only has one high-quality reference face image, but in some applications many high-quality face images are available, and they can further provide more complementary information for FSR. Adaptive spatial feature fusion network (ASFFNet) [37] is the first to explore multi-face guided FSR. Given multiple reference images, ASFFNet first selects the best reference image that should have the most similar pose and expression with LR face image by guidance selection module. However, misalignment and illumination differences still exist in the reference face image and the LR face image. Thus, ASFFNet applies weighted least-square alignment [162] and AdaIN [163] to cope with these two problems. Finally, they design an adaptive feature fusion block (AFFB) to generate an attention mask that is used to complement the information from LR face image and R. Multiple exemplar FSR (MEFSR) [160] directly feed all reference faces into weighted pixel average (PWAve) module to extract information for face restoration.

4.5.3 Dictionary-guided Methods.

It is observed that different people may have similar facial components. According to this observation, dictionary-guided methods are proposed, including joint super-resolution and face composite (JSRFC) [161] and deep face dictionary network (DFDNet) [38]. Dictionary-guided methods do not require the identity consistency between the reference face image and the LR face image, but build a component dictionary to boost face restoration. For example, JSRFC selects reference images that have similar components with the LR face image (every reference face image is labeled with a vector to indicate which components are similar.). Then, it aligns LR face image with the reference face image and extracts the corresponding components as a component dictionary. Finally, the dictionary components are used for the following face restoration. Different from JSRFC, Li et al. [38] build multi-scale component dictionaries based on features of the entire dataset. They use pretrained VGGFace [67] to extract features in different scales from high-quality faces, and then crop and resample four components with landmarks, and then cluster obtain K classes for every component by k-means. Given component dictionaries, they first select the most similar atoms for every component by the inner product, and then transfer the features from dictionary to the LR face image by dictionary feature transfer (DFT).

4.5.4 Discussion.

Single-face and multi-face guided FSR methods require one or multiple additional high-quality face image(s) with the same identity as the LR face image, which facilitates face restoration but limits their application, since the reference image may not exist. In addition, the alignment between low-quality LR face image and high-quality reference face image is also challenging in the reference FSR. Dictionary-guided methods break the restriction of the same identity, broadening the application but increasing the difficulty of face reconstruction.

4.6 Experiments and Analysis

To have a clear view of deep learning-based FSR methods, we compare the PSNR, SSIM, and LPIPS performance of the state-of-the-art algorithms on commonly used benchmark datasets (including CelebA [55], VGGFace2 [67], and CASIA-WebFace [69]) with upscale

8, and

16. Considering that the reference FSR methods are different from other FSR methods, we compare other FSR methods and reference FSR methods individually.

4.6.1 Comparison Results of FSR Methods.

We first introduce the experimental settings and analyze the results of FSR methods.

Experimental Setting: For CelebA [55] dataset, 168,854 images are used for training and 1,000 images for testing following DIC [36]. All the images are cropped and resized into 128

128 as

. We apply the degradation model in Equation (4) to generate

. Facial landmarks are detected in References [164, 165, 166] and heatmaps are generated according to the landmarks. For facial parsing map, we adopt pretrained BiSeNet [167] to extract the parsing map from

. For quality evaluation, PSNR and SSIM are introduced and both of them are computed on the Y channel of YCbCr space, which also follows DIC [36]. In addition, we further introduce the LPIPS to evaluate the performance of all comparison approaches. For the optimizer and learning rate when retraining different methods, we follow the setting in their original papers.

Experimental Results: We list and compare the results of some representative FSR methods in Table 7, including four general image super-resolution methods: SRCNN [70], VDSR [85], residual channel attention network (RCAN) [168], non-local sparse network (NLSN) [169], three general FSR methods: URDGN [34], WaSRNet [82], SPARNet [78], three prior-guided FSR methods: FSRNet [35], Super-FAN [126], DIC [36], two attribute-constrained FSR methods: FSRSA [142], AACNN [141], and three identity-preserving FSR methods: SICNN [145], SiGAN [157], and WaSRGAN [147]. Except that, we also report the parameters and FLOPs of these methods in the last two columns of Table 7. Note that the parameter and FLOPs are associated with the model with upscale

8. In addition, we also present the visual comparisons between a few state-of-the-art algorithms in Figure 9, Figure 10, and Figure 11.

Fig. 9.

Fig. 10.

Fig. 11.

Table 7.

Methods	×4			×8			×16			Params	FLOPs
Methods		PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	Params	FLOPs	LPIPS↓
General Image Super-Resolution Methods
SRCNN [70]	28.04	0.837	0.160	23.93	0.635	0.256	20.54	0.467	0.291	0.01M	0.3G
VDSR [85]	31.25	0.906	0.055	26.36	0.761	0.112	22.42	0.594	0.186	0.6M	11.0G
RCAN [168]	31.69	0.913	0.051	27.30	0.799	0.100	23.32	0.641	0.204	15.0M	4.7G
RCAN* [168]	26.30	0.769	0.177	22.17	0.521	0.265	—	—	—	15.0M	4.7G
NLSN [169]	32.08	0.919	0.044	27.45	0.804	0.091	23.69	0.671	0.154	43.4M	22.9G
NLSN* [169]	30.82	0.899	0.065	—	—	—	—	—	—	43.4M	22.9G
General FSR Methods
URDGN [34]	30.11	0.884	0.075	25.62	0.726	0.148	22.29	0.579	0.185	1.0M	14.6G
WaSRNet [82]	30.92	0.908	0.051	26.83	0.787	0.089	23.13	0.634	0.160	71.5M	19.2G
SPARNet [78]	31.71	0.913	0.048	27.44	0.804	0.089	23.68	0.674	0.139	10.0M	7.2G
Prior-guided FSR Methods
FSRNet [35]	31.46	0.908	0.052	26.66	0.771	0.110	23.04	0.629	0.175	3.1M	39.0G
Super-FAN [126]	31.17	0.905	0.040	27.08	0.788	0.058	23.42	0.652	0.125	1.3M	1.1G
DIC [36]	31.44	0.909	0.053	27.41	0.802	0.092	23.47	0.657	0.160	20.8M	14.8G
Attribute-constrained FSR Methods
FSRSA [142]	30.80	0.898	0.058	26.19	0.757	0.111	22.84	0.630	0.153	76.9M	0.9G
AACNN [141]	31.30	0.907	0.052	26.68	0.773	0.100	22.98	0.626	0.171	3.3M	0.2G
Identity-preserving FSR Methods
SICNN [145]	31.59	0.911	0.050	27.18	0.793	0.095	23.50	0.662	0.152	4.9M	5.4G
SiGAN [157]	30.68	0.892	0.034	25.63	0.740	0.062	22.18	0.596	0.099	19.5M	5.7G
WaSRGAN [147]	30.72	0.907	0.045	25.55	0.765	0.092	22.78	0.625	0.148	71.5M	19.2G

Table 7. Quantitative Evaluation of Various FSR Methods on CelebA in Terms of PSNR, SSIM, and LPIPS for ×4, ×8, and ×16

The best, the second-best, and the third-best results are emphasized with red, blue, and underscore, respectively. Note that Params and FLOPs are calculated for ×8 super-resolution model.

From these objective metrics and visual comparison results, we have the following observations:

(i) The retrained state-of-the-art general image super-resolution methods, such as RCAN and NLSN, are very competitive and even outperform the best FSR methods in terms of PSNR and SSIM. Meanwhile, as a general FSR method, SPARNet obtains the best performance among all the FSR methods. RCAN, NLSN, and SPARNet all do not explicitly incorporate the prior knowledge of face image, but they have obtained outstanding results. It shows that the design and optimization of the network is very important, and a well-designed network will have stronger fitting capabilities (less reconstruction errors). This observation will enlighten us that when we are designing a FSR deep network, it should be based on a strong backbone network.

(ii) The terms of RCAN* and NLSN* in Table 7 represent the pretrained models on general training images, and we directly download these models from the authors’ pages. Note that the pretrained results under certain magnification factors are not given (indicated as “—” in the table), because these methods are not trained under these magnification factors. RCAN and NLSN achieve better performance than RCAN* and NLSN*. This demonstrates that models trained by general images are not suitable for FSR but general image super-resolution methods trained by face images may perform well (sometimes even better than FSR methods on face images). Therefore, if we want to know and compare the performance of a newly proposed general image super-resolution on the task of FSR, then we cannot directly use the pretrained model released by the authors, but should retrain the model on the face image dataset. It should be noted that the objective results of these GAN-based FSR methods (e.g., URDGN, FSRSA, SiGAN, and WaSRGAN) are worse than those of NLSN*. This is mainly because that they often cannot get a better MSE due to the introduction of adversarial losses, which tend to allow the models to obtain perceptually better SR results but large reconstruction errors.

(iii) Compared with general image super-resolution methods and general FSR methods, these methods that incorporate facial characteristics do not perform well in terms of PSNR and SSIM. Nevertheless, we cannot conclude that it is meaningless to develop FSR methods that use facial characteristics. This is mainly because PSNR and SSIM may be not good assessment metrics for the task of image super-resolution [41], let alone for the task of FSR, in which human perception will be more important. To further exploit the super-resolution reconstruction capacity, we also introduce another assessment metric, LPIPS, which is more in line with human judgement. From the LPIPS results, we learn that these methods with low PSNR and SSIM may produce very good performance in terms of LPIPS, please refer to Super-FAN and SiGAN. This indicates that these methods that introduce facial characteristics can well represent the face image and recover the face contours and discriminant details.

(iv) When we compare FSR methods that use different facial characteristics, such as face structure prior, attributes, and identity, it is difficult to say which type of characteristic is more effective for FSR. Because these methods often use different backbone networks, and it is difficult to determine whether their performance changes are caused by the difference in the backbone network itself or because of the introduction of different facial characteristics. In practice, we can first develop a strong backbone and then incorporate facial characteristics to boost FSR.

4.6.2 Comparison Results of Reference FSR Methods.

The above FSR methods only require LR face images as input, while the reference FSR methods require LR face images and reference images. It is unfair to directly compare with these methods that do not use auxiliary high-resolution face images. Therefore, we compare the performance of the reference FSR methods individually.

Experimental Setting: Following ASFFNet [37], VGGFace2 [67] is reorganized into 106,000 groups and every group has 3–10 high-quality face images of the same identity, in which 10,000 groups are used for training set, 4,000 groups are for validation set and the remaining are testing set. In addition, two testing sets based on CelebA [55] and CASIA-WebFace [69] are also used, and each set contains 2,000 groups with 3–10 high-quality face images. We utilize facial landmarks to crop and resize all images into 256 × 256 as high-quality face images. To generate

, the degradation model Equation (5), where J and ↓ are embodied as JPEG compression with quality q and bicubic interpolation respectively, is applied to the high-quality images. We consider two types of blur kernels, i.e., Gaussian blur and motion blur kernels, and randomly sample the scale s from {1:0.1:8}, the noise level from {0:1:15}, and the compression quality factor q from {10:1:60} [37]. PSNR, SSIM, and LPIPS [41] are used as metrics.

Experimental Results: The experimental results are shown in Table 8. To be specific, we list the results of GFRNet [39], GWAInet [159] and the latest proposed ASFFNet [37] on CelebA [55], VGGFace2 [67] and CASIA-WebFace [69] with upscale ×8. Note that all the results are copied from the article [37], since we have difficulty in reproducing these methods. Note that GFRNet and GWAInet are single-face guided methods while ASFFNet is multi-face guided method. To be fair, the reference image of GFRNet and GWAInet is the same as the selected image in ASFFNet. From Table 8, it is obvious that multi-face guided method ASFFNet performs better than single-face guided methods (GWAInet and GFRNet). ASFFNet considers the illumination difference between the reference face image and the LR face image, which is ignored by GFRNet and GWAInet, and builds a well-designed AFFB instead of simple concatenation to adaptively the features of the reference face image and the LR face image. These two points contribute to the excellent performance of ASFFNet. Thus, difference (i.e., misalignment, illumination difference, and so on) elimination and effective information fusion of the reference face image and the LR face image are both important in reference FSR methods.

Table 8.

Methods	CelebA [55]			VGGFace2 [67]			CASIA-WebFace [69]
Methods		PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
GFRNet [39]	25.93	0.901	0.227	23.85	0.879	0.263	27.19	0.912	0.307
GWAInet [159]	25.77	0.901	0.210	23.87	0.879	0.261	27.18	0.910	0.250
ASFFNet [37]	26.39	0.905	0.185	24.34	0.881	0.238	27.69	0.921	0.219

Table 8. Quantitative Evaluation of Various Reference FSR Methods on CelebA [55], VGGFace2 [67], and CASIA-WebFace [69] in Terms of PSNR, SSIM, and LPIPS for ×8

The best, the second-best and the third-best results are emphasized with red, blue and underscore respectively.

4.7 Joint FSR and Other Tasks

Although the above FSR methods have achieved a breakthrough, FSR is still challenging and complex, since the input face images are often affected by many factors, including shadow, occlusion, blur, abnormal illumination, and so on. To recover these face images efficiently, some work is proposed to consider degradation caused by low-quality and other factors together. Moreover, researchers also jointly perform FSR and other tasks. In the following, we will review these joint FSR and other tasks methods.

4.7.1 Joint Face Completion and Super-resolution.

Both low-resolution and occlusion or shadowing always coexist in the real-world face images. Thus, the restoration of faces degraded by these two factors is important. The simplest way is to first complete the occluded part and then super-resolve the completed LR face images [170]. However, the results always contain large artifacts due to the accumulation of errors. Cai et al. [171] propose the FCSR-GAN method that pretrains a face completion model (FCM), and combines FCM with super-resolution model (SRM), then trains SRM with the fixed FCM, and finally finetunes the whole network. Then, Liu et al. [172] propose a graph convolution pyramid blocks, which only needs one step to be trained rather than multiple steps of FCSR-GAN. In contrast, Pro-UIGAN [173] utilizes facial landmark to capture facial geometric prior and recovers occluded LR face images progressively.

4.7.2 Joint Face Deblurring and Super-resolution.

Blurry LR face images always arise in real surveillance and sports videos, which cannot be recovered effectively by a single task model, e.g., super-resolution or deblurring model. In the literature, Yu et al. [174] develop SCGAN to deblur and super-resolve the input jointly. Then, Song et al. [175] find that the previous methods ignore the utilization of facial prior information and the recovered face image are lack of high-frequency details. Thus, they first utilize a parsing map and LR face image to recover a basic result, and then feed the basic result into detail enhancement module to compensate high-frequency details from the high-quality exemplar. Later, DGFAN [176] develops two feature extraction modules for different tasks to extract features and imports them into well-designed gated fusion modules to generate deblurred high-quality results. Xu et al. [177] incorporate face recognition network with face restoration to improve the identifiability of the recovered face images.

4.7.3 Joint Illumination Compensation and FSR.

Abnormal illumination FSR has also attracted the attention of many scholars. SeLENet [178] decomposes a face image into a normal face, an albedo map and a lighting coefficient, then replaces the lighting coefficient with the standard ambient white light coefficient, and then reconstructs the corresponding neutral light face image. Ding et al. [179] build a pipeline of face detection, and then recover the detected faces with landmarks. Zhang et al. [180] utilize a normal illumination external HR guidance to guide abnormal illumination LR face images for illumination compensation. They develop a copy-and-paste GAN (CPGAN), including an internal copy-and-paste network to utilize face intern information for reconstruction, and an external copy-and-paste network is applied to compensate illumination. Based on CPGAN, they further improve the external copy-and-paste network by introducing recursive learning and incorporating landmark estimation and develop the recursive CPGAN [181]. In contrast, Yasarla et al. [182] introduce network architecture search into face enhancement to design efficient network and extract identity information from HR guidance to restore face images.

4.7.4 Joint Face Alignment and Super-resolution.

The above FSR methods require the all the HR training sample to be aligned. Thus, the misalignment of the input LR face image to the training face images often leads to sharp performance decrease and artifacts. Therefore, a set of joint face alignment and super-resolution methods are developed. Yu et al. [49] insert multiple spatial transformer networks (STN) [183] into the generator to achieve face alignment, and develop TDN and MTDN [184]. As LR face images can be noisy and unaligned, Yu et al. build the TDAE method [185]. TDAE first upsamples and coarsely aligns LR face images to produce

, then downsamples

and obtains

to reduce noise, and then upsamples

for the final reconstruction.

4.7.5 Joint Face Frontalization and Super-resolution.

Faces in the real world have various poses, and some of them may not be frontal. When existing FSR methods are applied to non-frontal faces, the reconstruction performance drops sharply and has poor visual quality. Artifacts exist even when FSR and face frontalization are performed in sequence or inverse order. To alleviate this problem, the method in Reference [186] first takes advantage of STN and CNN to coarsely frontalize and hallucinate the faces, and then designs a fine upsampling network for refining face details. Yu et al. [187] propose a transformative adversarial neural network for joint face frontalization and hallucination. The method builds a transformer network to encode non-frontal LR face images and frontal LR ones into the latent space and requires the non-frontal one to be close to the frontal one, and then the encoded latent representations are imported into the upsampling network to recover the final results. Tu et al. [188] first train face restoration network and face frontalization network separately, and then propose task-integrated training strategy to merge two networks into a unified network for face frontalization and super-resolution. Note that face alignment aims to generate SR face images with the same pose as HR ones while face frontalization is to recover frontal SR faces from non-frontal LR faces.

4.8 Related Applications

Except the above-mentioned FSR methods and joint methods, a large number of new methods related to FSR have emerged in recent years, including face video super-resolution, old photo restoration, audio-guided FSR, 3D FSR, and so on, which are introduced in the following.

4.8.1 Face Video Super-resolution.

Faces usually appear in LR video sequences, such as surveillance. The correlation between frames can provide more complementary details, which benefit the face reconstruction. One direct solution is to fuse multi-frame information and exploit inter-frame dependency [189]. The approach in Reference [190] employs a generator to generate the SR results for every frame, and a fusion module is applied to estimate the central frame. Considering that the aforementioned methods cannot model the complex temporal dependency, Xin et al. [191] propose a motion-adaptive feedback cell that captures inter-frame motion information and updates the current frames adaptively. In Reference [192], based on the assumption that multiple super-resolved frames are crucial for the reconstruction of the subsequent frame, and thus it designs a recurrence strategy to make better use of inter-frame information. Inspired by the powerful transformer, the work of Reference [193] develops the first pure transformer-based face video hallucination model. MDVDNet [194] incorporates multiple priors from the video, including speech, semantic elements and facial landmarks to enhance the capability of deep learning-based method.

4.8.2 Old Photo Restoration.

Restoration of old pictures is vital and difficult in the real world, since the degradation is too complex to be stimulated. Naturally, one solution is to learn the mapping from a real LR face image (regarding real old images as real LR face images) to an artificial LR face images, and then apply the existing FSR methods to the generated artificial LR face image. BOPBL [195] proposes to transform images at latent space rather than image space. Specifically, BOPBL first encodes real and artificial LR face images into the same latent space

, and encodes HR face images into another latent space

, and then maps

into

by a mapping network.

4.8.3 Audio-guided FSR.

Considering that audio carries face-related information [196], Meishvili et al. [197] develop the first audio-guided FSR method. Due to the difference of multi-modal, they build two encoders to encode image and audio information. Then the encoded representations of images and the audio are fused, and the fused results are fed into the generator to recover the final SR results. The introduction of the audio in FSR is novel and inspires researchers to exploit cross modal information, but is challenging due to the differences between different modalities.

4.8.4 3d FSR.

Human face is the most concerned object in the field of computer vision. With the development of 2D technology, a large number of 3D methods are often proposed, because they can provide more useful features for face reconstruction and recognition. In the FSR society, the early 3D FSR approach is proposed by Pan et al. [198]. Berretti et al. [199] propose a superface model from a sequence of low-resolution 3D scans. The approach in Reference [200] takes only the rough, noisy, and low-resolution depth image as input, and predicts the corresponding high-quality 3D face mesh. By establishing the correspondence between the input LR face and 3D textures, Qu et al. present a patch-based 3D FSR on the mesh [201]. Benefiting from the development of deep learning technology, most recently, a 3D face point cloud super-resolution network approach is developed to infer the high-resolution data from low-resolution 3D face point cloud data [202].

5 Conclusion and Future Directions

In this review, we have presented a taxonomy of deep learning-based FSR methods. According to facial characteristics, this field can be divided into five categories: general FSR methods, prior-guided FSR methods, attribute-constrained FSR methods, identity-preserving FSR methods, and reference FSR methods. Then, every category is further divided into some subcategories depending on the design of the network architecture or the specific utilization of facial characteristics. In particular, general FSR methods are further divided into basic CNN-based methods, GAN-based methods, reinforcement learning-based methods, and ensemble learning-based methods. Besides, other methods combining facial characteristics are categorized according to the specific utilization pattern of facial characteristics. We also compare the performance of states of the art and give some deep analysis. Of course, FSR technique is not limited to the methods we presented, and a panoramic view of this fast-expanding field is rather challenging, thereby resulting in possible omissions. Therefore, this review serves as a pedagogical tool, providing researchers with insights into typical methods of FSR. In practice, researchers could use these general guidelines to develop the most suitable technique for their specific studies.

Despite great breakthroughs, FSR still presents many challenges and is expected to continue its rapid growth. In the following, we simply provide an outlook on the problems to be solved and trends to expect in the future.

Design of Network. From the comparison results with the state-of-the-art general image super-resolution methods, we learn that the backbone network has a crucial impact on the performance, especially in terms of PSNR and SSIM. Therefore, we can learn from the general image super-resolution task, in which many well-designed network structures have been continuously proposed (IPT [203] and SwinIR [204]), and design an effective deep network that is more suitable for FSR task. In addition to the effectiveness, an efficient network is also needed in practice, where the large model (with a mass of parameters and high computation costs) is very difficult to be deployed in real-world applications. Hence, developing models with lighter structure and lower computational taxing is still a major challenge.

Exploitation of Facial Prior. As a domain-specific super-resolution technique, FSR can be used to recover the facial details that are lost in the observed LR face images. The key to the success of FSR is to effectively exploit the prior knowledge of human faces, from 1D vector (identity and attributes), to 2D images (facial landmarks, facial heatmaps and parsing maps), and to 3D models. Therefore, discovering new prior knowledge of human face, how to model or represent these prior knowledge, and how to integrate this information organically into the end-to-end training framework are worthy of further discussion. In addition to these explicit prior knowledge, how to model and utilize the implicit prior that is learned from the data (such as the GAN prior [58, 106]) may be another direction.

Metrics and Loss Functions. As we know, the pixelwise

loss or

loss tend to produce the super-resolution results with high PSNR and SSIM values, while perceptual loss and adversarial loss are in favor of letting the model produce some visually pleasant results, i.e., good performance in terms of LPIPS and FID. Therefore, the assessment metric plays an important role in guiding the model optimization and affecting the final results. If we want to obtain a trustable result (in criminal investigation application), then PSNR and SSIM may be better metrics. In contrast, if we just want some visually pleasant results, then employing LPIPS and FID metrics may be a good choice. As a result, there is no universal assessment metric that can make the best of both worlds. Therefore, assessment metrics for FSR need more exploration in the future.

Discriminate FSR. In most situations, our goal is not only to reconstruct a visually pleasing HR face image. Actually, we hope that the super-resolved results can improve the face recognition task by human or computer. Therefore, it would be beneficial to recover a discriminated HR face image (for human) or discriminated feature (for computers) from an LR face image. To enhance the discriminant of super-resolved face images, we can use the weakly-supervised information (paired positive or negative samples) of the training sample to force the model to be able to reconstruct a discriminative face image.

Real-world FSR. The degradation process in the real world is too complex to be simulated, which results in a large gap between the synthesized LR and HR pairs and real-world data. When applying models trained by synthesized pairs to real-world LR face images, their performance drops dramatically. Given the HR training face images and the unpaired real-world LR face images, some methods [102, 205, 206] have been proposed to learn the real image degradation to create the sample pairs of synthesis LR face images and HR face images. These methods achieve better performance than previous approaches trained with the data produced by bicubic degradation. These methods actually have a potential assumption that all real-world LR face images share the same degradation, i.e., captured from the same camera. However, the obtained real-world LR face images are very different, and their degradation processes are different. Therefore, designing a more robust real-world FSR method is one of the problem has to be settled urgently.

Multi-modal FSR. Due to the rapid development of sensing technology, multiple sensors in the same system, such as autonomous driving and robots are becoming more and more common. The utilization of multi-modal information (including audio, depth, near infrared) will be increasingly promoted. Evidently, different modalities provide different clues. In this field, researchers always explore image-related information, such as attribute, identity, and others. Nevertheless, the emergence of audio-guided FSR [197] and hyperspectral FSR [207] inspire us to take advantage of information belonging to different modalities. This trend will undoubtedly continue and diffuse into every category in this field. The introduction of multi-modal information will also spur the development of FSR.

Footnote

A curated list of papers and resources to face super-resolution at https://github.com/junjun-jiang/Face-Hallucination-Benchmark.

References

[1]

Y. Liang, J. H. Lai, W. S. Zheng, and Z. Cai. 2012. A survey of face hallucination. Biometric Recognition. Springer, Berlin, 83–93.

Abstract

1 Introduction

2 Background

2.1 Problem Definition

2.2 Assessment Metrics and Loss Functions

2.2.1 Image Quality Assessment.

2.2.2 Loss Functions.

3 Characteristics of Face Images

3.1 Prior Information

3.2 Attribute Information

3.3 Identity Information

3.4 Datasets for FSR

4 FSR Methods

4.1 General FSR

4.1.1 Basic CNN-based Methods.

4.1.2 GAN-based Methods.

4.1.3 Reinforcement Learning-based Methods.

4.1.4 Ensemble Learning-based Methods.

4.1.5 Discussion.

4.2 Prior-guided FSR

4.2.1 Pre-prior Methods.

4.2.2 Parallel-prior Methods.

4.2.3 In-prior Methods.

4.2.4 Post-prior Methods.

4.2.5 Discussion.

4.3 Attribute-constrained FSR

4.3.1 Given Attribute Methods.

4.3.2 Estimated Attribute Methods.

4.3.3 Discussion.

4.4 Identity-preserving FSR

4.4.1 Face Recognition-based Methods:.

4.4.2 Pairwise Data-based Methods.

4.4.3 Discussion.

4.5 Reference FSR

4.5.1 Single-face Guided Methods.

4.5.2 Multi-face Guided Methods.

4.5.3 Dictionary-guided Methods.

4.5.4 Discussion.

4.6 Experiments and Analysis

4.6.1 Comparison Results of FSR Methods.

4.6.2 Comparison Results of Reference FSR Methods.

4.7 Joint FSR and Other Tasks

4.7.1 Joint Face Completion and Super-resolution.

4.7.2 Joint Face Deblurring and Super-resolution.

4.7.3 Joint Illumination Compensation and FSR.

4.7.4 Joint Face Alignment and Super-resolution.

4.7.5 Joint Face Frontalization and Super-resolution.

4.8 Related Applications

4.8.1 Face Video Super-resolution.

4.8.2 Old Photo Restoration.

4.8.3 Audio-guided FSR.

4.8.4 3d FSR.

5 Conclusion and Future Directions

Footnote

References

Cited By

Index Terms

Recommendations

Eigentransformation-based face super-resolution in the wavelet domain

Deep Illumination-Enhanced Face Super-Resolution Network for Low-Light Images

Wavelet-based eigentransformation for face super-resolution

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF