Open AccessArticle

Large Mask Image Completion with Conditional GAN

Changcheng Shao

^*,

Xiaolin Li

Fang Li

and

Yifan Zhou

College of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430000, China

Author to whom correspondence should be addressed.

Symmetry 2022, 14(10), 2148; https://doi.org/10.3390/sym14102148

Submission received: 16 August 2022 / Revised: 24 September 2022 / Accepted: 8 October 2022 / Published: 14 October 2022

Download

Browse Figures

Figure 1
The overall architecture of our framework. The framework consists of a generator and a discriminator. "> Figure 2
The scheme of our coarse-to-fine generator. The coarse-to-fine generator is based on two ResNet-like completion networks. The generator is symmetric in structure. The generator uses a multi-component loss that combines adversarial loss and a high receptive field perceptual loss based on VGG-16. "> Figure 3
The scheme of the PatchGAN discriminator. "> Figure 4
Training loss value plot. "> Figure 5
(a) FID value plot, (b) LPIPS plot, (c) SSIM plot, and (d) trade-off curve between FID and LPIPS. "> Figure 6
The results of our method. The first line in the figure is the mask, and the white rectangle or square represents the invisible part of the face. The bottom six lines in the figure are the completion results of our method for different faces. "> Figure 7
Face similarity based on FaceNet with different sizes of masks. ">

Versions Notes

Abstract

Recently, learning-based image completion methods have made encouraging progress on square or irregular masks. The generative adversarial networks (GANs) have been able to produce visually realistic and semantically correct results. However, much texture and structure information will be lost in the completion process. If the missing part is too large to provide useful information, the result will be ambiguity, residual shadow, and object confusion. In order to complete large mask images, we present a novel model using conditional GAN called coarse-to-fine condition GAN (CF CGAN). We use a coarse-to-fine generator with symmetry and new perceptual loss based on VGG-16. The generator is symmetric in structure. For large mask image completion, our method produces visually realistic and semantically correct results. The generalization ability of our model is also excellent. We evaluate our model on the CelebA dataset and use FID, LPIPS, and SSIM as the metrics. Experiments demonstrate superior performance in terms of both quality and reality in free-form image completion.

Keywords:

conditional GAN; completion; generator; discriminator; mask; deep learning; residual network

1. Introduction

Image completion (also known as inpainting) has been a hot topic in the field of computer vision, which aims to fill missing regions with plausible and meaningful contents. It is widely used in image editing [1], image re-targeting [2], and object removal [3]. Image completion first originated from the completion of artwork, restoring damaged artwork to its original appearance. Using the original pixel information in the image as a reference, the missing or damaged parts of the image are filled in so that the restored image is not recognizable to the observer as having signs of damage. Image completion not only restores damaged artwork, but also plays an important role in solving cases by public security authorities, such as identifying the appearance of suspects through image completion techniques for incomplete face areas captured by surveillance due to masks, sunglasses, and obscuring.

The subject has been researched before the deep learning era [4,5,6], and progress has accelerated in recent years by using deep and wide neural networks [7,8] and adversarial learning [9,10,11,12]. The traditional strategy for training the completion system is to train on a large automatically generated dataset, created by randomly masking real images. It is common to use a single stage, such as in [9,13]. In this work, we achieve state-of-the-art results with a two-stage network.

In this paper, we discuss a new method for image completion based on conditional generative adversarial networks. This method has a wide range of applications. For example, we can use it to remove objects in an image or changing the appearance of existing objects. To complete images with large masks, one can use the lama [13] method, a completion network architecture that leverages generative adversarial networks (GANs) [14] in a conditional setting. Recently, Zhao et al [9] suggested that a serious challenge remains that all existing algorithms tend to fail when handling large-scale missing regions. Instead, they use co-modulation to improve the generative capability, but often struggle to generate plausible image structures.

Here, we address two main issues of the above methods: the difficulty of generating plausible image structures with GANs and the lack of details and realistic textures in completion results. We show that with a new, robust coarse-to-fine generator and discriminator architecture, we can complete images with large masks, which are more visually appealing than previous methods. We first obtain results only by adversarial training without relying on perceptual loss based on VGG-16. We then show that adding perceptual loss based on VGG-16 [15] slightly improves the results. Through evaluation, we find that CF CGAN can generalize to plausible image structures. CF CGAN can capture and generate details and realistic textures and is robust to large masks. Our contributions are summarized below:

We propose a large mask completion method based on conditional GAN (CF CGAN);
We use the traditional generator as a global generator and introduce a local enhanced generator;
We introduce a new perceptual loss based on VGG-16. Using pre-trained VGG-16 as a feature extractor can reduce the model parameters and extract rich detailed information on the premise of ensuring the receptive field.

The contents of the remaining sections of this paper are organized as follows: Section 2 introduces related work; Section 3 systematically summarizes and introduces our proposed method in detail; Section 4 conducts a systematic method experimental evaluation. Section 5 summarizes the full text, emphasizing the limitations of this work and discussing future works.

2. Related Work

Early patch-based methods search and copy–paste patches from known regions to gradually fill target holes. Meanwhile, the diffusion-based method describes and solves the color propagation within the holes through partial differential equations. The above methods can generate high-quality static textures while completing simple shapes, but they lack mechanisms for modeling high-level semantics to complete new semantic structures within holes. In recent years, with the development of deep learning [16,17], scholars from all over the world are committed to the work of image completion [18,19] through deep learning. Deep-learning-based completion methods significantly improve the results of image completion by learning semantic information from images in large-scale datasets. These methods usually train a convolutional neural network as an end-to-end mapping function from the original image to the completed image.

The pioneering method for semantic completion is the context encoder (CE) proposed by Pathak et al. [7], which can be called the first deep learning method for image completion, which adopts a generative adversarial network with an encoder–decoder architecture as the main body of the generator. This completion algorithm is fast and can complete the structural information of the image reasonably. However, the image completed by this method often lacks more detailed information between textures, resulting in visual artifacts or blurred images. Yang C. et al. [20] proposed the use of the convolutional neural network for damaged image completion. This method proposes a new completion idea, which combines the content information of the image with the texture information to optimize. The algorithm can generate realistic and reasonable images in terms of content and texture, especially for high-resolution images. However, the requirements for the operating equipment are relatively high.

CGAN-based image completion methods have made a series of outstanding progress. Isola et al. [10] designed and proposed a general framework for the application of CGAN methods, namely PIX2PIX. Dolhansky et al. [21] generated high-quality restoration images using example information hidden in reference images based on GAN. Liao et al. [22] proposed a collaborative generative adversarial network based on CGAN, which consists of three branch networks, namely the face feature point generation network, image inpainting network, and image segmentation network. The framework inductively improves the main generation task by embedding additional information into other tasks. Conditional generative adversarial network frameworks have been proven feasible for image completion.

Many losses have been proposed to train completion networks. Pixelwise and adversarial losses are used typically. Some methods apply a spatial discount weighting strategy to the pixel loss [7,23]. Simple convolutional discriminators or PatchGAN discriminators have been used to implement the adversarial loss [18]. Other popular choices are Wasserstein adversarial loss with gradient penalized discriminators and spectral normalization discriminators. Following the previous work [13], we used the

R_{1}

gradient penalty [24] with the PatchGAN discriminator in our method. Completion frameworks usually contain feature matching loss and perceptual loss. We used perceptual loss based on VGG-16.

Recently, researchers have turned their attention to more challenging environments, where the most representative problem is large hole filling. Suvorov et al. [13] suggested that the main reason for that is the lack of an effective receptive field in both the completion network and the loss function. They proposed a method with fast Fourier convolutions [25]. The method often fails to generate plausible image structures with GANs, and there is a lack of details and realistic textures in the completion result. In this paper, we propose a novel method with conditional GAN called CF CGAN to simultaneously achieve high-quality pluralistic generation and large hole filling. CF CGAN can capture and generate details and realistic textures and is robust to large masks.

3. Method

Image conditional GANs solve the problem of transforming a conditional input x in the form of an image to an output image

\hat{x}

. We assumed that pairwise correspondences between input conditions and output images are available in the training data. The generator takes an image x and a latent vector z as the input and produces an output

\hat{x}

, and the discriminator takes a pair (x,

\hat{x}

) as the input and tries to distinguish the fake generated pair from the real distribution. Image completion can be viewed as a constrained image conditional generation problem, where known pixels are constrained to be invariant.

3.1. Overall Architecture

Our proposed image completion framework consists of a generator and a discriminator. Take a binary mask of unknown pixels m and a input image x, formulated as

x^{'} = stack (x ⊙ m, m)

. We use feed-forward completion network G, which we also refer to as the generator. The completion network processes the input

x^{'}

and returns a completed three-channel color image

\hat{x}

. Then, the discriminator processes the input image x and the output image

\hat{x}

and returns each patch of

\hat{x}

as real or fake. The architecture is given in Figure 1.

3.2. Generator

We decompose the generator into two sub-networks:

G_{1}

and

G_{2}

. We use

G_{1}

as the global generator network and

G_{2}

as the local enhancer network. The global generator network runs at a resolution of 128 × 128, and the local enhancer network runs at a resolution of 256 × 256. The generator is denoted as (

G_{1}

G_{2}

), as visualized in Figure 2.

3.2.1. Global Generator

Our global generator

G_{1}

is built on the architecture proposed by Suvorov et al. [13], which has been proven successful for image completion with a periodic structure. It uses a ResNet-like architecture with 3 downsampling blocks, 9 residual blocks, and 3 upsampling blocks. The residual blocks use FFC [25]. The global generator runs on low-resolution images, which greatly reduces the computation while obtaining the overall structure and low-frequency semantic information of the image.

3.2.2. Local Enhancer Generator

The local enhancer generator also consists of 3 components: a convolutional front-end

G_{2}^{f}

, three residual blocks

G_{2}^{r}

, and a transposed convolutional back-end

G_{2}^{b}

. Different from the global generator network, we use the elementwise sum of two feature maps as the input to the residual block

G_{2}^{r}

. The two feature maps are the output feature map of

G_{2}^{f}

and the last feature map of the back-end of the global generator network

G_{1}^{b}

. In current image completion methods, images are often down-sampled in order to reduce the amount of computation. This will lose much high-frequency information, which has a great impact on the completion quality of the image. Our local enhancement generator, which is processed at a resolution of 256 × 256, can obtain the high-level semantic information of the image well, and there are only three residual blocks in it.

During training, we train both the global generator and the local enhancer generator. We then jointly fine-tune all the networks together. We use this generator design to efficiently aggregate global and local information for large mask image completion tasks. We note that this joint multi-pair model is a well-established practice in computer vision [26].

3.3. Discriminator

As shown in Figure 3, for the PatchGAN discriminator [10], the output is a matrix, and each element in the matrix represents a local area of the input image. If the local area is real, we will obtain 1, otherwise 0.

PatchGAN was first used in the field of image style transfer. Compared with the discriminator in a typical GAN, the output of patch GAN is not a scalar, but a two-dimensional matrix. The discriminator of GAN is used to judge the picture. Whether synthetic or real, it considers the global picture of the image, so it is possible to ignore the local details of the image. The output of PatchGAN is a two-dimensional matrix, and each value represents the judgment of the local area of the image, so that the local texture details of the image can be focused and enhanced.

3.4. Loss Function

The completion problem is inherently ambiguous. There can be many plausible fillings for the same missing region especially when the missing region becomes wider. We discuss the components of the proposed loss, which together allow dealing with the complexity of the problem. We define our generator G as

f_{θ} (\cdot)

and discriminator D as

D_{ξ} (\cdot)

. To improve the quality and reality of the generation, we adopted the non-saturating adversarial loss, regardless of the pixelwise MAE or MSE loss, which usually leads to averaged blurry results. We also used the

R_{1} = E_{x} {∥\nabla D_{ξ} (x)∥}^{2}

gradient penalty [24]. Besides, we adopted the perceptual loss.

3.4.1. Adversarial Loss

To ensure that the generator

f_{θ} (\cdot)

generates natural-looking local details, we used an adversarial loss. Our discriminator

D_{ξ} (\cdot)

works on a local patch level, discriminating between “real” and “fake” patches. Only patches that intersect with the masked area receive the “fake” label. Finally, we used the non-saturating adversarial loss:

L_{G} = - E_{x, m} [log D_{ξ} (\hat{x})]

(1)

\begin{matrix} L_{D} = - E_{x} [log D_{ξ} (x)] - E_{x, m} [log D_{ξ} (\hat{x}) ⊙ m] \\ - E_{x, m} [log (1 - D_{ξ} (\hat{x})) ⊙ (1 - m)] \end{matrix}

(2)

L_{A d v} = {sg}_{θ} (L_{D}) + {sg}_{ξ} (L_{G}) \to min_{θ, ξ}

(3)

where x is a sample from the dataset and m is a binary mask of unknown pixels.

\hat{x}

f_{θ} (x^{'})

is the completion result for

x^{'} = stack (x ⊙ m, m)

E [*]

denotes the expected value of the distribution function.

D_{ξ} (\hat{x})

is the output of the discriminator to the completion result

\hat{x}

, and

D_{ξ} (x)

is the output of the discriminator to the original sample x.

{s g}_{v a r}

stops gradients when

L_{D}

and

L_{G}

reach a Nash equilibrium, and

L_{A d v}

is the joint loss to optimize.

3.4.2. Perceptual Loss

Image completion tasks can be regarded as image transformation tasks, where one input image is converted into another image output. The image transformation tasks solved by the existing methods often train a forward propagation network in the way of supervised training, which utilizes the error between image pixel levels. This method is very effective at test time because only one forward pass is required. However, the pixel-level error does not capture the perceptual difference between the output and the ground-truth image.

Recent studies [27,28] have found that the features extracted by the pre-trained VGG-16 network can be used as a loss metric in image synthesis tasks to measure the perceptual similarity of images by comparing the differences between features. This can retain more scene structure information while extracting scene depth information and improve the visual quality of image completion.

In this paper, a perceptual loss function is introduced to make the predicted depth map closer to the real depth map in terms of the visual effect. Compared with other networks, VGG-16 has deeper network layers and smaller convolution kernels. Using the pre-trained VGG-16

ψ (\cdot)

as a feature extractor can reduce the model parameters and extract rich detailed information on the premise of ensuring the receptive field. In the pre-trained VGG-16 structure, we replaced max pooling with average pooling and removed three fully connected layers. Finally, the network structure we selected has a total of thirteen network layers with parameters that are convolutional layers, and each convolutional layer is followed by a Relu activation layer. Since the perceptual loss does not require paired images to train the network, this reduces the amount of parameters during network training. We introduce the rich detailed information perceptual loss (RDPL), which uses the VGG-16 model

ψ (\cdot)

L_{R D P L} (x, \hat{x}) = M ({[ψ (x) - ψ (\hat{x})]}^{2})

(4)

where

{[\cdot - \cdot]}^{2}

is an elementwise operation and

M

is the interlayer mean of intralayer means.

3.4.3. Overall Loss

Our overall loss includes GAN loss, perceptual loss, and

R_{1} = E_{x} {∥\nabla D_{ξ} (x)∥}^{2}

as:

L_{o v e r a l l} = α L_{R D P L} + β L_{A d v} + λ R_{1}

(5)

where

α, β

, and

λ

control the weight of each part.

4. Experiments

In this section, we demonstrate that our method outperforms a range of strong baselines with standard low resolutions, and the difference is even more pronounced when completing wider holes.

4.1. Dataset

For the ease of evaluation, we used the CelebA-HQ dataset [29], a large-scale facial attribute dataset. It is derived from CelebA images as a new dataset containing 30,000 1024 × 1024 images of celebrity faces. We used CelebA-HQ with a resolution of 256 × 256 to reduce the amount of calculation. The images in this dataset cover large pose variations and background clutter. The feature distribution of the dataset is approximately symmetric, and it is very suitable to evaluate our method as the training dataset.

4.2. Evaluation Indicators

We used the learned perceptual image patch similarity (LPIPS), Fréchet inception distance (FID) [27], and structural similarity (SSIM) metrics [30]. Compared to the pixel-level

L 1

and

L 2

distances, LPIPS, FID, and SSIM are more suitable for measuring the performance of large masks completion when multiple natural completions are plausible.

The learned perceptual image patch similarity (LPIPS), also known as “perceptual loss”, is used to measure the difference between two images. This metric learns the inverse mapping of generated images to the ground-truth, forcing the generator to learn the inverse mapping of reconstructing real images from fake images, and prioritizes the perceptual similarity between them. LPIPS is more in line with human perception than traditional methods. The lower the value of LPIPS, the more similar the two images are and, vice versa, the greater the difference. The LPIPS calculation formula is as follows:

d (x, \hat{x}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {∥w_{l} ⊙ ({\hat{y}}_{h w}^{l} - {\hat{y}}_{0 h w}^{l})∥}_{2}^{2}

(6)

where x is the real image,

\hat{x}

is the generated image, and d is the distance between the two images. The calculation process of LPIPS is as follows: x and

\hat{x}

are sent to the VGG network for feature extraction, and the output of each layer is activated and normalized, denoted as

{\hat{y}}^{l}, {\hat{y}}_{0}^{l} \in R^{H_{l} \times W_{l} \times C_{l}}

. Then, the

L 2

distance is calculated after the weight of the w layer point multiplier. Finally, we take the average to obtain the distance.

The FID is a measure of the similarity between two image datasets. It has been shown to correlate well with human judgments of visual quality and is most commonly used to evaluate the quality of generative adversarial network samples. Lower scores are positively correlated with higher-quality images, and smaller values indicate that the features of the two sets of images are more similar. The FID is computed by computing the

F r

c h e t d i s t a n c e

between two Gaussians fit to the feature representation of the Inception network. The

F I D

calculation formula is as follows:

\begin{matrix} F I D (x, \hat{x}) = {∥μ_{x} - μ_{\hat{x}}∥}_{2}^{2} + \\ Tr (A_{x} + A_{\hat{x}} - 2 {(A_{x} A_{\hat{x}})}^{\frac{1}{2}}) \end{matrix}

(7)

where Tr is the sum of the elements of the main diagonal (the trace of the matrix),

μ

is the mean, A is the covariance matrix, x is the real image, and

\hat{x}

is the generated image.

SSIM is an index used to measure the similarity of two digital images. SSIM treats image degradation as a perceptual change in structural information, while also incorporating important perceptual phenomena such as luminance masking and contrast masking. SSIM is suitable as a verification metric for image completion. The SSIM calculation formula is as follows:

SSIM (x, \hat{x}) = \frac{(2 μ_{x} μ_{\hat{x}} + c_{1}) (2 σ_{x \hat{x}} + c_{2})}{(μ_{x}^{2} + μ_{\hat{x}}^{2} + c_{1}) (σ_{x}^{2} + σ_{\hat{x}}^{2} + c_{2})}

(8)

where

μ_{x}

and

μ_{\hat{x}}

represent the mean of images x and

\hat{x}

, respectively,

σ_{x}

and

σ_{\hat{x}}

represent the covariance of images x and

\hat{x}

, respectively, and

σ_{x \hat{x}}

represent the covariance of images x and

\hat{x}

4.3. Implementation Details

We conducted experiments on the CelebA-HQ datasets at a 256 × 256 resolution.For CelebA-HQ, train and validation splits were organized with 26,000 and 2000 images. The experimental platform of this experiment was the Ubuntu18.0 system; the learning framework was pytorch 1.7; the corresponding CUDA was 11.3; Cudnn was 8.0; the GPU was a GeForce RTX™ 3090. We used the Adam optimizer, with fixed learning rates 0.001 and 0.0001 for the generator and discriminator networks with a batch size of 16. We set the weight values

α = 10

β = 30

, and

λ = 0.001

. The overall loss included GAN loss, perceptual loss, and R1.

We trained our model using 40 epochs. During training, we recorded training loss every 250 steps. After each epoch, validation was performed on the validation set and the FID, LPIPS, and SSIM were calculated. We visualize the data in Figure 4 and Figure 5. In the early stage of training, the loss curve decreases rapidly. The FID, LPIPS, and SSIM indicators also showed rapid improvement. As the training enters the middle and late stages, the loss function begins to converge, and the loss curve tends to flatten. At this time, the FID and LPIPS values also oscillate at low points; SSIM oscillates at high points. We scaled LPIPS up by a factor of 10 to trade off the FID and LPIPS since the LPIPS indicator is in the 0.09–0.30 range, as shown in Figure 5d.

4.4. Comparisons to the Baselines

We selected image completion methods from classic and representative deep learning as baseline methods, which included CoModGAN [9] and LaMa-Fourier [13]. The two strong baselines are presented in Table 1. CF CGAN (ours) consistently outperformed a wide range of baselines. The results of the study correlate well with the quantitative evaluation and demonstrate that the completion produced by our method is more preferable to other methods.

The results of our method produced are presented in Figure 6. We standardized the way the image was masked by using square or rectangle masks. Then, we used FaceNet [31] to calculate the cosine distance between them and the reconstructions. FaceNet is a method that directly embeds face images into Euclidean space. In Euclidean space, distance corresponds directly to a measure of face similarity. A distance of 0.0 means the faces are the same; 4.0 corresponds to two opposite, different identities. A threshold of 1.1 correctly classifies each pair. Table 2 and Figure 7 show the distance of the faces between the original and reconstructed. Their distances are all less than the threshold of 1.1 under different mask sizes. This shows that our model completion works well.

4.5. Ablation Study

We performed a set of ablation experiments to show the importance of each component of our model. All ablation models were trained and evaluated on the CelebA dataset. We started from a baseline with a simple conditional generative adversarial network. The ablation results are shown in Table 3.

Local enhancer generator: We started with a simple conditional generative adversarial network baseline. We compared the baseline and the baseline with the local enhancer generator. From the results, the baseline with the local enhancer generator improved the numerical metrics.

Perceptual loss based on VGG-16: We verified that the high receptive field of the perceptual loss based on VGG-16 indeed improved the quality of the completion results.

PatchGAN discriminator: We used the PatchGAN discriminator and global discriminator, respectively, in the baseline to verify the effectiveness of the PatchGAN discriminator. The global discriminator, also known as the general discriminator, judges the authenticity of the entire image, and the result it returns is a scalar of 0 or 1. From the results, the PatchGAN discriminator is more efficient than the global discriminator as the PatchGAN discriminator focuses on more areas of the image.

5. Conclusions

In this paper, an image completion method, CF CGAN, based on conditional generative adversarial networks was proposed, including a coarse-to-fine generator with structural symmetry and a new perceptual loss based on VGG-16. The generator network based on generative adversarial loss ensures the consistency of the overall semantics of the completed image. Perceptual loss based on VGG-16 extracts rich detailed information on the premise of ensuring the receptive field. The coarse-to-fine completion method expands the information receptive field of the completion network, and the completion effect is very prominent, even in the case of large holes.

However, our current work also has some limitations. At present, we can only process images with a resolution of 256 × 256. When processing high-resolution images, the results are often unsatisfactory. When our model encounters face images outside the dataset, the result is often failure.

Although this paper used conditional generative adversarial networks to achieve good results in face image completion, there are still shortcomings that need further research. Generative adversarial network training is a dynamic game process. Whether it is a generative network or a discriminant network, when one party is too powerful, it is easy to encounter the problem of gradient disappearance. In addition, the method in this paper was only applied to the problem of face image completion, and the completion effect of other types of datasets is unknown and needs further research.

Author Contributions

Conceptualization, C.S. and F.L.; methodology, C.S.; software, C.S.; validation, Y.Z. and F.L.; formal analysis, F.L.; investigation, X.L.; resources, X.L.; data curation, Y.Z.; writing—original draft preparation, C.S.; writing—review and editing, X.L.; visualization, F.L.; supervision, F.L.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62171327, and in part by the first batch of application basic technology and science research foundation in Hubei Nuclear Power Operation Engineering Technology Research Center under Grant B210610.

Data Availability Statement

The dataset is available at: https://drive.google.com/file/d/1p-9I5cFGYG5N3x9TZtbov98joQgJhpsk/view?usp=sharing, accessed on 7 October 2022. The code is available at: https://github.com/ccshaoshao/CF_CGAN, accessed on 7 October 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liang, L.; Jin, L.; Xu, Y. Adaptive GNN for image analysis and editing. Adv. Neural Inf. Process. Syst. 2019, 32, 3638–3649. [Google Scholar]
Absetan, A.; Fathi, A. Integration of Deep Learned and Handcrafted Features for Image Retargeting Quality Assessment. Cybern. Syst. 2022, 1–24. [Google Scholar] [CrossRef]
Jiang, Q.; Peng, Z.; Shao, F.; Gu, K.; Zhang, Y.; Zhang, W.; Lin, W. Stereoars: Quality evaluation for stereoscopic image retargeting with binocular inconsistency detection. IEEE Trans. Broadcast. 2021, 68, 43–57. [Google Scholar] [CrossRef]
Hays, J.; Efros, A.A. Scene completion using millions of photographs. ACM Trans. Graph. (ToG) 2007, 26, 4-es. [Google Scholar] [CrossRef]
Criminisi, A.; Pérez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef] [PubMed]
Liao, L.; Hu, R.; Xiao, J.; Wang, Z. Edge-aware context encoder for image inpainting. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 3156–3160. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; pp. 417–424. [Google Scholar]
Zhao, S.; Cui, J.; Sheng, Y.; Dong, Y.; Liang, X.; Chang, E.I.; Xu, Y. Large scale image completion via co-modulated generative adversarial networks. arXiv 2021, arXiv:2103.10428. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5505–5514. [Google Scholar]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8798–8807. [Google Scholar]
Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 2149–2159. [Google Scholar]
Sun, J.; Bhattarai, B.; Chen, Z.; Kim, T.K. SeCGAN: Parallel Conditional Generative Adversarial Networks for Face Editing via Semantic Consistency. arXiv 2021, arXiv:2111.09298. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Walia, S.; Kumar, K.; Agarwal, S.; Kim, H. Using XAI for Deep Learning-Based Image Manipulation Detection with Shapley Additive Explanation. Symmetry 2022, 14, 1611. [Google Scholar] [CrossRef]
Umair, M.; Hashmani, M.A.; Hussain Rizvi, S.S.; Taib, H.; Abdullah, M.N.; Memon, M.M. A Novel Deep Learning Model for Sea State Classification Using Visual-Range Sea Images. Symmetry 2022, 14, 1487. [Google Scholar] [CrossRef]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. (ToG) 2017, 36, 1–14. [Google Scholar] [CrossRef]
Zheng, C.; Cham, T.J.; Cai, J. Pluralistic free-form image completion. Int. J. Comput. Vis. 2021, 129, 2786–2805. [Google Scholar] [CrossRef]
Yang, C.; Lu, X.; Lin, Z.; Shechtman, E.; Wang, O.; Li, H. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6721–6729. [Google Scholar]
Dolhansky, B.; Ferrer, C.C. Eye in-painting with exemplar generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7902–7911. [Google Scholar]
Liao, H.; Funka-Lea, G.; Zheng, Y.; Luo, J.; Kevin Zhou, S. Face completion with semantic knowledge and collaborative adversarial learning. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 382–397. [Google Scholar]
Yeh, R.A.; Chen, C.; Yian Lim, T.; Schwing, A.G.; Hasegawa-Johnson, M.; Do, M.N. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5485–5493. [Google Scholar]
Mescheder, L.; Geiger, A.; Nowozin, S. Which training methods for GANs do actually converge? In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3481–3490. [Google Scholar]
Chi, L.; Jiang, B.; Mu, Y. Fast fourier convolution. Adv. Neural Inf. Process. Syst. 2020, 33, 4479–4488. [Google Scholar]
Burt, P.; Adelson, E.H. The Laplacian pyramid as a compact image code. IEEE Trans. Commun. 1983, 31, 532–540. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. A neural algorithm of artistic style. arXiv 2015, arXiv:1508.06576. [Google Scholar] [CrossRef]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]

Figure 1. The overall architecture of our framework. The framework consists of a generator and a discriminator.

Figure 2. The scheme of our coarse-to-fine generator. The coarse-to-fine generator is based on two ResNet-like completion networks. The generator is symmetric in structure. The generator uses a multi-component loss that combines adversarial loss and a high receptive field perceptual loss based on VGG-16.

Figure 3. The scheme of the PatchGAN discriminator.

Figure 4. Training loss value plot.

Figure 5. (a) FID value plot, (b) LPIPS plot, (c) SSIM plot, and (d) trade-off curve between FID and LPIPS.

Figure 6. The results of our method. The first line in the figure is the mask, and the white rectangle or square represents the invisible part of the face. The bottom six lines in the figure are the completion results of our method for different faces.

Figure 7. Face similarity based on FaceNet with different sizes of masks.

Table 1. Quantitative evaluation of completion on the CelebA-HQ dataset.

	CelebA-HQ (256 × 256)
	40–50% Masked			All Samples
Method	FID ↓	LPIPS ↓	SSIM ↑	FID ↓	LPIPS ↓	SSIM ↑
CoModGAN [9]	35.9 ▲88%	0.139 ▼21%	0.802 ▲3%	16.8 ▲250%	0.079 ▼24%	0.853 ▲5%
LaMa-Fourier [13]	22.8 ▲19%	0.181 ▲2%	0.843 ▼2%	7.5 ▲56%	0.108 ▲2%	0.865 ▲4%
CF CGAN (ours)	19.1	0.177	0.826	4.8	0.105	0.897

▲ denotes deterioration, and ▼ denotes the improvement of the score compared to our CF CGAN model (presented in the last row). Note that the “40–50% masked” column contains metrics on the most difficult samples from the test sets: these are samples with more than 40% of images covered by masks.

Table 2. Face similarity based on FaceNet.

	5% Masked	10% Masked	15% Masked	20% Masked	30% Masked	40% Masked
Distance	0.417	0.765	0.710	0.895	1.028	1.093217

Table 3. Ablation study of our model design including the local enhancer generator (LEG), perceptual loss based on VGG-16 (PL), and PatchGAN discriminator (PD). We report FID, LPIPS, and SSIM scores.

Ablations	Methods	FID ↓	LPIPS ↓
Local enhancer generator (LEG)	baseline	22.3	0.182
Local enhancer generator (LEG)	baseline + LEG	10.5	0.124
Perceptual loss	baseline + LEG	10.5	0.124
Perceptual loss	baseline + LEG + PL	7.2	0.106
PatchGAN discriminator (PD)	baseline + LEG + PL	7.2	0.106
PatchGAN discriminator (PD)	baseline + LEG + PL + PD	4.8	0.105

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shao, C.; Li, X.; Li, F.; Zhou, Y. Large Mask Image Completion with Conditional GAN. Symmetry 2022, 14, 2148. https://doi.org/10.3390/sym14102148

AMA Style

Shao C, Li X, Li F, Zhou Y. Large Mask Image Completion with Conditional GAN. Symmetry. 2022; 14(10):2148. https://doi.org/10.3390/sym14102148

Chicago/Turabian Style

Shao, Changcheng, Xiaolin Li, Fang Li, and Yifan Zhou. 2022. "Large Mask Image Completion with Conditional GAN" Symmetry 14, no. 10: 2148. https://doi.org/10.3390/sym14102148

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Mask Image Completion with Conditional GAN

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Overall Architecture

3.2. Generator

3.2.1. Global Generator

3.2.2. Local Enhancer Generator

3.3. Discriminator

3.4. Loss Function

3.4.1. Adversarial Loss

3.4.2. Perceptual Loss

3.4.3. Overall Loss

4. Experiments

4.1. Dataset

4.2. Evaluation Indicators

4.3. Implementation Details

4.4. Comparisons to the Baselines

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI