research-article

Open access

CLIP-guided StyleGAN Inversion for Text-driven Real Image Editing

Authors:

Ahmet Canberk Baykal,

Deniz YuretAuthors Info & Claims

ACM Transactions on Graphics, Volume 42, Issue 5

Article No.: 172, Pages 1 - 18

https://doi.org/10.1145/3610287

Published: 29 August 2023 Publication History

PDF eReader

Abstract

Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the Contrastive Language-Image Pre-training (CLIP) embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.

1 Introduction

The quality of images synthesized by Generative Adversarial Networks [Goodfellow et al. 2014] have reached a remarkable level in less than a decade. StyleGAN and its variants [Karras et al. 2019, 2020, 2021] are now capable of generating highly realistic images, while allowing control over the generation process by means of style mixing. Recent works [Härkönen et al. 2020; Shen et al. 2020] have demonstrated that StyleGAN learns disentangled attributes, making it possible to find directions in its latent space to generate images that possess such desired attributes. Consequently, there has been a growing interest in utilizing semantic editing directions in the latent space mostly for preset directions such as gender, face orientation, and hair color.

Concurrent to the advances in generative modeling, we are also witnessing exciting breakthroughs in multimodal learning. For example, the recently proposed Contrastive Language-Image Pre-training (CLIP) model [Radford et al. 2021] provides an effective common embedding for images and text captions. Such an embedding, when combined with powerful GANs paves the road toward text-guided image editing, one of the most natural and intuitive ways of manipulating images. Hence, it comes with no surprise that several recent works [Li et al. 2020; Xia et al. 2021a; Patashnik et al. 2021; Kocasari et al. 2021; Wei et al. 2022] have focused on mapping target textual descriptions to editing directions in the latent space of StyleGAN. While some methods perform optimization in the latent space guided by CLIP [Xia et al. 2021a; Patashnik et al. 2021], others train a separate mapper network for each type of textual edit [Patashnik et al. 2021] or a general mapper conditioned on reference images and textual descriptions [Wei et al. 2022]. Instance-based optimization methods require long inference times. Training mappers for a single text prompt reduces the inference time to a single forward pass but comes with the price of training time as separate mappers need to be trained for each text prompt. Moreover, these mappers that operate in the latent space do not directly consider the features of the original image as they take inverted latent codes as inputs from pretrained GAN inversion networks.

In this study, we present a new approach, which we call CLIPInverter, to automatically edit an input image based on a target textual description containing multiple attributes by adjoining lightweight adapter modules to pretrained unconditional inversion methods (see Figure ). CLIPInverter includes a novel CLIP-conditioned adapter module (CLIPAdapter) that is attached to the pretrained encoder model to map both the input image and the target textual description to a residual latent code by utilizing the common CLIP embedding space. The residual latent code is then combined with the latent code of the input image obtained by the unconditional branch of the encoder and is fed to a CLIP-guided correction module (CLIPRemapper) that applies a final correction by blending the latent codes with latent codes predicted from the CLIP embedding of the target textual description based on learnable blending coefficients. The final latent code is decoded by a pretrained and frozen StyleGAN2 generator to synthesize the manipulated image that reflects the desired changes while preserving the identity of the original subject as much as possible. Our encoder-adapters are lightweight networks that directly modulate image feature maps using text embeddings and they could be appended to many pretrained encoders. Our CLIP-guided correction module utilizes the CLIP text embeddings to enhance the manipulations of the generated images while preserving the photorealism. Our method does not require any additional optimization on the latents and it successfully applies manipulations using various text prompts in a single forward pass. Since we directly modulate feature maps extracted during the inversion phase, our method is capable of editing images much better than the competing approaches, especially in cases when there are multiple attributes present in the target textual description, as proven by our experiments. See Figure 1 for an overview of our framework.

Fig. 1.

Fig. 2.

Our method aims to strike a balance between distortion and editability [Tov et al. 2021]. Namely, our text-guided CLIPAdapter is utilized to find an editing direction that is aligned with the given target description, specific to the input image. By leveraging the inversion in the \(\mathcal {W+}\) space, we aim to preserve the identity of the input image in the manipulated output, which helps in achieving relatively low distortion. However, it is important to note that complete elimination of distortion is not feasible in this process. While we are able to preserve the identity to a certain degree, we observe that not all attributes described in the target caption may be fully captured in the manipulated image. To address this, we introduce the text-guided refinement module, CLIPRemapper, which applies a final correction to the latent code, further aligning it with the desired target description. Essentially, CLIPRemapper finds a more editable region in the vicinity of the latent code we obtain from the previous stage. This process boosts the manipulation performance of our model massively, while keeping the distortion at a comparable level, as shown in our ablation study.

We demonstrate editing results for challenging cases where there are many attributes present in the target description. Our method is not restricted to a particular domain like commonly studied human faces, and we also evaluate our approach on birds and cats images. Exploring the multimodal nature of CLIP, instead of target textual descriptions, we can additionally use images or target textual descriptions containing vocabulary never seen during training as the guiding signal. Finally, we show that linearly interpolating between the original latent code and the updated latent code results in smooth image manipulations, providing a means for user to have control over the manipulation process.

We evaluate our method on a diverse set of datasets and provide detailed qualitative results and comparisons against the state-of-the-art models. Quantitative comparisons in language-guided editing still remains a challenge, as one needs to evaluate the manipulations from different aspects, such as accuracy, preservation of text-irrelevant details, photorealism, and so on. Current metrics are not suitable for evaluation as they do not consider some of these aspects at all. We propose two new metrics, Attribute Manipulation Accuracy (AMA) and CLIP Manipulative Precision (CMP), to measure how accurately the manipulations are applied and how well the text-irrelevant details are preserved. We perform quantitative comparisons against state-of-the-art models using these metrics along with Fréchet Inception Distance (FID). These comparisons as well as a user study that we conducted to evaluate perceptual realism and manipulation accuracy demonstrate the superiority of our approach over the prior work.

Our code and models are publicly available at the project website.¹

2 Related Work

2.1 GAN Inversion

In response to the growing demand for interpretability and controllability in GANs, the need for GAN inversion has emerged as a pivotal technique. By mapping a given image back into the latent space of a pretrained GAN model, as introduced by Zhu et al. [2016], GAN inversion facilitates a deeper understanding of the underlying features and structures in the latent space, enabling researchers to manipulate and interpret generated images with greater precision and insight. Below we discuss some representative works to highlight three main approaches to accomplish GAN Inversion; please refer to the recent survey [Xia et al. 2021b] for an in-depth discussion of various other inversion methods.

The optimization-based methods directly optimize a latent code that reconstructs the target image as close as possible using gradient descent [Abdal et al. 2019, 2020; Creswell and Bharath 2016; Tewari et al. 2020a]. This line of works is instance specific and does not require any trainable modules. The learning-based methods invert an image by a learned encoder. This approach is similar to an autoencoder pipeline, where the pretrained generator acts as the decoder. Unconditional encoders [Tewari et al. 2020b; Zhu et al. 2020; Alaluf et al. 2021a; Bau et al. 2019a; Richardson et al. 2021; Tov et al. 2021; Bai et al. 2022] aim to solely invert the image, without any modifications while conditional encoders [Alaluf et al. 2021b] are designed for obtaining a latent code conditioned on attributes such as pose, age, or facial expressions. The so-called hybrid methods [Zhu et al. 2016; Bau et al. 2019b] combine optimization-based methods with learning-based methods. The images are first inverted to a latent code by a learned encoder. This latent code then becomes the initialization for the latent optimization and is optimized to reconstruct the target image.

More recent approaches build different architectures, fine-tune StyleGAN weights, or modulate feature maps for inversion. Style Transformer [Hu et al. 2022] uses a combination of convolutional neural networks and transformers to invert images into the latent space. Pivotal Tuning Inversion [Roich et al. 2021] fine-tunes the generator around a pivotal latent code to find a balance for the distortion-editability trade-off. Some methods [Alaluf et al. 2021c; Dinh et al. 2022] train hypernetworks to modulate the weights of a pre-trained StyleGAN network for accurate as well as editable inversions. Spatially-Adaptive Multilayer (SAM) GAN Inversion [Parmar et al. 2022] predicts invertibility maps and High-Fidelity GAN Inversion [Wang et al. 2022] predicts latent maps to modulate StyleGAN features.

While both optimization-based and hybrid approaches may reconstruct images faithfully, they require solving an optimization problem for each image, resulting in longer processing times. However, our approach adapts learned adapters appended to encoders, which provides a much faster alternative to current methods. Furthermore, we condition the inversion process directly on the target captions, which ensures that a more effective editing space direction can be found in the latent space.

2.2 Latent Space Manipulation

Recent work has shown that GANs learn a semantically coherent latent space, enabling to map manipulations in the latent space to semantic image editing. Specifically, StyleGAN [Karras et al. 2019] learns an intermediate latent space by employing a mapping network to transform the sampled latent code. These intermediate latent codes determine the parameters of the AdaIN [Huang and Belongie 2017] layers introduced in the generator to control the style of the generated image, allowing control over the synthesis at different levels. A common approach when manipulating images is to first invert the input image back into the latent space of a pretrained generator using GAN inversion and then traverse the latent space to find a meaningful direction. Such a direction can be found by using explicit supervision of image attribute annotations [Shen et al. 2020; Abdal et al. 2021; Wu et al. 2020] or in an unsupervised manner [Voynov and Babenko 2020; Härkönen et al. 2020; Shen and Zhou 2021]. Recently proposed methods consider various modalities for conditional image manipulation. StyleMapGAN [Kim et al. 2021] proposes an intermediate latent space with spatial dimensions with spatial modulation that enables local editing based on reference images. Similarly, the study by Collins et al. [2020] uses a transformation matrix to control the interpolation between an input image and a reference image in the latent space to locally edit the input image. The recent work of Alaluf et al. [2021b] manipulates an input image based on a target age by training an encoder conditioned on the target age to find residual latent codes to add to the inverted latent code of the original image. In a similar vein, we train adapter layers appended to an encoder conditioned on textual descriptions to output these residual latent codes. We also use the CLIP model to define supervisory signals to explore the similarity of an input image and a textual description.

Moreover, there are several latent spaces to consider in a StyleGAN2 generator. The latent mapper transforms the latent codes in the space \(\mathcal {Z}\) drawn from a Normal distribution to an intermediate latent space \(\mathcal {W}\). The latent codes in the \(\mathcal {W}\) space are used at different stages in the StyleGAN2 generator, after being mapped to the \(\mathcal {S}\) space by an affine transformation. \(\mathcal {W}+\) space is an extended version of the \(\mathcal {W}\) space where a different \(\mathbf {w}\) is used for each style input of the generator. While some works find editing directions in the \(\mathcal {S}\) space such as StyleCLIP-GD [Patashnik et al. 2021] and StyleMC [Kocasari et al. 2021], many others like StyleCLIP-LO, StyleCLIP-LM [Patashnik et al. 2021], and SAM [Alaluf et al. 2021b] utilize the extended intermediate space \(\mathcal {W}+\). Our text-guided image encoder operates on \(\mathcal {W}+\) to find effective editing directions.

2.3 Text-guided Image Manipulation

Given an image and a target description in natural language, the aim of text-guided image manipulation is to generate images that reflect the desired semantic changes while also preserving the details or attributes not mentioned in the text. ManiGAN [Li et al. 2020] learns a text-image affine combination that selects image regions that are relevant to the language description and a detail correction module that modifies these regions. TediGAN [Xia et al. 2021a] enforces the text and image matching by mapping the images and the text to the same latent space and performs further optimization to preserve the identity of the subjects in the original image.

More recent works use semantics learned by a multi-modal method such as CLIP [Radford et al. 2021]. StyleCLIP [Patashnik et al. 2021] uses the CLIP space to optimize for the latent code (StyleCLIP-LO) that minimizes the distance of the image and text pair. They also present a latent mapper (StyleCLIP-LM) that predicts residual latent codes corresponding to specific attributes. Finally, they also experiment with mapping a text prompt to a global direction (StyleCLIP-GD) in the latent space that is independent of the input image. The most recent StyleMC [Kocasari et al. 2021] model presents an efficient method to learn global directions in the \(\mathcal {S}\) space of StyleGAN2 for a given text prompt, by finding directions at lower resolutions and applying manipulations at higher resolutions. It also utilizes CLIP to minimize the distance between the generated image and the text prompt. Most recently and most similar to our approach, HairCLIP [Wei et al. 2022] modulates the inverted latent codes based on hairstyle and hair color inputs as image or text. Their approach is similar to StyleCLIP-LM. However, they also modulate the latent codes with the CLIP embeddings rather than solely optimizing the similarity in the CLIP space.

Our work share some similarities with the aforementioned methods. Like the original TediGAN model, we employ an encoder to predict the latent code conditioned on the provided target description. That said, we estimate a residual latent code reflecting only the desired changes mentioned in the description, which is to be added to the inverted latent code of the input image. StyleCLIP-LM and StyleMC models predict residual latent codes similar to ours, but they require training their mapper functions from scratch for each text prompt via a loss function based on CLIP similarity. Most similar to our approach, HairCLIP applies modulations in the latent space after obtaining inversions with a pretrained network. However, we let CLIP embeddings modulate the feature maps via an adapter module for predicting the residual latent code. With this modulation, our inversion step is text guided, whereas HairCLIP applies text-conditioning on the latent space. We also train a correction module that applies latent code blending with learnable blending coefficients for improved accuracy, quality and fidelity in the output images. In Figure 1, we illustrate the aforementioned fundamental differences between our approach and the most similar StyleCLIP-LM and HairCLIP methods.

Our approach allows us to manipulate fine-scale details by modulating the feature maps, resulting in more accurate manipulations than HairCLIP. Thanks to this process, we also eliminate the need for separate training, unlike StyleCLIP-LM. That is, once our model is trained, it can be directly used to manipulate images by considering a large variety of text prompts containing multiple attributes.

We provide extensive comparisons against the aforementioned recent StyleGAN-based methods in Section 4 and show the superiority or competitiveness of our proposed approach.

Recently, diffusion models trained with variational inference achieved state-of-the-art performance in image generation [Dhariwal and Nichol 2021; Ho et al. 2020; Rombach et al. 2022]. With this success of diffusion-based models, several text-guided image manipulation methods have been proposed. DiffusionCLIP [Kim et al. 2022] first converts the images to latent noises by forward diffusion and then guides the reverse diffusion process by CLIP to control the attributes in the synthesized images. UniTune [Valevski et al. 2022] introduces a simple method to fine-tune large-scale text-to-image diffusion models on single images. Similarly, Imagic [Kawar et al. 2022] optimizes a text embedding and fine-tunes pretrained generative diffusion models to perform edits on a single image. Prompt-to-Prompt [Hertz et al. 2022] and its later extension Plug-and-Play [Tumanyan et al. 2023] achieve semantic edits by blending activations extracted from both the original and target prompts. These diffusion-based editing methods differ from ours as each one requires a large pre-trained text-to-image network. Hence, we do not directly evaluate our approach against these methods, but provide some comparisons in the supplementary.

2.4 Adapter Layers

Adapter layers [Houlsby et al. 2019], originally proposed for Natural Language Processing (NLP) tasks, are compact modules that allow parameter sharing in an efficient manner. The key idea is to add adapter modules, consisting of a few layers, between the layers of a pretrained network. The parameters of the adapter module are updated during the fine-tuning phase on a downstream task, while the original parameters of the pretrained network remain the same. This way, most of the parameters of the pretrained network are shared between different downstream tasks, resulting in a model that is able to perform diverse tasks efficiently. Since the parameters of the pretrained network are frozen, the original capabilities of the model are preserved. The module proposed for NLP [Houlsby et al. 2019] is appended after the feed-forward layers and before adding the skip connection back, in a transformer model. This module consists of a down-projection and an up-projection layer. Compared to the original pretrained model, the number of parameters of the adapter module is considerably smaller, allowing the learning of new tasks efficiently.

Adapter layers have also been proposed to use in computer vision tasks. Rebuffi et al. [2017] introduced residual adapter layers for multiple-domain learning in image recognition. Their residual adapter layers are slightly modified versions of the residual blocks in ResNet [He et al. 2016], where batch normalization and 1 \(\times\) 1 convolutions with residual connections are added to these residual blocks. Rebuffi et al. [2018] proposed several improvements over this module. They modified the series implementation of the residual adapter to obtain a parallel adapter, where the input to the convolutional blocks of the residual block is processed in parallel with the adapter convolutions and fed back to the original branch. They also investigated where to place the adapter layers in the ResNet to achieve the best performance. Finally, in VL-Adapter [Yi-Lin Sung 2022], the authors experimented with adapter layers in vision and language joint tasks. They added adapter modules consisting of downsampling and upsampling layers to the transformer architecture for parameter efficient fine-tuning.

Our approach consists of adapter modules that we attach to inversion models. Our encoder adapter module is similar to the mapping networks in StyleCLIP-LM. However, in these adapter modules, we modulate intermediate image feature maps that are extracted from the inversion model. After the modulation, the feature maps are fed back to the inversion model to be processed further. With this essential idea, we are able to add a text conditional branch to the existing GAN inversion models while preserving its unconditional inversion capabilities.

3 The Approach

3.1 Overview of CLIPInverter

Our text-guided image editing framework includes two separate modules, namely CLIPAdapter and CLIP Remapper, each playing a different role in obtaining the desired edit. CLIPAdapter involves CLIP-conditioned adapter layers for the GAN inversion process, which are used for finding semantic editing directions in the latent space along which the given input image is manipulated. CLIPRemapper then performs a final refinement over the predicted latent code of the output image considering the CLIP embedding of the input text prompt to further improve the manipulation accuracy as well as the perceptual quality.

Given an input image \(\mathbf {x}_{\mathbf {in}}\) and a desired target description \(\mathbf {t}_{\mathbf {target}}\), the goal of our CLIPInverter approach is to manipulate the input image and synthesize an output image \(\mathbf {x}_{\mathbf {out}}\) such that the end result reflects the attributes described in the text (e.g., hair color, age, and gender), while preserving the identity of the subject present in the original image or any other features not relevant to the description. Assuming that we have access to a StyleGAN2 [Karras et al. 2020] generator \(G\) that can synthesize images from a particular domain, we cast this text-guided manipulation task as finding a mapping of the input image \(\mathbf {x}_{\mathbf {in}}\) and the target text prompt \(\mathbf {t}_{\mathbf {target}}\) to a latent code \(\mathbf {w^*}\in \mathcal {W}+\) in the latent space of \(G\) so that when decoded it generates the manipulation result as \(\mathbf {x}_{\mathbf {out}} = G(\mathbf {w^*})\).

We perform the latent space mapping in two steps, using the unconditional and the conditional branches of the text-guided encoder, which we obtain by attaching CLIPAdapter to a pretrained image inversion network, namely encoder4editing (\(\text{e4e}\)) [Tov et al. 2021]. We first map the input image \(\mathbf {x}_{\mathbf {in}}\) to its latent code \(\mathbf {w}\) through the pretrained encoder \(\text{e4e}\). We then compute a residual latent vector \(\Delta \mathbf {w}\) through the conditional branch, which processes both the input image and the CLIP model [Radford et al. 2021] embedding of the textual description. The final image \(\mathbf {x}_{\mathbf {out}}\) is synthesized by passing the aggregated latent code first through the refinement module, \(\mathbf {w^*} = \text{CLIPRemapper}(\mathbf {w} + \Delta \mathbf {w})\) and then through the generator network, which is a pretrained StyleGAN2 [Karras et al. 2020] generator. CLIPInverter applies one final correction to the latent code by predicting latents based on the CLIP embedding of the target caption \(\mathbf {t}_{\mathbf {target}}\). Then, the predicted latent is blended with the previously inverted latent code depending on a learned interpolation coefficient \(\alpha\).

In the following, we describe the details of the key modules of CLIPInverter and the loss functions we utilize during training.

3.2 CLIPAdapter: CLIP-guided Adapters for Latent Space Manipulation

Figure 2(a) shows the architecture of our proposed text-guided encoder, which follows the architecture of \(\text{e4e}\) with attached lightweight adapters that enable us to incorporate the textual descriptions. The original \(\text{e4e}\) architecture maps the input image to feature maps at three levels—coarse, medium, and fine. We introduce Adaptive Group Normalization (AdaGN) layers in CLIPAdapter, replacing the Instance Normalization in the Adaptive Instance Normalization (AdaIN) [Huang and Belongie 2017] layers to modulate these features using features obtained from the CLIP [Radford et al. 2021] embedding of the target description.

Fig. 3.

CLIPAdapter also employs shallow mapping networks, one for each level, to better align the multi-modal semantic space of the CLIP model with the \(\mathcal {W}+\) space of StyleGAN2. Specifically, we feed the text embedding obtained from the CLIP model to a multi-layer perceptron (MLP) that predicts the scale and shift parameters of the subsequent AdaGN blocks. Given the image features from the coarse, medium, and fine layers of the encoder, the AdaGN blocks perform feature modulation such that the outputs control the prediction of the residual latent codes.

The design philosophy behind our encoder architecture is to have adapter layers in a pretrained network that can identify visual features relevant and irrelevant to the manipulation task in both image and text-specific manner in computing the residual latent code to identify the manipulation direction in the \(\mathcal {W+}\) space. Specifically, we factorize the layers of the \(\text{e4e}\) network into two groups: \(\text{e4e}_{\text{body}}\) and \(\text{e4e}_{\text{m2s}}\). While \(\text{e4e}_{\text{body}}\) includes the convolutional backbone layers and it extracts a feature pyramid consisting of feature maps from coarse, medium, and fine levels, \(\text{e4e}_{\text{body}}\) consists of small convolutional mapping networks that transforms these feature maps to the latent styles in the \(\mathcal {W+}\) space. We insert CLIPAdapter between \(\text{e4e}_{\text{body}}\) and \(\text{e4e}_{\text{m2s}}\).

More formally, to manipulate a given image \(\mathbf {x}_{\mathbf {in}}\) based on a text prompt \(\mathbf {t}_{\mathbf {target}}\), we start with obtaining the latent code \(\mathbf {w}\) of the original image in the \(\mathcal {W}+\) latent space of StyleGAN2 [Karras et al. 2020] via \(\text{e4e}\),

\begin{equation} \mathbf {w} = \text{e4e}(\mathbf {x}_{\mathbf {in}}) \in \mathbb {R}^{18 \times 512}. \end{equation}

(1)

To perform semantic edits on \(\mathbf {x}_{\mathbf {in}}\) to reflect the desired target look, we utilize the text-conditioned branch of our encoder network, which takes both the input image and the target textual description as input and outputs the residual latent code. During this process, we first extract intermediate feature maps \(\mathbf {c_i}\) from the body layers of the encoder network, \(\text{e4e}_{\text{body}}\),

\begin{equation} \mathbf {c_i} = \text{e4e}_{\text{body}}(\mathbf {x}_{\mathbf {in}}). \end{equation}

(2)

Next, we utilize the CLIP text embedding of the target text prompt \(\mathbf {t}_{\mathbf {target}}\) to modulate \(\mathbf {c_i}\), obtaining the modulated feature maps \(\mathbf {c_o}\) through our encoder-adapter layers CLIPAdapter:

\begin{equation} \mathbf {c_o} = \text{CLIPAdapter}(\mathbf {c_i}, \mathbf {t}_{\mathbf {target}}). \end{equation}

(3)

As the final step to predict the manipulation directions as residual latents \(\Delta \mathbf {w}\), we pass the modulated feature maps \(\mathbf {c_o}\) through the map2style layers of \(\text{e4e}\), \(\text{e4e}_{\text{m2s}}\),

\begin{equation} \Delta \mathbf {w} = \text{e4e}_{\text{m2s}}(\mathbf {c_o}) \in \mathbb {R}^{18 \times 512}. \end{equation}

(4)

Note that the body and map2style layers of \(\text{e4e}\) are combined to complete the pretrained encoder \(\text{e4e} = [\text{e4e}_\text{body}, \text{e4e}_\text{m2s}]\). The language conditioning happens in the adapter layers CLIPAdapter and these layers are the only layers with trained parameters in the inversion framework, the rest of the parameters are pretrained.

3.3 CLIPRemapper: CLIP-guided Latent Vector Refinement

To further enhance the quality of the manipulated image, we introduce a final refinement step over the predicted latent code. As shown in Figure 2(b), our CLIPRemapper carries out this refinement process by mapping CLIP text embedding of the given text prompt to the \(\mathcal {W+}\) space and then using the projected text embedding to steer the residual latent code predicted by CLIPInverter toward a direction more compatible with the target text. Specifically, CLIPRemapper involves shallow mapping networks for each level to better align image with the text. The text embedding obtained from CLIP is fed to MLPs at each stage to predict a component for latent code correction corresponding to the caption, as follows:

\begin{equation} \mathbf {\Delta \widehat{w}_i} = \text{MLP}_{i}(\mathbf {t}_\mathbf {target}). \end{equation}

(5)

Taking into account \(\Delta \mathbf {\widehat{w}_i}\), we apply a further correction to the residual latent code predicted through CLIPAdapter as

\begin{equation} \mathbf {\Delta {w_i}^{\prime }} = \frac{\left(\alpha _{i} * \Delta \mathbf {{w_i}} + (1 - \alpha _{i}) * \Delta \mathbf {\widehat{w}_i}\right)*\Vert \Delta \mathbf {{w_i}}\Vert }{\Vert {\alpha _{i} * \Delta \mathbf {{w_i}} + (1 - \alpha _{i}) * \Delta \mathbf {\widehat{w}_i}}\Vert }, \end{equation}

(6)

where \(\alpha _i\) is a weighting factor that is defined as a learnable parameter and \(\Delta \mathbf {{w_i}^{\prime }}\) represents the final corrected residual latent code.

In particular, the corrected residual latent code \(\Delta \mathbf {{w_i}^{\prime }}\) is obtained by considering linear combination of two separate codes, the residual latent code from CLIPAdapter \(\Delta \mathbf {{w_i}}\) and the vector \(\Delta \mathbf {\widehat{w}_i}\), followed by a normalization. We do not want the refinement procedure to make substantial changes in the predicted latent code. Hence, along with the loss functions introduced in the next section, the normalization further enforces the final latent code \(\Delta \mathbf {{w_i}^{\prime }}\) to be in the vicinity of the residual latent code predicted in the previous step. We only make the necessary changes in the semantic directions suggested by the CLIP embedding of the target text \(\mathbf {t}_\mathbf {target}\) through a simple image composition process in the latent StyleGAN space.

CLIPRemapper effectively integrates the local inductive bias of the target description and the desired visual characteristics for the source image as suggested by the target description. In structured domains such as human faces, residual latent code \(\Delta \widehat{w}\) obtained in an image blind manner using target description produces interpretable results. This process, as demonstrated in Figure 3, combines the manipulated image generated by CLIPAdapter with a generic image that predominantly exhibits the characteristics mentioned in the target description, leading to further improvements on both the manipulation accuracy and the perceptual quality. In the case of less structured domains, e.g., birds, while \(\Delta \widehat{w}\) may not be interpretable, it still provides some improvements to the manipulations. Additional visualizations for cat and bird images can be found in the supplementary material.

Fig. 4.

3.4 Training Losses

We train our proposed \(\text{CLIPInverter}\) model on a training set of images paired with their corresponding textual descriptions \(\lbrace (\mathbf {x}_\mathbf {in},\mathbf {t}_\mathbf {real})\rbrace\). Specifically, we employ a cyclic adversarial training strategy [Zhu et al. 2017] during training, which involves two separate manipulation steps. In the first one, we feed in the original input image \(\mathbf {x}_\mathbf {in}\) along with a target textual description \(\mathbf {t}_{target}\) (which does not match with the input image) to our model. This process generates a manipulated image \(\mathbf {x}_\mathbf {out} = \text{CLIPInverter}(\mathbf {x}_\mathbf {in},\mathbf {t}_\mathbf {target})\). In the cyclic pass, we take this manipulated image \(\mathbf {x}_\mathbf {out}\) and the original text description \(\mathbf {t}_\mathbf {real}\) (which describes the original input image \(\mathbf {x}_\mathbf {in}\)) as inputs to obtain \(\widehat{\mathbf {x}}_\mathbf {in} = \text{CLIPInverter}(\mathbf {x}_\mathbf {out},\mathbf {t}_\mathbf {real})\). We expect \(\widehat{\mathbf {x}}_\mathbf {in}\) to closely resemble the original image \(\mathbf {x}_\mathbf {in}\) by enforcing cycle consistency. We obtain the target text description by rolling the minibatch, meaning that each image will be paired with the textual description that describes the next image in the minibatch. We train our model with a set of loss functions. Each of these objectives are used both in the first manipulation pass and the following cycle pass. In the following, we only describe the losses for the first manipulation pass for the sake of presentation simplicity.

We use \(\mathcal {L}_{2}\) and \(\mathcal {L}_{\mathrm{LPIPS}}\) [Zhang et al. 2018] losses to respectively enforce pixelwise and perceptual similarities between the input and the manipulated image, such that

\begin{equation} \mathcal {L}_{\mathrm{2}} = \Vert \mathbf {x}_{in}-\mathbf {x}_{out}\Vert _{2}, \end{equation}

(7)

\begin{equation} \mathcal {L}_{\mathrm{LPIPS}} = \Vert F(\mathbf {x}_{in})-F(\mathbf {x}_{out})\Vert _{2}, \end{equation}

(8)

where \(F(\cdot)\) denotes deep features extracted from a pretrained AlexNet [Krizhevsky et al. 2012] model.

Ideally, we want any manipulation to preserve the identity of the subject in the original image. To preserve the identity, we employ an identity loss that maximizes the cosine similarity between the input image and the output image feature embeddings:

\begin{equation} \mathcal {L}_{\mathrm{ID}}=1-\langle R(\mathbf {x}_{in})), R(\mathbf {x}_{out})\rangle , \end{equation}

(9)

where \(\langle \cdot , \cdot \rangle\) represents the cosine similarity between the feature vectors and \(R\) denotes a pretrained deep network. Specifically, we use the pretrained ArcFace [Deng et al. 2019] network for human faces, and a ResNet50 [He et al. 2015] network trained with MOCOv2 [Chen et al. 2020] for birds and cats.

We also employ the following regularization loss, which enforces the predicted latent codes to be close to the average latent code of the generator and was shown to improve overall image quality in previous work [Richardson et al. 2021], such that

\begin{equation} \mathcal {L}_{\text{reg}}=\Vert \mathbf {w^*}-\overline{\mathbf {w}}\Vert _{2}, \end{equation}

(10)

where \(\mathbf {w^*}\) and \(\overline{\mathbf {w}}\) are the aggregated and the average latent codes, respectively.

Last, to enforce the similarity between the output image and the target description, we employ a directional CLIP loss [Gal et al. 2021]. Rather than directly minimizing the distance between the generated image \(\mathbf {x}_{out}\) and the text prompt \(\mathbf {t}_{target}\) in the CLIP space, directional CLIP loss aligns the direction from the input image \(\mathbf {x}_{in}\) to the manipulated image \(\mathbf {x}_{out}\) with the direction from the original text description \(\mathbf {t}_{real}\) to the target text description \(\mathbf {t}_{target}\):

\[\begin{eqnarray} \Delta T=E_\mathrm{CLIP,T}\left(\mathbf {t}_{target}\right)-E_\mathrm{CLIP,T}\left(\mathbf {t}_{real}\right), \nonumber \nonumber\\ \Delta I=E_\mathrm{CLIP,I}\left(\mathbf {x}_{out}\right)-E_\mathrm{CLIP,I}\left(\mathbf {x}_{in}\right), \nonumber \nonumber\\ \mathcal {L}_{\text{direction }}=1-\frac{\Delta I \cdot \Delta T}{|\Delta I||\Delta T|}, \end{eqnarray}\]

(11)

where \(E_{\mathrm{CLIP,T}}\) and \(E_{\mathrm{CLIP,I}}\) are the text and image encoders of CLIP, respectively.

Our final loss function for the first manipulation pass is a weighted sum of the objectives:

\begin{equation} \mathcal {L}_{\mathrm{manipulation}} = \lambda _{1}\mathcal {L}_{2} + \lambda _{2}\mathcal {L}_{\mathrm{LPIPS}} + \lambda _{3}\mathcal {L}_{\mathrm{ID}} + \lambda _{4}\mathcal {L}_{\mathrm{reg}} + \lambda _{5}\mathcal {L}_{\mathrm{direction}} , \end{equation}

(12)

where each \(\lambda _{i}\) determines the weight of the corresponding objective. The total loss including the first manipulation and the follow-up cycle passes is the following:

\begin{equation} \mathcal {L}_{\mathrm{total}} = \mathcal {L}_{\mathrm{manipulation}} + \lambda _6 \mathcal {L}_{\mathrm{cyclic}}, \end{equation}

(13)

where \(\mathcal {L}_{\mathrm{cyclic}}\) is the cyclic consistency loss, which contains the same loss terms as \(\mathcal {L}_{\mathrm{manipulation}}\) in which \(\mathbf {x}_{out}\) is replaced with \(\widehat{\mathbf {x}}_{in}\), and \(\lambda _{\mathrm{cycle}}\) is the weight for this cyclic loss.

During training, we follow a multi-stage regime. We first train the CLIPAdapter (without using CLIPRemapper). Once these are fully trained, we freeze the weights of CLIPAdapter weights and train the CLIPRemapper while optimizing for the CLIP loss along with the L2, Learned Perceptual Image Patch Similarity (LPIPS), and Identity Similarity (ID) losses. For the LPIPS and L2 losses, we also include the loss between images generated with and without CLIPRemapper that ensures that the CLIPRemapper does not change the images by a large amount. In addition, we also include a L2 regularization loss on the interpolation coefficients (lambdas) such that the amount of interpolation between two latent codes does not change the original code by a large amount. This is also observed to remove artifacts in the generated images.

4 Experimental Evaluation

4.1 Datasets

We conduct extensive evaluation on a variety of domains to illustrate the generalizability of our approach. We use the Multi-Modal CelebA-HQ [Lee et al. 2020; Xia et al. 2021a] dataset to train our model on the domain of human faces. This dataset consists of 30,000 images along with 10 textual descriptions for each image. We follow the default train/test split, using 6,000 images for testing and the remaining for training. For the birds domain, we use the CUB Birds dataset [Wah et al. 2011], which contains 11,788 images in total, including 2,933 images for testing, along with 10 captions for each image. Finally, for the domain of cat faces, we use the AFHQ-Cats dataset [Choi et al. 2020], which contains a total of 5,653 images, including 500 for testing. The captions for this dataset are generated using the approach mentioned in Nie et al. [2021] leveraging the CLIP [Radford et al. 2021] model.

4.2 Training Details

We use two pre-trained models trained on our datasets: StyleGAN2 generator and \(e4e\) encoder. Keeping the weights of these models frozen, we train CLIPInverter using the cyclic adversarial training scheme described in the previous section. The mismatching captions are sampled in such a way that matching caption for an image is sampled 25% of the time during training. In our experiments, for the CLIPAdapter, we empirically set \(\lambda _1 = 1.0\), \(\lambda _2 = 0.6\), \(\lambda _3 = 0.1\), \(\lambda _4 = 0.005\), \(\lambda _5 = 1.0\), and \(\lambda _6 = 1.0\) and the learning rate to 0.0005. For CLIPRemapper, we increase the weight of the identity loss to \(\lambda _3~=~0.5\) and totally exclude the regularization loss during training. We initialize the linear coefficient \(\alpha _i\)’s with 0.05 and train them together with the parameters of CLIPRemapper. We train CLIPAdapter for 200k iterations on a single Tesla v100 GPU, which takes about 6 days and CLIPRemapper for 20k iterations that takes about a day.

4.3 Evaluation Metrics

Quantitative analysis of the language-guided image manipulation task is a challenging matter. The quality and the photorealism of the generated images can be evaluated with FID [Heusel et al. 2017]. However, there is no established way to evaluate the manipulation accuracy of a model. It is crucial that an effective model should only alter the attributes specified in the target text prompt, while preserving the original attributes for the rest of the input image. Hence, we also use the ID similarity [Deng et al. 2019] to assess the identity preservation.

To evaluate the model accuracy in terms of these aspects, we propose two metrics: AMA and CMP. Attribute Manipulation Accuracy measures how accurately a model can apply single attribute manipulations. For face images, we train an attribute classifier using the images and their attribute annotations from the CelebA [Liu et al. 2015] dataset, following Nie et al. [2021]. In terms of the validation accuracy of the classifier on different attributes, we select 15 of the best performing attributes, such as blond hair, chubby, mustache (see the appendix for the full list of attributes), of 40 that are included in CelebA. Here, we have two versions of the AMA score. AMA-Single measures the accuracy of single attribute manipulations using our model. To evaluate this, we generate 50 image manipulations for each of the 15 selected attributes, resulting in a total of 750 images. For each manipulation, we employ pre-defined text prompts that specifically mention the attribute of interest, such as “This person has blond hair.” The accuracy is then determined by assessing how well the generated images align with the intended attribute manipulation. We evaluate the accuracy of these manipulations using the attribute classifier and take the mean of the accuracy across all the attributes to obtain the final AMA score for that model. AMA-Multiple evaluates the accuracy of multiple attribute manipulations achieved by our model. We generate target descriptions that involve combinations of two or three attributes and perform 50 image manipulations for each combination, resulting in a total of 350 images. We consider the manipulation successful only when the resulting changes can be accurately classified by the corresponding attribute classifiers. In this context, a classification is deemed correct if the attribute score surpasses a threshold of 0.90.

For cat and bird images, we use CLIP as a zero-shot classifier to calculate the AMA. We employ 30 attributes present in the AFHQ-Cats [Choi et al. 2020] and sample 40 attributes of the 273 attributes present in the CUB [Wah et al. 2011] dataset. For each selected attribute, we generate template-based captions covering all the classes in the category that the attribute belongs to. Then, we prompt CLIP with the output image and the generated captions to obtain similarity scores for each caption. The manipulation then is successful if the caption with the correct label has the highest probability after the softmax operation on the similarity scores.

CLIP Manipulative Precision is a modified version of the Manipulative Precision metric proposed by ManiGAN [Li et al. 2020] that uses the pre-trained CLIP [Radford et al. 2021] image and text encoders. CMP measures how aligned the synthesized image is with the target text prompt \(\mathbf {t_{target}}\) and how well the original contents of the input image are preserved. It is defined as

\begin{equation} \text{CMP} = (1 - \text{diff}) * \text{sim}, \end{equation}

(14)

where diff is the \(\mathcal {L}_{\mathrm{1}}\) pixel difference between the input image \(\mathbf {x_{in}}\) and the output image \(\mathbf {x_{out}}\), and sim is the CLIP similarity between the output image \(\mathbf {x_{out}}\) and the target textual description \(\mathbf {t_{target}}\). We calculate the CMP for each of the images generated for the AMA score and take their average to obtain the final CMP score for the corresponding model.

4.4 Qualitative Results

In Figure 4, we show that our method can manipulate images from very different domains such as human faces, cats, and birds. Given an input image, we manipulate it by just providing a natural textual description highlighting the desired edits. As can be seen in the figure, the target descriptions can specify more than one attribute. For instance, one can simultaneously apply lipstick while changing the hair style of a woman or can alter the attitude and appearance of a cat at the same time.

Fig. 5.

Our method can give plausible results independent of the complexity of the provided target description. For instance, in Figure 5, we present the outcomes of our approach obtained by taking into account compositions of different visual attributes. They demonstrate that our method can deal with the provided compositions and make the necessary changes in the original input images mentioned in the descriptions to its full extent.

Fig. 6.

In Figure 6, we demonstrate that predicting residual latent code for a given target description has the advantage that one can continuously interpolate between the original image and the final result, which allows users to have control over the degree of changes made during the manipulation process. For example, the appearance of the subjects smoothly changes to reflect the increase in the intensity of the lipstick, and the color of the cats and the bird slightly changes.

Fig. 7.

To some extent, our approach can also perform edits in a zero-shot setting by using descriptions never seen during training. The key to this ability lies in the use of the CLIP-based text guided adapters that enable us to align the visual and the textual domains and map out of domain textual descriptions to a semantic editing direction in the latent space. Hence, even if the terms in the target descriptions have not been observed for the first time, our method can make the necessary changes in the input images if semantically similar terms have been seen during training. For instance, in Figure 7, we include a number of cases where the color or the structure of the hair is manipulated using novel descriptions that do not exist in the training set such as curly hair, silver hair, and facial hair.

Fig. 8.

In our proposed CLIPAdapter, we employ CLIP embeddings of the text prompt to modulate the convolutional feature maps to predict the residual latent code, representing the changes on the input image required to meet the desired target description. In fact, CLIP model learns the alignment between images and text via a contrastive learning objective and discovers a common semantic space. Hence, our framework also allows for using exemplar images as the conditioning element without any changes or training. In Figure 8, we provide some qualitative results for such image-based manipulations performed by our proposed approach. We observe that although no further training is done by considering reference images instead of target description, our model achieves a good performance on transferring the appearance of the provided reference images to the input images.

Fig. 9.

We refer readers to the supplementary material for more manipulation results.

4.5 Qualitative Comparisons to Other Text-guided Manipulation Methods

We compare our approach with various existing methods, including TediGAN [Xia et al. 2021a], StyleCLIP [Patashnik et al. 2021], StyleMC [Kocasari et al. 2021], and HairCLIP [Wei et al. 2022]. For StyleCLIP, we use the latent optimization-based model StyleCLIP-LO, and for TediGAN, we use the CLIP-based optimization approach (TediGAN-B). In all of our experiments, we use the public implementations provided by the authors. For HairCLIP, we slightly modify its neural architecture and train it accordingly. In the original paper, they do consider different conditioning vectors for the mapper modules encoding hairstyle and hair color as they refer to details from different scales. Since, we focus on a generic text-guided manipulation process where it is hard to separate the textual terms into fine-, mid-, and high-level attributes, we let the embedding of the whole target description suggested by CLIP text encoder to condition the mappers equally. All of these approaches use StyleGAN2 as a frozen generator and utilize the CLIP embedding to measure the image and text similarity.

In Figure 9, we provide some qualitative comparisons between our method and the baselines on a number of human face images. As can be seen from the figure, our approach gives more accurate edits as compared to the existing methods, especially for captions that describe multiple attribute manipulations. For instance, for the first image, our model is able to make meaningful changes to the original input image to reflect the look depicted in the target description, and apply the gender change as well as changes in the eyebrows, hair, eyes, lips, and the outfit. For the second input image, our model is able to generate the smile and the lipstick while most of the other methods fail to apply both changes at the same time. In the last two examples, our manipulation results again reflect the given target descriptions—much better than those of the competing approaches. Our method manipulates the gender, hair color, eyebrows, and age of the man and applies makeup. Similarly, it generates a smile for the woman and makes her wear a jacket, which is inline with the necktie mentioned in the description. Similarly, in Figure 10, we compare our results with those of the TediGAN-B, StyleCLIP-LO, and HairCLIP methods on bird and cat images. Like the human faces, our model is able to generate visually more pleasing and relevant results than the competing approaches. For instance, our model is able to capture the yellow-greenish color mentioned in the description for the bird in the third row and the fearful look for the cat in the first row while other methods result in poor manipulations. For birds and cats, we could not provide any comparison against StyleCLIP-GD and StyleMC as their codebase use a different implementation of the StyleGAN and they do not provide pre-trained models for these datasets. In the supplementary material, we provide additional visual comparisons.

Fig. 10.

Fig. 11.

4.6 Quantitative Comparisons to Other Text-guided Manipulation Methods

We quantitatively compare our approach to the same approaches that are compared in the qualitative comparisons, namely TediGAN [Xia et al. 2021a], StyleCLIP-LO and StyleCLIP-GD [Patashnik et al. 2021], StyleMC [Kocasari et al. 2021], and HairCLIP [Wei et al. 2022]. We use the four metrics mentioned in Section 4.3 (FID, AMA, and CMP) and ID for these quantitative comparisons. The official PyTorch implementation [Seitzer 2020] is utilized to calculate the FID scores. The AMA and the CMP scores are calculated using the procedure described in Section 4.3.

Table 1 shows the quantitative comparisons for our model against various state-of-the-art approaches. TediGAN-B achieves fairly good FID and CMP scores. However, from the qualitative results, we observed that TediGAN-B exploits adversarial ways to optimize the CLIP similarity without changing the input pixels much while failing to apply the manipulations and producing distorted images.

Table 1.

	FID \(\downarrow\)	CMP \(\uparrow\)	AMA (Single) \(\uparrow\)	AMA (Multiple) \(\uparrow\)	ID \(\uparrow\)
TediGAN-B	55.424	0.285	11.286	1.142	37.97
StyleCLIP-LO	80.833	0.210	15.857	3.429	29.69
StyleCLIP-GD	82.393	0.191	33.143	11.429	57.37
StyleMC	84.088	0.187	12.143	2.857	30.05
HairCLIP	93.523	0.218	41.571	15.143	57.50
Ours	97.210	0.221	61.429	41.714	52.14

Table 1. Quantitative Comparisons on the CelebA Dataset

Our approach exhibits superior manipulation accuracy compared to other methods, particularly for manipulations involving multiple attributes, while maintaining a comparable level of perceptual quality. The best and second-best performing models are highlighted in bold and underlined, respectively.

While performing well in terms of either one or two metrics, the competing approaches usually fail to be competitive across all four metrics. StyleCLIP-LO is able to achieve a fairly comparable CMP, since it optimizes the CLIP similarity for each instance, and a good FID score but fails to apply the given attribute manipulations accurately. StyleMC also achieves a good FID score, since it finds directions in the \(\mathcal {S}\) space. However, it also fails to output accurate manipulations. Even though StyleCLIP-GD performs better than these two models, its performance still falls behind the performance of our approach. Finally, HairCLIP achieves the best scores out of the competing approaches. The results demonstrate the superiority of our model against HairCLIP, as our method achieve much higher manipulation accuracies while remaining competitive in terms of the FID and ID scores. Our approach finds a good balance for the distortion and editability problem by applying manipulations successfully while being comparable in terms of photorealism. Hence, they are able to achieve good scores across all four metrics.

Table 2 presents the quantitative comparisons on the AFHQ-Cats and the CUB datasets. Since CLIP is used as a similarity metric in CMP and as a zero-shot classifier in AMA estimations, TediGAN-B again achieves really good scores in these two metrics. However, as seen from the FID scores and the results shown in Figure 10, it gives highly blurred and non-realistic outputs that are not actually in line with the target descriptions. Another optimization-based method, StyleCLIP-LO, achieves worse AMA and CMP scores than TediGAN-B but better FID. Their loss functions allow the model to output realistic outputs, but they fail to apply the manipulations successfully, which can be seen in Figure 10. HairCLIP generates images that are better in line with the descriptions than the aforementioned methods. However, our approach outperforms HairCLIP by a large margin in terms of CMP and AMA while having a fairly close or even better FID values. We underlined the second best performing models for each metric to demonstrate the superiority of our approach against the others, since the best-performing models usually exploit adversarial ways to optimize the CLIP similarity that yield high CMP and AMA values or fail to apply the manipulations that yield better FID values.

Table 2.

	AFHQ-Cats			CUB
	FID \(\downarrow\)	CMP \(\uparrow\)	AMA \(\uparrow\)	FID \(\downarrow\)	CMP \(\uparrow\)	AMA \(\uparrow\)
TediGAN-B	39.414	0.255	82.467	42.007	0.233	59.500
StyleCLIP-LO	18.771	0.226	48.133	19.209	0.211	27.000
HairCLIP	21.087	0.227	44.667	26.447	0.218	57.050
Ours	24.172	0.245	76.467	25.837	0.221	66.000

Table 2. Quantitative Comparisons on the AFHQ-Cats and CUB Datasets

Our approach demonstrates superior manipulation accuracy compared to other methods, while also preserving a comparable perceptual quality. The best and second-best performing models are highlighted in bold and underlined, respectively.

For quantitative analysis, we conduct a user study via Qualtrics to evaluate the performance of our approach and all the other competing methods. Specifically, in this user study, we focus on two important aspects: (1) the accuracy of the edits with respect to the given target descriptions, and (2) the photorealism of the manipulated images. In our human evaluation, we randomly generate 48 questions and divide them into 3 groups, with 16 questions each. We make sure that at least 14 different subjects answer each of these group of questions. To measure the accuracy, we show the users an input image, a target description, and the manipulation results of all of the competing methods and ask them to rank the results against each other with respect to how consistent the edits are to the provided description. The participants perform this by dragging the images into their preferred order, where the left-most position refers to the worst result having rank order 1 and the right-most one represents the best outcome at rank order 6. To avoid any bias in the evaluation, the outputs of the methods are displayed in random order at each time. For the questions regarding photorealism, we design a similar ranking task, but this time, we show all the results in random order, and ask the participants to order these results with respect to how realistic they look. Please refer to the supplementary for a screenshot of our user study given to the participants.

Table 3 summarizes the results of our study where the average ranking scores are reported. We find that in terms of the accuracy, the human subjects prefer our proposed method against all the competing approaches. That is, our method makes only the necessary edits in the input images with respect to the given target descriptions in a precise manner. HairCLIP and StyleCLIP-GD give the next most accurate results following our model. In terms of photorealism, our results are also superior than these two approaches, indicating that our results are both accurate and photo-realistic. That said, the human subjects find the photorealism of the results of the concurrent StyleMC and StyleCLIP-LO models significantly better. However, the accuracy questions indicate that both StyleMC and StyleCLIP-LO have difficulty in manipulating the given input images in regard to the target descriptions, in contrast to our proposed model. StyleMC and StyleCLIP-LO, in general, make minimal, mostly insufficient changes in the input images (as also can be seen from Figure 9), and thus do not degrade the photorealism much.

Table 3.

Task	TediGAN-B	StyleCLIP-LO	StyleMC	StyleCLIP-GD	HairCLIP	Ours
Acc.	1.848	3.401	3.526	3.611	4.015	4.598
Real.	1.218	4.604	4.282	3.609	3.544	3.743

Table 3. User Study Results

The table represent the average rankings of the methods with respect to accuracy and realism, where the higher the value is the better the method is. The participants favor the results of our proposed model over the current state of the art when the accuracy of the manipulations is considered.

4.7 Ablation Study

During training our model, we leverage different loss terms. To analyze the contributions of these loss terms, we have performed an ablation study where we either remove or modify some of these loss terms during training. We provide visual comparisons between these models separately trained on different loss terms in Figure 11.

Fig. 12.

First, we employ the directional CLIP loss following [Gal et al. 2021] to better enforce the image and description similarity. Compared to the global CLIP loss, which directly minimizes the distance between the manipulated image \(\mathbf {x}_{out}\) and the text prompt \(\mathbf {t}_{target}\) in the CLIP space, the directional CLIP loss aligns the directions between the real and target descriptions and input and output images. As can be seen in the second column of Figure 11, the global CLIP loss suffers from artificial-looking manipulations and results in poorly constructed facial attributes as compared to the directional CLIP loss.

Second, to preserve the features and the details of the input image in the areas that we do not wish to modify, we employ the perceptual \(\mathcal {L}_{\mathrm{2}}\) and the \(\mathcal {L}_{\mathrm{LPIPS}}\) losses between the input and the output images. In theory, these perceptual loss terms contradict the directional CLIP loss, since the CLIP loss is trying to enforce the image and text similarity by manipulating the pixel values. To analyze the contribution of these perceptual terms, we have reduced the weights of these loss terms in the overall objective. The third column in Figure 11 shows a manipulation example from this experiment. As can be seen, the smile in the first row is also modified, and the model manipulates the hair style to curly hair in the second row even though this manipulations were not mentioned in the target description. This experiment demonstrates the necessity of these perceptual loss terms to prevent unwanted manipulations.

Third, we employ a cyclic-adversarial training strategy, where we first manipulate the image with a mismatching caption and then recover it by manipulating the output of the first pass with the matching target description. The fourth column in Figure 11 shows an example manipulation from the experiment where we remove this cyclic training regime. Even though the output is visually similar to the output from our full model, we observe that the cyclic consistency loss helps with the preservation of the identity as well as the manipulation accuracy.

Finally, we utilize a CLIP-guided correction module CLIPRemapper to apply the manipulations more accurately and increase the image fidelity. We see from the last two columns of the figure that without CLIPRemapper, the model is not able to apply all of the specified manipulations accurately, like the hair color in the first row or the earrings in the second row.

Table 4 shows the quantitative analysis of the experiments described above. The metrics verify that the global CLIP loss performs much worse in terms of attribute manipulations. This model is able to achieve a high CMP, since it directly optimizes image and text similarity in the CLIP space rather than aligning semantic directions. When we reduce the weight of the perceptual losses, the model is able to apply the manipulations with very high accuracy. However, this comes with the price of perceptual quality as FID suggests and unwanted manipulations as CMP suggests. Adding the cycle pass gives us a better supervision signal to train the model, as the improvements in the accuracy and the CMP suggest. Without CLIPAdapter, our model is not able to achieve great accuracy scores, suggesting that the adapter layers yield the residual latent codes corresponding to semantic directions successfully. Finally, adding CLIPRemapper to our model highly boosts the manipulation performance, with slightly decreasing photorealism in terms of FID. Overall, these quantitative results demonstrate that the combination of the loss functions and the light-weight modules we use allows our model to perform well across all metrics and apply manipulations accurately while preserving the photorealism and preventing unwanted changes.

Table 4.

	FID \(\downarrow\)	CMP \(\uparrow\)	AMA \(\uparrow\)
Ours w/ Global CLIP Loss	83.404	0.221	25.429
Ours w/o Perceptual Losses	105.432	0.194	65.571
Ours w/o Cycle Pass	85.851	0.215	40.857
Ours w/o CLIPAdapter	89.244	0.202	41.571
Ours w/o CLIPRemapper	88.395	0.216	53.28
Ours	97.210	0.221	61.429

Table 4. Quantitative Analysis of the Ablation Study

We have performed a quantitative analysis of the ablation study where we calculated the metrics for each of the described experiments for our model. The results demonstrate that our model finds a good balance for applying manipulations without decreasing the perceptual quality of the generations. The best and second-best performing models are highlighted in bold and underlined, respectively.

4.8 Limitations

In our approach, the CLIPAdapter module can be integrated with any inversion network model. Consequently, the limitations of the underlying inversion network are inherited by our approach. For instance, when employing the \(\text{e4e}\), the network may struggle to find accurate latent codes for inputs with unusual poses or challenging lighting conditions. Hence, the reconstructions occasionally result in alterations to identity and the loss of certain details. Despite these limitations, our approach remains capable of generating outputs that align with the provided textual descriptions. It is important to note that the output images are consistent with the reconstructed images, rather than the input images themselves. Figure 12 demonstrates manipulation results using our approach with various challenging inputs. As observed, our method successfully applies the manipulations with respect to the desired reconstructions, however, some details present in the input images, such as shadows or specific lighting conditions, may not be fully preserved during the unconditional inversion phase. It is worth mentioning that this is a common limitation of current GAN-based editing methods, as many of these approaches rely on a pre-trained encoder like \(\text{e4e}\) to obtain an initial inversion of the inputs. For a comprehensive comparison of competing approaches under these challenging cases, we refer readers to the supplementary material.

Fig. 13.

Additionally, the effectiveness of our method mainly lies in the proposed text-guided image encoder CLIPInverter, which estimates the residual latent code to capture the desired changes. Since CLIPInverter is trained by using a set of training images paired with corresponding textual descriptions, we observe that results of our approach might be affected by the biases that exist in the training data. For instance, Multi-Modal CelebA-HQ dataset containing human face images consists of 10 descriptions for each image, but we observe that the descriptions are not diverse, often using similar adjectives referring to certain attributes. Moreover, there is an imbalance between the number of female and male images, causing a bias toward a specific gender in certain attributes. When only attributes are used in the textual descriptions without any pronouns, unexpected gender manipulations might occur due to these biases. As observed in Figure 13, when we only use the description “wavy hair,” a gender manipulation also occurs. We can alleviate this problem by using more comprehensive textual descriptions, including additional details such as “She has wavy hair,” which yields a much more accurate manipulation. It is an interesting future direction to tackle the bias problem in a more systematic manner.

Fig. 14.

5 Conclusion

In this work, we have introduced CLIPInverter, a novel text-driven image editing approach. It can be used to manipulate an input image through the lens of StyleGAN latent space solely by providing a target textual description, which is much more intuitive than the commonly used user inputs such as sketches, strokes, or segmentation masks. The key component of our approach is the proposed text-guided adapter module called CLIPAdapter, which modulates image feature maps during the inversion to extract semantic edit directions with respect to the provided target description. Moreover, we suggest a text-guided refinement module that we refer to as CLIPRemapper, which performs an additional correction step on the predicted latent code from CLIPAdapter to further boost the accuracy of the performed edits in the input image. Our model does not require an instance-level latent code optimization or a separate training for specific text prompts as done in the prior work, and thus provides a faster alternative to the approaches that exist in the literature.

Our approach is not limited to a specific domain in that it only needs a pretrained StyleGAN model. As our experimental analysis on several different datasets illustrate, our model can handle the semantic edits through textual descriptions for very different domains. Moreover, thanks to the shared semantic space provided by the CLIP [Radford et al. 2021] model between images and text, our model can also be used to perform manipulations conditioned on another image or a novel textual description that has not been seen during training. Our experiments demonstrate significant improvements over the previous approaches in that our model can manipulate images with high accuracy and quality for any description.

Furthermore, it is important to highlight that our proposed framework is not limited to StyleGAN and can be seamlessly integrated into other deep generative models that operate on a latent space representation. Although our current implementation focuses on StyleGAN, the key contributions of our framework, namely CLIPAdapter and CLIPRemapper, are not specific to StyleGAN and can be easily adapted to other GAN architectures. This flexibility opens up opportunities for leveraging our framework in conjunction with recent advancements in latent space extension, such as dual-space GANs, which exhibit enhanced disentanglement of style and content information [Kwon and Ye 2021; Xu et al. 2022]. By incorporating our framework into these models, we can further enhance manipulation accuracy and broaden the range of images that can be generated based on textual descriptions.

Footnote

https://cyberiada.github.io/CLIPInverter

Supplementary Material

3610287-supp (3610287-supp.pdf)

Supplementary material

Download
49.70 MB

References

[1]

Rameen Abdal, Yipeng Qin, and Peter Wonka. 2019. Image2StyleGAN: How to embed images into the StyleGAN latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 4431–4440. DOI:

Abstract

1 Introduction

2 Related Work

2.1 GAN Inversion

2.2 Latent Space Manipulation

2.3 Text-guided Image Manipulation

2.4 Adapter Layers

3 The Approach

3.1 Overview of CLIPInverter

3.2 CLIPAdapter: CLIP-guided Adapters for Latent Space Manipulation

3.3 CLIPRemapper: CLIP-guided Latent Vector Refinement

3.4 Training Losses

4 Experimental Evaluation

4.1 Datasets

4.2 Training Details

4.3 Evaluation Metrics

4.4 Qualitative Results

4.5 Qualitative Comparisons to Other Text-guided Manipulation Methods

4.6 Quantitative Comparisons to Other Text-guided Manipulation Methods

4.7 Ablation Study

4.8 Limitations

5 Conclusion

Footnote

Supplementary Material

References

Cited By

Index Terms

Recommendations

Designing an encoder for StyleGAN image manipulation

UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image

Plenoptic Image Editing

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations