research-article

Open access

Customizing Text-to-Image Models with a Single Image Pair

Authors: Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan ZhuAuthors Info & Claims

SA '24: SIGGRAPH Asia 2024 Conference Papers

Article No.: 6, Pages 1 - 13

https://doi.org/10.1145/3680528.3687642

Published: 03 December 2024 Publication History

All formats PDF

Abstract

Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose PairCustomization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.

1 Introduction

Artistic works are often inspired by a reference image, a recurring scene, or even a previous piece of art [Markus 2019]. Such creations involve re-interpreting an original composition in the artist’s unique style. A notable example is Van Gogh’s Repetitions [Phillips Collection 2013], in which the artist created multiple versions of the same scenes with his distinctive expressiveness, including adaptations of other artists’ work. Such sets of variations allow close comparison of stylized art to a reference image, providing unique insights into an artist’s detailed techniques and choices.

Fig. 1:

In our work, we explore how such content-style image pairs can be used to customize a generative model to capture the demonstrated stylistic difference. Our goal is to customize a pre-trained generative model to synthesize stylized images, distilling the essence of the style from as few as a single pair without fixating on specific content. We wish to create a model capable of re-interpreting a variety of different content in the style demonstrated by the paired variation.

Prior works on model customization/personalization [Gal et al. 2022a; Kumari et al. 2023; Ruiz et al. 2023a] take one or a few images of a single concept to customize large-scale text-to-image models [Ramesh et al. 2022; Rombach et al. 2022]. While they aim to learn styles without using pairs, the generated samples from these customized models often resemble the training images’ content, such as specific objects, persons, and scene layouts. In Figure 1, we observe that standard single-image customization (3^rd row) alters the subject, color tone, and pose of the original image (1^st row). These issues arise because the artistic intent is difficult to discern from a single image: unlike image pairs that can demonstrate a style through contrasts, a singleton example will always intertwine choices of both style and content. Due to this ambiguity, the model fails to capture the artistic style accurately and, in some cases, overfits and generates the subject-specific details rather than the style, as shown in Figure 5.

On the other hand, our PairCustomization method exploits the contrast between image pairs to generate pairwise consistent images while better disentangling style and content. In Figure 1 (2^nd row), our method accurately follows the given style, turning the background into a single color matching the original background and preserving the identity and pose for each dog. Our method achieves this by disentangling the intended style from the image pair.

Our new customization task is challenging since text-to-image models were not initially designed to generate pairwise content. Even when given specific text prompts like “a portrait” and “a portrait with Picasso style”, a text-to-image diffusion model often struggles to generate images with consistent structure from the same noise seed. Therefore, it remains unclear how a customized model can generate stylized images while maintaining the original structure.

To address the challenges, we first propose a joint optimization method with separate sets of Low-Rank Adaptation [Hu et al. 2021] (LoRA) weights for style and content. The optimization encourages the content LoRA to reconstruct the content image and the style LoRA to apply the style to the content. We find that the resulting style LoRA can apply the same style to other unseen content. Furthermore, we enforce row-space orthogonality [Po et al. 2023] between style and content LoRA parameters to improve style and content disentanglement. Next, we extend the standard classifier-free guidance method [Ho and Salimans 2022] and propose style guidance. Style guidance integrates style LoRA predictions into the original denoising path, which aids in better content preservation and facilitates smoother control over the stylization strength. This method is more effective than the previous technique, where a customized model’s strength is controlled by the magnitude of LoRA weights [Ryu 2023b].

Our method is built upon Stable Diffusion XL [Podell et al. 2023]. We experiment with various image pairs, including different categories of content (e.g., portraits, animals, landscapes) and style (e.g., paintings, digital illustrations, filters). We evaluate our method on the above single image pairs and demonstrate the advantage of our method in preserving diverse structures while applying the stylization faithfully, compared to existing customization methods. Our code, models, and data are available on our webpage.

2 Related Works

Text-to-image generative models. Deep generative models aim to model the data distribution of a given training set [Dinh et al. 2017; Goodfellow et al. 2020; Ho et al. 2020; Kingma and Welling 2014; Song et al. 2021b; Van den Oord et al. 2016]. Recently, large-scale text-to-image models [Balaji et al. 2022; Chang et al. 2023; Gokaslan et al. 2023; Kang et al. 2023; Luo et al. 2023; Peebles and Xie 2023; Podell et al. 2023; Ramesh et al. 2022; Rombach et al. 2022; Saharia et al. 2022b; Sauer et al. 2023; Yu et al. 2022] trained on internet-scale training data [Byeon et al. 2022; Schuhmann et al. 2021] have shown exceptional generalization. Notably, diffusion models [Ho et al. 2020; Song et al. 2021a] stand out as the most widely adopted model class. While existing models can generate a broad spectrum of objects and concepts, they often struggle with rare or unseen concepts. Our work focuses on teaching these models to understand and depict a new style concept. Conditional generative models [Brooks et al. 2023; Isola et al. 2017; Li et al. 2023; Mou et al. 2024; Park et al. 2019; Saharia et al. 2022a; Zhang and Agrawala 2023] learn to transform images across different domains, but the training often requires thousands to millions of image pairs. We focus on a more challenging case, where only a single image pair is available.

Customizing generative models. Model customization, or personalization, aims to adapt an existing generative model with additional data, with the goal of generating outputs tailored to specific user preferences. Earlier efforts mainly focus on customizing pre-trained GANs [Goodfellow et al. 2020; Karras et al. 2019; 2020b] for smaller datasets [Karras et al. 2020a; Nitzan et al. 2022; Zhao et al. 2020], incorporating user edits [Bau et al. 2020; Wang et al. 2021; 2022], or aligning with text prompts [Gal et al. 2022b; Nitzan et al. 2023]. Recently, the focus has pivoted towards adapting large-scale text-to-image models to generate user-provided concepts, typically presented as one or a few images. Simply fine-tuning on the concept leads to overfitting. To mitigate this and enable variations via free text, several works explored different regularizations, including prior preservation [Kumari et al. 2023; Ruiz et al. 2023a], human alignment [Sohn et al. 2023], patch-based learning [Zhang et al. 2023], as well as parameter update restriction, where we only update text tokens [Alaluf et al. 2023; Daras and Dimakis 2022; Gal et al. 2022a; Voynov et al. 2023], attention layers [Gal et al. 2023a; Han et al. 2023; Kumari et al. 2023], low-rank weights [Hu et al. 2021; Ryu 2023a; Tewel et al. 2023], or clusters of neurons [Liu et al. 2023]. More recent methods focus on encoder-based approaches for faster personalization [Arar et al. 2023; Chen et al. 2023a; 2023b; Gal et al. 2023b; Li et al. 2024; Ma et al. 2023; Ruiz et al. 2023b; Shi et al. 2023; Valevski et al. 2023; Wei et al. 2023; Ye et al. 2023]. Instead of learning a single concept, several works further focus on learning multiple concepts [Avrahami et al. 2023; Gu et al. 2024; Kumari et al. 2023; Po et al. 2023; Shah et al. 2023]. Other methods [Materzynska et al. 2023; Ren et al. 2024] propose customizing text-to-video models to learn motion, while Guo et al. [2023] propose animating customized text-to-image models models by incorporating motion Low-Rank Adapter [Hu et al. 2021] modules. Our method takes inspiration from these techniques; however, we aim to address an inherently different task. Instead of learning concepts from an image collection, we customize the model to learn stylistic differences from an image pair.

Style and content separation. Various past works have explored learning a style while separating it from content [Chen and Schmidt 2016; Gatys et al. 2015; Huang and Belongie 2017; Li et al. 2017; Tenenbaum and Freeman 1996]. Our work is inspired by the seminal work Image Analogy [Hertzmann et al. 2001], a computational paradigm that takes an image pair and applies the same translation to unseen images. Common image analogy methods include patch-wise similarity matching [Hertzmann et al. 2001; Irony et al. 2005; Liao et al. 2017] and data-driven approaches [Bar et al. 2022; Park et al. 2020; Reed et al. 2015; Upchurch et al. 2016; Wang et al. 2023; Zhu et al. 2017]. Different from these, we aim to exploit the text-guided generation capabilities of large-scale models so that we can directly use the style concept with unseen context. Recently, StyleDrop [Sohn et al. 2023] has proposed to learn a custom style for masked generative transformer models. Unlike StyleDrop, we do not rely on human feedback in the process. Concurrent with our work, Hertz et al. [Hertz et al. 2023] introduced a method for generating images with style consistency, offering the option of using a style reference image. In contrast, we exploit an image pair to better discern the stylistic difference.

3 Method

Our method seeks to learn a new style from a single image pair. This task is challenging, as models tend to overfit when trained on a single image, especially when generating images in the same category as the training image (e.g., a model trained and tested on dog photos). To reduce this overfitting, we introduce a new algorithm aimed at disentangling the structure of the subject from the style of the artwork. Specifically, we leverage the image pair to learn separate model weights for style and content. At inference time, we modify the standard classifier-free guidance formulation to help preserve the original image structure when applying the learned style. In this section, we give a brief overview of diffusion models, outline our design choices, and explain the final method in detail.

3.1 Preliminary: Model Customization

Diffusion models. Diffusion models[Ho et al. 2020; Sohl-Dickstein et al. 2015; Song et al. 2021b], map Gaussian noise to the image distribution through iterative denoising. Denoising is learned by reversing the forward diffusion process x₀, …, x_T, where image x₀ is slowly diffused to random noise x_T over T timesteps, defined by \({\mathbf {x}}_t = \sqrt {\bar \alpha _t}{\mathbf {x}}_0 + \sqrt {1 - \bar \alpha _t}\epsilon\) for timestep t ∈ [0, T]. Noise \(\epsilon \sim \mathcal {N}(0, I)\) is randomly sampled, and \(\bar \alpha _t\) controls the noise strength. The training objective of diffusion models is to denoise any intermediate noisy image x_t via noise prediction:

\begin{equation} \begin{aligned} \mathbb {E}_{\epsilon ,{\mathbf {x}},{\mathbf {c}},t}\left[w_t \Vert \epsilon - \epsilon _{\theta }({\mathbf {x}}_t, {\mathbf {c}}, t)\Vert ^2\right], \end{aligned} \end{equation}

(1)

where w_t is a time-dependent weight, ϵ_θ(·) is the denoiser that learns to predict noise, and c denotes extra conditioning input, such as text. At inference, the denoiser ϵ_θ will gradually denoise random Gaussian noise into images. The resulting distribution of generated images approximates the training data distribution [Ho et al. 2020].

In our work, we use Stable Diffusion XL [Podell et al. 2023], a large-scale text-to-image model built on Latent Diffusion Models [Rombach et al. 2022]. The model consists of a U-Net [Ronneberger et al. 2015] trained on the latent space of an auto-encoder, with text conditioning from two text encoders, CLIP [Radford et al. 2021] and OpenCLIP [Ilharco et al. 2021].

Fig. 2:

Model customization with Low-Rank Adapters. Low-Rank Adapters (LoRA)[Hu et al. 2021] is a parameter-efficient fine-tuning method [Houlsby et al. 2019] that applies low-rank weight changes Δθ_LoRA to pre-trained model weights θ₀. For each layer with an initial weight \(W_0 \in \mathbb {R}^{m \times n}\), the weight update is defined by ΔW_LoRA = BA, a product of learnable matrices \(B \in \mathbb {R}^{m \times r}\) and \(A \in \mathbb {R}^{r \times n}\), where r ≪ min (m, n) to enforce the low-rank constraint. The weight matrix of a particular layer with LoRA is:

\begin{equation} \begin{aligned} W_{\text{LoRA}} = W_0 + \Delta W_{\text{LoRA}} = W_0 + BA. \end{aligned} \end{equation}

(2)

At inference time, the LoRA strength is usually controlled by a scaling factor α ∈ [0, 1] applied to the weight update ΔW_LoRA [Ryu 2023b]:

\begin{equation} \begin{aligned} W_{\text{LoRA}} = W_0 + \alpha \Delta W_{\text{LoRA}}. \end{aligned} \end{equation}

(3)

LoRA has been applied for customizing text-to-image diffusion models to learn new concepts with as few as three to five images [Ryu 2023b].

3.2 Style Extraction from an image pair

We aim to customize a pre-trained model with an artistic style in order to stylize the original model outputs while preserving their content, as shown in Figure 2 (right). To achieve this, we introduce style LoRA weight \(\theta _\text{style} = \theta _0 + \Delta \theta _\text{style}\). While a pre-trained model generates content from a noise seed and text c, style LoRA’s goal is to generate a stylized counterpart of original content from the same noise seed and a style-specific text prompt \({\mathbf {c}}_\text{style}\), where \({\mathbf {c}}_\text{style}\) is original text c appended by suffix “in <desc> style”. Here, <desc> is a placeholder for some worded description of the style (e.g., “digital art”), and style LoRA \(\theta _\text{style}\) associates <desc> to the desired style.

Unfortunately, learning style LoRA \(\theta _\text{style}\) from a single style image often leads to copying content (Figure 5). Hence, we explicitly learn disentanglement from a style and content image, denoted by \({\mathbf {x}}_\text{style}\) and \({\mathbf {x}}_\text{content}\), respectively.

Disentangling style and content. We leverage the fact that the style image shares the same layout and structure as the content image. Our key idea is to learn a separate content LoRA \(\theta _\text{content} = \theta _0 + \Delta \theta _\text{content}\) to reconstruct the content image. By explicitly modeling the content, we can train the style LoRA to “extract” the stylistic differences between the two images. We apply both style and content LoRA to reconstruct the style image, i.e., \(\theta _\text{combined} = \theta _0 + \Delta \theta _\text{content} + \Delta \theta _\text{style}\). This approach prevents leaking the content image to style LoRA, resulting in a better stylization model.

During training, we feed the content LoRA \(\theta _\text{content}\) with a content-specific text \({\mathbf {c}}_\text{content}\), which contains a random rare token V*, and feed the combined model \(\theta _\text{combined}\) with \({\mathbf {c}}_\text{style}\), where \({\mathbf {c}}_\text{style}\) is “\(\lbrace {\mathbf {c}}_\text{content}\rbrace\) in <desc> style”. Figure 2 (Left) summarizes our training process.

Jointly learning style and content. We employ two different objectives during every training step. To learn the content of the image, we first employ the standard training objective for diffusion models as described in Section 3.1 with the content image:

\begin{equation} \begin{aligned} \mathcal {L}_{\text{content}} = \mathbb {E}_{\epsilon ,{\mathbf {x}}_\text{content},t}\left[w_t\Vert \epsilon - \epsilon _{\theta _\text{content}}({\mathbf {x}}_{t, \text{content}}, {\mathbf {c}}_\text{content}, t)\Vert ^2\right], \end{aligned} \end{equation}

(4)

where \(\epsilon _{\theta _\text{content}}\) is the denoiser with content LoRA applied, x_{t, content} is a noisy content image at timestep t, and \({\mathbf {c}}_\text{content}\) is text representing the content image, including some rare token V*. Next, we optimize the combined style and content weights to reconstruct the style image. In particular, we only train the style LoRA weights during this step, while stopping the gradient flow to the content LoRA weights via stopgrad sg[ · ]:

\begin{equation} \begin{aligned} \theta _{\text{combined}} = \theta _0 + \text{sg}[\Delta \theta _{\text{content}}] + \Delta \theta _{\text{style}}. \end{aligned} \end{equation}

(5)

We then apply diffusion objective to train \(\theta _\text{combined}\) to denoise \({\mathbf {x}}_\text{t,style}\), a noisy style image at timestep t:

\begin{equation} \begin{aligned} \mathcal {L}_{\text{combined}} = \mathbb {E}_{\epsilon ,{\mathbf {x}}_\text{style},t}\left[w_t\Vert \epsilon - \epsilon _{\theta _\text{combined}}(x_{t, \text{style}}, {\mathbf {c}}_\text{style}, t)\Vert ^2\right], \end{aligned} \end{equation}

(6)

where \(\epsilon _{\theta _\text{combined}}\) is the denoiser with both LoRAs applied as in Equation 5, \({\mathbf {c}}_\text{style}\) is “\(\lbrace {\mathbf {c}}_\text{content}\rbrace\) in <desc> style”, and <desc> is a worded description of the style (e.g., “digital art”). Finally, we jointly optimize the LoRAs with the two losses:

\begin{equation} \begin{aligned} \min _{\Delta \theta _{\text{content}}, \Delta \theta _{\text{style}}} \mathcal {L}_{\text{content}} + \mathcal {L}_{\text{combined}} \end{aligned} \end{equation}

(7)

Figure 2 provides an overview of our method. Next, we discuss the regularization that promotes disentanglement of style from content.

Fig. 3:

Fig. 4:

Orthogonality between style and content LoRA. To further encourage style and content LoRAs to represent separate concepts, we enforce orthogonality upon the LoRA weights. We denote by W₀ the original weight matrix and \(W_\text{content}\), \(W_\text{style}\) the LoRA modifications (layer index omitted for simplicity). Reiterating Equation 2, we decompose \(W_\text{content}\), \(W_\text{style}\) into low-rank matrices:

\begin{equation} \begin{aligned} W_\text{content} = W_0 + B_\text{content}A_\text{content};\;W_\text{style} = W_0 + B_\text{style}A_\text{style}. \end{aligned} \end{equation}

(8)

We initialize \(B_\text{content}, B_\text{style}\) with the zero matrix and choose the rows of \(A_\text{content}\), \(A_\text{style}\) from an orthonormal basis. We then fix \(A_\text{content}\), \(A_\text{style}\) and only update \(B_\text{content}\), \(B_\text{style}\) in training. This forces the style and content LoRA updates to respond to orthogonal inputs, and empirically reduces visual artifacts, as shown in Figure 4. This technique is inspired by Po et al. [Po et al. 2023]. While their work focuses on merging multiple customized objects after each is trained separately, we apply the method for style-content separation during joint training.

3.3 Style Guidance

A common technique to improve text-to-image model’s sample quality is via classifier-free guidance [Ho and Salimans 2022]:

\begin{equation} \begin{aligned} \hat{\epsilon }_\theta ({\mathbf {x}}_t, \mathbf {c}) = \epsilon _\theta ({\mathbf {x}}_t, \varnothing) + \lambda _{\text{cfg}} (\epsilon _\theta ({\mathbf {x}}_t,\mathbf {c}) - \epsilon _\theta ({\mathbf {x}}_t, \varnothing)), \end{aligned} \end{equation}

(9)

where \(\hat{\epsilon }_\theta ({\mathbf {x}}_t, \mathbf {c}, t)\) is the new noise prediction, \(\varnothing\) denotes no conditioning, and λ_cfg controls the amplification of text guidance. For notation simplicity, we omit the timestep t in this equation and subsequent ones.

To improve pairwise consistency between original and stylized content, we propose an inference algorithm that preserves the original denoising path while adding controllable style guidance:

\begin{equation} \begin{aligned} \hat{\epsilon }_{\theta _0,\theta _\text{style}}({\mathbf {x}}_t, {\mathbf {c}}, {\mathbf {c}}_\text{style}) = \; & \epsilon _{\theta _0}({\mathbf {x}}_t, \varnothing) \\ +\ & \lambda _{\text{cfg}} (\epsilon _{\theta _0}({\mathbf {x}}_t, {\mathbf {c}}) - \epsilon _{\theta _0}({\mathbf {x}}_t, \varnothing)) \\ +\ & \lambda _{\text{style}}(\epsilon _{\theta _{\text{style}}}({\mathbf {x}}_t, {\mathbf {c}}_{ \text{style}}) - \epsilon _{\theta _0}({\mathbf {x}}_t, {\mathbf {c}})), \end{aligned} \end{equation}

(10)

where style guidance is the difference in noise prediction between style LoRA and the pre-trained model. Style guidance strength is controlled by λ_style and setting λ_style = 0 is equivalent to generating original content. In Figure 3, we compare our style guidance against scaling LoRA weights (Equation 3), and we find that our method better preserves the layout. More details and a derivation of our style guidance are in the supplement.

Previous works have also used multiple guidance terms with diffusion models, including guidance from multiple text prompts using the same model [Liu et al. 2022a] and additional image conditions [Brooks et al. 2023]. Unlike these, we obtain additional guidance from a customized model and apply it to the original model. StyleDrop [Sohn et al. 2023] considers a similar formulation with two guidance terms but for masked generative transformers. SINE [Zhang et al. 2023] uses a customized content model to apply text-based image editing to a single image, like adding snow. However, we use a customized style model to generate any image with the desired style.

Blending multiple learned styles. With a collection of models customized by our method, we can blend the learned styles as follows. Given some set of styles \(\mathcal {S}\) and strengths \(\lambda _{\text{style}_0}, \dots , \lambda _{\text{style}_n}\), we blend the style guidance from each model, and our new inference path is represented by

\begin{equation} \begin{aligned} \hat{\epsilon }_{\theta _0,\theta _\text{style}}({\mathbf {x}}_t, {\mathbf {c}}, {\mathbf {c}}_\text{style}) = \; & \epsilon _{\theta _0}({\mathbf {x}}_t, \varnothing) \\ +\ & \lambda _{\text{cfg}} (\epsilon _{\theta _0}({\mathbf {x}}_t, {\mathbf {c}}) - \epsilon _{\theta _0}({\mathbf {x}}_t, \varnothing)) \\ +\ & \sum _{\text{style}_i \in \mathcal {S}}\lambda _{\text{style}_i}(\epsilon _{\theta _{\text{style}_i}}({\mathbf {x}}_t, {\mathbf {c}}_{ \text{style}_i}) - \epsilon _{\theta _0}({\mathbf {x}}_t, {\mathbf {c}})), \end{aligned} \end{equation}

(11)

We can vary the strengths of any parameter \(\lambda _{\text{style}_i}\) to seamlessly increase or decrease style application while preserving content. Figure 10 gives a qualitative example of blending two different styles while preserving image content.

Implementation details. We train all models using an AdamW optimizer [Loshchilov and Hutter 2018] and learning rate 1 × 10^{− 5}. For baselines, we train for 500 steps. For our method, we first train our content weights on the content image for 250 steps and then train jointly for 500 additional steps. All image generation is performed using 50 steps of a PNDMScheduler [Liu et al. 2022b]. For all methods using inference with LoRA adapters, we use SDEdit [Meng et al. 2022] to further preserve the structure. Specifically, normal classifier-free guidance on the original prompt without style is used for the first 10 steps. We then apply style guidance/LoRA scale for the rest of the timesteps.

Fig. 5:

Fig. 6:

4 Experiments

4.1 Dataset

In this section, we show our method’s results on various image pairs and compare them with several baselines. We explain our dataset, baselines, and metrics in detail, then we present quantitative and qualitative results.

Fig. 7:

Datasets. To enable large-scale quantitative evaluation, we construct a diverse set of paired style and content images as follows. First, we generate 40 content images for each class: headshots, animals, and landscapes. When generating images in the headshot class, we generate 20 images with the prompt “A professional headshot of a man” and 20 images with the prompt “A professional headshot of a woman”. Similarly, we split the animal class into photos of dogs and cats. To curate synthetic pairs, we then apply image editing or image-to-image translation methods to all the content images to obtain the stylized version. For each unique prompt, we choose a single paired instance as training data and hold out the other pairs with the same prompt as a test set (Same Category). For each prompt, we also choose 5 pairs from each of the other prompts as a secondary test set (Different Category). We show all our synthetic training image pairs in the supplement. By leveraging synthetic pairs for evaluation, we can train on a single synthetic pair and test our results against held out synthetic style images. Secondly, we qualitatively compare against single artist pairs in Figure 5. Next, we describe the specific methods to create the paired dataset. First, we consider the diffusion-based image editing technique LEDITS++[Brack et al. 2023] to translate images into paintings. Next, we consider Cartoonization [Wang and Yu 2020], a GAN-based translation technique that aplies a cartoon-like effect. We also consider Stylized Neural Painting [Zou et al. 2021], which turns photos into painting strokes using a rendering based approach. Finally, we consider the image filtering technique posterization. We provide a more detailed description of each method for creating synthetic pairs in the supplement.

4.2 Baselines and Evaluation Metrics

Baselines. We compare our method against – (1) DreamBooth LoRA [Hu et al. 2021; Ryu 2023b] (DB LoRA), (2) Concept Sliders [Gandikota et al. 2023] (3) IP-adapters [Ye et al. 2023], (4) IP-adapters w/ T2I, and (5) StyleDrop [Sohn et al. 2023]. DB LoRA uses only the style image and fine-tunes low-rank adapters in all the linear layers in the attention blocks of the diffusion model. We evaluate different amounts of style applications for DB LoRA using the standard LoRA scale [Ryu 2023b]. Concept sliders presents a paired image model customization method that trains a single low-rank adapter jointly on both images, with different reconstruction losses for the style and content images. We also evaluate using the standard LoRA scale. IP-adapters is an encoder-based method that does not require training for every style and takes a style image as an extra condition separate from the text prompt. Increasing or decreasing the guidance from the input style image is possible by scaling the weight of the image conditioning. We consider the SDXL [Podell et al. 2023] implementation of this method. For the IP-Adapter, we compare against the stronger baseline of providing extra conditioning of an edge map of the content image through T2I Adapters [Mou et al. 2024] to preserve the content image structure. The recently proposed Styledrop [Sohn et al. 2023] technique for learning new styles is based on MUSE [Chang et al. 2023], and uses human feedback in its method. Since MUSE is not publicly available, we follow Style-Aligned Image Generation’s [Hertz et al. 2023] setup, and implement a version of StyleDrop on SDXL. Specifically, we train low-rank linear layers following each Feed-Forward layer in the attention blocks of SDXL. For a fair comparison, we train Styledrop without human feedback.

Evaluation metrics. When evaluating the performance of each method, we consider two quantitative metrics: perceptual distance to ground truth style images and structure preservation from the original image. A better customization method will have a low perceptual distance to the ground truth style images while still preserving content of the original image before adding style. We measure these using – (1) Distance to GT Styled: given holdout ground truth style images, we measure the perceptual distance between our styled outputs and the ground truth style images using DreamSim [Fu et al. 2023], a recent method for measuring the perceptual distance between images. DreamSim image embeddings are comprised of an ensemble of image embedding models, including CLIP [Radford et al. 2021] and DINO [Caron et al. 2021], which are then fine-tuned so the final embeddings respect human perception. We measure DreamSim distance as (1 - cosine similarity between DreamSim embeddings), where a lower value implies that the images are perceptually more similar. (2) Distance to Content Image: to measure content preservation after style application, we measure the perceptual distance of our generated style image to the original content image with no style guidance. We again use DreamSim, this time comparing styled and content images. Note here that a perceptual distance of zero to the content image is undesirable, as this would require no style to be applied. However, a better-performing method should obtain a better tradeoff between the two distances. (3) We also perform a human preference study of our method against baselines.

Fig. 8:

4.3 Results

Quantitative evaluation. We show quantitative results in Figure 7. Increased marker size (circles) indicates the higher application of style, and line color determines the method. When evaluating style similarity vs. structure preservation in Figure 7, we see that our training method’s Pareto dominates all baselines, yielding lower perceptual distance to style images while still being perceptually similar to the original content image. We find that DB LoRA and StyleDrop perform similarly, and report StyleDrop results in the supplement. Finally, we consider our method with LoRA scale during inference and other baselines with our style guidance during inference for ablation, and provide results in the supplement.

Qualitative evaluation. We compare our method with the highest performing baselines in Figure 5. The finetuning-based methods DB LoRA [Hu et al. 2021; Ryu 2023b] and Concept Sliders [Gandikota et al. 2023] outperform the encoder-based method [Ye et al. 2023] for our task. Hence, we compare against that in Figure 5. For both baselines, we modulate style application with LoRA scale (Equation 3). We observe that DB LoRA often fails to generate the style-transformed version of the original image and overfits to the training pair image when generating similar concepts. There are two main reasons why this may occur. First, we are in a challenging case where there is only 1 training image instead of the usual 3 − 5 images that customization methods use. Second, we are prompting the model on the same or very similar text prompts to the training prompt, and the baseline method overfits to the training image for these prompts. Our method preserves the structure of the original image while applying the learned style. Moreover, applying our style guidance instead of the LoRA scale benefits the baseline method as well (Figure 5, last 2 columns), as it can better preserve the structure of the original image, though it still tends to overfit to the content of the training image. We observe a similar issue for other baselines as well.

We compare our method to non finetuning-based methods in Figure 6. We observe that these methods perform worse than finetuning-based methods, especially when generating images in a category different from the training style image. We also compare with baselines using our style guidance for style application at inference time in the supplement.

User preference study. We perform a user preference study using Amazon Mechanical Turk. We test our method against all baselines, as well as a version of our method trained without the orthogonality constraint. Specifically, we test on all datasets in Section 4.1. When evaluating against DB LoRA and Concept Sliders, we consider inference with both LoRA scale as in Equation 3 and style guidance as in Equation 10. For each method, we pick a single style strength that performs most optimally according to quantitative metrics as in Figure 7. Full details are available in the supplement. We collect 400 responses per paired test of ours vs the other method. The user is shown an image generated via our method and an image generated via the other method and asked to select the image that best applies the given style to the new content image. We provide a detailed setup of the user study as well as a user study on baseline methods using our style guidance in the supplement. As shown in Figure 8, our method is favored by users in comparison to baselines, whether evaluating images generated within the same category as the training image pair or across different categories. Secondly, users prefer our full method to ours without the orthogonality constraint, specifically when evaluating on the same category as training.

Fig. 9:

Real Image Editing. Our method can also stylize real images. We use DDIM inversion [Garibi et al. 2024; Song et al. 2021a] to invert images into their noisy latent codes at some intermediate step using a reference prompt c. From here, we use our style guidance (Equation 10) with reference prompt c and new prompt c_style to denoise the noisy latent code to a stylized image. In Figure 9, we show real image editing results. We provide more details in the supplement.

Fig. 10:

Blending learned styles. We show that we can blend the learned styles by applying a new inference path, defined in Equation 11. In Figure 10, we show the results of blending two models. We can seamlessly blend the two styles at varying strengths while still preserving the content.

Fig. 11:

5 Discussion and Limitations

In this work, we have introduced a new task: customizing a text-to-image model with a single image pair. To address this task, we have developed a customization method that explicitly disentangles style and content through both training objectives and a separated parameter space. Our method enables us to grasp the style concept without memorizing the content of input examples. While our approach outperforms existing customization methods, it still exhibits several limitations, as discussed below.

Limitations. First, our method may occasionally fail to completely maintain input structure, as demonstrated in Figure 11. This could occur as background/pose change (Top), or as additional features being added (Bottom).

Second, our current method relies on test-time optimization, which takes around 15 minutes on a single A5000 GPU. This can be computationally demanding if we need to process many image styles. Leveraging encoder-based approaches [Arar et al. 2023; Ruiz et al. 2023b] for predicting style and content weights in a feed-forward manner could potentially speed up the customization process.

Lastly, our current method relies on roughly aligned image pairs. Learning styles from unaligned images that depict similar content is an interesting problem for future work.

Acknowledgments

We would like to thank Ali Jahanian, Gaurav Parmar, Ruihan Gao, Sean Liu, and Or Patashnik for their insightful feedback and input that contributed to the finished work. We also thank Jack Parkhouse and Aaron Hertzmann for providing style images. Maxwell Jones is supported by the Siebel Scholar fellowship and the Rales fellowship. This project is partly supported by the Amazon Faculty Research Award, NSF IIS-2239076, Open Philanthropy, and the Packard Fellowship.

Supplemental Material

PDF File

supplemental pdf

Download
8.77 MB

References

[1]

Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. 2023. A Neural Space-Time Representation for Text-to-Image Personalization. In SIGGRAPH Asia.

Abstract

1 Introduction

2 Related Works

3 Method

3.1 Preliminary: Model Customization

3.2 Style Extraction from an image pair

3.3 Style Guidance

4 Experiments

4.1 Dataset

4.2 Baselines and Evaluation Metrics

4.3 Results

5 Discussion and Limitations

Acknowledgments

Supplemental Material

References

Index Terms

Recommendations

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Image palette: painting style transfer via brushstroke control synthesis

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations