Our method seeks to learn a new style from a single image pair. This task is challenging, as models tend to overfit when trained on a single image, especially when generating images in the same category as the training image (e.g., a model trained and tested on dog photos). To reduce this overfitting, we introduce a new algorithm aimed at disentangling the structure of the subject from the style of the artwork. Specifically, we leverage the image pair to learn separate model weights for style and content. At inference time, we modify the standard classifier-free guidance formulation to help preserve the original image structure when applying the learned style. In this section, we give a brief overview of diffusion models, outline our design choices, and explain the final method in detail.
3.1 Preliminary: Model Customization
Diffusion models. Diffusion models[Ho et al.
2020; Sohl-Dickstein et al.
2015; Song et al.
2021b], map Gaussian noise to the image distribution through iterative denoising. Denoising is learned by reversing the forward diffusion process
x0, …,
xT, where image
x0 is slowly
diffused to random noise
xT over
T timesteps, defined by
\({\mathbf {x}}_t = \sqrt {\bar \alpha _t}{\mathbf {x}}_0 + \sqrt {1 - \bar \alpha _t}\epsilon\) for timestep
t ∈ [0,
T]. Noise
\(\epsilon \sim \mathcal {N}(0, I)\) is randomly sampled, and
\(\bar \alpha _t\) controls the noise strength. The training objective of diffusion models is to denoise any intermediate noisy image
xt via noise prediction:
where
wt is a time-dependent weight, ϵ
θ(·) is the denoiser that learns to predict noise, and
c denotes extra conditioning input, such as text. At inference, the denoiser ϵ
θ will gradually denoise random Gaussian noise into images. The resulting distribution of generated images approximates the training data distribution [Ho et al.
2020].
In our work, we use Stable Diffusion XL [Podell et al.
2023], a large-scale text-to-image model built on Latent Diffusion Models [Rombach et al.
2022]. The model consists of a U-Net [Ronneberger et al.
2015] trained on the latent space of an auto-encoder, with text conditioning from two text encoders, CLIP [Radford et al.
2021] and OpenCLIP [Ilharco et al.
2021].
Model customization with Low-Rank Adapters. Low-Rank Adapters (LoRA)[Hu et al.
2021] is a parameter-efficient fine-tuning method [Houlsby et al.
2019] that applies low-rank weight changes
ΔθLoRA to pre-trained model weights
θ0. For each layer with an initial weight
\(W_0 \in \mathbb {R}^{m \times n}\), the weight update is defined by
ΔWLoRA =
BA, a product of learnable matrices
\(B \in \mathbb {R}^{m \times r}\) and
\(A \in \mathbb {R}^{r \times n}\), where
r ≪ min (
m,
n) to enforce the low-rank constraint. The weight matrix of a particular layer with LoRA is:
At inference time, the LoRA strength is usually controlled by a scaling factor
α ∈ [0, 1] applied to the weight update
ΔWLoRA [Ryu
2023b]:
LoRA has been applied for customizing text-to-image diffusion models to learn new concepts with as few as three to five images [Ryu
2023b].
3.2 Style Extraction from an image pair
We aim to customize a pre-trained model with an artistic style in order to stylize the original model outputs while preserving their content, as shown in Figure
2 (right). To achieve this, we introduce style LoRA weight
\(\theta _\text{style} = \theta _0 + \Delta \theta _\text{style}\). While a pre-trained model generates content from a noise seed and text
c, style LoRA’s goal is to generate a stylized counterpart of original content from the same noise seed and a style-specific text prompt
\({\mathbf {c}}_\text{style}\), where
\({\mathbf {c}}_\text{style}\) is original text
c appended by suffix “in <desc> style”. Here, <desc> is a placeholder for some worded description of the style (e.g., “digital art”), and style LoRA
\(\theta _\text{style}\) associates <desc> to the desired style.
Unfortunately, learning style LoRA
\(\theta _\text{style}\) from a single style image often leads to copying content (Figure
5). Hence, we explicitly learn disentanglement from a style and content image, denoted by
\({\mathbf {x}}_\text{style}\) and
\({\mathbf {x}}_\text{content}\), respectively.
Disentangling style and content. We leverage the fact that the style image shares the same layout and structure as the content image. Our key idea is to learn a separate content LoRA \(\theta _\text{content} = \theta _0 + \Delta \theta _\text{content}\) to reconstruct the content image. By explicitly modeling the content, we can train the style LoRA to “extract” the stylistic differences between the two images. We apply both style and content LoRA to reconstruct the style image, i.e., \(\theta _\text{combined} = \theta _0 + \Delta \theta _\text{content} + \Delta \theta _\text{style}\). This approach prevents leaking the content image to style LoRA, resulting in a better stylization model.
During training, we feed the content LoRA
\(\theta _\text{content}\) with a content-specific text
\({\mathbf {c}}_\text{content}\), which contains a random rare token V*, and feed the combined model
\(\theta _\text{combined}\) with
\({\mathbf {c}}_\text{style}\), where
\({\mathbf {c}}_\text{style}\) is “
\(\lbrace {\mathbf {c}}_\text{content}\rbrace\) in <desc> style”. Figure
2 (Left) summarizes our training process.
Jointly learning style and content. We employ two different objectives during every training step. To learn the content of the image, we first employ the standard training objective for diffusion models as described in Section
3.1 with the content image:
where
\(\epsilon _{\theta _\text{content}}\) is the denoiser with content LoRA applied,
xt, content is a noisy content image at timestep
t, and
\({\mathbf {c}}_\text{content}\) is text representing the content image, including some rare token V*. Next, we optimize the combined style and content weights to reconstruct the style image. In particular, we only
train the style LoRA
weights during this step, while stopping the gradient flow to the content LoRA
weights via stopgrad sg[ · ]:
We then apply diffusion objective to train
\(\theta _\text{combined}\) to denoise
\({\mathbf {x}}_\text{t,style}\), a noisy style image at timestep
t:
where
\(\epsilon _{\theta _\text{combined}}\) is the denoiser with both LoRAs applied as in Equation
5,
\({\mathbf {c}}_\text{style}\) is “
\(\lbrace {\mathbf {c}}_\text{content}\rbrace\) in <desc> style”, and <desc> is a worded description of the style (e.g., “digital art”). Finally, we jointly optimize the LoRAs with the two losses:
Figure
2 provides an overview of our method. Next, we discuss the regularization that promotes disentanglement of style from content.
Orthogonality between style and content LoRA. To further encourage style and content LoRAs to represent separate concepts, we enforce orthogonality upon the LoRA weights. We denote by
W0 the original weight matrix and
\(W_\text{content}\),
\(W_\text{style}\) the LoRA modifications (layer index omitted for simplicity). Reiterating Equation
2, we decompose
\(W_\text{content}\),
\(W_\text{style}\) into low-rank matrices:
We initialize
\(B_\text{content}, B_\text{style}\) with the zero matrix and choose the rows of
\(A_\text{content}\),
\(A_\text{style}\) from an orthonormal basis. We then fix
\(A_\text{content}\),
\(A_\text{style}\) and only update
\(B_\text{content}\),
\(B_\text{style}\) in training. This forces the style and content LoRA updates to respond to orthogonal inputs, and empirically reduces visual artifacts, as shown in Figure
4. This technique is inspired by Po et al. [Po et al.
2023]. While their work focuses on merging multiple customized objects after each is trained separately, we apply the method for style-content separation during joint training.
3.3 Style Guidance
A common technique to improve text-to-image model’s sample quality is via classifier-free guidance [Ho and Salimans
2022]:
where
\(\hat{\epsilon }_\theta ({\mathbf {x}}_t, \mathbf {c}, t)\) is the new noise prediction,
\(\varnothing\) denotes no conditioning, and
λcfg controls the amplification of text guidance. For notation simplicity, we omit the timestep
t in this equation and subsequent ones.
To improve pairwise consistency between original and stylized content, we propose an inference algorithm that preserves the original denoising path while adding controllable style guidance:
where style guidance is the difference in noise prediction between style LoRA and the pre-trained model. Style guidance strength is controlled by
λstyle and setting
λstyle = 0 is equivalent to generating original content. In Figure
3, we compare our style guidance against scaling LoRA weights (Equation
3), and we find that our method better preserves the layout. More details and a derivation of our style guidance are in the supplement.
Previous works have also used multiple guidance terms with diffusion models, including guidance from multiple text prompts using the same model [Liu et al. 2022a] and additional image conditions [Brooks et al. 2023]. Unlike these, we obtain additional guidance from a customized model and apply it to the original model. StyleDrop [Sohn et al. 2023] considers a similar formulation with two guidance terms but for masked generative transformers. SINE [Zhang et al. 2023] uses a customized content model to apply text-based image editing to a single image, like adding snow. However, we use a customized style model to generate any image with the desired style. Blending multiple learned styles. With a collection of models customized by our method, we can blend the learned styles as follows. Given some set of styles
\(\mathcal {S}\) and strengths
\(\lambda _{\text{style}_0}, \dots , \lambda _{\text{style}_n}\), we blend the style guidance from each model, and our new inference path is represented by
We can vary the strengths of any parameter
\(\lambda _{\text{style}_i}\) to seamlessly increase or decrease style application while preserving content. Figure
10 gives a qualitative example of blending two different styles while preserving image content.
Implementation details. We train all models using an AdamW optimizer [Loshchilov and Hutter
2018] and learning rate 1 × 10
− 5. For baselines, we train for 500 steps. For our method, we first train our content weights on the content image for 250 steps and then train jointly for 500 additional steps. All image generation is performed using 50 steps of a PNDMScheduler [Liu et al.
2022b]. For all methods using inference with LoRA adapters, we use SDEdit [Meng et al.
2022] to further preserve the structure. Specifically, normal classifier-free guidance on the original prompt without style is used for the first 10 steps. We then apply style guidance/LoRA scale for the rest of the timesteps.