Ctrl-X

Kuan Heng Lin¹* Sicheng Mo¹* Ben Klingher¹ Fangzhou Mu² Bolei Zhou¹

¹UCLA ²NVIDIA

*Equal contribution

NeurIPS 2024

[Paper] [Code]

Overview

We present Ctrl-X, a simple training-free and guidance-free framework for text-to-image (T2I) generation with structure and appearance control. Given user-provided structure and appearance images, Ctrl-X designs feedforward structure control to enable structure alignment with the structure image and semantic-aware appearance transfer to facilitate the appearance transfer from the appearance image. Ctrl-X supports novel structure control with arbitrary condition images of any modality, is significantly faster than prior training-free appearance transfer methods, and provides instant plug-and-play to any T2I and text-to-video (T2V) diffusion model.

How does it work? Given clean structure and appearance latents, we first obtain noised structure and appearance latents via the diffusion forward process, then extracting their U-Net features from a pretrained T2I diffusion model. When denoising the output latent, we inject convolution and self-attention features from the structure latent and leverage self-attention correspondence to transfer spatially-aware appearance statistics from the appearance latent to achieve structure and appearance control. We name our method "Ctrl-X" because we reformulate the controllable generation problem by 'cutting' (and 'pasting') structure preservation and semantic-aware stylization together.

Results: Structure and appearance control

Results of training-free and guidance-free T2I diffusion with structure and appearance control, where Ctrl-X supports a diverse variety of structure images, including natural images, ControlNet-supported conditions (e.g., canny maps, normal maps), and in-the-wild conditions (e.g., wireframes, 3D meshes). The base model here is Stable Diffusion XL v1.0.

Results: Multi-subject structure and appearance control

Ctrl-X is capable of multi-subject generation with semantic correspondence between appearance and structure images across both subjects and backgrounds. In comparison, ControlNet + IP-Adapter often fails at transferring all subject and background appearances.

Results: Prompt-driven conditional generation

Ctrl-X also supports prompt-driven conditional generation, where it generates an output image complying with the given text prompt while aligning with the structure of the structure image. Ctrl-X continues to support any structure image/condition type here as well. The base model here is Stable Diffusion XL v1.0.

Results: Extension to video generation

We can directly apply Ctrl-X to text-to-video (T2V) models. We show results of AnimateDiff v1.5.3 (with base model Realistic Vision v5.1) here.

BibTeX

@inproceedings{lin2024ctrlx,
    author = {Lin, {Kuan Heng} and Mo, Sicheng and Klingher, Ben and Mu, Fangzhou and Zhou, Bolei},
    booktitle = {Advances in Neural Information Processing Systems},
    title = {Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance},
    year = {2024}
}

Related Work

Sicheng Mo, Fangzhou Mu, Kuan Heng Lin, Yanli Liu, Bochen Guan, Yin Li, Bolei Zhou. FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition. CVPR 2024.
Comment: Training-free conditional generation by guidance in diffusion U-Net subspaces for structure control and appearance regularization.

Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or. Cross-Image Attention for Zero-Shot Appearance Transfer. SIGGRAPH 2024.
Comment: Guidance-free appearance transfer to natural images with self-attention key + value swaps via cross-image correspondence.