CLIP-Guided Image Synthesis

By Simon Schölzel in Project

June 18, 2022

Abstract

A write-up that summarizes my personal learnings and experimentations with CLIP-guided image synthesis. It covers VQGAN, CLIP, Inference-by-Optimization, as well as various text-to-image and image-to-image experiments.

Date

June 18, 2022

Time

12:00 AM

Location

Atlantis, Atlantic Ocean

VQGAN

A raw image $x\in \mathbb{R}^{H\times W\times 3}$ serves as input to the model.
The image runs through an CNN-based image encoder $E\left( \cdot \right)$ to produce the latent variable $\hat{z}=E\left( x \right)\in \mathbb{R}^{h\times w\times n_z}$ which summarizes local image patches. Hence, the encoder is designed to exploit strong local, position-invariant correlations within images.
The CNN representation $\hat{z}$ is quantized using an element-wise compression module $\textbf{q}\left( \cdot \right)$ per $\hat{z}_{ij}$ to obtain a quantized image encoding:

$$ z_\textbf{q} =\textbf{q}\left( \hat{z} \right):=\left( arg \min_{\substack{z_k\in }}\mathcal{Z} \left| \hat{z}_{ij}-z_k \right|\right)\in \mathbb{R}^{h\times w\times n_z} $$

Thereby, each $\hat{z}_{ij}\in\hat{z}$ is mapped into its closest codebook entry $z_k$ (i.e., nearest neighbour) which stems from a learned discrete codebook $\mathcal{Z}=\left\{ z_k \right\}_{k=1}^{K}\subset \mathbb{R}^{n_z}$ with the hyperparameters $n_z$ denoting the dimensionality of the codebook and $h\times w$ as the number of codebook token. Each codebook entry represents a cluster centroid in latent space that is optimized during training. Generally, the smaller the codebook size the more abstract the learned visual parts. This leads to a potentially more creative but less controllable generator through a higher degree of interpolation.

Equivalently, $z_\textbf{q}$ can, hence, be representend in terms of a sequence of codebook indices which act as input token:

$$ z_\textbf{q} = \left( z_{s_{ij}} \right) $$

with $s_{ij}$ denoting the respective codebook entry indexes which form the sequence $s\in\left\{ 0,..., \left| \mathcal{Z} \right|-1 \right\}^{h\times w}$.
The image is reconstructed from the codebook vectors (i.e., quantized image) via a generator $G\left( \cdot \right)$ (which is trained in an adversarial fashion by trying to outwit a patch-based discriminator $D\left( \cdot \right)$):

$$ \hat{x}=G\left( z_\textbf{q} \right)=G\left( \textbf{q}\left( E\left( x \right) \right) \right) $$
Finally, three losses are back-propagated through the network. First, the compression module is trained to learn a rich codebook:

$$ \mathcal{L}_{VQ}\left( E,G,\mathcal{Z} \right) = \left| x-\hat{x} \right|^{2} + \left| sg\left[ E\left( x \right) -z_\textbf{q} \right] \right|^2_2 + \left| sg\left[ z_\text{q} -E\left( x \right) \right] \right|^2_2 \tag{1} $$

where the first term on the RHS describes the reconstruction loss, the second term facilitates the learning of meaningful embeddings, and the latter term is the commitment loss.

Second, the GAN is learned under the classic adversarial training framework:

$$ \mathcal{L}_{GAN}\left( \left\{ E,G,\mathcal{Z} \right\}, D \right)= \left[ \log D\left( x \right) + 1-\log D\left( \hat{x} \right) \right] \tag{2} $$

Both are combined into the overall VQGAN loss using an adaptive weight $\lambda$:

$$ arg \min_{\substack{E,G,\mathcal{Z}}} \max_{\substack{D}} \mathbb{E}_{x\sim p\left( x \right)}= \mathcal{L}_{VQ}\left( E,G,\mathcal{Z} \right)+\lambda \mathcal{L}_{GAN}\left( \left\{ E,G,\mathcal{Z} \right\}, D \right) \tag{1+2} $$

Third, an autoregresive Transformer model is concurrently trained to perform next-index prediction to model the next codebook entry in a sequence of entry indexes under the following loss function:

$$ \mathcal{L_{Transformer}}= \mathbb{E}_{x\sim p\left( x \right)}\left[ -\log p\left( s \right) \right]= \mathbb{E}_{x\sim p\left( x \right)}\left[ -\log \right] \tag{3} $$

with $p\left( s \right)=\prod_i p\left( s_i | s_{\lt i} \right)$. This way, the Transformer learns long-range dependencies between visual parts and, thus, the the global composition of the image.

Alternatively, the synthesis process can be conditioned on additional information (e.g., class information) by concatenating with a conditioning embedding $c$, i.e., $p\left( s|c \right)=\prod_i p\left( s_i | s_{\lt i},c \right)$. The Transformer employs sliding-window attention to enable the generation of high resolution images at the caveat of only paying attention to a local (but still sufficiently large) set of codebook vectors.

CLIP

A (desirably large) batch of image-text-pairs serve as input to CLIP. Pairs are gleaned and filtered from publicly available sources on the Internet, resulting in a dataset comprising of a total of 400 million image-text-pairs. In contrast to most computer vision models, the image captions provide a natural language signal richer than any single class label and are, thus, less restrictive in terms of supervision (natural language supervision).
Text and images run through a text and image encoder, respectively. The former is a bi-directional 12-layer Transformer while the latter is either a ResNet or Vision Transformer (ViT). The most prominent VQGAN-CLIP framework employs a ViT/B-32 which models 2D images in terms of embedded 32x32 image patches which serve as 1D input to vanilla Transformer encoder (in conjunction with positional embeddings).

Image captions are capped at 76 token with the EOS token of the final hidden layer representing the dense text embedding.
The text and image embeddings are projected into a multi-model embedding space in which pairwise Cosine similarities scores quantify the match between each text and image embedding included in each batch.
The model is optimized to produce a high-quality, multi-modal embedding space in which embeddings of actual (randomly matched) image-text-pairs are in close proximity to each other (far apart). The authors frame this pre-training task by asking “which captions goes with which image?”

Inference-by-Optimization

Encode text prompt using CLIP’s pre-trained text encoder.
Create noise vector $\hat{z}$ which serves as input to the pre-trained VQGAN model. The model’s decoder initiates the synthesis process by generating an image candidate that is to be evaluated by CLIP. Alternatively, $\hat{z}$ can be initialized set by providing a preexisting image as prior which is encoded by the VQGAN model. This way, the image synthesis does not start from random noise (cold start) but rather from a given image (warm start) which may inhibit desired properties of the output image.
Encode the synthesized image using CLIP’s pre-trained image encoder. Importantly, prior to encoding, the image is segmented into a batch of random cutouts as CLIP operates on lower resolution images relative to VQGAN. In addition, the synthesized image faces various augmentations, such as flipping, color jitter, or noising.
Compute similarity between embedded text prompt and synthesized image cutouts within CLIP’s multi-modal embedding space.
“Optimize” $\hat{z}$ using back-propagation with the text-image-similarity score acting as a loss function that is to be maximized (gradient ascent). The loss is computed by averaging over all crops and augmented images to stabilize the update steps. Finally, the loss is extended by a regularization loss to penalize the appearance of visual artefacts:

$$ \mathcal{L}_{VQGAN-CLIP}=\mathcal{L}_{CLIP}+\alpha\times \frac{1}{N}\sum_{i=0}^{N} Z_i^2 $$

Mostly, optimization over 500 update steps yields stable and visually appealing results.

Experiments

Code Implementation

All experiments were conducted using Google Colab. The code is based on the seminal Juypter Notebook disseminated by Katherine Crowson. Click to download the adapated Notebook.

Delicious Choclate Bar Made of < material > | ArtStation HD

concept art	crystalline matter	gold	jellies and peanuts
spectral matter	wood	moon sand

Wooden Banana Temple in an Underwater Kingdom | < art style >

concept art	painting by Hokusai	pixelart	sketch by Leonardo Da Vinci
trending on artstation	unreal engine	Behance HD	ink drawing
Studio Ghibli style	watercolour	chalk art"	cel shading
low poly	matte painting	pencil sketch	disney style
steampunk	storybook illustration	tilt shift	8K 3D

Marble Temple in the Volcanic Highlands | < art style >

**Image Prompt**
trending on artstation	steampunk

**Model Output**
trending on artstation	steampunk

Marble Temple in the Volcanic Highlands | < art style >

**Image Prompt**

**Model Output**
trending on artstation	steampunk

Logo Generation

**Image Prompt**

**Model Output**
black-and-white \| nature-lover	colorful \| style-of-1970s-sci-fi-book-cover
colorful \| nature-lover	colorful \| trending-on-artstation

Resources

Initial idea and codebase:

Notebook by Katherine Crowson and translated into English by somewheresy.
Crowson, K./Biderman, S./Kornis, D./Stander, D./Hallahan, E./Castricato, L./Raff, E. (2022): VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance, arXiv working paper 2022-04-18. Link

Intuition:

Steinbrück, A. (2022): Explaining the code of the popular text-to-image algorithm (VQGAN+CLIP in PyTorch), blog article 2022-04-11. Link
Miranda, L. (2021): The Illustrated VQGAN, blog article 2021-08-21. Link

History of AI art:

Snell, C. (2021): Alien Dreams: An Emerging Art Scene, blog article 2021-06-30. Link
Morris, J. (2022): The Weird and Wonderful World of AI Art, blog article 2022-01-28. Link
Baschez, N. (2022): DALL·E 2 and The Origin of Vibe Shifts, blog article 2022-04-22. Link

Style Inspirations:

@kingdomakrillic (2021): CLIP + VQGAN keyword comparison, blog article 2021-07-23. Link

Posted on:: June 18, 2022

Length:: 6 minute read, 1104 words

Categories:: Project

See Also:

black-and-white \| nature-lover	colorful \| style-of-1970s-sci-fi-book-cover
colorful \| nature-lover	colorful \| trending-on-artstation