CLIP-Guided Image Synthesis
By Simon Schölzel in Project
June 18, 2022
Abstract
A write-up that summarizes my personal learnings and experimentations with CLIP-guided image synthesis. It covers VQGAN, CLIP, Inference-by-Optimization, as well as various text-to-image and image-to-image experiments.
Date
June 18, 2022
Time
12:00 AM
Location
Atlantis, Atlantic Ocean
VQGAN
-
A raw image
\(x\in \mathbb{R}^{H\times W\times 3}\)
serves as input to the model. -
The image runs through an CNN-based image encoder
\(E\left( \cdot \right)\)
to produce the latent variable\(\hat{z}=E\left( x \right)\in \mathbb{R}^{h\times w\times n_z}\)
which summarizes local image patches. Hence, the encoder is designed to exploit strong local, position-invariant correlations within images. -
The CNN representation
\(\hat{z}\)
is quantized using an element-wise compression module\(\textbf{q}\left( \cdot \right)\)
per\(\hat{z}_{ij}\)
to obtain a quantized image encoding:$$ z_\textbf{q} =\textbf{q}\left( \hat{z} \right):=\left( arg \min_{\substack{z_k\in }}\mathcal{Z} \left| \hat{z}_{ij}-z_k \right|\right)\in \mathbb{R}^{h\times w\times n_z} $$
Thereby, each
\(\hat{z}_{ij}\in\hat{z}\)
is mapped into its closest codebook entry\(z_k\)
(i.e., nearest neighbour) which stems from a learned discrete codebook\(\mathcal{Z}=\left\{ z_k \right\}_{k=1}^{K}\subset \mathbb{R}^{n_z}\)
with the hyperparameters\(n_z\)
denoting the dimensionality of the codebook and\(h\times w\)
as the number of codebook token. Each codebook entry represents a cluster centroid in latent space that is optimized during training. Generally, the smaller the codebook size the more abstract the learned visual parts. This leads to a potentially more creative but less controllable generator through a higher degree of interpolation.Equivalently,
\(z_\textbf{q}\)
can, hence, be representend in terms of a sequence of codebook indices which act as input token:$$ z_\textbf{q} = \left( z_{s_{ij}} \right) $$
with
\(s_{ij}\)
denoting the respective codebook entry indexes which form the sequence\(s\in\left\{ 0,..., \left| \mathcal{Z} \right|-1 \right\}^{h\times w}\)
. -
The image is reconstructed from the codebook vectors (i.e., quantized image) via a generator
\(G\left( \cdot \right)\)
(which is trained in an adversarial fashion by trying to outwit a patch-based discriminator\(D\left( \cdot \right)\)
):$$ \hat{x}=G\left( z_\textbf{q} \right)=G\left( \textbf{q}\left( E\left( x \right) \right) \right) $$
-
Finally, three losses are back-propagated through the network. First, the compression module is trained to learn a rich codebook:
$$ \mathcal{L}_{VQ}\left( E,G,\mathcal{Z} \right) = \left| x-\hat{x} \right|^{2} + \left| sg\left[ E\left( x \right) -z_\textbf{q} \right] \right|^2_2 + \left| sg\left[ z_\text{q} -E\left( x \right) \right] \right|^2_2 \tag{1} $$
where the first term on the RHS describes the reconstruction loss, the second term facilitates the learning of meaningful embeddings, and the latter term is the commitment loss.
Second, the GAN is learned under the classic adversarial training framework:
$$ \mathcal{L}_{GAN}\left( \left\{ E,G,\mathcal{Z} \right\}, D \right)= \left[ \log D\left( x \right) + 1-\log D\left( \hat{x} \right) \right] \tag{2} $$
Both are combined into the overall VQGAN loss using an adaptive weight
\(\lambda\)
:$$ arg \min_{\substack{E,G,\mathcal{Z}}} \max_{\substack{D}} \mathbb{E}_{x\sim p\left( x \right)}= \mathcal{L}_{VQ}\left( E,G,\mathcal{Z} \right)+\lambda \mathcal{L}_{GAN}\left( \left\{ E,G,\mathcal{Z} \right\}, D \right) \tag{1+2} $$
Third, an autoregresive Transformer model is concurrently trained to perform next-index prediction to model the next codebook entry in a sequence of entry indexes under the following loss function:
$$ \mathcal{L_{Transformer}}= \mathbb{E}_{x\sim p\left( x \right)}\left[ -\log p\left( s \right) \right]= \mathbb{E}_{x\sim p\left( x \right)}\left[ -\log \right] \tag{3} $$
with
\(p\left( s \right)=\prod_i p\left( s_i | s_{\lt i} \right)\)
. This way, the Transformer learns long-range dependencies between visual parts and, thus, the the global composition of the image.Alternatively, the synthesis process can be conditioned on additional information (e.g., class information) by concatenating with a conditioning embedding
\(c\)
, i.e.,\(p\left( s|c \right)=\prod_i p\left( s_i | s_{\lt i},c \right)\)
. The Transformer employs sliding-window attention to enable the generation of high resolution images at the caveat of only paying attention to a local (but still sufficiently large) set of codebook vectors.
CLIP
-
A (desirably large) batch of image-text-pairs serve as input to CLIP. Pairs are gleaned and filtered from publicly available sources on the Internet, resulting in a dataset comprising of a total of 400 million image-text-pairs. In contrast to most computer vision models, the image captions provide a natural language signal richer than any single class label and are, thus, less restrictive in terms of supervision (natural language supervision).
-
Text and images run through a text and image encoder, respectively. The former is a bi-directional 12-layer Transformer while the latter is either a ResNet or Vision Transformer (ViT). The most prominent VQGAN-CLIP framework employs a ViT/B-32 which models 2D images in terms of embedded 32x32 image patches which serve as 1D input to vanilla Transformer encoder (in conjunction with positional embeddings).
Image captions are capped at 76 token with the
EOS
token of the final hidden layer representing the dense text embedding. -
The text and image embeddings are projected into a multi-model embedding space in which pairwise Cosine similarities scores quantify the match between each text and image embedding included in each batch.
-
The model is optimized to produce a high-quality, multi-modal embedding space in which embeddings of actual (randomly matched) image-text-pairs are in close proximity to each other (far apart). The authors frame this pre-training task by asking “which captions goes with which image?”
Inference-by-Optimization
-
Encode text prompt using CLIP’s pre-trained text encoder.
-
Create noise vector
\(\hat{z}\)
which serves as input to the pre-trained VQGAN model. The model’s decoder initiates the synthesis process by generating an image candidate that is to be evaluated by CLIP. Alternatively,\(\hat{z}\)
can be initialized set by providing a preexisting image as prior which is encoded by the VQGAN model. This way, the image synthesis does not start from random noise (cold start) but rather from a given image (warm start) which may inhibit desired properties of the output image. -
Encode the synthesized image using CLIP’s pre-trained image encoder. Importantly, prior to encoding, the image is segmented into a batch of random cutouts as CLIP operates on lower resolution images relative to VQGAN. In addition, the synthesized image faces various augmentations, such as flipping, color jitter, or noising.
-
Compute similarity between embedded text prompt and synthesized image cutouts within CLIP’s multi-modal embedding space.
-
“Optimize”
\(\hat{z}\)
using back-propagation with the text-image-similarity score acting as a loss function that is to be maximized (gradient ascent). The loss is computed by averaging over all crops and augmented images to stabilize the update steps. Finally, the loss is extended by a regularization loss to penalize the appearance of visual artefacts:$$ \mathcal{L}_{VQGAN-CLIP}=\mathcal{L}_{CLIP}+\alpha\times \frac{1}{N}\sum_{i=0}^{N} Z_i^2 $$
Mostly, optimization over 500 update steps yields stable and visually appealing results.
Experiments
All experiments were conducted using Google Colab. The code is based on the seminal Juypter Notebook disseminated by Katherine Crowson. Click to download the adapated Notebook.
Delicious Choclate Bar Made of < material > | ArtStation HD
Wooden Banana Temple in an Underwater Kingdom | < art style >
Marble Temple in the Volcanic Highlands | < art style >
Marble Temple in the Volcanic Highlands | < art style >
Logo Generation
Resources
Initial idea and codebase:
- Notebook by Katherine Crowson and translated into English by somewheresy.
- Crowson, K./Biderman, S./Kornis, D./Stander, D./Hallahan, E./Castricato, L./Raff, E. (2022): VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance, arXiv working paper 2022-04-18. Link
Intuition:
- Steinbrück, A. (2022): Explaining the code of the popular text-to-image algorithm (VQGAN+CLIP in PyTorch), blog article 2022-04-11. Link
- Miranda, L. (2021): The Illustrated VQGAN, blog article 2021-08-21. Link
History of AI art:
- Snell, C. (2021): Alien Dreams: An Emerging Art Scene, blog article 2021-06-30. Link
- Morris, J. (2022): The Weird and Wonderful World of AI Art, blog article 2022-01-28. Link
- Baschez, N. (2022): DALL·E 2 and The Origin of Vibe Shifts, blog article 2022-04-22. Link
Style Inspirations:
- @kingdomakrillic (2021): CLIP + VQGAN keyword comparison, blog article 2021-07-23. Link