CLIP-Guided Image Synthesis

By Simon Schölzel in Project

June 18, 2022

Abstract

A write-up that summarizes my personal learnings and experimentations with CLIP-guided image synthesis. It covers VQGAN, CLIP, Inference-by-Optimization, as well as various text-to-image and image-to-image experiments.

Date

June 18, 2022

Time

12:00 AM

Location

Atlantis, Atlantic Ocean

VQGAN

Illustrated VQGAN

  1. A raw image \(x\in \mathbb{R}^{H\times W\times 3}\) serves as input to the model.

  2. The image runs through an CNN-based image encoder \(E\left( \cdot \right)\) to produce the latent variable \(\hat{z}=E\left( x \right)\in \mathbb{R}^{h\times w\times n_z}\) which summarizes local image patches. Hence, the encoder is designed to exploit strong local, position-invariant correlations within images.

  3. The CNN representation \(\hat{z}\) is quantized using an element-wise compression module \(\textbf{q}\left( \cdot \right)\) per \(\hat{z}_{ij}\) to obtain a quantized image encoding:

    $$ z_\textbf{q} =\textbf{q}\left( \hat{z} \right):=\left( arg \min_{\substack{z_k\in }}\mathcal{Z} \left| \hat{z}_{ij}-z_k \right|\right)\in \mathbb{R}^{h\times w\times n_z} $$

    Thereby, each \(\hat{z}_{ij}\in\hat{z}\) is mapped into its closest codebook entry \(z_k\) (i.e., nearest neighbour) which stems from a learned discrete codebook \(\mathcal{Z}=\left\{ z_k \right\}_{k=1}^{K}\subset \mathbb{R}^{n_z}\) with the hyperparameters \(n_z\) denoting the dimensionality of the codebook and \(h\times w\) as the number of codebook token. Each codebook entry represents a cluster centroid in latent space that is optimized during training. Generally, the smaller the codebook size the more abstract the learned visual parts. This leads to a potentially more creative but less controllable generator through a higher degree of interpolation.

    Illustrated VQGAN

    Equivalently, \(z_\textbf{q}\) can, hence, be representend in terms of a sequence of codebook indices which act as input token:

    $$ z_\textbf{q} = \left( z_{s_{ij}} \right) $$

    with \(s_{ij}\) denoting the respective codebook entry indexes which form the sequence \(s\in\left\{ 0,..., \left| \mathcal{Z} \right|-1 \right\}^{h\times w}\).

  4. The image is reconstructed from the codebook vectors (i.e., quantized image) via a generator \(G\left( \cdot \right)\) (which is trained in an adversarial fashion by trying to outwit a patch-based discriminator \(D\left( \cdot \right)\)):

    $$ \hat{x}=G\left( z_\textbf{q} \right)=G\left( \textbf{q}\left( E\left( x \right) \right) \right) $$

  5. Finally, three losses are back-propagated through the network. First, the compression module is trained to learn a rich codebook:

    $$ \mathcal{L}_{VQ}\left( E,G,\mathcal{Z} \right) = \left| x-\hat{x} \right|^{2} + \left| sg\left[ E\left( x \right) -z_\textbf{q} \right] \right|^2_2 + \left| sg\left[ z_\text{q} -E\left( x \right) \right] \right|^2_2 \tag{1} $$

    where the first term on the RHS describes the reconstruction loss, the second term facilitates the learning of meaningful embeddings, and the latter term is the commitment loss.

    Second, the GAN is learned under the classic adversarial training framework:

    $$ \mathcal{L}_{GAN}\left( \left\{ E,G,\mathcal{Z} \right\}, D \right)= \left[ \log D\left( x \right) + 1-\log D\left( \hat{x} \right) \right] \tag{2} $$

    Both are combined into the overall VQGAN loss using an adaptive weight \(\lambda\):

    $$ arg \min_{\substack{E,G,\mathcal{Z}}} \max_{\substack{D}} \mathbb{E}_{x\sim p\left( x \right)}= \mathcal{L}_{VQ}\left( E,G,\mathcal{Z} \right)+\lambda \mathcal{L}_{GAN}\left( \left\{ E,G,\mathcal{Z} \right\}, D \right) \tag{1+2} $$

    Third, an autoregresive Transformer model is concurrently trained to perform next-index prediction to model the next codebook entry in a sequence of entry indexes under the following loss function:

    $$ \mathcal{L_{Transformer}}= \mathbb{E}_{x\sim p\left( x \right)}\left[ -\log p\left( s \right) \right]= \mathbb{E}_{x\sim p\left( x \right)}\left[ -\log \right] \tag{3} $$

    with \(p\left( s \right)=\prod_i p\left( s_i | s_{\lt i} \right)\). This way, the Transformer learns long-range dependencies between visual parts and, thus, the the global composition of the image.

    Alternatively, the synthesis process can be conditioned on additional information (e.g., class information) by concatenating with a conditioning embedding \(c\), i.e., \(p\left( s|c \right)=\prod_i p\left( s_i | s_{\lt i},c \right)\). The Transformer employs sliding-window attention to enable the generation of high resolution images at the caveat of only paying attention to a local (but still sufficiently large) set of codebook vectors.

    Illustrated Sliding Attention

CLIP

CLIP Contrastive Pre-training

  1. A (desirably large) batch of image-text-pairs serve as input to CLIP. Pairs are gleaned and filtered from publicly available sources on the Internet, resulting in a dataset comprising of a total of 400 million image-text-pairs. In contrast to most computer vision models, the image captions provide a natural language signal richer than any single class label and are, thus, less restrictive in terms of supervision (natural language supervision).

  2. Text and images run through a text and image encoder, respectively. The former is a bi-directional 12-layer Transformer while the latter is either a ResNet or Vision Transformer (ViT). The most prominent VQGAN-CLIP framework employs a ViT/B-32 which models 2D images in terms of embedded 32x32 image patches which serve as 1D input to vanilla Transformer encoder (in conjunction with positional embeddings).

    Image captions are capped at 76 token with the EOS token of the final hidden layer representing the dense text embedding.

  3. The text and image embeddings are projected into a multi-model embedding space in which pairwise Cosine similarities scores quantify the match between each text and image embedding included in each batch.

  4. The model is optimized to produce a high-quality, multi-modal embedding space in which embeddings of actual (randomly matched) image-text-pairs are in close proximity to each other (far apart). The authors frame this pre-training task by asking “which captions goes with which image?”

Inference-by-Optimization

Illustrated VQGAN-CLIP

  1. Encode text prompt using CLIP’s pre-trained text encoder.

  2. Create noise vector \(\hat{z}\) which serves as input to the pre-trained VQGAN model. The model’s decoder initiates the synthesis process by generating an image candidate that is to be evaluated by CLIP. Alternatively, \(\hat{z}\) can be initialized set by providing a preexisting image as prior which is encoded by the VQGAN model. This way, the image synthesis does not start from random noise (cold start) but rather from a given image (warm start) which may inhibit desired properties of the output image.

  3. Encode the synthesized image using CLIP’s pre-trained image encoder. Importantly, prior to encoding, the image is segmented into a batch of random cutouts as CLIP operates on lower resolution images relative to VQGAN. In addition, the synthesized image faces various augmentations, such as flipping, color jitter, or noising.

  4. Compute similarity between embedded text prompt and synthesized image cutouts within CLIP’s multi-modal embedding space.

  5. “Optimize” \(\hat{z}\) using back-propagation with the text-image-similarity score acting as a loss function that is to be maximized (gradient ascent). The loss is computed by averaging over all crops and augmented images to stabilize the update steps. Finally, the loss is extended by a regularization loss to penalize the appearance of visual artefacts:

    $$ \mathcal{L}_{VQGAN-CLIP}=\mathcal{L}_{CLIP}+\alpha\times \frac{1}{N}\sum_{i=0}^{N} Z_i^2 $$

    Mostly, optimization over 500 update steps yields stable and visually appealing results.

Experiments

Code Implementation
×

All experiments were conducted using Google Colab. The code is based on the seminal Juypter Notebook disseminated by Katherine Crowson. Click to download the adapated Notebook.

Delicious Choclate Bar Made of < material > | ArtStation HD

algae
concept art
crystalline matter
crystalline matter
gold
gold
jellies and peanuts
jellies and peanuts
spectral matter
spectral matter
wood
wood
moon sand
moon sand

Wooden Banana Temple in an Underwater Kingdom | < art style >

concept art
concept art
painting by Hokusai
painting by Hokusai
pixelart
pixelart
sketch by Leonardo Da Vinci
sketch by Leonardo Da Vinci
trending on artstation
trending on artstation
unreal engine
unreal engine
Behance HD
Behance HD
ink drawing
ink drawing
Studio Ghibli style
Studio Ghibli style
watercolour
watercolour
chalk art
chalk art"
cel shading
cel shading
low poly
low poly
matte painting
matte painting
pencil sketch
pencil sketch
disney style
disney style
steampunk
steampunk
storybook illustration
storybook illustration
tilt shift
tilt shift
8K 3D
8K 3D

Marble Temple in the Volcanic Highlands | < art style >

Image Prompt
trending on artstation
trending on artstation
steampunk
steampunk
Model Output
trending on artstation
trending on artstation
steampunk
steampunk

Marble Temple in the Volcanic Highlands | < art style >

Image Prompt
trending on artstation
steampunk
Model Output
trending on artstation
trending on artstation
steampunk
steampunk

Logo Generation

Image Prompt
trending on artstation
steampunk
Model Output
black-and-white | nature-lover
black-and-white | nature-lover
colorful | style-of-1970s-sci-fi-book-cover
colorful | style-of-1970s-sci-fi-book-cover
colorful | nature-lover
colorful | nature-lover
colorful | trending-on-artstation
colorful | trending-on-artstation

Resources

Initial idea and codebase:

  • Notebook by Katherine Crowson and translated into English by somewheresy.
  • Crowson, K./Biderman, S./Kornis, D./Stander, D./Hallahan, E./Castricato, L./Raff, E. (2022): VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance, arXiv working paper 2022-04-18. Link

Intuition:

  • Steinbrück, A. (2022): Explaining the code of the popular text-to-image algorithm (VQGAN+CLIP in PyTorch), blog article 2022-04-11. Link
  • Miranda, L. (2021): The Illustrated VQGAN, blog article 2021-08-21. Link

History of AI art:

  • Snell, C. (2021): Alien Dreams: An Emerging Art Scene, blog article 2021-06-30. Link
  • Morris, J. (2022): The Weird and Wonderful World of AI Art, blog article 2022-01-28. Link
  • Baschez, N. (2022): DALL·E 2 and The Origin of Vibe Shifts, blog article 2022-04-22. Link

Style Inspirations:

  • @kingdomakrillic (2021): CLIP + VQGAN keyword comparison, blog article 2021-07-23. Link
Posted on:
June 18, 2022
Length:
6 minute read, 1104 words
Categories:
Project
See Also: