The DreamBooth Technique

By Simon Schölzel in Project

December 30, 2022

Abstract

DreamBooth is a fine-tuning technique for large, pretrained text-to-image models (e.g., DALL-E2, Imagen, Stable Diffusion). Based on a small reference set of training images of a given subject or object (henceforth *concept*), the DreamBooth technique learns a custom identifier for the given concept and implants the concept embedding into the model's output domain. It enables the model to synthesize images of the underlying concept in different contexts and settings with very high-quality.

Date

December 30, 2022

Time

12:00 AM

Location

Holbox, Mexico

DreamBooth

DreamBooth is a few-shot personalization technique for fine-tuning large, pretrained text-to-image models (e.g., DALL-E2, Imagen, Stable Diffusion). Based on a small reference set of training images of a given subject or object (henceforth concept), the DreamBooth technique learns a custom identifier for the given concept and implants the concept embedding into the model’s output domain. This enables the model to synthesize images of the underlying concept in different contexts and settings with very high-quality.

Key Components

Fine-tuning dataset: Limited set of fine-tuning images depicting the concept of interest. The concept images should contain variations in terms of pose, background, angle, etc. to promote learning of the concept representation. Text prompt corresponding to these images should take the form: “A photo of a [concept identifier] [concept class]”. The reference to the concept class ensures that the model can leverage prior knowledge about the general concept class.

Concept identifier: Encoding of the concept via a rare-token identifier in the text encoder’s vocabulary (e.g., ðŁĴŁ for a CLIP text encoder). This token representation is overriden during fine-tuning and acts as a unique concept identifier.

Loss function: Combination of reconstruction loss to learn new concept and prior preservation loss to prevent overfitting and language drift (i.e., ability to generate images of other concepts of the same class)

DreamBooth Hackathon by Huggingface

I fine-tuned two personalized DreamBooth models for the Huggingface DreamBooth Hackathon.

The Pokéball Machine

Model checkpoint: CompVis/stable-diffusion-v1-4
Training images: Images of the original, red-and-white Pokéball
Rare-token identifier: pkblz

Iridescent Jellyfish

Model checkpoint: runwayml/stable-diffusion-v1-5
Training images: Images of fluorescent jellyfishes
Rare-token identifier: ðŁĴŁ

Experiment Log

Input images with same resolution as model checkpoint (e.g., 512x512) enhances image augmentations and improves performance.
Use of rare-token identifier that is so rare that is has not yet embedded an own unique concept (e.g., pkblz versus pokeeee where the latter seems to encode a visual prior that relates to Japanese/Asian culture).
Choice of relatively low learning rate (2e-06) and relatively larger number of steps (800).
At inference time, a lower guidance scale (e.g., 7) allows more creativity while a larger guidance scale (e.g., 11) ensures that the model adheres to the text prompt more strongly.
TODO: more heterogeneity in text prompts
TODO: human images

References

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2022). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242.

Posted on:: December 30, 2022

Length:: 2 minute read, 394 words

Categories:: Project

See Also: