An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal1,2, Yuval Alaluf1, Yuval Atzmon2, Or Patashnik1, Amit H. Bermano1, Gal Chechik2, Daniel Cohen-Or1
1Tel Aviv University, 2NVIDIA
Abstract:
Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks.
This repo contains the official code, data and sample inversions for our Textual Inversion paper.
29/08/2022 Merge embeddings now supports SD embeddings. Added SD pivotal tuning code (WIP), fixed training duration, checkpoint save iterations. 21/08/2022 Code released!
- Release code!
- Optimize gradient storing / checkpointing. Memory requirements, training times reduced by ~55%
- Release data sets
- Release pre-trained embeddings
- Add Stable Diffusion support
Our code builds on, and shares requirements with Latent Diffusion Models (LDM). To set up their environment, please run:
conda env create -f environment.yaml
conda activate ldm