Instantly share code, notes, and snippets.

Star 7 You must be signed in to star a gist
Fork 0 You must be signed in to fork a gist

Clone this repository at <script src="https://gist.github.com/madebyollin/ff6aeadf27b2edbc51d05d5f97a595d9.js"></script>

Save madebyollin/ff6aeadf27b2edbc51d05d5f97a595d9 to your computer and use it in GitHub Desktop.

Download ZIP

notes_on_sd_vae

Notes / Links about Stable Diffusion VAE

Stable Diffusion's VAE is a neural network that encodes and decodes images into a compressed "latent" format. The encoder performs 48x lossy compression, and the decoder generates new detail to fill in the gaps.

(Calling this model a "VAE" is sort of a misnomer - it's an encoder with some very slight KL regularization, and a conditional GAN decoder)

This document is a big pile of various links with more info.

VAE Versions & Lineage

CompVis
- [2021] The original decoder training code (using a vq bottleneck instead of a kl bottleneck) is from the taming transformers paper https://github.com/CompVis/taming-transformers
- [2022] The original SD VAE is the kl-f8 model from the latent diffusion paper https://github.com/CompVis/latent-diffusion#pretrained-autoencoding-models
Stability
- [2022-10] SD-VAE-FT: These widely-used VAEs are finetunes of the compvis ones https://huggingface.co/stabilityai/sd-vae-ft-ema https://huggingface.co/stabilityai/sd-vae-ft-mse (only the decoder is changed) https://twitter.com/StabilityAI/status/1586183361361428480
- [2023-06] SDXL-VAE is a retrained-from-scratch model with the same code / architecture as the original https://huggingface.co/stabilityai/sdxl-vae https://arxiv.org/abs/2307.01952 https://github.com/Stability-AI/generative-models/tree/main. This comes in two versions (0.9 and 1.0) but the 0.9 one is generally considered to look better. Not clear if the training code in SGM repo works yet.
- [2023-11] The SVD-VAE is a finetuned version of SD-VAE-FT with added temporal (3d) convolutions in the decoder, intended to decode smooth (non-flickery) videos from batches of SD latents.
OpenAI
- [2023-11] OpenAI trained a consistency-model decoder for the original SD VAE latent space https://github.com/openai/consistencydecoder https://cdn.openai.com/papers/dall-e-3.pdf. This is like 10x the size of the standard VAE, but quality is supposed to be higher.

Other SD-VAE-related Codebases

madebyollin (me)
- https://github.com/madebyollin/taesd - tiny distilled version of both the SD and SDXL autoencoder (also, removes some annoying scaling stuff & removes the stochasticity)
- https://huggingface.co/madebyollin/sdxl-vae-fp16-fix - finetuned version of the SDXL (0.9) VAE that works in fp16 precision without NaNs
- https://gist.github.com/madebyollin/865fa6a18d9099351ddbdfbe7299ccbf - modified version of mrsteyk's consistency decoder code
birchlabs
- https://birchlabs.co.uk/machine-learning#vae-distillation - tiny MLP decoder & training code
city96
- https://github.com/city96/SD-Latent-Interposer - converter between SD and SDXL latent spaces (with some artifacts)
cccntu
- https://github.com/cccntu/fine-tune-models/ - vae finetuning code
mosaicml
- mosaicml/diffusion#79 - vae training code
various people working to get VAE training in diffusers
- huggingface/diffusers#3726

Other Info

Meta's emu model uses 10x compression instead of 50x and changes the adversarial loss a bit https://huggingface.co/papers/2309.15807
GAN vs. Consistency VAE comparisons https://twitter.com/anotherjesse/status/1721754763149099246
You can remove the TAESD upsampling layers to get lower-res RGB images https://twitter.com/madebyollin/status/1720847470245343631
Seems likely that Bing uses original VAE decoder but ChatGPT uses the consistency model https://twitter.com/madebyollin/status/1715182160142111082
The Retro Diffusion team have a special decoder for pixel art https://twitter.com/RealAstropulse/status/1674431288894459909
SD VAE KL noise has very little effect
- Variances are really small https://twitter.com/Ethan_smith_20/status/1719768055902027840
- Taking the encoder mean instead of sampling causes no difference in most cases (even though sampling is the technically correct choice) https://twitter.com/Birchlabs/status/1721714156275933608
There's an attention layer in the VAE that also doesn't do very much, can be disabled for some speedup
SDXL and SD VAE latents are totally incompatible https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/discussions/6
The SD VAE encoder has an annoying bright spot that gets worse when encoding higher-resolution images (animation)