notes_on_sd_vae
source link: https://gist.github.com/madebyollin/ff6aeadf27b2edbc51d05d5f97a595d9
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Instantly share code, notes, and snippets.
Notes / Links about Stable Diffusion VAE
Stable Diffusion's VAE is a neural network that encodes and decodes images into a compressed "latent" format. The encoder performs 48x lossy compression, and the decoder generates new detail to fill in the gaps.
(Calling this model a "VAE" is sort of a misnomer - it's an encoder with some very slight KL regularization, and a conditional GAN decoder)
This document is a big pile of various links with more info.
VAE Versions & Lineage
- CompVis
- [2021] The original decoder training code (using a vq bottleneck instead of a kl bottleneck) is from the taming transformers paper https://github.com/CompVis/taming-transformers
- [2022] The original SD VAE is the kl-f8 model from the latent diffusion paper https://github.com/CompVis/latent-diffusion#pretrained-autoencoding-models
- Stability
- [2022-10] SD-VAE-FT: These widely-used VAEs are finetunes of the compvis ones https://huggingface.co/stabilityai/sd-vae-ft-ema https://huggingface.co/stabilityai/sd-vae-ft-mse (only the decoder is changed) https://twitter.com/StabilityAI/status/1586183361361428480
- [2023-06] SDXL-VAE is a retrained-from-scratch model with the same code / architecture as the original https://huggingface.co/stabilityai/sdxl-vae https://arxiv.org/abs/2307.01952 https://github.com/Stability-AI/generative-models/tree/main. This comes in two versions (0.9 and 1.0) but the 0.9 one is generally considered to look better. Not clear if the training code in SGM repo works yet.
- [2023-11] The SVD-VAE is a finetuned version of SD-VAE-FT with added temporal (3d) convolutions in the decoder, intended to decode smooth (non-flickery) videos from batches of SD latents.
- OpenAI
- [2023-11] OpenAI trained a consistency-model decoder for the original SD VAE latent space https://github.com/openai/consistencydecoder https://cdn.openai.com/papers/dall-e-3.pdf. This is like 10x the size of the standard VAE, but quality is supposed to be higher.
Other SD-VAE-related Codebases
- madebyollin (me)
- https://github.com/madebyollin/taesd - tiny distilled version of both the SD and SDXL autoencoder (also, removes some annoying scaling stuff & removes the stochasticity)
- https://huggingface.co/madebyollin/sdxl-vae-fp16-fix - finetuned version of the SDXL (0.9) VAE that works in fp16 precision without NaNs
- https://gist.github.com/madebyollin/865fa6a18d9099351ddbdfbe7299ccbf - modified version of mrsteyk's consistency decoder code
- birchlabs
- https://birchlabs.co.uk/machine-learning#vae-distillation - tiny MLP decoder & training code
- city96
- https://github.com/city96/SD-Latent-Interposer - converter between SD and SDXL latent spaces (with some artifacts)
- cccntu
- https://github.com/cccntu/fine-tune-models/ - vae finetuning code
- mosaicml
- mosaicml/diffusion#79 - vae training code
- various people working to get VAE training in diffusers
Other Info
- Meta's emu model uses 10x compression instead of 50x and changes the adversarial loss a bit https://huggingface.co/papers/2309.15807
- GAN vs. Consistency VAE comparisons https://twitter.com/anotherjesse/status/1721754763149099246
- You can remove the TAESD upsampling layers to get lower-res RGB images https://twitter.com/madebyollin/status/1720847470245343631
- Seems likely that Bing uses original VAE decoder but ChatGPT uses the consistency model https://twitter.com/madebyollin/status/1715182160142111082
- The Retro Diffusion team have a special decoder for pixel art https://twitter.com/RealAstropulse/status/1674431288894459909
- SD VAE KL noise has very little effect
- Variances are really small https://twitter.com/Ethan_smith_20/status/1719768055902027840
- Taking the encoder mean instead of sampling causes no difference in most cases (even though sampling is the technically correct choice) https://twitter.com/Birchlabs/status/1721714156275933608
There's an attention layer in the VAE that also doesn't do very much, can be disabled for some speedup
- SDXL and SD VAE latents are totally incompatible https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/discussions/6
The SD VAE encoder has an annoying bright spot that gets worse when encoding higher-resolution images (animation)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK