153

GitHub - lucidrains/DALLE2-pytorch: Implementation of DALL-E 2, OpenAI's updated...

 2 years ago
source link: https://github.com/lucidrains/DALLE2-pytorch
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

DALL-E 2 - Pytorch (wip)

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch. Yannic Kilcher summary

The main novelty seems to be an extra layer of indirection with the prior network (whether it is an autoregressive transformer or a diffusion network), which predicts an image embedding based on the text embedding from CLIP. Specifically, this repository will only build out the diffusion prior network, as it is the best performing variant (but which incidentally involves a causal transformer as the denoising network joy)

This model is SOTA for text-to-image for now.

It may also explore an extension of using latent diffusion in the decoder from Rombach et al.

Please join if you are interested in helping out with the replication

Do let me know if anyone is interested in a Jax version #8

Install

$ pip install dalle2-pytorch

Usage (work in progress)

template

$ dream 'sharing a sunset at the summit of mount everest with my dog'

Once built, images will be saved to the same directory the command is invoked

Training (work in progress, will offer both in code and as command-line)

template

  • finish off gaussian diffusion class for latent embedding - allow for both prediction of epsilon as well as directly predicting embedding
  • make sure it works end to end
  • augment unet so that it can also be conditioned on text encodings (although in paper they hinted this didn't make much a difference)
  • look into Jonathan Ho's cascading DDPM for the decoder, as that seems to be what they are using. get caught up on DDPM literature
  • figure out all the current bag of tricks needed to make DDPMs great (starting with the blur trick mentioned in paper)
  • train on a toy task, offer in colab
  • add attention to unet - apply some personal tricks with efficient attention

Citations

@misc{ramesh2022,
    title   = {Hierarchical Text-Conditional Image Generation with CLIP Latents}, 
    author  = {Aditya Ramesh et al},
    year    = {2022}
}
@misc{crowson2022,
    author  = {Katherine Crowson},
    url     = {https://twitter.com/rivershavewings}
}
@misc{rombach2021highresolution,
    title   = {High-Resolution Image Synthesis with Latent Diffusion Models}, 
    author  = {Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
    year    = {2021},
    eprint  = {2112.10752},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
@inproceedings{Liu2022ACF,
    title   = {A ConvNet for the 2020s},
    author  = {Zhuang Liu and Hanzi Mao and Chaozheng Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
    year    = {2022}
}
@misc{zhang2019root,
    title   = {Root Mean Square Layer Normalization},
    author  = {Biao Zhang and Rico Sennrich},
    year    = {2019},
    eprint  = {1910.07467},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK