2

Video Diffusion Models

 2 years ago
source link: https://video-diffusion.github.io/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Video Diffusion Models

Video Diffusion Models

  • Jonathan Ho*
  • Tim Salimans*
  • Alexey Gritsenko
  • William Chan
  • Mohammad Norouzi
  • David Fleet

We present results on video generation using diffusion models. We propose an architecture for video diffusion models which is a natural extension of the standard image architecture. We show that this architecture is effective for jointly training from image and video data. To generate long and higher resolution videos we introduce a new conditioning technique that performs better than previously proposed methods. We present results on text-conditioned video generation and state-of-the-art results on an unconditional video generation benchmark.

Example results

fireworks_q80.webp Samples from a text-conditioned video diffusion model, conditioned on the string fireworks.

000.webp001.webp002.webp003.webp004.webp005.webp006.webp007.webp008.webp009.webp010.webp011.webp012.webp014.webp015.webp016.webp017.webp019.webp020.webp021.webp022.webp023.webp024.webp025.webp026.webp027.webp028.webp029.webp More samples from a text-conditioned video diffusion model. The conditioning string is displayed above each sample.

Summary

Diffusion models have recently been producing high quality results in domains such as image generation and audio generation, and there is significant interest in validating diffusion models in new data modalities. In this work, we present first results on video generation using diffusion models, for both unconditional and conditional settings. Prior work on video generation has usually employed other types of generative models, like GANs, VAEs, flow-based models, and autoregressive models.

We show that high quality videos can be generated by essentially the standard formulation of the Gaussian diffusion model, with little modification other than straightforward architectural changes to accommodate video data within memory constraints of deep learning accelerators. We train models that generate a block of a fixed number of frames of a video, and to generate videos longer than that number of frames, we additionally show how to repurpose a trained model to act as a model which is block-autoregressive over frames. We test our methods on an unconditional video generation benchmark, where we achieve state-of-the-art sample quality scores, and we also show promising results on text-conditioned video generation.

Gradient conditioning method

One of our main innovations is a new conditional generation method for unconditional diffusion models. Our new conditioning method, which refer to as the gradient method, modifies the sampling procedure of the model to improve a conditioning loss on denoised data using gradient-based optimization. We find that the gradient method is more capable than existing methods in ensuring consistency of the generated samples with the conditioning information.

We use the gradient method to autoregressively extend our models to more timesteps and higher resolutions.

video_samples_gradient_cond.pngvideo_samples_replace_cond.png Frames from our gradient method (left) and a baseline "replacement" method (right) for autoregressive extension. Videos sampled using the gradient method attain superior temporal coherence compared to the baseline method.

Additional techniques

The basic techniques we employ are as follows (details can be found in our full paper):

  • Architecture: for video data we use a factorized space-time UNet, which is a straightforward extension of the standard 2D UNet used in image diffusion models.
  • Joint image-video training: our factorized UNets can be run on variable sequence lengths and therefore can be jointly trained on both video and image modeling objectives. We find that this joint training, which has the effect of a bias-variance tradeoff on the training objective, is important for video sample quality.
  • Classifier-free guidance: improves sample quality for text conditioned generation, similar to existing work on image modeling.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK