28

Variational Autoencoders are not autoencoders

 5 years ago
source link: https://www.tuicool.com/articles/hit/MF3mai2
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

When VAEs are trained with powerful decoders, the model can learn to ‘ignore the latent variable’. This isn’t something an autoencoder should do. In this post we’ll take a look at why this happens and why this represents a shortcoming of the name Variational Autoencoder rather than anything else.

Variational Autoencoders (VAEs) are popular for many reasons, one of which is that they provide a way to featurise data. One of their ‘failure modes’ is that if a powerful decoder is used, training can result in good scores for the objective function we optimise, yet the learned representation is completely useless in that all data points are encoded as the prior distribution, so the latent representation contains no information about .

The name Variational Autoencoder throws a lot of people off when trying to understand why this happens — an autoencoder compresses observed high-dimensional data into a low-dimensional representation, so surely VAEs should always result in a good compression? In fact, this behaviour is not a failure mode of VAEs per se, but rather represents a failure mode of the name VAE!

In this post, we’ll look at what VAEs are actually trained to do — not what they sound like they ought to do — and see that this ‘pathelogical’ behaviour entirely makes sense. We’ll see that VAEs are a particular way to train Latent Variable Models , and that fundamentally their encoders are introduced as a mathematical trick to allow approximation of an intractable quantity. The nature of this trick is such that when powerful decoders are used, ignoring the latent variable is encouraged.

VAEs and autoencoders

An autoencoder is a type of model in which we compress data by mapping to a low dimensional space and back. Autoencoder objectives are one of the following equivalent forms:

uAVziqB.png!web

The objective of a VAE (the variational lower bound , also known as the Evidence Lower BOund or ELBO and introduced inthis previous post) looks somewhat like the second of these, hence giving rise to the name Variational Autoencoder. Averaging the following over BruyQny.png!web gives the full objective to be maximised:

Many papers and tutorials introducing VAEs will explicitly describe (i) as the ‘reconstruction’ loss and (ii) as the ‘regulariser’. However, despite appearances VAEs are not in their heart-of-hearts autoencoders: we’ll describe this in detail in the next section, but it’s of critical importance to stress that, rather than maximising a regularised reconstruction quality, the fundamental goal of a VAE is to maximise the log-likelihood JreMbib.png!web .

This is not possible to do directly, but by introducing the approximate posterior 3Mf6vya.png!web we can get a tractable lower bound of the desired objective, giving us the VAE objective. The variational lower bound is precisely what its name suggests – a lower bound on the log-likelihood, not a ‘regularised reconstruction cost’. A failure to recognise this distinction has caused confusion to many.

Latent Variable Models

A Latent Variable Model (LVM) is a way to specify complex distributions over high dimensional spaces by composing simple distributions, and VAEs provide one way to train such models. An LVM is specified by fixing a prior fIVVBnM.png!web and parameterised family of conditional distributions iA7fueE.png!web , the latter of which is also called the decoder or generator interchangeably in the literature.

For a fixed , we get a distribution YF3e6b3.png!web over the data-space. Training an LVM requires (a) picking a divergence between 2yM3MbA.png!web and the true data distribution AnmuEf2.png!web ; (b) choosing to minimise this.

Hang on a second – in VAEs, we maximise a lower bound on the log-likelihood, not minimise a divergence, right? In fact, it turns out that if we choose the following KL as our divergence,

7zQZBvE.png!web

then since the left expectation doesn’t depend on , minimising the divergence is equivalent to maximising the right expectation, which happens to be the log-likelihood.

Since the yi2M7rN.png!web is a divergence, we have that aYFF3yb.png!web with equality if and only if 3MreAb2.png!web . This means that the maximum possible value of IVnYbu2.png!web occurs when 3MreAb2.png!web , at which point ae6na2I.png!web . So this is the global optimum of the VAE objective.

Although 6FVvqqf.png!web and iA7fueE.png!web are usually chosen to be simple and easy to evaluate, YF3e6b3.png!web is generally difficult to evaluate since it involves computing an integral. EzaqmiQ.png!web can’t easily be evaluated, but the variational lower bound of this quantity, 7ZzeEbv.png!web , can be. This involves introducing a new family of conditional distributions 7baeai7.png!web which we call the approximate posterior. Provided we have made sensible choices about the family of distributions 7baeai7.png!web , 7ZzeEbv.png!web will be simple to evaluate (and differentiate through) but the price we pay is the gap between the true posterior UZ3MZ3y.png!web and the approximate posterior 7baeai7.png!web . This is derived in more detailhere.

EZzyaeJ.png!web

While it is indeed tempting to look at the definition of 7ZzeEbv.png!web in Equation URRJJ3U.png!web and think ‘reconstruction + regulariser’ as many people do, it’s important to remember that the encoder 7baeai7.png!web was only introduced as a trick: we’re actually trying to train an LVM and the thing we want to maximise is EzaqmiQ.png!web . 7baeai7.png!web doesn’t actually have anything to do with this term beyond a bit of mathematical gymnastics that gives us an easily computable approximation to EzaqmiQ.png!web .

Powerful decoders

For our purposes, we will define a decoder — i.e. family of conditional distributions iA7fueE.png!web — to be powerful with respect to M7J7fyr.png!web if there exists a 7VVFN3j.png!web such that zyQjqqu.png!web for all and . This is a property of both the family of decoders as well as the data itself. In words, a decoder is powerful if it is possible to perfectly describe the data distribution without using the latent variable.

When people talk about powerful decoders and ‘ignoring the latent variables’, they are often referring to a case in which M7J7fyr.png!web is a complex dataset of images, and the decoder is a very expressive auto-regressive architecture (e.g. PixelCNN).

However, this also happens in much simpler cases too: suppose that M7J7fyr.png!web is Gaussian 7fAniyA.png!web and that we use a Gaussian decoder, where veIBzyJ.png!web where emEvmyQ.png!web and q26BNbq.png!web are parameterised by neural networks. In this case, the decoder is also powerful with respect to M7J7fyr.png!web , provided that the neural networks are capable of modelling the constant functions iQf2I3v.png!web and N3MBvqj.png!web .

As a brief aside, suppose we use a Gaussian decoder, but with non-Gaussian M7J7fyr.png!web . The decoder can be made more expressive by adding more layers to the network, but it will not be possible to make the decoder powerful with resepct to M7J7fyr.png!web by only adding more and more layers – doing so would require adding more expressive conditional distributions than Gaussians.

It’s quite easy to prove using bQ3umuq.png!web that ‘ignoring the latent variable’ in VAEs with decoders that are powerful with respect to the data is actually optimal behaviour.

Claim:Suppose that (i) there exists YnuENbJ.png!web such that zyQjqqu.png!web for all x, and (ii) there exists FrARRzI.png!web such that iQn2U37.png!web for all z. Then nymEzyB.png!web is a globally optimal solution to the VAE objective.

Proof:If zyQjqqu.png!web then URBze2E.png!web , and thus B7rAFnI.png!web and so the variational lower bound in Equation bQ3umuq.png!web is tight. That is,

Thus the objective of the VAE is at its global optimum. YNjYza2.png!web

What if nUf2yyq.png!web but q2MFv2e.png!web isn’t independent of ?

If we have powerful decoders, it may well be that there is a setting of the parameters y6BFzmZ.png!web such that AnEBRnU.png!web and for which A3E7NjI.png!web does actually depend on . In this case, for any j2mqQju.png!web we have

and so QJjqmau.png!web will be strictly worse than the global optimum for any j2mqQju.png!web for which 3aIrEzQ.png!web . If A3E7NjI.png!web depends on , the posterior distribution VRjeauM.png!web is likely to be complex. Since 7baeai7.png!web must by design be a reasonably simple family of distributions, it is unlikely that there exists a j2mqQju.png!web such that BbqUFr6.png!web for all , and hence it is likely that for any j2mqQju.png!web ,

6nAvUjN.png!web

which is to say that the solution QVnueej.png!web will be preferred by the VAE over AnEBRnU.png!web .

Put differently, and subject to some caveats about the richness of the family of distributions 3Mf6vya.png!web : if there is an optimal solution which ignores the latent code, it is probably the unique optimal solution .

Summary

If you are still in the mindset that VAEs are autoencoders with objectives of the form ‘reconstruction + regulariser’, the above proof that ignoring the latent variable is optimal when using powerful decoders might be unsatisfying. But remember, VAEs are not autoencoders! They are first and foremost ways to train LVMs. The objective of the VAE is a lower bound on EzaqmiQ.png!web . The encoder 7baeai7.png!web is introduced only as a mathematical trick to get a lower bound of EzaqmiQ.png!web that is computationally tractable. This bound is exact when the latent variables are ignored, so if it is possible to capture the data distribution — i.e. naMNr2U.png!web — without using the latent variables, this will be preferred by the VAE.

I’m grateful to Jamie Townsend and Diego Fioravanti for helpful discussions leading to the writing of this post, and to Sebastian Weichwald, Alessandro Ialongo, Niki Kilbertus and Mateo Rojas-Carulla for proofreading it.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK