Autoencoding Generative Adversarial Networks

How the AEGAN architecture stabilizes GAN training and prevents mode collapse

Apr 18 ·9min read

riInqmm.jpg!web

AEGAN, a two-way street (Image source: Pixabay )

GANs are hard to train. When they work, they work wonders , but anyone who’s tried to train one themselves knows they’re damn finicky bastards. Two of the most common problems in GAN training are mode collapse and lack of convergence. In mode collapse, the generator learns to only generate a handful of samples; in generating “handwritten” digits, a GAN undergoing mode collapse might only learn to draw sevens, albeit highly-realistic sevens. With lack of convergence, the healthy competition between the generator and the discriminator sours, usually with the discriminator becoming much better than the generator; when the discriminator is able to easily and completely discern between real and generated samples, the generator doesn’t get useful feedback and isn’t able to improve.

In a recent paper , I proposed a technique which appears to stabilize GAN training and addresses both of the above issues. A side-effect of this technique is that it allows for efficient and direct interpolation between real samples. In this article, I aim to step through the key ideas of the paper and illustrate why I think the AEGAN technique has the potential to be a very useful tool in the GAN trainer’s toolbox.

jEvYNjj.png!web

I’ve been told to stop burying the lede, so here’s the most interesting result of the paper. What’s going on here? Keep reading to find out.

Enter the AEGAN

Bijective Mapping

GANs learn a mapping from some latent space Z (the random noise) to some sample space X (the dataset, usually images). These mappings are naturally injective — each point z in Z corresponds to some sample x in X . However, they’re rarely surjective — many samples in X do not have a corresponding point in Z. Indeed, mode collapse occurs when many points zi , zj , and zk map to a single sample xi , and the GAN is unable to generate points xj or xk . With this in mind, a more ideal GAN would have the following qualities:

Each latent point z in Z should correspond to a unique sample x in X.
Each sample x in X should correspond to a unique latent point z in Z .
The probability of drawing z from Z, p(Z=z) , should equal the probability of drawing x from X , p(X=x) .

These three qualities suggest that we should aim for a one-to-one relationship (i.e. a bijective mapping) between the latent space and the sample space. To do this, we train a function G : Z ⟶ X , which is our generator, and another function E : X ⟶ Z , which we will call the encoder. The intents of these functions are:

G(z) should produce realistic samples in the same proportions as they are distributed in X . (This is what regular GANs aim to do)
E(x) should produce likely latent points in the same proportions as they are distributed in Z.
The composition E(G(z)) should faithfully reproduce the original latent point z .
The composition G(E(x)) should faithfully reproduce the original image x .

Architecture

ayqINvQ.png!web

Figure 1: The high-level AEGAN architecture. Networks are shown as boxes, values as circles, and losses as diamonds. Colours represent combined networks, where red is a regular image-generating GAN, yellow is a GAN for producing latent vectors, blue is an image autoencoder, and green is a latent vector autoencoder.

AEGAN is a four-network model comprising of two GANs and two autoencoders, illustrated in Figure 1, and is a generalization of the CycleGAN technique for unpaired image-to-image translation where one of the image domains is replaced with random noise. In short, we train two networks to translate between sample space X and latent space Z , and we train another two networks to discriminate between real and fake samples and latent vectors. Figure 1 is a complicated diagram, so let me break it down:

Networks (Squares):

G is the generator network. It takes a latent vector z as input and returns an image x as output.
E is the encoder network. It takes an image x as input and returns a latent vector z as output.
Dx is the image discriminator network. It takes an image x as input and returns the probability that x was drawn from the original dataset as output.
Dz is the latent discriminator network. It takes a latent vector z as input and returns the probability that z was drawn from the latent distribution as output.

Values (Circles):

x : genuine samples from the original dataset. This is a bit ambiguous, because in some places I use x to mean any value in the domain X. Sorry about that.
z : genuine samples from the latent-generating distribution (random noise).
x_hat : samples produced by G given a real random vector, i.e. x_hat=G(z).
z_hat : vectors produced by E given a real sample, i.e. z_hat=E(x).
x_tilde : samples reproduced by G from encodings produced by E , i.e. x_tilde=G(z_hat)=G(E(x)).
z_tilde : vectors reproduced by E from images generated by G , i.e. z_tilde=E(z_hat)=E(G(z)).

Losses (Diamonds):

L1 (blue): The image reconstruction loss ||G(E(x))-x||_1 , i.e. the Manhattan distance between the pixels of the original image and the autoencoded reconstruction.
L2 (green): The latent vector reconstruction loss ||E(G(z))-z||_2 , i.e. the Euclidean distance between the original latent vector and the autoencoded reconstruction.
GAN (red): The adversarial loss for images. Dx is trained to discriminate between real images ( x ) and fake images ( x_hat and x_tilde (not shown))
GAN (yellow): The adversarial loss for latent vectors. Dz is trained to discriminate between real random noise ( z ) and encodings ( z_hat and z_tilde (not shown))

Training

The AEGAN is trained in the same way as a GAN, alternatingly updating the generators ( G and E ) and the discriminators ( Dx and Dz ). The AEGAN loss function is slightly more complex than the typical GAN loss, however. It consists of four adversarial components:

vu2E3eI.png!web

The adversarial components of the AEGAN loss.

and two reconstruction components (shown here summed together):

The reconstruction components of the AEGAN loss. The λs are hyperparamters that control the relative weights of the reconstruction components.

which, all summed, form the AEGAN loss. E and G try to minimize this loss while Dx and Dz try to maximize it. If you don’t care for the math, the intuition is simple:

G tries to trick Dx into believing the generated samples x_hat and the autoencoded samples x_tilde are real, while Dx tries to distinguish those from the real samples x .
E tries to trick Dz into believing the generated samples z_hat and the autoencoded samples z_tilde are real, while Dz tries to distinguish those from the real samples z .
G and E have to work together so that the autoencoded samples G(E(x))=x_tilde are similar to the original x , and that the autoencoded samples E(G(z))=z_tilde are similar to the original z .

Results

To start with, a disclaimer. Due to personal reasons, I’ve only had the time and energy to test this on a single dataset. I’m publishing my work as-is so that others can test out the technique themselves and validate my results or show that this is a dead-end. That said, here’s a sample of the results after 300k training steps:

ZzYFr2Q.png!web

Figure 2: Random images generated by the AEGAN after 300k training steps on a dataset of 21552 unique anime faces.

By itself, figure 2 isn’t all that exciting. If you’re reading a Medium article about GANs, then you’ve probably seen the StyleGAN trained on anime faces that produces way better results . What is exciting is comparing the above results to figure 3:

VBFZreY.png!web

Figure 3: Random images generated by a GAN after 300k training steps on a dataset of 21552 unique anime faces.

The GAN used to generate the images in figure 3 and the AEGAN used to generate the images in figure 2 have the exact same architectures for G and for Dx ; the only difference is that the AEGAN was made to learn the reverse function as well. This stabilized the training process. And before you ask, no, this wasn’t a one-off fluke; I repeated the training for both the GAN and the AEGAN five times, and in each case, the AEGAN produced good results and the GAN produced garbage.

An exciting side-effect of the AEGAN technique is that it allows for direct interpolation between real samples. GANs are known for their ability to interpolate between samples; draw two random vectors z1 and z2 , interpolate between the vectors, then feed the interpolations to the generator and boom ! With AEGAN, we can interpolate between real samples:

jEvYNjj.png!web

Figure 4: Interpolations between real samples. The left and right-most columns are real samples, while the middle images are interpolations between those samples. The bottom row features an interpolation between a sample and that same sample mirrored horizontally, giving the illusion that the character is turning their head.

Because the encoder E is able to map a sample x to its corresponding point z in the latent space, the AEGAN allows us to find points z1 and z2 for any samples x1 and x2 and interpolate between them as one would for a typical GAN. Figure 5 illustrates the reconstructions of 50 random samples from the dataset:

uma6f26.png!web

Figure 5: Pairs of real samples (left in each pair) and their reconstructions (right in each pair) arranged in five columns. In this sense, AEGAN functions like an autoencoder, but without the hallmark blurriness. Note that the eye colours of the reconstructions are all various shades of green, a form of minor mode collapse.

Discussion

First, I’d like to address the shortcomings of this experiment. As I said, this was only tested on a single dataset. The structures of the individual networks G , E , Dx , and Dz also weren’t extensively explored and no meaningful hypertuning was performed (on number or shape of layers, λs, etc.). The networks themselves were fairly simplistic; a more thorough and fair experiment would be to apply the AEGAN technique as a wrapper to a more powerful GAN on a more complex dataset such as CelebA .

That said, the AEGAN has a number of desirable theoretical properties which make it ripe for further exploration.

Forcing the AEGAN to preserve information about the latent vector in the generated image prevents mode collapse by definition. This also allows us to avoid batch independence-breaking techniques like batch normalization and batch discrimination. Incidentally, I was forced to avoid batch normalization in this experiment due to an issue with its implementation in TF.keras 2.0, but that’s a story for another day…
Learning a bijective function allows for direct interpolation between real samples, without relying on auxiliary networks or invertible layers. It also may allow for better exploration and manipulation of the latent space, possibly by experimenting with different distributions as was done in Adversarial Autoencoders .
Exposing the generator to real samples directly allows it to spend less time wading about in the abyss of pixel-space.

To that last point, the generators of regular GANs are never directly exposed to the training data, and only learn what the data looks like indirectly through the discriminator’s feedback (hence the nickname “ blind forger ”). By including a reconstruction loss, the generator can beeline towards the low-dimensional manifold in high-dimensional pixel space. Consider figure 6, which shows the AEGAN’s output after only 200 training steps:

uQnAven.png!web

Figure 6: AEGAN output after 200 training steps (NOT 200 epochs). Faces are clearly visible, and the have distinct hair and eye colours, expressions, and poses.

Compare this to figure 7, which shows a regular GAN with the same architecture as the AEGAN at the same point in its training:

NbUVjiR.png!web

Figure 7: GAN output after 200 training steps (NOT 200 epochs). Since you already know these are faces, you can kind of see hair, skin, and eyes, but there’s substantial mode collapse and the images are very low quality.

As you can see, the AEGAN is particularly effective at finding the low-dimensional manifold, although measuring its ability to fit that manifold will require further experimentation.

Further Work

Apply AEGAN to state-of-the-art techniques like StyleGAN to see if it improves quality and/or rate of convergence.
Explore the λ hyperparameters to find optimal values; explore curriculum methods, such as gradually decreasing the λs over time.
Apply conditionality to the training; explore Bernoulli and Multinoulli latent components, as was done in Adversarial Autoencoders .
Apply AEGAN to a designed image dataset with a known underlying manifold, to measure how effectively the technique can reproduce it.
Find a way to match the dimensionality of the latent space to the dimensionality of the data-generating function’s manifold (easier said than done!)

Errata

I’d be remiss if I didn’t mention Variational Autoencoder / GANs somewhere, which is an interesting, related technique, so here it is. The data used to train these models is available on Kaggle . You can check out the original paper here . My tf.keras implementation of this network is available at the following github repo:

ConorLazarou/AEGAN-keras

A Keras implementation of the Autoencoding Generative Adversarial Network (AEGAN) technique.

Autoencoding Generative Adversarial Networks