Generative Adversarial Networks (GANs)

Generative Adversarial Networks

A Gentle Introduction To GANs, Why Training Them Is Challenging And What Are The Main Attempts To Remedy ?

Mar 9 ·5min read

ZrURJvV.png!web

These U ltra-Realistic Images of Human Faces Are Not Real; They Were Produced by NVIDIA Style-Based GAN That Allows Control Over Different Aspects of The Image

Generative Adversarial Networks ( a.k.a. GANs) represents one of the most exciting recent innovation in deep learning. GANs were originally introduced by Ian Goodfellow and Yoshua Bengio from the University of Montreal, in 2014 and Yann LeCun considered them as ‘the most interesting idea in the last 10 years in ML’ .

A GAN is a generative model in which two neural networks are competing in a typical game theory scenario. The first neural network is the generator , responsible of generating new synthetic data instances that resemble your training data, while its adversary, the discriminator tries to distinguish between real (training) and fake (artificially generated) samples generated by the generator. The mission of the generator is to try fooling the discriminator, and the discriminator tries to resist from being fooled. That’s why the system as a whole is described as adversarial .

As shown in the figure below, the generator’s input is simply a random noise, while, only the discriminator has access to the training data for classification purposes. The generator keeps improving its output based, exclusively, on the feedback of the discriminator network (positive in case of match with training data and negative if there is no match).

Q3qY3iB.png!web

Generative Adversarial Network Architecture

Formally, GANs are based on the zero-sum (minimax) non-cooperative game , if one player wins the opponent loses. While the first player tries to maximize its actions, the second player’s actions aim to minimize them. From a game theory perspective, the GAN model converges when the discriminator and the generator reach the well-known Nash equilibrium which is the optimal point for the minimax game described above. Since the two players try to misled each other, the Nash equilibrium happens when one of the players will not change its action regardless of what the opponent may do.

Therefore, GANs are notoriously difficult to train in practice with significant problems of non-convergence, it takes a while before the generator starts producing fake data that is acceptably close to the real data.

Commonly Faced Challenges

GANs have a number of common failure modes that can come ahead when training them. Among these failures, three big challenges issues are the focus of the work of several research groups all over the world. While none of these problems have been completely solved, we’ll mention some things that people have tried.

1. Mode Collapse

When training GANs, the aim is to produce a wide variety of fake data that mimics your real ones (i.e. fitting the same distribution) . From a random input, we want to create a completely different and new output (for example a new realistic human face) . However, when the generator finds one or a limited diversity of samples regardless of the input, that seems most plausible to the discriminator, the generator may leazily learn to produce only that output. Even if it may appear, at first, as a good indication of training progress, that’s one of the most challenging failures when training GANs called Mode Collapse or helvetica scenario . Such situation may happen once the discriminator is stuck in a local minima and cannot distinguish between a real input and generator’s output. At this point, the generator will easily notice this black hole and keeps generating the same output or at most slightly different ones.

Attempts to Remedy

Use a different loss function, such as a Wasserstein loss by letting you train the discriminator to optimality without worrying about vanishing gradients. If the discriminator doesn’t get stuck in local minima, it learns to reject the outputs that the generator stabilizes on. So the generator has to try something new.
Unrolled GANs use a generator loss function that incorporates not only the current discriminator’s classifications, but also the outputs of future discriminator versions. So the generator can’t over-optimize for a single discriminator.
Training GANs with diverse samples of data.

2. Failure to Converge

GANs frequently fail to converge. Adversarial training scenarios may easily look as unstable by assuming that competing two neural networks against each other with the goal that both networks will eventually reach equilibrium. Without a deep knowledge of how to bypass such risks, this may be considered as a naive assumption to make, because there is no guarantee that competing gradient updates will result in convergence and not a random oscillations.

Attempts to Remedy

An easy trick consists on adding noise to discriminator inputs (both the real and synthetic data) to discourage it from being overconfident about its classification, or relying on a limited set of features to distinguish between training data and generator’s output.
In the same direction as the previous trick, we can try to use the Two Time-Scale Update Rule proposed in NIPS’2017 where the authors provided a mathematical proof of convergence to Nash equilibrium. Picking a higher learning rate for the discriminator compared to the generator. Thus, our generator will have more training iterations and more time to train than the discriminator. It is easly understandble that training a classifier is widely easier than training a generative model.
Penalizing discriminator weights: See, for example, Stabilizing Training of Generative Adversarial Networks through Regularization .

3. Vanishing Gradients

Research has suggested that if your discriminator is too good, then generator training can fail due to vanishing gradients. In effect, an optimal discriminator doesn’t provide enough information for the generator to make progress. When we apply backpropagation, we use the chain rule of differentiation, which has a multiplying effect. Thus, gradient flows backward, from the final layer to the first layer. As it flows backward, it gets increasingly smaller. Sometimes, the gradient is so small that the initial layers learn very slowly or stop learning completely. In this case, the gradient doesn’t change the weight values of the initial layers at all, so the training of the initial layers in the network is effectively stopped. This is known as the vanishing gradients problem.

Attempts to Remedy

We can use activation functions such as ReLU , LeakyReLU instead of sigmoid or tanh squashing input values between 0 and 1 or -1 and 1 respectively resulting in an exponential decrease in the gradient.
The Wasserstein loss is designed to prevent vanishing gradients even when you train the discriminator to optimality.
The original GAN paper proposed a modified minimax loss to deal with vanishing gradients.

Know your author

Rabeh Ayari is a Senior Data Scientist working on applied AI problems for credit risk modelling & fraud analytics and conducting original research in machine learning. My areas of expertise include data analysis using deep neural networks, machine learning, data visualization, feature engineering, linear/non-linear regression, classification, discrete optimization, operations research, evolutionary algorithm and linear programming. Feel free to drop me a message here !