6

The missing piece of GAN

 2 years ago
source link: https://medium.com/@sunnerli/the-missing-piece-of-gan-d091604a615a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The missing piece of GAN

Figure 1. The missing guessing in GAN framework.

It’s about 4 years since GAN (generative adversarial network) has been developed. However, some fact is proven in this summer [1] and the fact shows that there is some missing piece in GAN theory. The improved version is call Relativistic GAN. General to said, the paper extend the relativistic idea to traditional GAN. By using relativistic idea, the GAN framework can work normally with stability.

In this article, I will discuss some relativistic idea about GAN. The article will be split as two parts. In the first part, I will describe the fundamental of GAN, and demonstrate some idea about relativistic concept. On the other hand, I will show my experiment in the other part. We only focus on unconditional GAN in this article and ignore the conditional cases, but you can extend the idea into conditional model normally.

The fundamental of GAN

Figure 2. The diagram of GAN.

In 2014, Ian Goodfellow purposed the very famous generative model: GAN. In the pioneering era, the data is generated from a prior distribution. To simplify the prior assumption, the Gaussian distribution had been used to become the prior. The generator will generate the data which sample from prior distribution. In the second part, the discriminator will determine if the input image is the true or fake one. The figure 2 shows the diagram of GAN. In the diagram, I only use value 0 as the notation of fake prediction, and value 1 as the notation of true prediction.

In the GAN framework, the goal of generator is trying to generate the image which can cheat over the discriminator. On the other hand, the goal of discriminator is to learn to distinguish if the input image is true or not.

Figure 3. The loss function of standard GAN.

Figure 3 shows the loss function of GAN. Actually, the neural networks are playing a min-max zero-sum game. The loss value is calculated by the cross-entropy mechanism. In the following part of article, I will use “standard GAN” as the term of this specific kind of model whose loss is computed by cross entropy concept.

In the period to revise the parameters of discriminator, we want to maximize the loss function, and make the discriminator to have ability to measure the divergence between the true data distribution and generated data distribution. On the other hand, we want to minimize the loss function while revising the parameters of generator, and reduce the divergence between the two different distributions.

In the ideal situation, the total loss will be 0. However, this kind of idea is unstable during training. Thus there are other games idea which were purposed. We don’t discuss about the extended idea about the other game in this article.

Figure 4. The characteristic of generator and discriminator.

General to said, the GAN can be generalize into the theory about Fenchel conjugate and f-divergence. Nevertheless, we ignore the mathematical theory about these complicated idea in this article. You can refer to the paper of f-GAN [2] if you’re interesting about the theory. Simply, you can just imagine the character of the two network which is shown in Figure 4. The discriminator acts like a ruler, and try to measure the distance; the generator acts like a rope, and try to make the distribution as close as possible!

Figure 5. One of the updating code in the pytorch implementation of GAN.

Nowadays, there are bunch amount of re-implementation of standard GAN. I just pick one of them [3] and show in Figure 5. During updating the generator, the prediction of fake data is expected as valid label. The case of true data is not considered in the revision of generator for standard GAN. On the other hand, we treat the fake data as fake label while updating the discriminator. At the same time, we also treat the real data as valid label.

Another distance metric — IPM

However, it’s the very weird point that we just use one line to update the generator. In the other word, we only increase the probability of the fake data. In the central theory of GAN, we should minimize the whole loss function. Why everyone just do half of the work?

Figure 6. The proceeding path of true data.

The reason is that there is no relation between generator and real data. During forwarding process, the true data are directly sent into discriminator, and the proceeding path is shown in Figure 6. As the result, the gradient computation will not proceed to generator during back propagation process.

WGAN-family methods were purposed by Arjovsky et al in last year. These methods include WGAN and WGAN-GP. (If you are not familiar with these methods, you can refer to my another article :-) Arjovsky used strong mathematics to prove that the Earth-Mover distance is robust to provide the meaningful gradient information, and it can make the training more stable.

Figure 7. The formula of IPM.

In the rest part of article, I will use “IPM-based” to represent the model which uses IPM as the metric of different distributions. In the paper [1], Alexia try to induct the idea of WGAN as an IPM-based method. The Figure 7 shows the definition of IPM. The full name of IPM is Integral probability metrics. Just like the description in the original paper:

IPMs are statistical divergences

Simply to said, IPM is just another kind of metric to measure the distance between the different distributions. In the traditional theory of GAN, we realize the loss function that Figure 2 shows is to minimize the Jensen-Shannon divergence (JS divergence). So JS divergence is a kind of metric to measure the distance between the real data distribution and generated data distribution. The calculation of JS divergence is to compute the divergence with log probability form, and the formula is shown in Figure 8.

Figure 8. The formula of JS divergence.

However, the distance of two data distributions cannot only been measured by JS divergence. We can use different kind of metric if we can prove the theory with formal mathematics. IPM is the another distance metric. It calculate the supreme difference between the two distributions. Alexia also shows that the standard GAN updating idea is very similar to IPM-based method if the model is under some condition. (I don’t mention the detail of the condition in this article since the length of article will become too long if I list it.)

Persuade you to use relativistic approach

In the paper[1], the author use some mathematics to prove that the IPM-based approach is more reasonable. However, I’m more satisfied about the description in the idea of the debate. The following will describe the reason in two angle: prior knowledge and divergence minimization. (I didn’t fully realize the description of gradient angle, thus I don’t mention about that)

prior knowledge

Figure 9. The probability changing during training.

The current updating schema is unreasonable! The ideal discriminator cannot distinguish the input data, and give the probability of 0.5 in expectation. The Nash equilibrium will be reached finally, and it’s shown in the left side of Figure 9. However, in the actual training, we only increase the probability of fake data just like the middle sub-image shows. The trend of the two line is absolutely different to the left side!!! Just like the criticism in the paper:

This behavior is illogical considering the a priori knowledge that half of the samples in the mini-batch are fake!

The convergence can be obtained only if we also decrease the probability of true data. After this revision, the value of converge probability will be located in 0.5, and the equilibrium can be obtained again. The right side of Figure 9 shows that the final point is similar to the left side as magnitude of 0.5.

Simply to said, the IPM-based model has consider the probability of true data during generator updating because the concept of IPM is “the difference compare to both distribution”. However, standard GAN updating approach doesn’t consider to decrease the probability of true data.

Divergence minimization

Figure 10. The formula of JS divergence in GAN.

The Figure 10 shows that the principle of standard GAN training can be treated as minimizing the JS divergence. The previous article also shows the detail inference about this formula. We discuss for three cases (We don’t consider the case that the discriminator is very poor that the probability of true data is 0):

  1. The probability of true data is 1, and the probability of fake data is 0: In this case, the optimal discriminator is gotten, and the JS divergence is log(4)
  2. The probability of true data is 0.5, and the probability of fake data is 0.5: In this case, the discriminator can not distinguish correctly, and the JS divergence is 0
  3. The probability of true data is 1, and the probability of fake data is 1: We can encounter this case if we updating the generator as standard GAN updating strategy. However, the JS divergence is 1 + log(4) which is not the minimum

In this formula, the minimum can be obtained only if the probabilities of true data and fake data are 0.5 equally.

On the contrary, we just increase the probability of fake data without decreasing the probability of true data in the updating of generator. It cause that the minimum situation of JS divergence cannot reach if the probability of true data is very high in the standard GAN training.

The probability of true data should be 0.5 in the optimal. However, the higher probability of true data will be judged if the optimal discriminator is trained, and the probability of true data will not decrease in the updating of generator.

Summery for the above debate, the more stable training phenomenon can be obtained only if we not only increase the probability of fake data, but also decrease the probability of true data. Therefore, it is critical to involve the fake data into generator training procedure.

Can we just extend the generator?

Figure 11. The diagram of my naive idea.

From the previous discussion, we know that it’s crucial to make the true data participate the updating of generator. Let’s make a naive guessing: can we just modify the structure of generator without adopting the IPM idea?

For this thinking, I just revise the structure of the generator. The generator is composed by two sub-network. The first sub-network is the original one. However, we train another hour-glass network, force it to learn an identity mapping between the input and the output, and treat the combination of two network as the “actual” generator. The structure of my naive idea is shown in Figure 11.

Under this revised structure, the true data can participate the computation of generator alternatively. Consequently, the gradient can be spread into the generator during updating. We also play the min-max game toward the original loss function. Can this naive revision work normally?

Unfortunately, the answer is not. There is no mathematics theory to prove that the revision of structure can solve the unstable convergence problem completely. In the experiment, the mode collapse will also occur, and the training will crash like standard GAN. We don’t provide the sample code for this revision in this article, but you can try it by your own if you are interesting about (The revision code is not hard).

Result of relativistic idea

At last, I try to re-implement the idea of relavistic GAN loss, and do a simple experiment toward MNIST dataset. The link shows my project source code. Furthermore, this project not only shows the result of training, but also provides the revised version of GAN loss which is defined in the official CycleGAN project implementation. You can just alter the file and feel free to use it! The simple code demonstrates how to use it the revised version.

Figure 12. The unconditional training of MNIST by using standard GAN training strategy.

The Figure 12 shows the training result by using original standard GAN training strategy. (The figure is GIF, you can reload the page if you read the previous part too long!) As you can see, the training is unstable, and it crash at the middle procedure. The generator can only generate the noise finally.

Figure 13. The unconditional training of MNIST by using relativistic GAN training strategy.

On the other hand, the Figure 13 shows the rendering result by adopting relativistic concept. As you can see, the much stable result can be obtained even though we still use cross entropy loss as the measure of the two distribution.

Figure 14. The changing of the loss measure.

The figure 14 shows the changing of the loss calculation. In the standard GAN, we simply calculate the sigmoid value of discriminator result, and treat it as — how much the probability of realistic it is for the given input data. However, after adopting the relativistic idea, we can regard the definition of the discriminator output as — compare to the opposite data, how much the given data is like the realistic one ?

Conclusion

In this article, we try to discuss the principle of standard GAN training strategy, and point out there is some missing piece in the theory. Next, I try to give some comments from different field, and hope the reader can accept the relativistic idea. Furthermore, we simply discuss the naive idea to make the true data participate in the computation of generator. At last, I shows my experiment with successful result, and provide the alternative code that you can swap into your own project.

By using relativistic idea, the theory of GAN is complete! The training is much stable, and you can easily adopt the idea too. Hope this change can make GAN be more powerful to distinguish toward different kinds of image domain in the various vision task!

Claps

Give me some claps if you think this article can help you realize the principle of original paper much easily!!!!!!

Reference

[1] Alexia Jolicoeur-Martineau, “The relativistic discriminator: a key element missing from standard GAN,” ArXiv:1807.00734 [cs.LG], July. 2018.

[2] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka, “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization,” In 2016 The Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December, 2016, pp. 1–9.

[3] https://github.com/eriklindernoren/PyTorch-GAN/blob/master/implementations/gan/gan.py


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK