Generating Synthetic Images from textual description using GANs

Automatic synthesis of realistic images is extremely difficult task and even the state-of-the-art AI/ML algorithm suffer to fulfil this expectation. In this story I will talk about how to generate realistic images from textual description which describes the image. If you are a fan of Generative Adversarial Network [GAN] then you are at the right place.

GAN was introduced back in 2014 by Ian Goodfellow and since then this topic itself became very popular in research community and we saw plenty of papers been published within few years. Deep Convolutional-GAN [DC-GAN] was one of them. Today topic is around DC-GAN.

YbQ7Rj6.png!web

Can you guess whether this image is of a real cute baby or it’s imagined by GAN?

This image is actually generated by GAN. This work was done by Karras et al. from NVIDIA, the paper is called StyleGAN . The objective of this paper is to generate very high resolution image of human faces. You can visit the website https://thispersondoesnotexist.com and each visit to this website will generate random image of human face. Since inception GAN went through lot of evolution in a very quick time and today we are at a stage to generate images which are indistinguishable from read image. GAN models are really good at generating random images but it’s extremely difficult to control GAN to generate image of our interest. In this topic I will discuss about how to generate image from a textual description which describes the image in details.

Topic of Discussion:I will talk about a GAN formulation which takes as input a textual description and generate an rgb image which was described in the textual input. As an example, given

“this flower has a lot of small round pink petals”

as input will generate an image of flower having round pink petal as follows:

How does Vanilla GAN works:Before moving forward let us have a quick look at how does Vanilla GAN works.

If you are already aware of Vanilla GAN, you can skip this section.

GAN comprises of two independent networks. One is called Generator and the other one is called Discriminator. Generator generates synthetic samples given a random noise [sampled from latent space] and the Discriminator is a binary classifier which discriminates between whether the input sample is real [output a scalar value 1] or fake [output scalar value 0]. Samples generated by Generator is termed as fake sample.

RzyAZbV.png!web

The beauty of this formulation is the adversarial nature between Generator and Discriminator. Discriminator wants to do it’s job in best possible way, when a fake sample [which are generated by Generator] is given to a Discriminator, it wants to call it out as fake but the Generator wants to generate samples in a way so that the Discriminator makes a mistake in calling it out as a real one. In some sense, the Generator is trying to fool the Discriminator.

Avei2ii.png!web

Let us have a quick look at the objective function and how does the optimization is done. It’s a min-max optimization formulation where Generator wants to minimize the objective function whereas Discriminator wants to maximize the same objective function.

FZJzaez.png!web

Discriminator wants to drive the likelihood of D(G(z)) to 0. Hence it wants to maximize (1-D(G(z))) whereas the Generator wants to force the likelihood of D(G(z)) to 1 so that Discriminator makes a mistake in calling out generated sample as real. Hence Generator wants to minimize (1-D(G(z)).

BbAvEr2.png!web

This min-max formulation of the objective function has a global optimum when data distribution and model distribution is same which means if the optimization function converges to global minimum then the model had learnt the underlying data distribution present in input training dataset. Follow GAN paper for better understanding.

Text-to-Image formulation:In our formulation, instead of only noise as input to Generator, the textual description is first transformed into a text embedding, concatenated with noise vector and then given as input to Generator. As an example, the textual description has been transformed into a 256 dimensional embedding and concatenated with 100 dimensional noise vector [which was sampled from a Normal distribution]. This formulation will help Generator to generate images aligned with input description instead of generating random images.

v2AzIj7.png!web

For Discriminator, instead of having only image as input, a pair of image and text embeddings are sent as input. Output signals are either 0 or 1. Earlier Discriminator’s responsibility was just to predict whether given image is real or fake. Now, Discriminator has one more additional responsibility. Along with identifying the given image is read or fake, it also predicts the likelihood of whether the given image and text aligned with each other. This formulation force Generator to not only generate images which looks real but also to generate images which are aligned with input textual description.

BrU7BzB.png!web

To fulfil the purpose of the 2-fold responsibility of Discriminator, during training time, a series of different (image, text) pairs are given as input as follows:

Pair of (Real Image, Real Caption) as input and target variable is set to 1
Pair of (Wrong Image, Real Caption) as input and target variable is set to 0
Pair of (Fake Image, Real Caption) as input and target variable is set to 0

A very interesting thing to notice here, when the target variable for (Fake Image, Real Caption) pair is 0 from Discriminator perspective. It’s set to 1 for Generator loss because Generator wants Discriminator to call it out as real image.

Dataset:A very popular open-source dataset has been used for this solution. It’s called Oxford flowers-102 dataset which has approx 8k images of 102 different categories and each image has 10 different captions describing the image. People can write captions in all different ways, this dataset is a nice example of that and covers a very wide variety of variation of captions for each image. This rich dataset helps us to learn better text embedding and helps to address the problem of variability in expressing same intent in different ways.

Eb2mIfU.png!web

How text embeddings are learnt:There are several unsupervised way of learning text embeddings. A very successful way of learning text embeddings is Skip-Thought Vectors . This kind of pre-trained vectors can be leveraged for multiple purposes to solve downstream applications. Another popular way to learn text embeddings is called Triplet Loss. For Triplet Loss formulation, 2 captions of same image is selected, one of them is considered as anchor whereas the other one is considered as positive. A random caption from different category is considered as negative. You can refer to my previous story to know more about Triplet Loss.

Image similarity using Triplet Loss

Have you ever trained a Machine Learning model to solve a classification problem? If yes, what was the number of…

towardsdatascience.com

QBj26bi.png!web

You can refer to following git repository as a reference to Triplet Loss implementation.

sanku-lib/image_triplet_loss

This repository is an implementation of following "medium" story: Image similarity using Triplet Loss Execute…

github.com

Results:As you see in following, even if text are written in all different format the model is able to interpret the intent and accordingly generating images. As the model was trained on images of flower, giving as input like “I want a black cat having white stripes” will generate a random image.

Conclusion:This works provides a very good direction to the future of Generative Adversarial Networks. The research community had made good progress in controlling what kind of image to generate by GAN. As research make more progress in this domain, You can expect much better results in near future.