Lets make some Anime using Deep Learning

Let’s make some Anime using Deep Learning

Comparing text generation methods: LSTM vs GPT2

Arpan Mishra

Jul 10 ·12min read

Ena6FjV.jpg!web

Photo by Bruce Tang on Unsplash

The motivation for this project was to see how far technology has come in just a few years in the NLP domain especially when it comes to generating creative content. I have explored two text generation techniques by generating Anime synopsis, first with LSTM units which is a relatively old technique and then with a fine tuned GPT2 transformer.

In this post you will see how AI went from creating this piece of nonsense…

A young woman capable : a neuroi laborer of the human , where one are sent back home ? after defeating everything being their resolve the school who knows if all them make about their abilities . however of those called her past student tar barges together when their mysterious high artist are taken up as planned while to eat to fight !

to this piece of art.

A young woman named Haruka is a high school student who has a crush on a mysterious girl named Miki. She is the only one who can remember the name of the girl, and she is determined to find out who she really is.

To get the most out of this post you must have knowledge of :

Python programming
Pytorch
Working of RNNs
Transformers

Alright then, lets see some code!

Data Description

The data used here has been scraped from myanimelist , it initially contained over 16000 data points and it was a really messy dataset. I have taken the following steps to clean it:

Removed all the weird genres of Anime (if you’re an Anime fan you will know what I`m talking about).
Every synopsis contained its source in the end of the description (Eg: Source: myanimelist, Source: crunchyroll etc.) so I have removed that as well.
Animes that are based on video games, spin-offs or some adaptation had very small summaries so I removed all the synopses with words less than 30 & I also removed all the synopses which contained the words “spin-off”, “based on”, “music video”, “adaptation”. The logic behind this was that these types of Animes won’t really make our model creative .
I have also removed Animes with synopsis words more than 300. This is just to make the training easier (check GPT2 section for more details).
Removed symbols.
Some descriptions also contained japanese characters so those were also removed.

The following functions take care of all this

The LSTM way

The traditional approach for text generation uses recurrent LSTM units. LSTM (or long short term memory) are specifically designed to capture long term dependencies in sequential data which the normal RNNs can’t and it does so by using multiple gates which govern the information that passes from one time step to another.

Intuitively, in a time step the information that reaches an LSTM unit goes through these gates and they decide if the information needs to be updated, if they are updated then the old information is forgotten and then this new updated values are sent to the next time step. For a more detailed understanding of LSTMs I would suggest you to go through this blog .

Creating the Dataset

So before we build out model architecture, we must tokenize the synopses and process then in such a way that the model accepts it.

The input and the output is same in text generation except that the output tokens are shifted one step to the right. This basically means that the model takes in input the past words and predicts the next word. The input and output tokens are passed into the model in batches and each batch has a fixed sequence length. I have followed these steps to create the dataset:

Create a config class.
Join all the synopsis together.
Tokenize the synopses.
Define the number of batches.
Create vocabulary, word to index & index to word dictionaries.
Create output tokens by shifting input tokes to the right.
Create a generator function which outputs the input and output sequences batch wise.

Creating the dataset

Model Architecture

Our model consists of an embedding layer, a stack of LSTM layers (I have used 3 layers here), dropout layer and then finally a linear layer which outputs the scores of each vocabulary token. We haven`t used the softmax layer just yet you will understand why shortly.

Since the LSTM units also outputs the hidden states, the model also returns these hidden states so that they can be passed onto the model in the next time step (next batch of word sequence). Also, after every epoch we need to reset the hidden states to 0 as we won’t need the information from the last time step of the previous epoch in the first time step of the current epoch, so we have a “zero_state” function as well.

model

Training

We then just define out training function, store the losses from every epoch and save the model with the best loss. We are also calling the zero state function before every epoch to reset the hidden states.

The loss function that we are using is the cross entropy loss , this is the reason we are not passing the output through an explicit softmax layer as this loss function calculates that internally.

All the training is being done on GPU, following are the parameters that are being used (as provided in config class):

batch size = 32
maximum sequence length = 30
embedding dimension = 100
hidden dimension = 512
epochs = 15

Generating Anime

During the text generation step, we feed into the model some input text for example, ‘A young woman’, our function will first tokenize this, then pass it into the model. The function also takes the length of the synopsis that we want to output.

The model will output the scores of each vocabulary token. We will then apply softmax to these scores to convert them into a probability distribution.

Then we use top-k sampling i.e we select top k tokens with highest probability out of the n vocabulary tokens and then randomly sample one token which we return as the output.

This output is then concatenated into out initial input string. This output token becomes the input for the next time step. Say the output was “capable” then our concatenated text is “A young woman capable”. We keep doing this until we output the final token and then we print the output.

Here is a nice diagram to understand what the model is doing

JVzEn2Z.png!web

inference step. Source: machinetalk

In the above example, I have given the max length as 100 and the input text as “In the”, and this the output that we get

In the days attempt it 's . although it has , however ! what they believe that humans of these problems . it seems and if will really make anything . as she must never overcome allowances with jousuke s , in order her home at him without it all in the world : in the hospital she makes him from himself by demons and carnage . a member and an idol team the power for to any means but the two come into its world for what if this remains was to wait in and is n't going ! on an

This seems grammatically correct but it makes no sense at all. LSTM though are better at capturing long term dependencies than basic RNN but they can only see a few steps (words) back or a few steps forward if we use bidirectional RNNs to capture the context of the text hence when generating very long sentences we see that they make no sense.

The GPT2 way

A little bit of theory

Transformersdo a much better job in capturing the context of the text piece provided. They use only attention layers (no RNNs) which allows them to understand the context of the text much better as they can see as many time steps back (and forth depending on the attention) as they wish to. There are different types of attention but the attention that is used by GPT2, one of the best models out there for language modeling, is called masked self attention . If you’re not familiar with transformers please go through this blog before proceeding.

GPT2 instead of using both the transformer encoder and decoder stacks uses a high stack of just transformer decoders. There are 4 variants of the GPT2 transformer depending on the number of decoder stacked.

yEnMJ3Y.png!web

Variants. Source: Jalammar

Each decoder unit consists of mainly 2 layers:

Masked Self Attention
Feed Forward Neural Network

There is a layer normalization step and a residual connection after each step as well. This is what I mean…

eMnu2af.png!web

Single Decoder unit

If you went through the blog post earlier you must know how self attention is calculated. Intuitively, the self attention scores give us the importance or the attention the word in the current time step should give to the other words (past time step or future depending on the attention).

In masked self attention however, we are not concerned with the next or the future words. So the transformer decoder is allowed to only attend to the present and the past words and the future words are masked .

Here is a beautiful representation of this idea…

RZrMjqz.png!web

Masked Self Attention. Source: jalammar

In the above example the current word is “it” and as you can see the words “a” and “robot” have high attention scores. This is because “it” is being used to refer to “robot” and so is “a”.

You must have noticed that <s> token in the beginning of the above input text. <s> just being used to mark the start of an input string. Traditionally instead of the <s> token <|endoftext|> is used.

Another thing you must have noticed that, this is similar to the traditional language modeling where seeing the present and the past tokens the next token is predicted. Then this predicted token is added to the inputs and then again the next token is predicted.

I have given a very intuitive and high-level understanding of GPT2. Even though this is enough to dive into the code, it would be a good idea to read up more on this concept to get a deeper understanding. I suggest Jay Allammar`s blog .

The Code

I have used GPT2 with a linear model head from the Hugging Face library for text generation. Out of the 4 variants, I have used GPT2 small which has 117M parameters.

I have trained the model on Google Colab, the main issue in training was figuring out the batch size and the maximum sequence length so that I don’t run out of memory while training on GPU, a batch size of 10 and maximum sequence length of 300 finally worked for me.

For this reason I have also removed the synopses with words more than 300 so that when we generate the synopsis till a length of 300, it is actually complete.

Creating the Dataset

For fine tuning the first task is to get the data in the required format, dataloader in Pytorch allows us to do that very easily.

Steps:

Clean the data using clean_function defined above.
Append the <|endoftext|> token after every synopsis.
Tokenize each synopsis using GPT2Tokenizer from HuggingFace.
Create a mask for the tokenized words ( Note : this mask is not the same as the masked self attention we talked about, this is for masking the pad tokens we will see next).
Pad the sequences with length lesser than the max length (here 300) using the <|pad|> token.
Convert the token ids and the mask into tensor and return them.

dataset

Model Architecture

Here, we don’t explicitly need to create a model architecture as Hugging Face library takes care of that for us. We just simply import the pre-trained GPT2 model with language model head.

This LM head is nothing but actually just a linear layer which outputs the scores of each vocabulary token (before softmax).

The cool thing about the GPT2Model with LM head provided by Hugging Face is that we can directly pass the labels (our input tokens) here and they are internally shifted to the right by one step, and the model along with the prediction scores returns the loss as well. It actually also return the hidden states of every layer in the model and the attention scores as well but we are not interested in that.

We can import the model and the tokenizer and define all the hyperparameters in a config class like so…

Training Function

Steps:

The training function takes the ids and the masks from the data loader.
Passes the ids and masks through the model.

The model outputs a tuple:- (loss,predicted scores, list of key&value pairs of every masked attention layer, list of hidden states of every layer,attention scores) We are only interested in the first 2 items in this tuple so we access those.

Perform backward propagation and update the parameters.
Return the mean loss for the epoch.

Running the Train Function

Steps:

Read the data.
Create data loader object.
Define the optimizer, I am using AdamW (Adam with weight decay). Learning rate is 0.0001 and weight decay is 0.003.
Define the scheduler. I am using the a linear schedule with warm up from Hugging Face. The warm up steps are 10 (This basically means that for the first 10 training steps the learning rate is going to increase linearly and then it is going to decrease linearly).
Run the training function. I have trained for 5 epochs.
Save the model with the lowest loss.
Empty the GPU cache after every epoch to prevent OOM error.

Generating Anime

In the generation step I have used top-k sampling (as in LSTM way) as well as top-p sampling. In top-p sampling we provide a cumulative probability say p, then the top vocabulary tokens that are selected must have a sum probability of p.

We can combine both the top-k and top-p methods, first the top-k tokens are selected with highest probability scores then for these k tokens a normalized score is calculated. This is such that the sum of these scores for the k tokens is 1, we can also say that the probability mass is just re-distributed among the k tokens.

Next top-p sampling is done on these k scores, then out of the selected tokens finally we just sample using the probabilities to get the final output token.

And again we do not have to code all of this, hugging face takes care of it all with its generate method.

Steps:

<|pad|>

You must have noticed that the generate method has a lot of parameters. They can be tuned to get the most optimal output. Check out this blog by hugging face that explains these parameters in detail.

For the input text, “In the year” this is the output that we get….

In the year 2060, mankind has colonized the solar system, and is now on the verge of colonizing other planets. In order to defend themselves against this new threat, the Earth Federation has established a special unit known as the Planetary Defense Force, or PDF. The unit is composed of the elite Earth Defense Forces, who are tasked with protecting the planet from any alien lifeforms that might threaten the safety of Earth. However, when a mysterious alien ship crashes in the middle of their patrol, they are forced to use their special mobile suits to fend off the alien threat.

You know what? I would actually watch this.

The difference is between the synopsis generated by LSTMs and GPT2 is just huge! Not only in the model able to capture long term dependecies well, but the context is also maintained throughout.

Here is another one…

A shinigami (death god) who is a descendant of the legendary warrior Shigamis father, is sent to Earth to fight against the evil organization known as the Dark Clan. However, his mission is to steal the sacred sword, the Sword of Light, which is said to grant immortality to those who wield it.

Checkout my github repository , where you can go through the code and see some more cool AI generated Anime.

I hope you enjoyed this post and were easily able to follow it. Please provide valuable feedback in the comments. I would love to know how you would approach this task and if you have a better way of doing this and improving the model.

References

Understanding LSTM by Christopher Olah

Text generation with LSTM by Trung Tran

Fine tuning GPT2 Example by Martin Frolovs

Transformers Blog by Jay Alammar

GPT2 Blog by Jay Alammar

Generate Method Hugging Face