Moving from Keras to Pytorch

Why? How? It's not that difficult.

Photo by David Clode on Unsplash

Pytorch is great. But it doesn’t make things easy for a beginner.

A while back, I was working with a competition on Kaggle on text classification, and as a part of the competition, I had to somehow move to Pytorch to get deterministic results. Results that don’t change with every run of the network to try out different models and have consistent results.

Now I have always worked with Keras in the past and it has given me pretty good results, but somehow I got to know that the CuDNNGRU/CuDNNLSTM layers in keras are not deterministic , even after setting the seeds.

So Pytorch did come to rescue. And I am glad that I considered moving.

As a side note : if you want to know more about NLP , I would like to recommend this awesome course on Natural Language Processing in the Advanced machine learning specialization . This course covers a wide range of tasks in Natural Language Processing from basic to advanced: Sentiment Analysis, summarization, dialogue state tracking, to name a few.

While Keras is great to start with deep learning, with time you are going to resent some of its limitations.

I also thought about moving to Tensorflow. It seemed like a good transition as TF is the backend of Keras.

But was it hard?

With the whole session.run commands and tensorflow sessions, I was sort of confused. It was not Pythonic at all.

Pytorch helps in that since it is much more Pythonic. You have things under your control and you are not losing anything on the performance front. Maybe gaining actually.

In the words of Andrej Karpathy:

I’ve been using PyTorch a few months now and I’ve never felt better. I have more energy. My skin is clearer. My eye sight has improved.

— Andrej Karpathy (@karpathy) May 26, 2017

So without further ado let me translate Keras to Pytorch for you.

The Classy way to write your network?

OOPs: Object-Oriented Programming

Let us create an example network in keras first which we will try to port into Pytorch.

Here I would like to give a piece of advice too. When you try to move from Keras to Pytorch take any network you have and try porting it to Pytorch . It will make you understand Pytorch in a much better way.

Here I am trying to write one of the networks that gave pretty good results in the Quora Insincere questions classification challenge for me.

This model has all the bells and whistles which at least any Text Classification deep learning network could contain with its GRU, LSTM and embedding layers and also a meta input layer. And thus would serve as a good example.

Also if you want to read up more on how the BiLSTM/GRU and Attention model work do visit my post here .

So a model in pytorch is defined as a class(therefore a little more classy) which inherits from nn.module . Every class necessarily contains an __init__ procedure block and a block for the forward pass.

__init__

Why is this Classy?

I found it beneficial due to a these reasons:

1) It gives you a lot of control over how your network is built.

2) You understand a lot about the network when you are building it since you have to specify input and output dimensions. So fewer chances of error . (Although this one is really up to the skill level)

3) Easy to debug networks. Any time you find any problem with the network just use something like print("avg_pool", avg_pool.size()) in the forward pass to check the sizes of the layer and you will debug the network easily

4) You can return multiple outputs from the forward layer. This is pretty helpful in the Encoder-Decoder architecture where you can return both the encoder and decoder output. Or in the case of autoencoder where you can return the output of the model and the hidden layer embedding for the data.

5) Pytorch tensors work in a very similar manner to numpy arrays . For example, I could have used Pytorch Maxpool function to write the maxpool layer but max_pool, _ = torch.max(h_gru, 1) will also work.

6) You can set up different layers with different initialization schemes . Something you won’t be able to do in Keras. For example, in the below network I have changed the initialization scheme of my LSTM layer. The LSTM layer has different initializations for biases, input layer weights, and hidden layer weights.

7) Wait until you see the training loop in Pytorch You will be amazed at the sort of control it provides.

Now the same model in Pytorch will look like something like this. Do go through the code comments to understand more on how to port.

Hope you are still there with me. One thing I would like to emphasize here is that you need to code something up in Pytorch to really understand how it works.

And know that once you do that you would be glad that you put in the effort. On to the next section.

A Highly customizable Training Loop

In the above section, I wrote that you will be amazed once you saw the training loop. That was an exaggeration.

On the first try, you will be a little baffled/confused.

But as soon as you read through the loop more than once it will make a lot of intuitive sense. Once again read up the comments and the code to gain a better understanding.

This training loop does k-fold cross-validation on your training data and outputs Out-of-fold train_preds and test_preds averaged over the runs on the test data.

I apologize if the flow looks something straight out of a kaggle competition, but if you understand this you would be able to create a training loop for your own workflow. And that is the beauty of Pytorch.

So a brief summary of this loop is as follows:

Create stratified splits using train data
Loop through the splits.
Convert your train and CV data to tensor and load your data to the GPU using the X_train_fold = torch.tensor(x_train[train_idx.astype(int)], dtype=torch.long).cuda() command
Load the model onto the GPU using the model.cuda() command
Define Loss function, Scheduler, and Optimizer
Create train_loader and valid_loader` to iterate through batches.
Start running epochs. In each epoch
Set the model mode to train using model.train() .
Go through the batches in train_loader and run the forward pass
Run a scheduler step to change the learning rate
Compute loss
Set the existing gradients in the optimizer to zero
Backpropagate the losses through the network
Clip the gradients
Take an optimizer step to change the weights in the whole network
Set the model mode to eval using model.eval() .
Get predictions for the validation data using valid_loader and store in variable valid_preds_fold
Calculate Loss and print
After all the epochs are done, predict the test data and store the predictions. These predictions will be averaged at the end of the split loop to get the final test_preds
Get Out-of-fold(OOF) predictions for train set using train_preds[valid_idx] = valid_preds_fold
These OOF predictions can then be used to calculate the Local CV score for your model.

But Why? Why so much code?

Okay. I get it. That was probably a handful. What you could have done with a simple .fit in keras, takes a lot of code to accomplish in Pytorch.

But understand that you get a lot of power too. Some use cases for you to understand:

While in Keras you have prespecified schedulers like ReduceLROnPlateau (and it is a task to write them), in Pytorch you can experiment like crazy. If you know how to write Python you are going to get along just fine
Want to change the structure of your model between the epochs. Yeah, you can do it. Changing the input size for convolution networks on the fly.
And much more. It is only your imagination that will stop you.

Wanna Run it Yourself?

So another small confession here.

The code above will not run as is as there are some code artifacts which I have not shown here. I did this in favor of making the post more readable. Like you see the seed_everything , MyDataset and CyclicLR (From Jeremy Howard Course) functions and classes in the code above which are not really included with Pytorch. But fret not my friend.

I have tried to write a Kaggle Kernel with the whole running code. You can see the code here and include it in your projects.

Take a look at the How to Win a Data Science Competition: Learn from Top Kagglers course in the Advanced machine learning specialization by Kazanova. This course talks about a lot of intuitive ways to improve your models. Definitely recommended.

I am going to be writing more of such posts in the future too. Let me know what you think about the series. Follow me up at Medium or Subscribe to my blog to be informed about them. As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz .

Moving from Keras to Pytorch