6

TensorFlow 2.1: A How-To

 4 years ago
source link: https://mc.ai/tensorflow-2-1-a-how-to/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Let’s go deep! —amazing picture by Dids

The keras mode, the eager mode and the graph mode

If you, just like me, are anything like a normal person, you probably have experienced how sometimes you get so caught up in the development of your application that it is hard to find a moment to stop and think if we are doing things the most efficient way we can do: are we using the right tools? which framework suits best my use case? is this approach extensible? do we have in mind the scalability?

This is specially true in the AI field. We all know the AI is a rapidly moving field. New research is published by the day. There is a huge fight between major AI frameworks that are being developed at a high pace. New hardware architectures, chips and optimizations are released to support the deployment of the growing AI adoption… However, despite all bells and whistles, sometimes you need to stop and reconsider.

When is it a good moment to stop and reconsider? That you will only know. For me this moment has come very recently. I have been using Keras and Tensorflow 1.x (TF1) both at work and for my personal projects since I started in this field. I am completely in love with the high level approach of the Keras library and the lower level approach of Tensorlfow that lets you change things under the hood when you need more customization.

Although I have been a huge fan of this Keras-Tensorflow marriage there always have been a very specific downside that set this couple far from idyllic: the debugging features . As you already know, in Tensorflow, there is this paradigm of defining the computational graph first, compile it after (or move it to GPU) and then run it very efficiently. This paradigm is very nice and makes sense technically speaking, but, once you have the model in the GPU it is almost impossible to debug it.

This is why after a while and coinciding that it has been roughly a year since TensorFlow 2.0 was published in its alpha version I decided to take a shoot at TensorFlow 2.1 (I could have started with TF2.0, but we all know we love new software) and share with you how it went.

TensorFLow 2.1

The sad truth is that I had a hard time figuring out how I was supposed to use this new TensorFlow version, the famous 2.1 stable version. I know, there are plenty of tutorials , notebooks and code gists… However I found that it was not the programming what bore the difficulty, since in the end it is just Python, but the paradigm shift. To put it simple: TensorFlow 2 programming differs from TensorFlow 1 in the same way Object Oriented programming differs from Functional programming.

After doing some experiments I found that in TensorFlow 2.1 there are 3 approaches for building models:

tf.keras
tf.function

So enough of the boring! Show me code!

The Keras Mode

This is the standard usage we all are used to. Use just plain Keras with a custom loss function featuring a Squared Error loss. The network is a 3 Dense layers deep network.

# The network
x = Input(shape=[20])
h = Dense(units=20, activation='relu')(x)
h = Dense(units=10, activation='relu')(h)
y = Dense(units=1)(h)

The objective in here is to teach a network to learn how to sum a vector of 20 elements. So we feed the network with a dataset of [10000 x 20] , so 10000 samples with 20 features each (the elements to sum). That’s in:

# Training samples
train_samples = tf.random.normal(shape=(10000, 20))
train_targets = tf.reduce_sum(train_samples, axis=-1)
test_samples = tf.random.normal(shape=(100, 20))
test_targets = tf.reduce_sum(test_samples, axis=-1)

We can run this example and we get the usual nice looking Keras output:

Epoch 1/10
10000/10000 [==============================] - 10s 1ms/sample - loss: 1.6754 - val_loss: 0.0481
Epoch 2/10
10000/10000 [==============================] - 10s 981us/sample - loss: 0.0227 - val_loss: 0.0116
Epoch 3/10
10000/10000 [==============================] - 10s 971us/sample - loss: 0.0101 - val_loss: 0.0070

So what’s happening here? you might ask. Well, nothing, just a Keras toy example training at 10s per epoch (in a NVIDIA GTX 1080 Ti). What about the programming paradigm? Same as before, just like in TF1.x, you define the graph, then you run it by calling keras.models.Model.fit . And the debugging features? Same as before… None. You can not even set a simple break point in the loss function.

After running this, you might be wondering a very obvious question: where on earth are all the nice features the TensorFlow 2 release promised? And you would be right. If the integration with the Keras package means just not having to install an additional package… what is the advantage?

On the top of that there is one, even more important question: where are the famous, by all expected, debugging features? Fortunately this is where the Eager Mode comes to rescue.

The Eager Mode

What if I told you there is a way to build your models interactively and having access to all the operations in runtime? — If you are shacking yourself in excitement it means you have suffered the deep pain of a runtime error in a random batch after 10 epochs… Yes, I know, I have been there too, we can start calling us brothers in arms after those battles.

Well, yes, this is the operation mode you were looking for. In this mode, all tensor operations are interactive, you can set a break point and get access to any of the intermediate tensor variables. However, this flexibility comes at a cost: more explicit code. Let’s take a look:

The first thing it might come to your head after reading the code could be: a lot of code just for doing a model.compile and a model.fit . Yes, true. But on the other hand you have the control of all what was happening under the hood before. And what was happening under the hood? The training loop.

So now things change. In this approach you can design how things are going to work from the ground up. Here are the things you can specify now:

  • Metrics: ever wanted to measure results per samples, batches, or by any other custom statistic? No problem, we got you covered. Now you can use the good old moving average or any other custom metric based on whatever you want.
  • Loss function: ever wanted to make crazy multiple parameters dependent loss function? Well, this is solved too, you can get all the tricky you want in the loss function definition without Keras complaining about it with its _standarize_user_data ( link )
  • Gradients: you can access the gradients, and define the specifics of the forward and the backward pass. Yes, finally, so please, join me in a big: Hooray!

The metrics are specified with the new tf.keras.metrics API . You just take the metric you want, define it and use it like this:

# Getting metric instanced
metric = tf.keras.metrics.Mean() # Run your model to get the loss and update the metric
loss = [...]
metric(loss)# Print the metric 
print('Training Loss: %.3f' % metric.result().numpy())

The loss function and the gradients are computed in the forward and the backward pass respectively. In this approach, the forward pass must be recorded by the tf.GradientTape . The tf.GradientTape will track (or tape) all the tensors operations done in the forward pass so it can compute the gradients in the backward pass. Putting it in other words: in order to run backward, you must remember the path you took forward.

# Forward pass: needs to be recorded by gradient tape
with tf.GradientTape() as tape:
 y_pred = model(x)
 loss = loss_compute(y_true, y_pred)# Backward pass:
gradients = tape.gradient(loss, model.trainable_weights)
optimizer.apply_gradients(zip(gradients, model.trainable_weights))

This is pretty straightforward, in the forward pass you run your prediction and see how well you did it by computing a loss. In the backward pass you check how your weights affected that loss by computing the gradients and, then, try to minimize the loss by updating the weights (with the help of an optimizer).

You can also notice in the code that at the end of each epoch the validation loss is computed (by running just the forward pass without updating weights).

Well let’s see how this compares to previous approach (I have reduced the output a bit so it can fit in here):

Epoch 1:
Loss: 1.310: 100%|███████████| 10000/10000 [00:41<00:00, 239.70it/s]
Epoch 2:
Loss: 0.018: 100%|███████████| 10000/10000 [00:41<00:00, 240.21it/s]
Epoch 3:
Loss: 0.010: 100%|███████████| 10000/10000 [00:41<00:00, 239.28it/s]

What happened? Have you noticed? It took 41s per epoch on the same machine, that is 4x time increment… And this is just a dummy model. Can you imagine how this can scale up for a real use case model such us RetinaNet, YOLO or MaskRCNN?

Luckily, the nice TensorFlow guys were aware of this, and implemented the graph mode.

The Graph Mode

The graph mode (from AutoGraph or tf.function ) is sort of a mixed mode between the two previous. You can get a sense of what this is in here and here . But I found those guides to be a bit confusing so I am explaining it in my own words.

If the Keras mode was about defining the graph and running it in GPU later, and the eager mode was about executing each step interactively, the graph mode lets you code as if you were in eager mode but run the training almost as fast as if you were in Keras mode (so yes, in the GPU).

The only change with regard to the eager mode is that in the graph mode you break up the code into small functions and annotate those functions with the @tf.function . Let’s take a look to see how things changed:

Now you see how the forward and backward pass computations have been refactored into 2 functions that have been annotated with the @tf.function decorator.

So what is really happening here? Easy. Whenever you annotate a function with the @tf.function decorator, you are “compiling” those operations into the GPU the same way Keras does. So by annotating your functions you tell TensorFlow to run those operation in a optimized graph in the GPU.

Under the hood, what is really happening is that the function is being parsed by AutoGraph , tf.autograph . AutoGraph will take the function inputs and outputs and generate a TensorFlow graph from them, meaning, it will parse the operations to get the outputs from the inputs into a TensorFlow graph. This generated graph will be run very efficiently into the GPU.

This is why it is sort of a mixed mode, because all the operations are run interactively except the operations annotated with the @tf.function decorator.

This also means that you will have access to all the variables and tensors except the ones within the functions decorated with @tf.function , of which you will only have access to its inputs and outputs. This approach establishes a very clear way of debugging in which you can start developing interactively in eager mode and, then, when your model is ready, push it to production performance with @tf.function . Sounds good right? Let’s see how it goes:

Epoch 1:
Loss: 1.438: 100%|████████████| 10000/10000 [00:16<00:00, 612.3it/s]
Epoch 2:
Loss: 0.015: 100%|████████████| 10000/10000 [00:16<00:00, 615.0it/s]
Epoch 3:
Loss: 0.009: 72%|████████████| 7219/10000 [00:11<00:04, 635.1it/s]

Well, an amazing 16 s/epoch. You might thing that it is not as fast as the Keras mode but, on the other hand, you get all the debugging features and a very close performance.

Conclusions

If you have been following all the article it won’t come as a surprise to you that in the end all this sums up to the very old software problem: flexibility or efficiency? Eager mode or Keras mode? Well, why settle? Use the graph mode!

In my view the TensorFlow guys have done an excellent work into providing more flexibility for us, developers, without compromising too much in the efficiency. So from what I stand I can only say bravo for them.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK