Reinforcement Learning with TensorFlow Agents — Tutorial - JOYK Joy of Geek, Geek News, Link all geek

Reinforcement Learning with TensorFlow Agents — Tutorial

Try TF-Agents for RL with this simple tutorial, published as a Google colab notebook so you can run it directly from your browser.

Mauricio Fadel Argerich

Jul 1 ·7min read

VFFZBrn.jpg!web

Some weeks ago, I wrote an article naming different frameworks you can use to implement Reinforcement Learning (RL) in your projects, showing the ups and downs of each of them and wondering if any of them would rule them all at some point. Since then, I’ve come to know TF Agents , a library for RL based on TensorFlow and with the full support of its community (note that TF Agents is not an official Google product but it is published as a repository from the official TensorFlow account on Github).

I am currently using TF Agents on a project and it has been easy to start with it, thanks to its good documentation including tutorials . It is updated regularly and has lots of contributors, which makes me think it is possible we will see TF Agents as the standard framework for implementing RL in the near future. Because of this, I’ve decided to make this article to give you a quick introduction, so you can also benefit from this library. I have published all the code used here as a Google colab notebook , so you can easily run it online.

You can find the Github with all the code and documentation for TF-Agents here . You won’t need to clone their repository, but it’s always useful to have the official Github for reference. I have implemented the following example following partially one of their tutorials (1_dqn_tutorial) but I have simplified it further and used it for playing Atari games in this article. Let’s get hands on.

Installing TF Agents and Dependencies

As already said, TF-Agents runs on TensorFlow, more specifically TensorFlow 2.2.0. In addition you will need to install the following packages if you don’t have them already:

pip install tensorflow==2.2.0
pip install tf-agents

Implementing a DQN Agent for CartPole

We will implement a DQN Agent ( Mnih et al. 2015 ) and use it for CartPole, a classic control problem. If you would like to solve something more exciting like, say, an Atari game, you just need to change the environment name with the one you wish, choosing it from all the available OpenAI environments .

We start by doing all of the necessary imports. As you can see below, we implement quite a few objects from TF-Agents. These are all things we can customize and switch for our implementation.

from __future__ import absolute_import, division, print_functionimport base64
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tffrom tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import q_network
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.utils import common

Environment

CartPole environment from OpenAI Gym [GIF from jaekookang / RL-cartpole .]

Now, we head on to create our environment. In CartPole, we have a cart with a pole on top of it, the agent’s mission is to learn to keep up the pole, moving the cart left and right. Note that we will use an e environment from suite_gym already included in TF-Agents, which is a slightly customized (and improved for its use with TF-Agents) version of OpenAI Gym environments (if you’re interested, you can check the differences with OpenAI’s implementation here ). We will also use a wrapper for our environment called TFPyEnvironment — which converts the numpy arrays used for state observations, actions and rewards into TensorFlow tensors. When dealing with TensorFlow models, (i.e., neural networks) we use tensors, so by using this wrapper we save some effort we would need to convert these data.

env = suite_gym.load('CartPole-v1')
env = tf_py_environment.TFPyEnvironment(env)

Agent

There are different agents in TF-Agents we can use: DQN , REINFORCE , DDPG , TD3 , PPO and SAC . We will use DQN as said above. One of the main parameters of the agent is its Q (neural) network, which will be use to calculate the Q-values for the actions in each step. A q_network has two compulsory parameters: input_tensor_spec and action_spec defining the observation shape and the action shape. We can get this from our environment so we will define our q_network as follows:

q_net = q_network.QNetwork(env.observation_spec(), 
                           env.action_spec())

There are many more parameters we can customize for our q_network as you can see here , but for now, we will go with the default ones. The agent also requires an optimiser to find the values for the q_network parameter. Let’s keep it classic and use Adam.

optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.001)

Finally, we define and initialize our agent with the following parameters:

The time_step_spec, which we get from our environment and defines how are our time steps defined.
The action_spec, same as for the q_network.
The Q network we created before.
The optimizer we have also created before.
The TD error loss function, similar to how the loss is used in NN.
The train step counter, that is just a rank 0 tensor (a.k.a. scalar) which will count the number of steps we do on the environment.

train_step_counter = tf.Variable(0)agent = dqn_agent.DqnAgent(env.time_step_spec(),
                           env.action_spec(),
                           q_network=q_net,
                           optimizer=optimizer,
                           td_errors_loss_fn= 
                                  common.element_wise_squared_loss,
                           train_step_counter=train_step_counter)agent.initialize()

Helper Methods: Average Cumulative Return and Collecting Data

We will also need some helper methods. The first one will iterate over the environment for a number of episodes, applying the policy to choose what actions to follow and return the average cumulative reward in these episodes. This will come in handy to evaluate the policy learned by our agent. Below, we also try the method in our environment for 10 episodes.

def compute_avg_return(environment, policy, num_episodes=10):
    total_return = 0.0
    for _ in range(num_episodes):
        time_step = environment.reset()
        episode_return = 0.0        while not time_step.is_last():
            action_step = policy.action(time_step)
            time_step = environment.step(action_step.action)
            episode_return += time_step.reward
        total_return += episode_return    avg_return = total_return / num_episodes
    return avg_return.numpy()[0]# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(env, agent.policy, 5)
returns = [avg_return]

We will also implement a method to collect data when training our agent. One of the breakthroughs of DQN was experience replay, in which we store the experiences of the agent (state, action, reward) and use it to train the Q network in batches in each step. This improves the learning by making it faster and more stable. In order to do this, TF-Agents includes the object TFUniformReplayBuffer, which stores these experiences to re-use them later, so we firstly create this object that we will need later on.

In this method, we take an environment, a policy and a buffer, take the current time_step formed by its state observation and reward at that time_step, the action the policy chooses and then the next time_step. Then, we store this in the replay buffer. Note the replay buffer stores an object called Trajectory, so we create this object with the elements named before, and then save it to the buffer using the method add_batch.

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
                                data_spec=agent.collect_data_spec,                                                                
                                batch_size=env.batch_size,                                                              
                                max_length=100000)def collect_step(environment, policy, buffer):
    time_step = environment.current_time_step()
    action_step = policy.action(time_step)
    next_time_step = environment.step(action_step.action)
    traj = trajectory.from_transition(time_step, 
                                      action_step, 
                                      next_time_step)# Add trajectory to the replay buffer
    buffer.add_batch(traj)

Train Agent

We can finally train our agent. We define the number of steps we will make in every iteration, after this number of steps, we will train our agent in every iteration, modifying it’s policy. For now let’s just use 1 step per iteration. We also define the batch size with which our Q network will be trained and an iterator so we iterate over the experienced of the agent.

Then, we will just gather some experience for our buffer and start with the common RL loop. Get experience by acting on the environment, train policy and repeat. We additionally print the loss and evaluate the performance of the agent every 200 and 1000 steps respectively.

collect_steps_per_iteration = 1
batch_size = 64
dataset = replay_buffer.as_dataset(num_parallel_calls=3, 
                                   sample_batch_size=batch_size, 
                                   num_steps=2).prefetch(3)
iterator = iter(dataset)
num_iterations = 20000
env.reset()for _ in range(batch_size):
    collect_step(env, agent.policy, replay_buffer)for _ in range(num_iterations):
    # Collect a few steps using collect_policy and save to the replay buffer.
    for _ in range(collect_steps_per_iteration):
        collect_step(env, agent.collect_policy, replay_buffer)    # Sample a batch of data from the buffer and update the agent's network.
    experience, unused_info = next(iterator)
    train_loss = agent.train(experience).loss    step = agent.train_step_counter.numpy()    # Print loss every 200 steps.
    if step % 200 == 0:
        print('step = {0}: loss = {1}'.format(step, train_loss))    # Evaluate agent's performance every 1000 steps.
    if step % 1000 == 0:
        avg_return = compute_avg_return(env, agent.policy, 5)
        print('step = {0}: Average Return = {1}'.format(step, avg_return))
        returns.append(avg_return)

Plot

We can now plot how the cumulative average reward varies as we train the agent. For this, we will use matplotlib to make a very simple plot.

iterations = range(0, num_iterations + 1, 1000)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')

bYnymeJ.png!web

Average Return over 5 episodes of our DQN agent. You can see how the performance increases over time as the agent becomes more experienced.

Complete Code

I have shared all the code in this article as a Google Colab notebook . You can directly run all the code as it is, if you would like to change it, you have to save it on your own Google drive account and then you can do whatever you like. You can also download it to run it locally on your computer, if you wish to.

Where to go from here

You can follow the tutorials included in the repository of TF-Agents on Github
If you would like to check other nice frameworks for RL, you can see my previous post here:

5 Frameworks for Reinforcement Learning on Python

Programming your own Reinforcement Learning implementation from scratch can be a lot of work, but you don’t need to do…

towardsdatascience.com

You can also check other environments in which to try TF-Agents (or any RL algorithm of your choice) in this other article I wrote some time ago.

As usual, thank you for reading! Let me know in responses what you think about TF-Agents, and also if you have any question or you found any :bug: in the code.

Reinforcement Learning with TensorFlow Agents — Tutorial