Deep Reinforcement Learning and Hyperparameter Tuning

Using Ray’s Tune to Optimize your Models

Apr 16 ·6min read

6ZnQRnE.png!web

One of the most difficult and time consuming parts of deep reinforcement learning is the optimization of hyperparameters. These values — such as the discount factor [latex]\gamma[/latex], or the learning rate — can make all the difference in the performance of your agent.

Agents need to be trained to see how the hyperparameters affect performance — there’s no a priori way to know whether a higher or lower value for a given parameter will improve total rewards. This translates into multiple, costly training runs to get a good agent in addition to tracking the experiments, data, and everything associated with training the models.

Ray provides a way to deal with all of this with the Tune library, which automatically handles your various models, saves the data, adjusts your hyperparameters, and summarizes the results for quick and easy reference.

TL;DR

We walk through a brief example of using Tune’s grid search features to optimize our hyperparameters.

Installing Tune

Tune is a part of the Ray project but requires a separate install, so if you haven’t installed it yet, you’ll need to run the following to get Tune to work.

pip install ray[tune]

From here, we can import our packages to train our model.

import ray
from ray import tune

Tuning your First Model

Starting with the basics, let’s use Tune to train an agent to solve CartPole-v0 . Tune takes a few dictionaries with various settings and criteria to train. The two that it must have are config and stop arguments.

The config dictionary will provide the Tune with the environment it needs to run as well as any environment specific configurations that you may want to specify. This is also where most of your hyperparameters are going to reside, but we'll get to that in a moment.

The stop dictionary tells Tune when to finish a training run or when to stop training altogether. It can be customized based on reward criteria, elapsed time, number of steps taken, and so forth. When I first started with Tune, I overlooked setting any stopping criteria and wound up letting an algorithm train for hours before realizing it. So, you can run it without this, but you may rack up a decent AWS bill if you're not careful!

Try the code below to run the PPO algorithm on CartPole-v0 for 10,000 time steps.

ray.init(ignore_reinit_error=True)
config = {
    'env': 'CartPole-v0'
}
stop = {
    'timesteps_total': 10000
}
results = tune.run(
    'PPO', # Specify the algorithm to train
    config=config,
    stop=stop
)

With these settings, you should see a print-out of the status of your workers, memory, as well as the logdir where all of the data is stored for analysis later.

The console will print these values with each iteration unless the verbose argument in tune.run() is set to 0 (silent).

When training is complete, you’ll get an output saying the status has been terminated, the elapsed time, and mean reward for the past 100 episodes among other data.

Using Grid Search to Tune Hyperparameters

The power of Tune really comes in when we leverage it to adjust our hyperparameters. For this, we’ll turn to the grid_search function which allows the user to specify a set of hyperparameters for the model to test.

To do this, we just need to wrap a list of values in the tune.grid_search() function and place that in our configuration dictionary. Let's go back to our CartPole example above. We might want to see if the learning rate makes any difference and if a two-headed network provides any benefit. We can use grid_search() to implement the different combinations of these as shown below:

config = {
    "env": 'CartPole-v0',
    "num_workers": 2,
    "vf_share_layers": tune.grid_search([True, False]),
    "lr": tune.grid_search([1e-4, 1e-5, 1e-6]),
    }
results = tune.run(
    'PPO', 
    stop={
        'timesteps_total': 100000
    },
    config=config)

Now we see an expanded status printout which contains the various trials we want to run:

3YFZvem.png!web

As Ray kicks off each one of these, it will show the combination of hyperparameters we want to explore as well as the rewards, iterations, and elapsed time for each. When it completes, we should see TERMINATED as the status for each to show that it worked properly (otherwise it would read ERROR).

nQFNVj3.png!web

Analyzing Tune Results

The output of our tune.run() function is an analysis object that we've labeled results . We can use this to access further details about our experiments. The relevant data can be accessed via results.dataframe() , which will return a Pandas data frame containing average rewards, iterations, KL divergence, configuration settings, and on and on. The data frame also contains the specific directory your experiments were saved in ( logdir ) so you can get into the details of your particular run.

If you look into the logdir directory, you'll find a number of files that contain the saved data from your training runs. The primary file for our purposes will be progress.csv - this contains the training data from each of the iterations, allowing you to dive into the details.

For example, if we want to view the training and loss curves for our different settings, we can loop over the logdir column in our data frame, load each of the progress.csv files and plot the results.

# Plot training results
import matplotlib.pyplot as plt
import pandas as pdcolors = plt.rcParams['axes.prop_cycle'].by_key()['color']
df = results.dataframe()# Get column for total loss, policy loss, and value loss
tl_col = [i for i, j in enumerate(df.columns)
          if 'total_loss' in j][0]
pl_col = [i for i, j in enumerate(df.columns)
          if 'policy_loss' in j][0]
vl_col = [i for i, j in enumerate(df.columns)
          if 'vf_loss' in j][0]
labels = []
fig, ax = plt.subplots(2, 2, figsize=(15, 15), sharex=True)
for i, path in df['logdir'].iteritems():
    data = pd.read_csv(path + '/progress.csv')
    # Get labels for legend
    lr = data['experiment_tag'][0].split('=')[1].split(',')[0]
    layers = data['experiment_tag'][0].split('=')[-1]
    labels.append('LR={}; Shared Layers={}'.format(lr, layers))
    
    ax[0, 0].plot(data['timesteps_total'], 
            data['episode_reward_mean'], c=colors[i],
            label=labels[-1])
    
    ax[0, 1].plot(data['timesteps_total'], 
           data.iloc[:, tl_col], c=colors[i],
           label=labels[-1])
    
    ax[1, 0].plot(data['timesteps_total'], 
               data.iloc[:, pl_col], c=colors[i],
               label=labels[-1])
    
    
    ax[1, 1].plot(data['timesteps_total'], 
               data.iloc[:, vl_col], c=colors[i],
               label=labels[-1])ax[0, 0].set_ylabel('Mean Rewards')
ax[0, 0].set_title('Training Rewards by Time Step')
ax[0, 0].legend(labels=labels, loc='upper center',
        ncol=3, bbox_to_anchor=[0.75, 1.2])
ax[0, 1].set_title('Total Loss by Time Step')
ax[0, 1].set_ylabel('Total Loss')
ax[0, 1].set_xlabel('Training Episodes')ax[1, 0].set_title('Policy Loss by Time Step')
ax[1, 0].set_ylabel('Policy Loss')
ax[1, 0].set_xlabel('Time Step')ax[1, 1].set_title('Value Loss by Time Step')
ax[1, 1].set_ylabel('Value Loss')
ax[1, 1].set_xlabel('Time Step')plt.show()

JbaMbyn.png!web

Beyond Grid Search

There are far more tuning options available in Tune. If you want to see what you can tweak, take a look at the documentation for your particular algorithm . Moreover, Tune enables different approaches to hyperparameter optimization. Grid search can be slow, so just by changing a few options, you can use Bayesian optimization, HyperOpt and others. Finally, Tune makes population based training (PBT) easy allowing multiple agents to scale across various machines. All of this will be covered in future posts!

Using Ray’s Tune to Optimize your Models

TL;DR

Installing Tune

Tuning your First Model

Using Grid Search to Tune Hyperparameters

Analyzing Tune Results

Beyond Grid Search

Recommend

CARS: 华为提出基于进化算法和权值共享的神经网络结构搜索，CIFAR-10上仅需单卡半天 |...

CEO离职，导购生意被蚕食：“什么值得买”值得买吗？

Where is the data?

对腾讯后续“组局”游戏直播的猜想

多云管理平台软件及服务提供商“FIT2CLOUD飞致云”完成C+轮融资，嘉御基金独家投资

OneTesselAway: Building a Real-Time Public Transit Status IoT Device

一篇讲透百度霸屏引流细节思路与操作玩法

Pcileech WebRadar – browser based radar cheat for CS:GO

讲真，这两款idea插件，能治愈你英语不好的病

Redis 应用场景汇总

About Joyk