Ray and RLlib for Fast and Parallel Reinforcement Learning

An intro tutorial to RL training with Ray

Apr 8 ·5min read

67j263B.jpg!web

Photo by Jean Gerber on Unsplash

Ray is more than just a library for multi-processing; Ray’s real power comes from the RLlib and Tune libraries that leverage this capability for reinforcement learning. It enables you to scale training to large-scaled distributed servers, or just take advantage of the parallelization properties to more efficiently train using your own laptop. The choice is yours.

TL;DR

We show how to train a custom reinforcement learning environment that has been built on top of OpenAI Gym using Ray and RLlib.

A Gentle RLlib Tutorial

Once you’ve installed Ray and RLlib with pip install ray[rllib] , you can train your first RL agent with a single command in the command line:

rllib train --run=A2C --env=CartPole-v0

This will tell your computer to train using the Advantage Actor Critic Algorithm (A2C) using the CartPole environment. A2C and a host of other algorithms are already built into the library meaning you don’t have to worry about the details of implementing those yourself.

This is really great, particularly if you’re looking to train using a standard environment and algorithm. If you want to do more, however, you’re going to have to dig a bit deeper.

RLlib Agents

The various algorithms you can access are available through ray.rllib.agents . Here, you can find a long list of different implementations in both PyTorch and Tensorflow to begin playing with.

These are all accessed using the algorithm’s trainer method. For example, if you want to use A2C as shown above, you can run:

import ray
from ray.rllib import agentsray.init()
trainer = agents.a3c.A2CTrainer(env='CartPole-v0')

If you want to try a DQN instead, you can call:

trainer = agents.dqn.DQNTrainer(env='CartPole-v0') # Deep Q Network

All the algorithms follow the same basic construction alternating from lower case algo abbreviation to uppercase algo abbreviation followed by “Trainer.”

Changing hyperparameters is as easy as passing a dictionary of configurations to the config argument. A quick way to see what’s available to you is to call trainer.config to print out the options that are available for your chosen algorithm. A few examples include:

fcnet_hiddens controls the number of hidden units and hidden layers (passed as a dictionary called model into config and then a list, I’ll show an example below).
vf_share_layers determines whether or not you have one neural network with multiple output heads or separate value and policy networks.
num_workers sets the number of processors for parallelization.
num_gpus to set the number of GPU’s you will use.

There are lots of others to set and customize from the network (typically located in model dictionary) to various callbacks and multi-agent settings.

Example: Training PPO for `CartPole`

I want to turn and show a quick example to get you started and show you how this works with a standard, OpenAI Gym environment.

Choose your IDE or text editor of choice and try the following:

import ray
from ray.rllib import agents
ray.init() # Skip or set to ignore if already called
config = {'gamma': 0.9,
          'lr': 1e-2,
          'num_workers': 4,
          'train_batch_size': 1000,
          'model': {
              'fcnet_hiddens': [128, 128]
          }}
trainer = agents.ppo.PPOTrainer(env='CartPole-v0', config=config)
results = trainer.train()

The config dictionary changed the defaults for the values above. You can see how we can influence the number of layers and nodes in the network by nesting a dictionary called model in the config dictionary. Once we've specified our configuration, calling the train() method on our trainer object will send the environment to the workers and begin collecting data. Once enough data is collected (1,000 samples according to our settings above) the model will update and send the output to a new dictionary called results .

If you want to run multiple updates, you can set up a training loop to continuously call the train() method for a given number of iterations or until some other threshold has been reached.

Customizing your RL Environment

OpenAI Gym and all of its extensions are great, but if you’re looking for novel applications of RL or to use it in your company , you’re going to need to work with a custom environment.

Unfortunately, the current version of Ray (0.9) explicitly states that it is not compatible with the gym registry. Thankfully, it isn’t too difficult to put together a helper function to get custom gym environments to work with Ray.

Let’s assume you have some environment called MyEnv-v0 that is properly registered so that you can invoke it with gym.make('MyEnv-v0') like you would with any other gym environment (if you haven't already, you can check out my step-by-step process on setting up environments here ).

To call that custom environment from Ray, you need to wrap it in a function that will return the environment class, not an instantiated object. The best way I’ve found to do this is with a create_env() helper function:

def env_creator(env_name):
    if env_name == 'MyEnv-v0':
        from custom_gym.envs.custom_env import CustomEnv0 as env
    elif env_name == 'MyEnv-v1':
        from custom_gym.envs.custom_env import CustomEnv1 as env
    else:
        raise NotImplementedError
    return env

From here, you can set up your agent and train it on this new environment with only a slight modification to the trainer .

env_name = 'MyEnv-v0'
config = {
    # Whatever config settings you'd like...
    }
trainer = agents.ppo.PPOTrainer(
    env=env_creator(env_name), 
    config=config)
max_training_episodes = 10000
while True:
    results = trainer.train()
    # Enter whatever stopping criterion you like
    if results['episodes_total'] >= max_training_episodes:
        break
print('Mean Rewards:\t{:.1f}'.format(results['episode_reward_mean']))

Note that above, we call the environment with the env_creator , everything else remains the same.

Tips for Working with Custom Environments

If you’re used to building your own models from the environment to the networks and algorithms, then there are some features you need to be cognizant of when working with Ray.

First, Ray adheres to the OpenAI Gym API meaning that your environments need to have step() and reset() methods as well as carefully specified observation_space and action_space attributes. I had always been a bit lazy with respect to these last two, because I could simply define my network input and output dimensions and not have to regard the range of input values, for example, that the gym.spaces methods require. Ray checks all the inputs to ensure that they fall within that specified range (I spent too much time debugging runs before realizing that the low value on my gym.spaces.Box was set to 0, but the environment was returning values on the order of -1e-17 and causing it to crash).

When setting up your action and observation spaces, stick to Box , Discrete , and Tuple . The MultiDiscrete and MultiBinary don't work ( currently ) and will cause the run to crash. Instead, wrap Box or Discrete spaces in the Tuple function.

Take advantage of custom pre-processing when you can. Ray makes assumptions about your state inputs, which usually work just fine, but it also enables you to customize the pre-processing steps which may help your training.

Going Beyond RLlib

Ray can greatly speed up training and make it far easier to get started with deep reinforcement learning. RLlib isn’t the end (we just scratched the surface of its capabilities here anyway), it has a powerful cousin called Tune which enables you to adjust the hyperparameters of your model and manages all of the important data collection and back-end work for you. Make sure you check back for updates on how to bring this library into your work process.

An intro tutorial to RL training with Ray

TL;DR

A Gentle RLlib Tutorial

RLlib Agents

Example: Training PPO for `CartPole`

Customizing your RL Environment

Tips for Working with Custom Environments

Going Beyond RLlib

Recommend

Apple Pay交通卡更新加入深圳通及京津冀互联互通卡

央行数字货币备战如何？合作机构行动露玄机

网约车复苏难：订单降八成，司机收入不够交车租

Rust语言的编程范式 | | 酷壳 - CoolShell

如何缓解颈椎疼痛

加州州长道歉：检测不足是我责任尽快实现应测尽测

楼主 7 年 iOS，投了上百简历，一个面试都没有，大佬们帮忙看看是简历有什么问题吗？

自动化用例开发过程中的常见技巧：代理模式

一种构建开发者平台的金字塔模型

Binding in SwiftUI

About Joyk

Ray and RLlib for Fast and Parallel Reinforcement Learning

An intro tutorial to RL training with Ray

TL;DR

A Gentle RLlib Tutorial

RLlib Agents

Example: Training PPO for CartPole

Customizing your RL Environment

Tips for Working with Custom Environments

Going Beyond RLlib

Recommend

About Joyk

Example: Training PPO for `CartPole`