Deep Reinforcement Learning with RLlib and TensorFlow for Price Optimization

Deep Learning has made serious inroads into Reinforcement Learning. Deep Reinforcement Learning(DRL) has been used successfully for playing Atari games. Beyond games, Reinforcement Learning(RL) is applicable for any decision making problem under uncertain conditions e.g autonomous vehicles, business decision making problems. Classic Reinforcement Learning solutions become intractable when faced with large dimensional state space and action space. Deep Learning shines with problems that have large input and output dimension. So it was natural to apply Deep Learning to Reinforcement Learning for higher dimensional problems.

In this post, we will go though a DRL based solution for price optimization, which is a business decision making problem. The solution is based on the excellent DRL library called RLlib which uses TensorFlow and PyTorch under the hood. You can choose between TensorFlow and PyTorch. The python implementation for price optimization is available in my open source GitHub repository avenir.

Deep Reinforcement Learning

In Reinforcement Learning, there is an environment charaterized by a state. In a given state , an agent takes some action based on some policy. After the action is taken, with some delay a reward is obtained from the environment and the state changes.For example, in a board game, the state is the board layout and the action is the move to make.

This process can go indefinitely or until some goal is met. The goal of the learner is to maximize the long term cumulative discounted reward. For more technical details, please follow the excellent links provided in this post.

There are various approaches for solving RL problems. There is one algorithm that involves learning a function called quality function or Q function Q(s , a). Given a state s and action a, the function returns the reward r. Here is list of various DRL approaches

Values based (DQN belongs to this)
Policy gradient based
Model based

In classic RL, the Q function is obtained through dynamic programming, the final output being a table of state and action, where the table values are the rewards.

With Deep Q Network or DQN, Deep Learning is used to learn this Q function. The network takes a state as input and the output is the action with highest reward. This result is combined with a some exploration based algorithm before the final action is returned.

The network architecture could be a regular multi layer network, convolution neural network (CNN), recurrent neural network(RNN) or long term short term network(LSTM). Problems where image is the input as in autonomous vehicles, CNN may be most appropriate. Various improvements on the core DQN algorithm have been suggested.. This article has list of various of DRL algorithms and corresponding neural network architecture used for the algorithm.

RLlib is an excellent python library for DRL built on top of TensorFlow or PyTorch deep learning libraries. It uses TensorFlow by default. But it’s easy to switch to PyTorch by changing RLlib configuration.

Price Optimization

Consider a business that was using Excel and domain knowledge for pricing it’s products. It decided to adopt DRL driven pricing policies. The business wants the DRL model to recommend price once a week when new price is enforced. The business does it’s part by returning the next state and reward from the action taken to the DRL system for continuous training of the model

This use case is adopted from this excellent blog on DRL and then substantially enhanced. The blog also has another interesting use case for supply chain optimization using DQN. It also has a from scratch implementation of DQN using PyTorch. Here are improvements made to the use case

Part of the demand that depends on the price is non linear
A seasonal and random component has been added to the demand
A new element has been added to the state which is the offset into the seasonal cycle.
Ability to checkpoint trained model
Ability to restore checkpoints model and perform additional training
Ability to restore checkpoints model and get an action from the model for a given state

The state has 2 * T + 1 elements,, where T is user defined. The first T elements contain the past actions i.e process. The next T elements is a one hot encoded vector for the current time step. The last element is offset into the seasonal cycle.

The dimension of the action space is defined by choosing a minimum price, maximum price and price increase step. I have used a minimum price of 400, maximum price of 495 and a price step of 5 resulting in an action space dimension of 20. Each action corresponds to discrete price e.g. 400, 405 etc.

Our agent is essentially software simulated. Later in this post, we cover how a real life agent interacts with a DRL system for getting price recommendation

DQN Powered Price Optimization in Action

To use RLlib for DRL you have define a Python class for the environment, extending OpenAI Gym Env class. Training consists of many episodes. each episode will have the length of T defined earlier. Here is the full Python implementation. The training loop of DQN is as follows

Each iteration consists of multiple episodes, the number being set in RLlib configuration
At the beginning of each episode, the reset() method of the environment object is called
For each iteration through an episode, the next(action) method is called with the action passed as an argument.The method return next state, reward and a boolean indicating whether the episode has completed

First thing we will do use to train a model and checkpoint it. I have provided the number of iterations and the checkpoint directory path in the command line argument. Each iteration consists of multiple episodes and each episode is of length T.

./price_rl.py train 60 ./model/price > po.out

Here is a sample output for one iteration out of the 60.

**** next iteration 38
custom_metrics: {}
date: 2020-06-07_05-05-01
done: false
episode_len_mean: 19.0
episode_reward_max: 31336807.170057666
episode_reward_mean: 28265365.29418587
episode_reward_min: 24130726.138449784
episodes_this_iter: 263
episodes_total: 10271
experiment_id: a7d2d802d7124c4d9f8a4f5debbc1bd8
hostname: akash
info:
  last_target_update_ts: 195040
  learner:
    default_policy:
      cur_lr: 0.0020000000949949026
      max_q: 8500715.0
      mean_q: 6068760.5
      mean_td_error: -79556.6640625
      min_q: 1460607.375
      model: {}
  num_steps_sampled: 195156
  num_steps_trained: 12426240
  num_target_updates: 386
iterations_since_restore: 39
node_ip: 192.168.43.54
num_healthy_workers: 0
off_policy_estimator: {}
perf:
  cpu_util_percent: 47.725
  ram_util_percent: 73.46547619047618
pid: 9255
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
  mean_env_wait_ms: 0.14818509330169555
  mean_inference_ms: 1.3257083458296628
  mean_processing_ms: 0.9839668044348469
time_since_restore: 3104.312764406204
time_this_iter_s: 58.990262031555176
time_total_s: 3104.312764406204
timers:
  learn_throughput: 24655.607
  learn_time_ms: 10.383
timestamp: 1591531501
timesteps_since_restore: 0
timesteps_total: 195156
training_iteration: 39

env reset count 263

Next, we are going restore a check pointed model and get an action from it for a given state. The second command line argument is the path of the checkpoint file, which is where the model was saved in previous step. Here I am letting a random but valid state to be created. You could also provide a state as a command line argument

./price_rl.py loact  ./model/price/checkpoint_60/checkpoint-60  >> po.out

Here is the output. At the end you will find that the action returned by DQN is 4 which is the index into the price array. The index of 4 corresponds to the price of 420. In the array of output probability distribution you will find that the 5th element has the highest probability.

******** loading checkpointed model and getting action ********
creating random but valid state
2020-06-07 06:07:34,246	INFO resource_spec.py:212 -- Starting Ray with 2.69 GiB memory available for workers and up to 1.36 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2020-06-07 06:07:34,687	INFO services.py:1170 -- View the Ray dashboard at localhost:8265
2020-06-07 06:07:36,567	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-07 06:07:36,643	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-07 06:07:36,650	WARNING deprecation.py:30 -- DeprecationWarning: `exploration_final_eps` has been deprecated. Use `exploration_config.final_epsilon` instead. This will raise an error in the future!
observation space  (41,)
action space  20
2020-06-07 06:07:40,179	WARNING trainer_template.py:124 -- The experimental distributed execution API is enabled for this algorithm. Disable this by setting 'use_exec_api': False.
2020-06-07 06:07:40,180	INFO trainable.py:217 -- Getting current IP.
2020-06-07 06:07:40,180	WARNING util.py:37 -- Install gputil for GPU system monitoring.
2020-06-07 06:07:40,284	INFO trainable.py:217 -- Getting current IP.
2020-06-07 06:07:40,284	INFO trainable.py:423 -- Restored on 192.168.43.54 from checkpoint: ./model/price/checkpoint_60/checkpoint-60
2020-06-07 06:07:40,284	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 60, '_timesteps_total': None, '_time_total': 4407.278539657593, '_episodes_total': 15802}
state:
[410 405 415 405 490 410 475 460 490 455 470   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0
   0   0   0   0  58]
action:
(4, [], {'action_prob': 1.0, 'action_logp': 0.0, 'action_dist_inputs': array([6148030.5, 6178264.5, 6182630. , 6198873. , 6223526.5, 6208431. ,
       6216324. , 6170734. , 6223479. , 5894620.5, 5763397. , 5770401.5,
       5641359.5, 5512044. , 5620705.5, 5454862. , 5396949.5, 5567250.5,
       5232936. , 5146781. ], dtype=float32), 'q_values': array([6148030.5, 6178264.5, 6182630. , 6198873. , 6223526.5, 6208431. ,
       6216324. , 6170734. , 6223479. , 5894620.5, 5763397. , 5770401.5,
       5641359.5, 5512044. , 5620705.5, 5454862. , 5396949.5, 5567250.5,
       5232936. , 5146781. ], dtype=float32)})

You will find that the 5th element in the distribution has the highest value as expected. The distribution is not very skewed. There could be several reasons for it

Difference between actions i.e price difference of 5 may not be significant enough
More training may be required

Restoring a check pointed model and performing additional training and then checkpointing the model again can be done as follows.

./price_rl.py inctr  ./model/price/checkpoint_60/checkpoint-60 40 ./model/price  >> po.out

The second argument is the checkpoint file path. The third argument is the number of iterations and the fourth argument is the checkpoint directory path. When done it will create a new checkpoint directory and save the new checkpoint file there.

Exploration vs Exploitation

If you you look at the action return by the DQN and compare that with action probability distribution, you will find that sometimes it’s not the action with highest probability. This is so because DQN explores the action space, returning non optimal action sometimes like any RL system.

DQN uses epsilon greedy algorithm to trade off between exploration and exploitation. With some probably, it will return a random action. The probability value is high in the beginning and gradually decreases as training progresses. In other words, as training begins, DQN is exploration heavy and then gradually transitions to being more exploitative returning the best action as determined by the network.

DQN Service and Client

In the solution so far, the agent was software simulated. Rewards were generated artificially using some logic based current price, previous price along with a seasonal and random component. In real life however, rewards (profit for our use case) come from a real environment.

For handling real life scenarios and production environment, RLlib provides a REST service for the DQN model and a corresponding REST client. They can be used for off policy training and online training and getting action recommendation from DRL We will delve into them in the next two sections.

Off Policy Training

In many real life use cases there may be historical data on environment state, actions taken and rewards obtained and we would like to use it to train the DQN model. This is called off policy training, because you are not asking the DRL system for the action to take. It’s coming from a different source.

For example, past pricing and profit data may be available in Excel for our use case, where each record has the following fields. In the parenthesis it shows qualities in our use case context.

Current state (price history, seasonal cycle offset)
Action (price)
Reward (profit)
Nest state (price history, seasonal cycle offset)

To use this data for off policy training, following steps need to be taken.

Define an Env class to abstract the train data
Define a episode size and split the data into multiple episodes
Call DQN service to start a new episode
In the reset() call of the environment object, select an episode randomly and return the state at the beginning of the episode
Call DQN service to log the the action and state for the particular record in the episode
Call step(action) for Env object. It will return reward, next state and the done flag
Call DQN service to log the reward
When episode ends call DQN service to end current episode and start a new one and go back to 4

Online Training and Usage

In on line training and usage you solicit DRL for action and perform the action in the external environment and return the reward to DRL. The steps are as follows

Define an Env class
Define a episode size
Call DQN service to start a new episode
In the reset() call of the environment object initialize state
Get action from DRL passing the current state
Call step(action) for Env object. It will return reward, next state and the done flag. This needs to be async call with delayed response.
Call DQN service to log the reward
When episode ends call DQN service to end current episode and start a new one and go back to 4

Normally, you would perform off policy training and checkpoint the model and then deploy the check pointed model online. However, some times historical data is not available for training or you may be deploying a green field solution and you are forced to skip the off policy training.

In such cases, the DRL system will do more exploration of the action space because there is nothing learnt yet to fall back on for optimum action choice when deployed online. It’s very likely that actions from the DLR system in the early phase of the deployment will be sub optimal.

Wrapping Up

We have taken a tour through Deep Reinforcement Learning, as it applies solving a business decision making problem of setting the price of a product. We have used a fantastic DRL library called RLlib. RLlib completely encapsulates TensorFlow and PyTorch. There is a tutorial document for the use case in this post.

Deep Reinforcement Learning with RLlib and TensorFlow for Price Optimization