45

Elucidating Policy Iteration in Reinforcement Learning — Jack’s Car Rental Probl...

 4 years ago
source link: https://www.tuicool.com/articles/jUfMbei
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Elucidating Policy Iteration in Reinforcement Learning — Jack’s Car Rental Problem

N3AzQjU.png!web

[1]: Generalized Policy Iteration

In this blog post, I’ll try to elucidate the policy iteration algorithm in Reinforcement Learning by using it to solve Jack’s Car Rental Problem. This problem and its variant are given in Example 4.2 and Exercise 4.5, respectively, in the book by Sutton and Barto (Reinforcement Learning: An Introduction, Second Edition).

Problem Statement

Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of $2 per car moved.

We assume that the number of cars requested and returned at each location are Poisson random variables. Recall that if X is a Poisson random variable, then

aqe2Efa.png!web

Suppose λ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns.

To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate γ to be 0.9 and formulate this as a continuing finite Markov Decision Process, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight.

But what does it mean to solve this problem?

Solving this problem means solving two things, firstly how many cars should Jack move, overnight, between each location, to maximize his total expected reward, i.e., what should be his strategy (policy) given a situation (state) and secondly, if Jack knows this strategy, how can he compare which situations are better than others (value)?

Initial Setup

Let’s first write in code the initial things we know. From the problem, we know that Jack can get two types of rewards, the first being the $10 reward when he rents a car, and the second being the -$2 reward for each car that he moves from one location to the other (note that the latter reward is negative).

Now we define a poisson_ class, which takes a parameter λ and calculates the corresponding probability mass function. It has two data members α and β, which denote the interval [α, β ) of values of n for which the pmf value is above ε (here 0.01). We set the pmf values below ε to zero and then normalize the resulting distribution.

For instance, for λ = 3, the values of the data members α, β and vals are:

The policy iteration algorithm

Before we go any further, to ensure that all readers are on the same page, let’s quickly revise what do we mean by the terms value and policy here.

In our car rental example, the state at any time of the system is a pair of two numbers, the number of cars at the first and the second location. Given a state, Jack has to choose an action, which is the number of cars that he can move from the first to the second location or vice-versa. According to the problem, it can vary between -5 and +5, where +n represents that Jack moves n cars from the first to the second location.

A policy is a mapping from states to actions, i.e., given a state, how many cars should Jack move overnight.Now, suppose Jack has some policy π, then given this π, the value of a state (say s) is the expected reward that Jack would get when he starts from s and follows π after that.

Znu6V3e.png!web

[1]: The Policy Iteration Algorithm

The policy iteration algorithm, as shown in the above image, consists of three components. Let’s discuss each of these components separately in the context of solving the rental problem.

The first component is the initialization. As shown in the above image, we initialize the value and policy matrices arbitrarily. Initializing them to zero also works okay. Note that given a policy, we define a value for each state, and since our state is a pair of two numbers where each number takes a value between 0 and 20, hence we represent value by a matrix of shape (21 x 21). The policy takes a state and outputs an action; hence, it can also be represented by a matrix of the same shape.

The second component is policy evaluation. By policy evaluation, we mean that following this policy, what should be the value of any state. As mentioned above, given a policy π, the value of a state (say s) is the expected reward that Jack would get when he starts from s and follows π after that.

qQZNveB.png!web

This expectation idea associated with the value of a state can be written in the form as shown in the above image from which the Bellman equation can be derived as shown.

This Bellman equation forms the basis of the value update shown in the policy evaluation component.

Value Update

After many such updates, V(s) converges to a number which almost satisfies (with at most some θ error) the Bellman equation and hence represents the value of state s.

The third component is policy improvement. Given a state (say s), we assign π(s) to be equal to that action which maximizes the expected reward. And we say that the policy becomes stable when none of the action maximization step in any state causes a change in the policy.

We run policy evaluation and improvement components in a loop until the policy gets stable.

Results

mUrqEf6.png!web

The above image shows policies after passes of evaluation and improvement as heatmaps. Recall that policy is represented by a matrix containing action values which are in the range [-5, 5].

Adding non-linearities to the original rental problem

Let’s see what happens if we add some non-linearities like the following to the problem above:

  1. One of Jack’s employees at the first location rides a bus home each night and lives near the second location. She is happy to shuttle one car to the second location for free. Each additional car still costs $2, as do all cars moved in the other direction.
  2. In addition, Jack has limited parking space at each location. If more than 10 cars are kept overnight at a location (after any moving of cars), then an additional cost of $4 must be incurred to use a second parking lot (independent of how many cars are kept there).

These conditions can be easily added to modify the original code. We will have one more reward of -$4 for second parking lot, if needed (Note that this reward is negative).

The resulting policies (passed through evaluation and improvement) are shown in the following image.

Qz6zY3R.png!web

Conclusion

We solved the above problem using Dynamic Programming. We stored the intermediate value and policy matrices and used them in our policy evaluation and improvement functions. In order to use Bellman updates though, we need to know the dynamics of the environment, just like we knew the probabilities for rewards and next states in the rental example. If we can only sample from the underlying distribution and don’t know the distribution itself, then Monte Carlo methods can be used to solve the corresponding learning problem. We will discuss these methods in a later post.

References

[1] Sutton, R., & Barto, A. (2017). Reinforcement learning: An introduction. Cambridge, MIT Press

[2] S. David’s lecture on Planning by Dynamic Programming ( https://www.youtube.com/watch?v=Nd1-UUMVfz4 ).


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK