Value Iteration for V-function

V-function in Practice for Frozen-Lake Environment

Jun 15 ·9min read

vMJV7vz.png!web

In theprevious post, we presented the Value Iteration method to calculate the V-values and Q-values required by Value-based Agents. In this post, we will present how to implement the Value Iteration method for computing the state value by solving the Frozen-Lake Environment.

Value Iteration with V-function in Practice

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .

Next, we will present in detail the code that makes up the Value Iteration method that we presented in the previous post. So let’s come to the code. In the beginning, we import the used libraries and define the main constants:

import gym
import collections
from torch.utils.tensorboard import SummaryWriterENV_NAME="FrozenLake-v0"GAMMA = 0.9
TEST_EPISODES = 20
N =100
REWARD_GOAL = 0.8

Agent’s data structures

The main data structures which will keep Agent’s information are:

rewards : A dictionary with the composite key “source state” + “action” +“target state”. The value is obtained from the immediate reward.
transits : A table as a dictionary keeping counters of the experienced transitions. The key is the composite “state” + “action”, and the value is another dictionary that maps the target state into a count of times that we have seen it.
values : A dictionary that maps a state into the calculated value of this state (V-value).
state : Current state of the Agent.

The main data structures are created in the class constructor of the Agent.

Training Algorithm

The overall logic of our training algorithm is simple. Until the desired Reward goal is not reached we will execute the following steps:

play N random steps from the environment to populating the reward and transits tables.
After those N steps, it performs a value iteration step over all states, updating the value table.
Then we play several full episodes to check the improvements using the updated value table.
If the average reward for those test episodes is above the desired boundary, then we stop training.

Before we go into more detail about this code, it will help to review the agent’s code first.

Agent Class

In the Agent class constructor, we create an Environment that we will be using for data samples, obtain our first observation, and define tables for rewards, transitions, and values:

class Agent: 
      def __init__(self):
          self.env = gym.make(ENV_NAME) 
          self.state = self.env.reset()
          self.rewards = collections.defaultdict(float)
          self.transits = collections.defaultdict(
                        collections.Counter)
          self.values = collections.defaultdict(float)

Play random steps

Remember that in the previous post we advanced that the estimation of Transitions and Rewards will be obtained through the history of the Agent’s interaction with the Environment.

This is done by the method play_n_random_steps that plays N random steps from the Environment, populating the reward and transits tables with random experiences.

def play_n_random_steps(self, count):
    for _ in range(count):
        action = self.env.action_space.sample()
        new_state, reward, is_done, _ = self.env.step(action)
        self.rewards[(self.state, action, new_state)] = reward
        self.transits[(self.state, action)][new_state] += 1
        if is_done:
           self.state = self.env.reset()
        else:
           self.state = new_state

Note that we don’t need to wait for the end of the episode to start learning; we just perform N steps and remember their outcomes. This is one of the differences between the Value Iteration and the Cross-Entropy method shown in a previous post, which requires full episodes to learn.

Value of the action

The next method calculates the Q-function, the value of an action from a state using the transits , reward , and value tables of the Agent. We will use it for two purposes: to select the best action to perform from the state and to calculate the new value of the state on the Value Iteration algorithm.

def calc_action_value(self, state, action):
    target_counts = self.transits[(state, action)]
    total = sum(target_counts.values())
    
    action_value = 0.0
    for tgt_state, count in target_counts.items():
        reward = self.rewards[(state, action, tgt_state)]
        val = reward + GAMMA * self.values[tgt_state]
        action_value += (count / total) * val
    return action_value

First, from the transits table, we extract transition counters for the given state and action that the method receives as arguments. We sum all counters to obtain the total count of times we have executed the action from the state. Then we iterate every target state that our action has landed on and calculate its contribution to the total action value using the formula presented in the Bellman equation post:

This value is equal to immediate reward plus discounted value for the target state and multiplied this sum to the probability of this transition (individual counter divided by the total value computed before). We add the result for each iteration to a variable action_value , the variable that will be returned.

Select best Action

In order to select the best action from a given state, we have the method select_action that uses the previous calc_action_value method to make the decision:

def select_action(self, state):
    best_action, best_value = None, None
    for action in range(self.env.action_space.n):
        action_value = self.calc_action_value(state, action)
        if best_value is None or best_value < action_value:
           best_value = action_value
           best_action = action
    return best_action

The code of this method iterates over all possible actions in the Environment and calculates the value for every action and returns the action with the largest Q-value.

Value Iteration function

And here we have the main function, that as we described in the previous post, what Value Iteration method do is just loop over all states in the Environment:

def value_iteration(self):
    for state in range(self.env.observation_space.n):
        state_values = [
              self.calc_action_value(state, action)
              for action in range(self.env.action_space.n)
        ]
   self.values[state] = max(state_values)

For every state, we update its value with the maximum value of the action available from the state.

Training Loop and the monitoring of the code

After presenting the Agent’s class and its methods, we come back to describe the main loop. First of all, we create the Environment that we will be using for testing, create an instance of the Agent class, the summary writer for TensorBoard, and some variables that we will use:

test_env = gym.make(ENV)
agent = Agent()
writer = SummaryWriter()iter_no = 0
best_reward = 0.0

Remember that Value Iteration method formally requires an infinite number of iterations to converge exactly to obtain the optimal value function. In practice, in the previous post, we showed that we can stop once the value function changes by only a small amount in an iteration of the training loop.

n this example, in order to keep the code simple and with few iterations, we decide to stop when we reach some reward threshold. But all the rest of the code is identical.

The overall logic of our code is simply a loop that will iterate until the Agent achieve the expected performance (If the average reward for those test episodes is above the REWARD_GOAL boundary, then we stop training):

while best_reward < REWARD_GOAL:
        agent.play_n_random_steps(N)
        agent.value_iteration()
 
        ...

The body of the loop is composed of the steps introduced earlier:

STEP 1: play N random steps calling the method plays_n_random_steps.

STEP 2: It performs a value iteration sweep over all states calling the method value_iteration.

STEP 3: Then we play several full episodes to check the improvements using the updated value table. For that the code uses agent.elect_action() to find the best action to take in the new test_env Environment (we don’t want to mess with the current state of the main Environment used to gather random data), to check improvements of the Agent:

iter_no += 1
reward_avg = 0.0
for _ in range(TEST_EPISODES):
    total_reward = 0.0
    state = test_env.reset()
    while True:
        action = Select_action(state)
        new_state, new_reward, is_done, _ = test_env.step(action)
        total_reward += new_reward
        if is_done: break
        state = new_state
    reward_test += total_reward
reward_test /= TEST_EPISODES

Additionally, the code writes data into TensorBoard in order to tracks the best average reward:

writer.add_scalar("reward", reward_test, iter_no)
if reward_test > best_reward:
    print("Best reward updated %.2f at iteration %d " % 
         (reward_test ,iter_no) )
    best_reward = reward_test

And that’s all!

Run the program

Okay, let’s run our program:

Best reward updated 0.40 in iteration 13 
Best reward updated 0.65 in iteration 20 
Best reward updated 0.80 in iteration 23 
Best reward updated 0.85 in iteration 28 
Best reward updated 0.90 in iteration 76

Test the client

Now, if we try with the same client test code as in the case of Cross-Entropy, we can see that the Agent we have built, can learn from a slippery Environment:

new_test_env = gym.make(‘FrozenLake-v0’, is_slippery=True)
state = new_test_env.reset()
new_test_env.render()
is_done = False
iter_no = 0
while not is_done:
    action = Select_action(state)
    new_state, reward, is_done, _ = new_test_env.step(action)
    test_env.render()
    state = new_state
    iter_no +=1
print(“reward = “, reward)
print(“iterations =”, iter_no)

36NNFre.png!web

. . .

ja6VzmU.png!web

Conclusions

Note that the algorithm presented is stochastic and for different executions, it employs a different number of iterations to reach a solution. However, it could learn from a slippery Environment instead the Cross-Entropy presented earlier. We can plot all of them using Tensorboard that will show graphs like the following one:

JNZ3ae3.png!web

We can notice that in all cases, it needs a few seconds as a maximum to find a good policy that solves the Environment in 80% of runs. If you remember the Cross-Entropy method, for a slippery Environment, it took many hours to achieve a success rate of only 60%.

Moreover, we can apply this algorithm to a larger version of FrozenLake, which has the name FrozenLake8x8-v0. The larger version of FrozenLake can take many more iterations to solve, and, according to TensorBoard charts, most of the time it waits for the first successful episode (it need to have at least one successful episode to start learning from useful value table), then it very quickly reaches convergence. Below is a chart that compares reward dynamics during training on FrozenLake-4x4 and 8x8 version:

a2IBZrv.png!web

In addition to changing the Environment in ENV_NAME (“FrozenLake-v0” vs “FrozenLake8x8-v0”), the reader can perform tests with different values of the hyperparameters as GAMMA, TEST_EPISODE, REWARD_GOAL or N . Why don’t you try it?

What is next?

In the next post, we will present the Value Iteration method code that learns the values of the actions instead of learns the values of the states as we have done here. See you in the next post!

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .

Acknowledgments: The code presented in this post has been inspired from the code of Maxim Lapan who has written an excellent practical book on the subject .

V-function in Practice for Frozen-Lake Environment

Value Iteration with V-function in Practice

Agent’s data structures

Training Algorithm

Agent Class

Play random steps

Value of the action

Select best Action

Value Iteration function

Training Loop and the monitoring of the code

Run the program

Test the client

. . .

Conclusions

What is next?

Recommend

From directory deletion to SYSTEM shell

Silq is a new high-level programming language for quantum computers

LaTeX Typesetting – Part 1 (Lists)

使用Apache Spark和Apache Hudi构建分析数据湖

微服务技术栈：常见注册中心组件，对比分析

Pkg.go.dev is open source!

Deep learning on graphs: successes, challenges, and next steps

Introducing Windows Insider Channels

Guide to Write Recursive Functions using Python

Grassmann.jl A\b 3x faster than Julia's StaticArrays.jl

About Joyk