N-step TD Method

In previous posts, we have been together explored some general reinforcement learning methods, includingSARSA, an on-policy updates, where the Q value is updated based on the trajectory the agent takes, andMonte Carlo method, which is generally used to estimate a policy. In this post, we will

recall SARSA and Monte Carlo method
explain why these two methods can be unified(in fact, they are the same method with different parameters)
implement random walk example and compare the effectiveness of different n-step methods

n-step TD from reinforcement learning an introduction

SARSA & Monte Carlo Simulation

Remember that in 1-step SARSA, we updated the current Q value estimation with:

Q(S, A) += lr * (R(S, A) + Q(S', A') - Q(S, A))

which indicates that the estimation of current state is affected only by the estimation of next state, and this is why it is also called 1-step update method. With 1-step SARSA, there comes 2-step, 3-step SARSA and so on, where the current update could also depend on states that are 2, 3 … steps away. And when the number of steps approximate to infinity, this is essentially Monte Carlo simulation.

Monte Carlo methods perform an update for each state based on the entire sequence of observed rewards from that state until the end of the episode. The update of one-step TD methods, on the other hand, is based on just the one next reward, bootstrapping from the value of the state one step later as a proxy for the remaining rewards.

For example, in a game playing setting like blackjack, we applied Monte Carlo simulation and got result, which is either win, draw or lose, only to the end of the game playing, so it is actually using the final state to update the current state, and that is why this is also a n-step TD method with n approximating to infinity.

The n-step TD Method

Let’s directly take a look at the algorithm and implement it with a concrete example.

n-step SARSA

It might be a little tricky to understand the algorithm, let me explain with actual numbers. The lowercase t is the timestamp the agent currently at, so it starts from 0, 1, 2 … until the end of game. τ is the timestamp of Q value that being updated, say, if n=3 , which is 3-step TD method, current t=5 , then τ=t-n+1=5-3+1=3 , which means when the agent reaches timestamp 5, the Q value of timestamp 3 will be updated. And the update is simply picking up reward along the way(multiplied by the decaying factor γ ) , and use the temporal difference of G — Q(S_τ, A_τ) to update the current estimation(this part is exactly the same as what we talked in SARSA). The τ+n<T part is specifying situation, say, if the end state is T=10 , which means the agent takes 10 steps to reach the end of game, then at the end timestamp t=10 , state τ=10-3+1=8 will be updated, but what about the state of τ=9 ? The algorithm says that states in between will be updated only by the reward collected along the way.

You might wonder why the updates are all in ordered fashion, while the examples we talked before are always updating Q values till the end of game and backpropagate the reward. In fact, the ordered update is more generalised form, considering situations that are not game playing and there are no end of the game, say, stock market investment where we care about long term total reward and there is not an obvious end of episode. Both ordered update and reversed update are able to converge when the agent takes enough number of trials, while the reversed update is generally more effective in terms of the reward collected at the end of episode is mostly non-zero value, thus the updates along the way will not be zero .

Random Walk Implementation

Let’s implement an example and see how different number of steps of TD method affects the agent’s learning.( full code )

In this game setting, all episodes start in the center state, C , then proceed either left or right by one state on each step, with equal probability . Episodes terminate either on the extreme left or the extreme right. When an episode terminates on the right, the agent is rewarded +1 , when it terminate on the left, the agent is rewarded -1 , and all other rewards are 0 .

Initialisation

We initialise a game with 19 states(the start and end state is not included!). In the __init__ function, we, as usual, initialise actions and Q values, and the Q values of ending states are explicitly set to 1 and -1 in order to help the agent converges faster. self.n is the number of steps.

Actions Taking

The agent chooses actions left and right with equal probabilities, and by taking the action the index is either added by 1 or subtracted by 1.

TD updates

The n-step SARSA updates is exactly the same as we saw in the algorithms shown above. Before each episode, we initialise actions , states and rewards to keep track of the agent’s behaviour along the way in order to do n-step updates. Only when tau == T-1 can the updates stop, as the last state T will not be updated.

Comparison of different N

By running the algorithm with different learning rate and steps(n), we are able to compare the performance of our agent.

The actual_state_value for a n-state random walk is equally increased from -1 to 1, and because the agent choose actions with equal probability, the estimate_state_value equals the expectation of Q values in that state.

From the result, we can see that generally the best n of steps lies in between 1 to infinity.This illustrates how the generalisation of TD and Monte Carlo methods to n-step methods can potentially perform better than either of the two extreme methods.

And lastly, please check out the full code here . You are welcomed to contribute, and if you have any questions or suggestions, please raise comment below!

SARSA & Monte Carlo Simulation

The n-step TD Method

Random Walk Implementation

Initialisation

Actions Taking

TD updates

Comparison of different N

Reference:

Recommend

Why Should Android App Developers Consider Flutter?

Ethereum DApp Development Company

Introducing Equiano, a subsea cable from Portugal to South Africa

5G NSA和SA究竟有何区别？看完后再决定何时买5G手机

英国试验男性避孕药

NASA 宣布资助泰坦星无人机探索项目 Dragonfly

电池存起火隐患，蔚来宣布召回ES8

华为“舵手”任正非

垃圾分类对上海环境有多大的影响？

呼吸强迫症 OCD，什么鬼？？？总想主动去控制呼吸，求解。。。。。

About Joyk