Value Function Approximation — Control Methods - JOYK Joy of Geek, Geek News, Link all geek

Generalized Policy Iteration should be no new concept to us at this point. The change here is going to be using approximate policy evaluation. We will start with some parameter vector w , defining some value function. Then, we will act greedily with a bit of epsilon exploration, with respect to that value function, giving us a new policy. Then to evaluate this new policy we update the parameters of our value function, keep repeating the process until we (hopefully) converge to an optimal value function. The diagram below illustrates the process.

What you’ll notice is, like before, we don’t need to go all the way to that top line, or in other words, waste time/samples of experience trying to exactly fit our function approximator. After the slight adjustment of our policy we act immediately, with the freshest data available to us.

The problem with the algorithm above is the likely possibility that the optimal value function will not be found, as in reality we are just getting closer to the approximated value function.

Our first step, now, will be to approximate the action-value function rather than the state-value function.

For every state and any action, with parameters w , we build a function that predicts how much reward we expect to get from that state and action.

We minimize the mean-squared error between the approximate action-value function and the true action-value function.

Using the Chain Rule and stochastic gradient descent, we find a local minimum:

Let’s consider the simplest case, using linear action-value function approximation. We build a feature vector to represent state and actions:

These features explain the entire state-action space. We do this by building a linear combination of features, but we can also use a more sophisticated system like a neural network.

The gradient descent update then collapses to:

Value Function Approximation — Control Methods

Recommend

微服务公用代码组织实践

讲讲insert on duplicate key update 的死锁坑

裁员消息满天飞，我们整理了一份真实名单

比特币要再创新高？也许没那么快

17 个案例带你 5 分钟搞定 Linux 正则表达式

crash工具分析大型Linux服务器死锁实战

Android 开发技术周报 Issue#274

指掌易完成B+轮亿级融资，持续拓展移动安全市场

通晓多种编程语言的程序员，真香？

生鲜网购：一半靠运气，一半靠运输

About Joyk