49

Value Function Approximation — Control Methods

 4 years ago
source link: https://mc.ai/value-function-approximation - control-methods/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
zeEJ73y.png!web

Generalized Policy Iteration should be no new concept to us at this point. The change here is going to be using approximate policy evaluation. We will start with some parameter vector w , defining some value function. Then, we will act greedily with a bit of epsilon exploration, with respect to that value function, giving us a new policy. Then to evaluate this new policy we update the parameters of our value function, keep repeating the process until we (hopefully) converge to an optimal value function. The diagram below illustrates the process.

What you’ll notice is, like before, we don’t need to go all the way to that top line, or in other words, waste time/samples of experience trying to exactly fit our function approximator. After the slight adjustment of our policy we act immediately, with the freshest data available to us.

The problem with the algorithm above is the likely possibility that the optimal value function will not be found, as in reality we are just getting closer to the approximated value function.

Our first step, now, will be to approximate the action-value function rather than the state-value function.

For every state and any action, with parameters w , we build a function that predicts how much reward we expect to get from that state and action.

We minimize the mean-squared error between the approximate action-value function and the true action-value function.

Using the Chain Rule and stochastic gradient descent, we find a local minimum:

Let’s consider the simplest case, using linear action-value function approximation. We build a feature vector to represent state and actions:

These features explain the entire state-action space. We do this by building a linear combination of features, but we can also use a more sophisticated system like a neural network.

The gradient descent update then collapses to:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK