Week 9: Policy evaluation#
What you see#
The example shows the policy evaluation algorithm on a simple (deterministic) gridworld with a living reward of
Every time you move pacman the game will execute a single update of the policy-evaluation algorithm (applied to the random policy where each action is taken with probability
The algorithm will converge after about 20 steps and thereby compute both
How it works#
When computing e.g. the value function
Where the expectation is with respect to the next state. Let’s consider a concrete example. In the starting state
You can verify for yourself that this update is always correct.