Week 9: Policy evaluation

Week 9: Policy evaluation#

What you see#

The example shows the policy evaluation algorithm on a simple (deterministic) gridworld with a living reward of 0.05 per step and no discount (γ=1). The goal is to get to the upper-right corner.

Every time you move pacman the game will execute a single update of the policy-evaluation algorithm (applied to the random policy where each action is taken with probability π(a|s)=14). You can change between the value-function and action-value function by pressing m.

The algorithm will converge after about 20 steps and thereby compute both vπ(s) and qπ(s,a) (depending on the view-mode). These represent the expected reward according to the (random) policy π.

How it works#

When computing e.g. the value function v(s), the algorithm will in each step, and for all states, perform the update:

V(s)Eπ[Rt+1+γV(St+1)|St=s]

Where the expectation is with respect to the next state. Let’s consider a concrete example. In the starting state s0 (bottom-left corner), a random policy will with probability 12 move pacman into the wall (and therefore stay in state s0) and with probability 14 move pacman up/left and thereby get to states s and s. Since the living reward is 0.05, we can then insert and get:

V(s0)12(0.05+V(s0))+14(0.05+V(s))+14(0.05+V(s))

You can verify for yourself that this update is always correct.