Week 9: Policy iteration#
What you see#
The example shows the policy iteration algorithm on a simple (deterministic) gridworld with a living reward of \(-0.05\) per step and no discount (\(\gamma = 1\)). The goal is to get to the upper-right corner.
The algorithm will first evaluate the current policy for exactly \(20\) steps. Then it will perform a single policy improvement step. That means the algorithm agrees with policy evaluation for the first 20 steps.
The algorithm begins with a random policy (where each action is taken with probability \(\pi(a|s) = \frac{1}{4}\)). You can change between the value-function and action-value function by pressing m.
Note that the algorithm will show the current policy as arrows.
The algorithm will converge after a few policy-iteration steps and thereby compute both \(v^*(s)\) and \(q^*(s, a)\) (depending on the view-mode). These represent the expected reward according to the (optimal!) policy \(\pi^*\).
How it works#
As stated, you should see the policy-evaluation example Week 9: Policy evaluation ./to understand how the algorithm works for the first 20 steps. After the (initially random) policy is evaluated, the algorithm will update the current policy by being greedy with respect to the value/action-value function (the step is the same in either case). The policy is always indicated by the arrows.
After the update the process begins over with 20 more evaluations of the (updated) policy.