Week 10: TD-learning#
What you see#
The example shows the TD(0) algorithm applied to a deterministic gridworld environment with a living reward of \(-0.05\) and a per step and no discount (\(\gamma = 1\)). The goal is to get to the upper-right corner.
Every time you move pacman the game will execute a single update of the TD(0) algorithm.
It will take quite a few steps for the algorithm to converge, but after it has converged it will show the value function \(v_pi(s)\) for the current policy – which by default is the random policy. This means that the algorithm will compute the same result as policy evaluation seen in Week 9: Policy evaluation
How it works#
When you transition from a state \(s\) to \(s'\) the algorithm iteratively update \(V(s)\) according to the rule:
Where \(\alpha=0.5\) is a learning rate.