Week 10: TD-learning

Week 10: TD-learning#

What you see#

The example shows the TD(0) algorithm applied to a deterministic gridworld environment with a living reward of \(-0.05\) and a per step and no discount (\(\gamma = 1\)). The goal is to get to the upper-right corner.

Every time you move pacman the game will execute a single update of the TD(0) algorithm.

It will take quite a few steps for the algorithm to converge, but after it has converged it will show the value function \(v_pi(s)\) for the current policy – which by default is the random policy. This means that the algorithm will compute the same result as policy evaluation seen in Week 9: Policy evaluation

How it works#

When you transition from a state \(s\) to \(s'\) the algorithm iteratively update \(V(s)\) according to the rule:

\[V(s) \leftarrow V(s) + \alpha (R_{t+1} + \gamma V(s') - V(s) )\]

Where \(\alpha=0.5\) is a learning rate.