Week 11: N-step sarsa

Week 11: N-step sarsa#

What you see#

The example show the n-step sarsa algorithm applied to a gridworld. The Gridworld has a single, terminal reward. The n-step Sarsa method will eventually learn the optimal policy.

How it works#

The n-step method is a bit annoying to implement, and I would refer to the pseudo code in (Sutton and Barto [SB18], Section 7.2) for a complete picture. However, the basic idea is that we define the n-step return:

\[G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n} )\]

What this means is that at a given time \(t\), we need to know what happens \(n\)-steps into the future, and knowing this, we can compute the above number. So concretely, when Pacman is walking around and has taken at least \(n\) steps, then in each subsequent step we can compute the corresponding \(n\)-step return (in the past so to speak). Let’s suppose that This allows Pacman to update \(Q(S_t, A_t)\) – but remember that \(t\) is n-steps ago so to speak, and \(S_t\), \(A_t\) is the state/action for pacman \(n\) steps previously, and so on.

The update is then:

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (G_{t:t+n} - Q(S_t, A_t) )\]

This corresponds to regular Sarsa when \(n=1\). There is one small wrinkle (and quite frandkly, this is why the algorithm is annoying to implement): After the episode has terminated, we are ‘missing’ \(n\) updates since we were updating the values \(n\) steps in the past. These have to be applied, which is why you see \(n\) Q-values change in the last step. This is concretely implemented in the last if-statement in the pseuco-code in (Sutton and Barto [SB18], Section 7.2).