Week 11: Sarsa#
What you see#
The example show the Sarsa-algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 and -1 on the two exit squares. The four values in each grid \(s\) grid show the 4 Q-values \(Q(s,a)\), one for each action.
How it works#
When the agent takes action \(a\) in a state \(s\), and then get an immediate reward of \(r\) and move to a new state \(s'\), and here takes action \(a'\), then the Q-values are updated according to the rule
This update rule will eventually learn the action-value function \(q_{\pi}(s,a)\) associated with the policy \(\pi\) used to generate actions. The trick in Sarsa learning is that while it is applying this learning rule, actions are selected epsilon-greedy with respect to the current Q-values \(Q(s,a)\) shown in the simulation. The result is that it both adapts the \(Q\)-values to the current policy, and improves the current policy according to the Q-values. It is then a theoretical result that the outcome of both processes is that it will eventually converge to the best epsilon-greedy policy.
Warning
A small note: Think about the update rule and what it means when we apply it to the first state \(s_0\):
- First we need to select an action \(a_0\) 
- Then we go to a square \(s_1\) 
- Then finally we need the action in that square \(a_1\) (see (1)) 
In other words, we need to get action \(a_1\) to apply the update \(Q(s_0,a_0)\), and this is why the the updates look a bit sluggish when you play by keyboard.
There is another small issue: The algorithmic code in [SB18] assumes we can compute compute \(a_1\) from the policy when we update \(Q(s_0, a_0)\); this is fine when actions are determined by a policy we can compute at any time we like, however, it won’t work when the actions ade decided by keyboard inputs since obviously the computer cannot predict what key you will press next.
Therefore, the implementation shown on this page will wait to apply the Q-updates until the actions are pressed (thus the delay effect) whereas the version you asked to implement during the exercises follow the pseudo-code in [SB18].
However, both methods will compute the same Q-values (one will just do it one step later and thus be suitable for keyboard input!), so please don’t be confused by this point! As long as you understand the update rule, you should be all set.
