Week 11: Q-learning

Week 11: Q-learning#

What you see#

The example show the Q-learning-algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 and -1 on the two exit squares. The four values in each grid \(s\) grid show the 4 Q-values \(Q(s,a)\), one for each action.

How it works#

When the agent takes action \(a\) in a state \(s\), and then get an immediate reward of \(r\) and move to a new state \(s'\), the Q-values are updated according to the rule

\[Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right]\]

This update rule will eventually learn the optimal action-value function \(q_{*}(s,a)\) – as long as all actions are tried infinitely often. Concretely, the agent will follow an epsilon-greedy policy with respect to the current Q-values \(Q(s,a)\) shown in the simulation. This ensures that the agent frequently takes action that it think are good, and eventually converge to the optimal Q-values.