Week 10: MC Control#

What you see#

The example shows first-visit monte carlo (MC) for action-value \(q_{\pi}(s,a)\) estimation applied to the bookgrid environment with a living reward of \(-0.05\) per step and no discount (\(\gamma = 1\)). The goal is to get to the upper-right corner – i.e., this is not actually MC control, but rather policy evaluation. Control will be shown in the next example.

The game execute the first-visit monte carlo agent and inserts the values of \(N\) (number of visits) and \(S\) (accumulated return) for each pair of states and actions \((s,a)\), i.e. 4 per state. This is used to estimate the action-value function. Note these values are only updated when the environment terminates, since this is the only time we can compute the return.

By default, this environment evaluate the random policy.

How it works#

The value function is estimated as

\[Q(s, a) = \frac{N(s, a) } {S(s, a)}\]

See (Sutton and Barto [SB18]) for further details on how these are updated. Note the similarities to the value-estimation algorithm.

Week 10: Every-visit monte carlo learning#

This example shows the mc control algorithm. It uses the same mechanism as above to estimate the action-value function \(Q(s,a)\), however, instead of executing the random policy it behaves epsilon-greedy. Similar to policy iteration, this will eventually lead to better and better policies, and assuming we let the exploration rate \(\varepsilon \rightarrow 0\) (see todays lecture) the method would estimate the optimal policy and the optimal action-values \(q^*(s,a)\). However, since \(\varepsilon\) is fixed in the simulation, it only evaluate a near-optimal policy as you can see if you press p. You can compare this to the optimal result found using Week 9: Value iteration.