Week 10: MC value estimation#
What you see#
The example shows first-visit monte carlo (MC) applied to the bookgrid environment with a living reward of \(-0.05\) per step and no discount (\(\gamma = 1\)). The goal is to get to the upper-right corner.
The game execute the first-visit monte carlo agent and inserts the values of \(N\) (number of visits) and \(S\) (accumulated return) for each state. This is used to estimate the value function. Note these values are only updated when the environment terminates, since this is the only time we can compute the return.
The MC algorithm will estimate the value function \(v_{\pi}\) of a given policy, in this case the random policy. Note that this takes time, but eventually the result will agree with Week 9: Policy evaluation.
How it works#
The value function is estimated as
See (Sutton and Barto [SB18]) for further details on how these are updated.
Week 10: Every-visit monte carlo learning#
This example shows the every-visit monte carlo algorithm to the same problem as above. Note that this mean the same state can be updated multiple times in each episode if it is revisited; i.e. try to move back and forth and see how this change how \(N(s)\) and \(S(s)\) are estimated. Although this produces a biased estimate, if you run it long enough time it will eventually converge to the true value function as computed by first-visit or policy evaluation (Week 9: Policy evaluation).