Exercise 9: Monte-carlo methods

Exercise 9: Monte-carlo methods#

Note

This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
- 02465ex9_Python.pdf
Slides: [1x] ([6x]). Reading: Chapter 5-5.4+5.10, [SB18].
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date

Tabular methods (Q-learning, Sarsa, etc.)#

As the name suggests, tabular methods requires us to maintain a table of \(Q\)-values or state-values \(V\). The \(Q\)-values in particular can be a bit tricky to keep track of and I have therefore made a helper class irlc.ex09.rl_agent.TabularAgent which will hopefully simplify the process.

Note

The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when \(\mathcal{A}(s) \neq \mathcal{A}(s')\). Gymasiums ways of dealing with this situation is to use the info-dictionary, e.g. so that s, info = env.reset() will specify a info['mask'] variable which is a numpy ndarray so that a given action is available if info['mask'][a] == 1. You can read more about this choice at The gymnasium discrete space documentation.

The \(Q\)-values behave like a 2d numpy ndarray:

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.3) # Use epsilon-greedy exploration.
>>> state, _ = env.reset()
>>> state
(0, 0)
>>> agent.Q[state, 1] = 2 # Update a Q-value
>>> agent.Q[state, 1]     # Get a Q-value
2
>>> agent.Q[state, 0]     # Q-values are by default zero
0

To implement masking, the agent.Q-table has two special functions which requires the info-dictionary. As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles.

To get the optimal action use agent.Q.get_optimal_action(s, info_s)

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env)
>>> state, info = env.reset()               # Get the info-dictionary corresponding to s
>>> agent.Q[state, 1] = 2.5                 # Update a Q-value; action a=1 is now optimal.
>>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state
1

To get all Q-values corresponding to a state use agent.Q.get_Qs(s, info_s)

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env)
>>> state, info = env.reset()                  # Get the info-dictionary corresponding to s
>>> agent.Q[state, 1] = 2.5                    # Update a Q-value; action a=1 is now optimal.
>>> actions, Qs = agent.Q.get_Qs(state, info)  # Note we pass along the info-dictionary corresopnding to this state
>>> actions                                    # All actions that are available in this state (after masking)
(0, 1, 2, 3)
>>> Qs                                         # All Q-values available in this state (after masking)
(0, 2.5, 0, 0)

You can combine this functionality to get e.g. the maximal Q-value using agent.Q[s, agent.Q.get_optimal_action(s, info)].

Note

The Q-table will remember the masking information for a given state and warn you if you are trying to access an action that has been previously masked.

We often want to perform \(\varepsilon\)-greedy exploration. To simplify this, the agent has the function agent.pi_eps. Since this function uses the Q-values, it also requires an info-dictionary:

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.1)  # epsilon-greedy exploration
>>> state, info = env.reset()               # to get a state and info-dictionary
>>> a = agent.pi_eps(state, info)           # Epsilon-greedy action selection
>>> a
0

Warning

In the train(s, a, r, sp, done, info_s, info_sp)-method, remember to use the info-dictionary corresponding to the state.

use self.Q.get_Qs(s, info_s) and self.Q.get_Qs(sp, info_sp)
never use self.Q.get_Qs(s, info_sp)

Classes and functions#

None so far.

Solutions to selected exercises#

Problem 10.1-10.2: MC Value estimation

Problem 10.3-10.4: MC control

Problem 10.5: TD Learning

Exercise 9: Monte-carlo methods

Contents

Exercise 9: Monte-carlo methods#

Tabular methods (Q-learning, Sarsa, etc.)#

Classes and functions#

Solutions to selected exercises#