Exercise 9: Monte-carlo methods#
Note
This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date
Tabular methods (Q-learning, Sarsa, etc.)#
As the name suggests, tabular methods requires us to maintain a table of \(Q\)-values or state-values \(V\). The \(Q\)-values in particular can be a bit tricky to keep track of and I have therefore made a helper class irlc.ex09.rl_agent.TabularAgent which will
hopefully simplify the process.
Note
The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when \(\mathcal{A}(s) \neq \mathcal{A}(s')\). Gymasiums ways of
dealing with this situation is to use the info-dictionary, e.g. so that s, info = env.reset() will specify a info['mask'] variable which is a numpy ndarray so that a given action is available if info['mask'][a] == 1.
You can read more about this choice at The gymnasium discrete space documentation.
The \(Q\)-values behave like a 2d numpy ndarray:
>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.3) # Use epsilon-greedy exploration.
>>> state, _ = env.reset()
>>> state
(0, 0)
>>> agent.Q[state, 1] = 2 # Update a Q-value
>>> agent.Q[state, 1] # Get a Q-value
2
>>> agent.Q[state, 0] # Q-values are by default zero
0
To implement masking, the agent.Q-table has two special functions which requires the info-dictionary. As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles.
To get the optimal action use
agent.Q.get_optimal_action(s, info_s)>>> from irlc.ex08.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() # Get the info-dictionary corresponding to s >>> agent.Q[state, 1] = 2.5 # Update a Q-value; action a=1 is now optimal. >>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state 1
To get all Q-values corresponding to a state use
agent.Q.get_Qs(s, info_s)>>> from irlc.ex08.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() # Get the info-dictionary corresponding to s >>> agent.Q[state, 1] = 2.5 # Update a Q-value; action a=1 is now optimal. >>> actions, Qs = agent.Q.get_Qs(state, info) # Note we pass along the info-dictionary corresopnding to this state >>> actions # All actions that are available in this state (after masking) (0, 1, 2, 3) >>> Qs # All Q-values available in this state (after masking) (0, 2.5, 0, 0)
You can combine this functionality to get e.g. the maximal Q-value using agent.Q[s, agent.Q.get_optimal_action(s, info)].
Note
The Q-table will remember the masking information for a given state and warn you if you are trying to access an action that has been previously masked.
We often want to perform \(\varepsilon\)-greedy exploration. To simplify this, the agent has the function agent.pi_eps. Since this function
uses the Q-values, it also requires an info-dictionary:
>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.1) # epsilon-greedy exploration
>>> state, info = env.reset() # to get a state and info-dictionary
>>> a = agent.pi_eps(state, info) # Epsilon-greedy action selection
>>> a
0
Warning
In the train(s, a, r, sp, done, info_s, info_sp)-method, remember to use the info-dictionary corresponding to the state.
use
self.Q.get_Qs(s, info_s)andself.Q.get_Qs(sp, info_sp)never use
self.Q.get_Qs(s, info_sp)
Classes and functions#
None so far.
Solutions to selected exercises#
Problem 10.1-10.2: MC Value estimation
Problem 10.3-10.4: MC control
Problem 10.5: TD Learning