Exercise 10: Monte-carlo methods and TD learning#
Note
The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date
Tabular methods (Q-learning, Sarsa, etc.)#
As the name suggests, tabular methods requires us to maintain a table of \(Q\)-values or state-values \(V\). The \(Q\)-values in particular can be a bit tricky to keep track of and I have therefore made a helper class irlc.ex09.rl_agent.TabularAgent
which will
hopefully simplify the process.
Note
The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when \(\mathcal{A}(s) \neq \mathcal{A}(s')\). Gymasiums ways of
dealing with this situation is to use the info
-dictionary, e.g. so that s, info = env.reset()
will specify a info['mask']
variable which is a numpy ndarray so that a given action is available if \(info['mask'][a] == 1\).
You can read more about this choice at The gymnasium discrete space documentation.
The \(Q\)-values behave like a 2d numpy ndarray:
>>> from irlc.ex09.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.3) # Use epsilon-greedy exploration.
>>> state, _ = env.reset()
>>> state
(0, 0)
>>> agent.Q[state, 1] = 2 # Update a Q-value
>>> agent.Q[state, 1] # Get a Q-value
2
>>> agent.Q[state, 0] # Q-values are by default zero
0
To implement masking, the agent.Q
-table has two special functions which requires the info
-dictionary. As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles.
To get the optimal action use
agent.Q.get_optimal_action(s, info_s)
>>> from irlc.ex09.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() # Get the info-dictionary corresponding to s >>> agent.Q[state, 1] = 2.5 # Update a Q-value; action a=1 is now optimal. >>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state 1
To get all Q-values corresponding to a state use
agent.Q.get_Qs(s, info_s)
>>> from irlc.ex09.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() # Get the info-dictionary corresponding to s >>> agent.Q[state, 1] = 2.5 # Update a Q-value; action a=1 is now optimal. >>> actions, Qs = agent.Q.get_Qs(state, info) # Note we pass along the info-dictionary corresopnding to this state >>> actions # All actions that are available in this state (after masking) (0, 1, 2, 3) >>> Qs # All Q-values available in this state (after masking) (0, 2.5, 0, 0)
You can combine this functionality to get e.g. the maximal Q-value using agent.Q[s, agent.Q.get_optimal_action(s, info)]
.
Note
The Q-table will remember the masking information for a given state and warn you if you are trying to access an action that has been previously masked.
We often want to perform \(\varepsilon\)-greedy exploration. To simplify this, the agent has the function agent.pi_eps
. Since this function
uses the Q-values, it also requires an info
-dictionary:
>>> from irlc.ex09.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.1) # epsilon-greedy exploration
>>> state, info = env.reset() # to get a state and info-dictionary
>>> a = agent.pi_eps(state, info) # Epsilon-greedy action selection
>>> a
0
Warning
In the train(s, a, r, sp, done, info_s, info_sp)
-method, remember to use the info
-dictionary corresponding to the state.
use
self.Q.get_Qs(s, info_s)
andself.Q.get_Qs(sp, info_sp)
never use
self.Q.get_Qs(s, info_sp)
Classes and functions#
- class irlc.ex09.rl_agent.TabularAgent(env, gamma=0.99, epsilon=0)[source]#
Bases:
Agent
This helper class will simplify the implementation of most basic reinforcement learning. Specifically it provides:
A \(Q(s,a)\)-table data structure
An epsilon-greedy exploration method
The code for the class is very simple, and I think it is a good idea to at least skim it.
The Q-data structure can be used a follows:
>>> from irlc.ex09.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() # Get the info-dictionary corresponding to s >>> agent.Q[state, 1] = 2.5 # Update a Q-value; action a=1 is now optimal. >>> agent.Q[state, 1] # Check it has indeed been updated. 2.5 >>> agent.Q[state, 0] # Q-values are 0 by default. 0 >>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state 1
Note
The
get_optimal_action
-function requires aninfo
dictionary. This is required since the info dictionary contains information about which actions are available. To read more about the Q-values, seeTabularQ
.- __init__(env, gamma=0.99, epsilon=0)[source]#
Initialize a tabular environment. For convenience, it stores the discount factor \(\gamma\) and exploration parameter \(\varepsilon\) for epsilon-greedy exploration. Access them as e.g.
self.gamma
When you implement an agent and overwrite the
__init__
-method, you should include a call such assuper().__init__(gamma, epsilon)
.- Parameters:
env – The gym environment
gamma – The discount factor \(\gamma\)
epsilon – Exploration parameter \(\varepsilon\) for epsilon-greedy exploration
- pi_eps(s, info)[source]#
Performs \(\varepsilon\)-greedy exploration with \(\varepsilon =\)
self.epsilon
and returns the action. Recall this means that with probability \(\varepsilon\) it returns a random action, and otherwise it returns an action associated with a maximal Q-value (\(\arg\max_a Q(s,a)\)). An example:>>> from irlc.ex09.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() >>> agent.pi_eps(state, info) # Note we pass along the info-dictionary corresopnding to this state 3
Note
The
info
dictionary is used to mask (exclude) actions that are not possible in the state. It is similar to the info dictionary inagent.pi(s,info)
.- Parameters:
s – A state \(s_t\)
info – The corresponding
info
-dictionary returned by the gym environment
- Returns:
An action computed using \(\varepsilon\)-greedy action selection based the Q-values stored in the
self.Q
class.
- class irlc.ex09.rl_agent.TabularQ(env)[source]#
Bases:
object
This is a helper class for storing Q-values. It is used by the
TabularAgent
to store Q-values where it can be be accessed asself.Q[s,a]
.- __init__(env)[source]#
Initialize the table. It requires a gym environment to know how many actions there are for each state. :type env: :param env: A gym environment.
- get_Qs(state, info_s=None)[source]#
Get a list of all known Q-values for this particular state. That is, in a given state, it will return the two lists:
\[\begin{split}\begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_k \end{bmatrix}, \quad \begin{bmatrix} Q(s,a_1) \\ Q(s,a_1) \\ \vdots \\ Q(s,a_k) \end{bmatrix} \\\end{split}\]the
info_s
parameter will ensure actions are correctly masked. An example of how to use this function from a policy:>>> from irlc.ex09.rl_agent import TabularAgent >>> class MyAgent(TabularAgent): ... def pi(self, s, k, info=None): ... actions, q_values = self.Q.get_Qs(s, info) ...
- Parameters:
state – The state to query
info_s – The info-dictionary returned by the environment for this state. Used for action-masking.
- Returns:
actions - A tuple containing all actions available in this state
(a_1, a_2, ..., a_k)
Qs - A tuple containing all Q-values available in this state
(Q[s,a1], Q[s, a2], ..., Q[s,ak])
- get_optimal_action(state, info_s)[source]#
For a given state
state
, this function returns the optimal action for that state.\[a^* = \arg\max_a Q(s,a)\]An example: .. runblock:: pycon
>>> from irlc.ex09.rl_agent import TabularAgent >>> class MyAgent(TabularAgent): ... def pi(self, s, k, info=None): ... a_star = self.Q.get_optimal_action(s, info)
- Parameters:
state – State to find the optimal action in \(s\)
info_s – The
info
-dictionary corresponding to this state
- Returns:
The optimal action according to the Q-table \(a^*\)
Solutions to selected exercises#
Problem 10.1-10.2: MC Value estimation
Problem 10.3-10.4: MC control
Problem 10.5: TD Learning