Exercise 10: Monte-carlo methods and TD learning

Exercise 10: Monte-carlo methods and TD learning#

Note

This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
- 02465ex10_Python.pdf
Slides: [1x] ([6x]). Reading: Chapter 5-5.4+5.10; 6-6.3, [SB18].
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date

Tabular methods (Q-learning, Sarsa, etc.)#

As the name suggests, tabular methods requires us to maintain a table of \(Q\)-values or state-values \(V\). The \(Q\)-values in particular can be a bit tricky to keep track of and I have therefore made a helper class irlc.ex09.rl_agent.TabularAgent which will hopefully simplify the process.

Note

The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when \(\mathcal{A}(s) \neq \mathcal{A}(s')\). Gymasiums ways of dealing with this situation is to use the info-dictionary, e.g. so that s, info = env.reset() will specify a info['mask'] variable which is a numpy ndarray so that a given action is available if info['mask'][a] == 1. You can read more about this choice at The gymnasium discrete space documentation.

The \(Q\)-values behave like a 2d numpy ndarray:

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

To implement masking, the agent.Q-table has two special functions which requires the info-dictionary. As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles.

To get the optimal action use agent.Q.get_optimal_action(s, info_s)

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

To get all Q-values corresponding to a state use agent.Q.get_Qs(s, info_s)

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

You can combine this functionality to get e.g. the maximal Q-value using agent.Q[s, agent.Q.get_optimal_action(s, info)].

Note

The Q-table will remember the masking information for a given state and warn you if you are trying to access an action that has been previously masked.

We often want to perform \(\varepsilon\)-greedy exploration. To simplify this, the agent has the function agent.pi_eps. Since this function uses the Q-values, it also requires an info-dictionary:

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

Warning

In the train(s, a, r, sp, done, info_s, info_sp)-method, remember to use the info-dictionary corresponding to the state.

use self.Q.get_Qs(s, info_s) and self.Q.get_Qs(sp, info_sp)
never use self.Q.get_Qs(s, info_sp)

Classes and functions#

class irlc.ex09.rl_agent.TabularAgent(env, gamma=0.99, epsilon=0)[source]#

Bases: Agent

This helper class will simplify the implementation of most basic reinforcement learning. Specifically it provides:

A \(Q(s,a)\)-table data structure

An epsilon-greedy exploration method

The code for the class is very simple, and I think it is a good idea to at least skim it.

The Q-data structure can be used a follows:

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

Note

The get_optimal_action-function requires an info dictionary. This is required since the info dictionary contains information about which actions are available. To read more about the Q-values, see TabularQ.

__init__(env, gamma=0.99, epsilon=0)[source]#

Initialize a tabular environment. For convenience, it stores the discount factor \(\gamma\) and exploration parameter \(\varepsilon\) for epsilon-greedy exploration. Access them as e.g. self.gamma

When you implement an agent and overwrite the __init__-method, you should include a call such as super().__init__(gamma, epsilon).

Parameters:

env – The gym environment
gamma – The discount factor \(\gamma\)
epsilon – Exploration parameter \(\varepsilon\) for epsilon-greedy exploration

pi_eps(s, info)[source]#

Performs \(\varepsilon\)-greedy exploration with \(\varepsilon =\) self.epsilon and returns the action. Recall this means that with probability \(\varepsilon\) it returns a random action, and otherwise it returns an action associated with a maximal Q-value (\(\arg\max_a Q(s,a)\)). An example:

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

Note

The info dictionary is used to mask (exclude) actions that are not possible in the state. It is similar to the info dictionary in agent.pi(s,info).

Parameters:

s – A state \(s_t\)
info – The corresponding info-dictionary returned by the gym environment

Returns:

An action computed using \(\varepsilon\)-greedy action selection based the Q-values stored in the self.Q class.

class irlc.ex09.rl_agent.TabularQ(env)[source]#

Bases: object

This is a helper class for storing Q-values. It is used by the TabularAgent to store Q-values where it can be be accessed as self.Q[s,a].

__init__(env)[source]#: Initialize the table. It requires a gym environment to know how many actions there are for each state. :type env: :param env: A gym environment.

get_Qs(state, info_s=None)[source]#

Get a list of all known Q-values for this particular state. That is, in a given state, it will return the two lists:

\[\begin{split}\begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_k \end{bmatrix}, \quad \begin{bmatrix} Q(s,a_1) \\ Q(s,a_1) \\ \vdots \\ Q(s,a_k) \end{bmatrix} \\\end{split}\]

the info_s parameter will ensure actions are correctly masked. An example of how to use this function from a policy:

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

Parameters:

state – The state to query
info_s – The info-dictionary returned by the environment for this state. Used for action-masking.

Returns:

actions - A tuple containing all actions available in this state (a_1, a_2, ..., a_k)
Qs - A tuple containing all Q-values available in this state (Q[s,a1], Q[s, a2], ..., Q[s,ak])

get_optimal_action(state, info_s)[source]#

For a given state state, this function returns the optimal action for that state.

\[a^* = \arg\max_a Q(s,a)\]

An example: .. runblock:: pycon

>>> from irlc.ex09.rl_agent import TabularAgent
>>> class MyAgent(TabularAgent):
...     def pi(self, s, k, info=None):
...         a_star = self.Q.get_optimal_action(s, info)

Parameters:

state – State to find the optimal action in \(s\)
info_s – The info-dictionary corresponding to this state

Returns:

The optimal action according to the Q-table \(a^*\)

to_dict()[source]#

This helper function converts the known Q-values to a dictionary. This function is only used for visualization purposes in some of the examples.

Returns:: A dictionary q of all known Q-values of the form q[s][a]

Solutions to selected exercises#

Solution to Problem 1 - The single-state example

Part a: Comparing to the \(T=3\) example clearly \(N^{\text{first} }(s_1) = 1\) and the return is:

\[G_0 = R_1 + \gamma R_2 + \gamma^2 R_3 = 1 + \gamma + \gamma^2 = \frac{1 - \gamma^3}{1-\gamma}\]

Therefore the accumulated reward is:

\[S^{\text{first}}(s_1) = G_0 = \sum_{t=0}^{T-1}\gamma^t R_{t+1} = \sum_{t=0}^{T-1}\gamma^t = \frac{1-\gamma^{T} }{1-\gamma} \equiv h(T).\]

Part b: From the previous example, it is clear that the return only depends on the number of future times \(T\) we visit a state. If we again consider the \(T=3\) example we see that \(N^{\text{every} }(s_1) = T\) since we visit \(s_1\) a total of \(T\) times. Each time we compute the (future) return \(G_t\); in the first visit this is equal to what we got before, in the next visit it is based on \(T-1\) future rewards, and so on. For \(T=3\):

\[\begin{split}S^{\text{every}}(s_1) & = (R_1 + \gamma R_2 + \gamma^2 R_3) + ( R_2 + \gamma R_3) + (R_3) \\ & = h(3) + h(2) + h(1)\end{split}\]

So in general:

\[\begin{split}S^{\text{every}}(s_1) & = \sum_{t=1}^T h(t) = \sum_{t=1}^T \frac{1-\gamma^{t} }{1-\gamma} = \frac{T - \gamma \sum_{k=0}^{T-1} \gamma^k}{1-\gamma} \\ & = \frac{T - \gamma \frac{1-\gamma^T }{1-\gamma} }{1-\gamma} \\ & = \frac{T - \gamma h(T) }{1-\gamma}\end{split}\]

Part c: Given the above we can compute the two estimators for \(m=1\) and see that they clearly give different results. Although they agree when \(T \rightarrow \infty\), this limit is not interesting since the value of \(T\) is random not under our control.

Part d: Given \(m\) estimates clearly \(N^{\text{first} }(s_1) = m\) and

\[S^{\text{first}}(s_1) = \sum_{k=1}^m h(T_k) = \frac{m - \sum_{k=1}^m \gamma^{T_k} }{1-\gamma}\]

Part e: Again clearly \(N^{\text{every} }(s_1) = \sum_{k=1}^m T_k\) and

\[S^{\text{every}}(s_1) = \sum_{k=1}^m \frac{T_k - \gamma h(T_k) }{1-\gamma} = \frac{\sum_{k=1}^m T_k - \gamma \sum_{k=1}^m h(T_k) }{1-\gamma}\]

Part f: By calculation we get that:

\[\frac{m}{ N^{\text{every} } (s_1) } = \frac{1}{ \frac{1}{m } \sum_{k=1}^m T_k } \approx \frac{1}{ \mathbb{E}[T] } = \frac{1}{ \frac{1}{1-p} } = 1-p\]

According to the law of large numbers the error in this approximation scale as \(\frac{1}{\sqrt{m}}\).

Part g: Starting with first-visit we have that:

\[\begin{split}\frac{ S^{\text{first} } (s_1) }{ N^{\text{first} }(s_1) } & = \frac{1}{m} \sum_{k=1}^m h(T_k) = \frac{1 - \frac{1}{m} \sum_{k=1}^m \gamma^{T_k} }{1-\gamma} \\ & \approx \frac{1}{1-\gamma}\left[ 1 - \mathbb{E}[\gamma^T] \right] \\ & = \frac{1}{1-\gamma}\left[1 - \gamma \frac{ (1-p) }{ 1 - \gamma p } \right] \\ & = \frac{1}{1-\gamma p}\end{split}\]

For every visit we get that

\[\begin{split}\frac{S^{\text{every} } (s_1) }{ N^{\text{every} }(s_1) } & = \frac{1}{1-\gamma} \frac{1}{m} \left[ \sum_{k=1}^m T_k - \gamma \sum_{k=1}^m h(T_k) \right] \frac{m}{ N^{\text{every} }(s_1) } \\ & \approx \left[ \frac{1}{m} \sum_{k=1}^m T_k - \gamma \frac{1}{m} \sum_{k=1}^m h(T_k) \right] \frac{ 1-p }{1-\gamma} & = \left[ \frac{1}{1-p} - \gamma \frac{1}{1-\gamma p} \right] \frac{ 1-p }{1-\gamma} \\ & = \frac{1}{1-\gamma p}\end{split}\]

So we see that first and every visit agree when \(m\) is large and the true return is \(v_\pi(s_1) = \frac{1}{1-\gamma p}\).

Bonus proofs

The chance of staying in \(s_1\) is \(p\), the chance of staying \(T-1\) times is \(p^{T-1}\), and the chance of jumping to \(s_2\) is \(1-p\). Therefore \(p(T) = p^{T-1}(1-p)\).
\[\begin{split}\mathbb{E}[T] & = (1-p)\sum_{T=0}^\infty T p^{T-1} = (1-p) \sum_{T=0}^\infty \frac{d}{dp} p^T \\ & = (1-p) \frac{d}{dp} \sum_{T=0}^\infty p^T = (1-p)\frac{d}{dp} \frac{1}{1-p} = (1-p)\frac{1}{(1-p)^2} = \frac{1}{1-p}.\end{split}\]
\[\mathbb{E}[\gamma^T] = (1-p)\sum_{T=1}^\infty T\gamma^T p^{T-1} = (1-p)\gamma \sum_{T=1}^\infty (\gamma p)^{T-1} = \frac{(1-p)\gamma}{1-\gamma p}.\]

Problem 10.1-10.2: MC Value estimation

Problem 10.3-10.4: MC control

Problem 10.5: TD Learning

Exercise 10: Monte-carlo methods and TD learning

Contents

Exercise 10: Monte-carlo methods and TD learning#

Tabular methods (Q-learning, Sarsa, etc.)#

Classes and functions#

Solutions to selected exercises#