Exercise 11: Model-Free Control with tabular and linear methods

Exercise 11: Model-Free Control with tabular and linear methods#

Note

This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
- 02465ex11_Python.pdf
Slides: [1x] ([6x]). Reading: Chapter 6.4-6.5; 7-7.2; 9-9.3; 10.1, [SB18].
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date

Linear function approximators#

The idea behind linear function approximation of $Q$-values is that

We initialize (and eventually learn) a $d$-dimensional weight vector $w \in \mathbb{R}^d$
We assume there exists a function to compute a $d$-dimensional feature vector $x(s,a) \in \mathbb{R}^d$
The $Q$-values are then represented as

\[Q(s,a) = x(s,a)^\top w\]

Learning is therefore entirely about updating $w$. We are going to use a class, LinearQEncoder, to implement the tile-coding procedure for defining $x(s,a)$ as described in (Sutton and Barto [SB18]).

The following example shows how you initialize the linear $Q$-values and compute them in a given state:

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

For learning, you can simply update $w$ as any other variable, and there is a convenience method to get the optimal action. The following example will illustrate a basic usage:

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

Note

Depending on how $x(s,a)$ is defined, the linear encoder can behave very differently. I have therefore included a few different classes in irlc.ex09.feature_encoder which only differ in how $x(s,a)$ is computed. I have chosen to focus this guide on the linear tile-encoder which is used in the MountainCar environment and is the main example in (Sutton and Barto [SB18]). The API for the other classes is entirely similar.

Classes and functions#

class irlc.ex11.feature_encoder.FeatureEncoder(env)[source]#

Bases: object

The idea behind linear function approximation of $Q$-values is that

We initialize (and eventually learn) a $d$-dimensional weight vector $w \in \mathbb{R}^d$
We assume there exists a function to compute a $d$-dimensional feature vector $x(s,a) \in \mathbb{R}^d$
The $Q$-values are then represented as

\[Q(s,a) = x(s,a)^\top w\]

Learning is therefore entirely about updating $w$.

The following example shows how you initialize the linear $Q$-values and compute them in a given state:

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

__init__(env)[source]#

Initialize the feature encoder. It requires an environment to know the number of actions and dimension of the state space.

Parameters:: env – An openai Gym Env.

property d#

Get the number of dimensions of $w$

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

x(s, a)[source]#

Computes the $d$-dimensional feature vector $x(s,a)$

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

Parameters:

s – A state $s$
a – An action $a$

Returns:

Feature vector $x(s,a)$

get_Qs(state, info_s=None)[source]#

This is a helper function, it is only for internal use.

Parameters:

state
info_s

Returns:

get_optimal_action(state, info=None)[source]#

For a given state state, this function returns the optimal action for that state.

\[a^* = \arg\max_a Q(s,a)\]

An example:

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

Parameters:

state – State to find the optimal action in $s$
info – The info-dictionary corresponding to this state

Returns:

The optimal action according to the Q-values $a^*$

class irlc.ex11.feature_encoder.LinearQEncoder(env, tilings=8, max_size=2048)[source]#

Bases: FeatureEncoder

__init__(env, tilings=8, max_size=2048)[source]#

Implements the tile-encoder described by (SB18)

Parameters:

env – The openai Gym environment we wish to solve.
tilings – Number of tilings (translations). Typically 8.
max_size – Maximum number of dimensions.

x(s, a)[source]#

Computes the $d$-dimensional feature vector $x(s,a)$

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

Parameters:

s – A state $s$
a – An action $a$

Returns:

Feature vector $x(s,a)$

property d#

Get the number of dimensions of $w$

/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

Solutions to selected exercises#

Solution to the conceptual exam problem

Part a: Since the immediate reward is zero, the next $Q$-value will be determined by the $Q$-value associated with the state north of the agent and the action the agent generates in that state:

\[Q(s,a) = Q(s,a) + \alpha (r + \gamma Q(s', a') - Q(s,a) )\]

If the exploration rate is non-zero, all actions $a’$ may occur, giving rise to two different new values. This mean the $Q$-value can be updated to:

\[Q(s, \texttt{North} )= 0.0, 0.432\]

Part b: It is evident that we need to propagate the $Q$-value from the northern square to the $Q$-value we wish to update. To do this, we first go $\texttt{north}$, but then to change the $Q$-value we must select $\texttt{east}$ in that state. Therefore, we backtrack ($\texttt{west}$, $\texttt{south}$). Pacman will now be on the current state, and a single step $\texttt{east}$ will mean pacman updates the green $Q$-value, and it can be updated to a non-zero value since the next action generated by Sarsa can select the $Q$-value associated with the red square. The answer is therefore 5 and the actions are:

\[\texttt{north}, \texttt{east}, \texttt{west}, \texttt{south}, \texttt{east}\]

Part c: After convergence, Sarsa will have learned the $Q$-values associated with the $\varepsilon$-soft policy $\pi$, i.e. to $q_\pi$. It will clearly attempt to move the agent towards the goal square with a $+1$ reward, and since $\gamma<1$ it will attempt to do so quickly. The fastest way to do that is either north or south of the central pillar. The southern way, associate with $Q_e$, takes the agent next to the dangerous square with a $-1$ reward. There is a chance of at least $\frac{\varepsilon}{4}$ of randomly falling into that square using the $\varepsilon$-soft policy. We can therefore conclude this path is far more dangerous, and this must be reflected in the $Q$-values. Hence, $Q_e < Q_n$. Note that this will not be true for $Q$-learning, where the two paths are the same. Note this argument is similar to the cliffwalking example we saw in the exercises and in (Sutton and Barto [SB18]).

Problem 11.1: Q-learning agent

Problem 11.2: Sarsa-learning agent

Problem 11.3: Semi-gradient Q-agent

Exercise 11: Model-Free Control with tabular and linear methods

Contents

Exercise 11: Model-Free Control with tabular and linear methods#

Linear function approximators#

Classes and functions#

Solutions to selected exercises#