Exercise 11: Model-Free Control with tabular and linear methods#
Note
This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
Slides: [1x] ([6x]). Reading: Chapter 6.4-6.5; 7-7.2; 9-9.3; 10.1, [SB18].
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date
Linear function approximators#
The idea behind linear function approximation of \(Q\)-values is that
We initialize (and eventually learn) a \(d\)-dimensional weight vector \(w \in \mathbb{R}^d\)
We assume there exists a function to compute a \(d\)-dimensional feature vector \(x(s,a) \in \mathbb{R}^d\)
The \(Q\)-values are then represented as
\[Q(s,a) = x(s,a)^\top w\]
Learning is therefore entirely about updating \(w\).
We are going to use a class, LinearQEncoder
, to implement the tile-coding procedure for defining \(x(s,a)\) as described in (Sutton and Barto [SB18]).
The following example shows how you initialize the linear \(Q\)-values and compute them in a given state:
>>> import gymnasium as gym
>>> env = gym.make('MountainCar-v0')
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> Q = LinearQEncoder(env, tilings=8) # as in (:cite:t:`sutton`)
>>> s, _ = env.reset()
>>> a = env.action_space.sample()
>>> Q(s,a) # Compute a Q-value.
np.float64(0.0)
>>> Q.d # Get the number of dimensions
2048
>>> Q.x(s,a)[:4] # Get the first four coordinates of the x-vector
array([1., 1., 1., 1.])
>>> Q.w[:4] # Get the first four coordinates of the w-vector
array([0., 0., 0., 0.])
For learning, you can simply update \(w\) as any other variable, and there is a convenience method to get the optimal action. The following example will illustrate a basic usage:
>>> import gymnasium as gym
>>> env = gym.make('MountainCar-v0')
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> Q = LinearQEncoder(env, tilings=8)
>>> s, _ = env.reset()
>>> a = env.action_space.sample()
>>> Q.w = Q.w + 2 * Q.w # w <-- 3*w
>>> Q.get_optimal_action(s) # Get the optimal action in state s
1
Note
Depending on how \(x(s,a)\) is defined, the linear encoder can behave very differently. I have therefore included
a few different classes in irlc.ex09.feature_encoder
which only differ in how \(x(s,a)\) is computed. I have chosen to focus this guide on the linear tile-encoder
which is used in the MountainCar environment and is the main example in (Sutton and Barto [SB18]). The API for the other classes is entirely similar.
Classes and functions#
- class irlc.ex11.feature_encoder.FeatureEncoder(env)[source]#
Bases:
object
The idea behind linear function approximation of \(Q\)-values is that
We initialize (and eventually learn) a \(d\)-dimensional weight vector \(w \in \mathbb{R}^d\)
We assume there exists a function to compute a \(d\)-dimensional feature vector \(x(s,a) \in \mathbb{R}^d\)
The \(Q\)-values are then represented as
\[Q(s,a) = x(s,a)^\top w\]
Learning is therefore entirely about updating \(w\).
The following example shows how you initialize the linear \(Q\)-values and compute them in a given state:
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) >>> s, _ = env.reset() >>> a = env.action_space.sample() >>> Q(s,a) # Compute a Q-value. np.float64(0.0) >>> Q.d # Get the number of dimensions 2048 >>> Q.x(s,a)[:4] # Get the first four coordinates of the x-vector array([1., 1., 1., 1.]) >>> Q.w[:4] # Get the first four coordinates of the w-vector array([0., 0., 0., 0.])
- __init__(env)[source]#
Initialize the feature encoder. It requires an environment to know the number of actions and dimension of the state space.
- Parameters:
env – An openai Gym
Env
.
- property d#
Get the number of dimensions of \(w\)
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) # Same encoding as Sutton & Barto >>> Q.d 2048
- x(s, a)[source]#
Computes the \(d\)-dimensional feature vector \(x(s,a)\)
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) # Same encoding as Sutton & Barto >>> s, info = env.reset() >>> x = Q.x(s, env.action_space.sample())
- Parameters:
s – A state \(s\)
a – An action \(a\)
- Returns:
Feature vector \(x(s,a)\)
- get_Qs(state, info_s=None)[source]#
This is a helper function, it is only for internal use.
- Parameters:
state
info_s
- Returns:
- get_optimal_action(state, info=None)[source]#
For a given state
state
, this function returns the optimal action for that state.\[a^* = \arg\max_a Q(s,a)\]An example:
>>> from irlc.ex09.rl_agent import TabularAgent >>> class MyAgent(TabularAgent): ... def pi(self, s, k, info=None): ... a_star = self.Q.get_optimal_action(s, info) ...
- Parameters:
state – State to find the optimal action in \(s\)
info – The
info
-dictionary corresponding to this state
- Returns:
The optimal action according to the Q-values \(a^*\)
- class irlc.ex11.feature_encoder.LinearQEncoder(env, tilings=8, max_size=2048)[source]#
Bases:
FeatureEncoder
- __init__(env, tilings=8, max_size=2048)[source]#
Implements the tile-encoder described by (SB18)
- Parameters:
env – The openai Gym environment we wish to solve.
tilings – Number of tilings (translations). Typically 8.
max_size – Maximum number of dimensions.
- x(s, a)[source]#
Computes the \(d\)-dimensional feature vector \(x(s,a)\)
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) # Same encoding as Sutton & Barto >>> s, info = env.reset() >>> x = Q.x(s, env.action_space.sample())
- Parameters:
s – A state \(s\)
a – An action \(a\)
- Returns:
Feature vector \(x(s,a)\)
- property d#
Get the number of dimensions of \(w\)
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) # Same encoding as Sutton & Barto >>> Q.d 2048
Solutions to selected exercises#
Solution to the conceptual exam problem
Part a: Since the immediate reward is zero, the next \(Q\)-value will be determined by the \(Q\)-value associated with the state north of the agent and the action the agent generates in that state:
If the exploration rate is non-zero, all actions $a’$ may occur, giving rise to two different new values. This mean the $Q$-value can be updated to:
Part b: It is evident that we need to propagate the $Q$-value from the northern square to the $Q$-value we wish to update. To do this, we first go \(\texttt{north}\), but then to change the \(Q\)-value we must select \(\texttt{east}\) in that state. Therefore, we backtrack (\(\texttt{west}\), \(\texttt{south}\)). Pacman will now be on the current state, and a single step \(\texttt{east}\) will mean pacman updates the green \(Q\)-value, and it can be updated to a non-zero value since the next action generated by Sarsa can select the \(Q\)-value associated with the red square. The answer is therefore 5 and the actions are:
Part c: After convergence, Sarsa will have learned the $Q$-values associated with the \(\varepsilon\)-soft policy \(\pi\), i.e. to \(q_\pi\). It will clearly attempt to move the agent towards the goal square with a \(+1\) reward, and since \(\gamma<1\) it will attempt to do so quickly. The fastest way to do that is either north or south of the central pillar. The southern way, associate with \(Q_e\), takes the agent next to the dangerous square with a \(-1\) reward. There is a chance of at least \(\frac{\varepsilon}{4}\) of randomly falling into that square using the \(\varepsilon\)-soft policy. We can therefore conclude this path is far more dangerous, and this must be reflected in the \(Q\)-values. Hence, \(Q_e < Q_n\). Note that this will not be true for \(Q\)-learning, where the two paths are the same. Note this argument is similar to the cliffwalking example we saw in the exercises and in (Sutton and Barto [SB18]).
Problem 11.1: Q-learning agent
Problem 11.2: Sarsa-learning agent
Problem 11.3: Semi-gradient Q-agent