Exercise 11: Model-Free Control with tabular and linear methods#

Note

  • The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:

  • You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.

  • To get the newest version of the course material, please see Making sure your files are up to date

Linear function approximators#

The idea behind linear function approximation of \(Q\)-values is that

  • We initialize (and eventually learn) a \(d\)-dimensional weight vector \(w \in \mathbb{R}^d\)

  • We assume there exists a function to compute a \(d\)-dimensional feature vector \(x(s,a) \in \mathbb{R}^d\)

  • The \(Q\)-values are then represented as

    \[Q(s,a) = x(s,a)^\top w\]

Learning is therefore entirely about updating \(w\). We are going to use a class, LinearQEncoder, to implement the tile-coding procedure for defining \(x(s,a)\) as described in (Sutton and Barto [SB18]).

The following example shows how you initialize the linear \(Q\)-values and compute them in a given state:

/builds/02465material/02465public/02465students_complete/irlc/ex11/feature_encoder.py:14: SyntaxWarning: invalid escape sequence '\i'
  """

For learning, you can simply update \(w\) as any other variable, and there is a convenience method to get the optimal action. The following example will illustrate a basic usage:

>>> import gymnasium as gym
>>> env = gym.make('MountainCar-v0')
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> Q = LinearQEncoder(env, tilings=8)
>>> s, _ = env.reset()
>>> a = env.action_space.sample()
>>> Q.w = Q.w + 2 * Q.w     # w <-- 3*w
>>> Q.get_optimal_action(s) # Get the optimal action in state s
1

Note

Depending on how \(x(s,a)\) is defined, the linear encoder can behave very differently. I have therefore included a few different classes in irlc.ex09.feature_encoder which only differ in how \(x(s,a)\) is computed. I have chosen to focus this guide on the linear tile-encoder which is used in the MountainCar environment and is the main example in (Sutton and Barto [SB18]). The API for the other classes is entirely similar.

Classes and functions#

class irlc.ex11.feature_encoder.FeatureEncoder(env)[source]#

Bases: object

The idea behind linear function approximation of \(Q\)-values is that

  • We initialize (and eventually learn) a \(d\)-dimensional weight vector \(w \in \mathbb{R}^d\)

  • We assume there exists a function to compute a \(d\)-dimensional feature vector \(x(s,a) \in \mathbb{R}^d\)

  • The \(Q\)-values are then represented as

    \[Q(s,a) = x(s,a)^\top w\]

Learning is therefore entirely about updating \(w\).

The following example shows how you initialize the linear \(Q\)-values and compute them in a given state:

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8)
>>> s, _ = env.reset()
>>> a = env.action_space.sample()
>>> Q(s,a) # Compute a Q-value.
np.float64(0.0)
>>> Q.d             # Get the number of dimensions
2048
>>> Q.x(s,a)[:4]    # Get the first four coordinates of the x-vector
array([1., 1., 1., 1.])
>>> Q.w[:4]         # Get the first four coordinates of the w-vector
array([0., 0., 0., 0.])
__init__(env)[source]#

Initialize the feature encoder. It requires an environment to know the number of actions and dimension of the state space.

Parameters:

env – An openai Gym Env.

property d#

Get the number of dimensions of \(w\)

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8) # as in (SB18)
>>> Q.d
2048
x(s, a)[source]#

Computes the \(d\)-dimensional feature vector \(x(s,a)\)

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8) # as in (SB18)
>>> s, info = env.reset()
>>> x = Q.x(s, env.action_space.sample())
Parameters:
  • s – A state \(s\)

  • a – An action \(a\)

Returns:

Feature vector \(x(s,a)\)

get_Qs(state, info_s=None)[source]#

This is a helper function, it is only for internal use.

Parameters:
  • state

  • info_s

Returns:

get_optimal_action(state, info=None)[source]#

For a given state state, this function returns the optimal action for that state.

\[a^* = \arg\max_a Q(s,a)\]

An example:

>>> from irlc.ex09.rl_agent import TabularAgent
>>> class MyAgent(TabularAgent):
...     def pi(self, s, k, info=None):
...         a_star = self.Q.get_optimal_action(s, info)
... 
Parameters:
  • state – State to find the optimal action in \(s\)

  • info – The info-dictionary corresponding to this state

Returns:

The optimal action according to the Q-values \(a^*\)

class irlc.ex11.feature_encoder.LinearQEncoder(env, tilings=8, max_size=2048)[source]#

Bases: FeatureEncoder

__init__(env, tilings=8, max_size=2048)[source]#

Implements the tile-encoder described by (SB18)

Parameters:
  • env – The openai Gym environment we wish to solve.

  • tilings – Number of tilings (translations). Typically 8.

  • max_size – Maximum number of dimensions.

x(s, a)[source]#

Computes the \(d\)-dimensional feature vector \(x(s,a)\)

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8) # as in (SB18)
>>> s, info = env.reset()
>>> x = Q.x(s, env.action_space.sample())
Parameters:
  • s – A state \(s\)

  • a – An action \(a\)

Returns:

Feature vector \(x(s,a)\)

property d#

Get the number of dimensions of \(w\)

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8) # as in (SB18)
>>> Q.d
2048

Solutions to selected exercises#