Exercise 2: Dynamical Programming#
Note
The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:
You are encouraged to prepare the homework problems 1, 2 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date
This week will be concerned with dynamical programming. Dynamical programming is a technique that uses a model of the environment to create an optimal plan. This means there are now 3 classes to content with:
- The environment
Represented by an OpenAI gymnasium
Env
class. This class represents the problem we are trying to solve and has e.g. astep()
-method- The agent
Represented as a
Agent
class. This class represents the agent which which acts in the environment. In our case, the agent should plan optimally using dynamical programming and the policy methodpi()
, should simply execute each step of that plan- The model
This class (
DPModel
) represents the dynamical programming model the agent use to plan. It must implement python functions equivalent to \(f_k(x_k, u_k, w_k)\), \(g_k(x_k, u_k, w_k)\) and so on. The agent should use the model clas to plan, and not the environment.
The dynamical programming model#
Using a class to represent a model can be confusing, but since we will use the same trick in control it is worth considering why and how we choose to represent the model this way.
The model is basically just a collection of functions. For instance, consider these functions corresponding to \(f_k\) and \(g_k\):
>>> def f(x, u, w, k):
... return x+u-w #Compute and return x_{k+1} here
...
>>> def g(x, u, w, k):
... return x**2 + u**2 #Compute and return g_k(x, u, w) here
...
A model is then just a class that collects the function as so:
>>> class MyModel:
... def f(self, x, u, w, k):
... return x+u-w #Compute and return x_{k+1} here
... def g(self, x, u, w, k):
... return x**2 + u**2 #Compute and return g_k(x, u, w) here
...
>>> model = MyModel()
>>> print(model.g(2, 1, 0, 0))
5
Note
signature means the name and order of variables
What the class DPModel
(see below) does is just to specify the signature of the functions f
, g
, etc. shown above.
This helps us to specify the model in the same way every time, and is a very common technique in programming.
In other words, the class DPModel
itself is very simple, and looks something like this:
>>> class SimplifiedDPModel:
... def f(self, x, u, w, k):
... raise NotImplementedError("Implement f_k here")
... def g(self, x, u, w, k):
... raise NotImplementedError("Implement f_k here")
...
And when we actually want to implement a specific model, for instance of the inventory environment, we can do that as follows (see (Herlau [Her24], Subsection 5.1.2)):
# inventory.py
class InventoryDPModel(DPModel):
def __init__(self, N=3):
super().__init__(N=N)
def A(self, x, k): # Action space A_k(x)
return {0, 1, 2}
def S(self, k): # State space S_k
return {0, 1, 2}
def g(self, x, u, w, k): # Cost function g_k(x,u,w)
return u + (x + u - w) ** 2
def f(self, x, u, w, k): # Dynamics f_k(x,u,w)
return max(0, min(2, x + u - w ))
def Pw(self, x, u, k): # Distribution over random disturbances
return {0:.1, 1:.7, 2:0.2}
def gN(self, x):
return 0
And the model can be used as:
/builds/02465material/02465public/02465students_complete/irlc/ex02/inventory.py:2: SyntaxWarning: invalid escape sequence '\('
"""
This means that when you implement the DP algorithm, you only need to refer to functions such as model.S
, model.f
,
and so on without thinking about any specific environment – if you use the functions correctly, your implementation will
work with any specific model that matches the signature in the DPModel
class!.
The following example shows the start of the DP algorithm implementation. It represents the cost functions \(J_k`\) as
a list of dictionaries J
,
so that J[N][x]
corresponds to the terminal cost \(J_N(x)\).
>>> def very_simplified_dp(model):
... N = model.N # Planning horizon
... print("The planning horizon is", N)
... print("The state space is", model.S(0))
... J = [dict() for k in range(N+1)] # Initialize J_k = [J_0, J_1, ..., J_N]
... J[N] = {x: model.gN(x) for x in model.S(N) } # Initialize J_N by looping over terminal states model.S(N)
... return J # Insert the rest of the DP algorithm here.
...
>>> from irlc.ex02.inventory import InventoryDPModel
>>> model = InventoryDPModel(N=4)
>>> J = very_simplified_dp(model) # Call the DP algorithm and get J back.
The planning horizon is 4
The state space is {0, 1, 2}
>>> print(J)
[{}, {}, {}, {}, {0: 0, 1: 0, 2: 0}]
A primer on Pacman#
In the first project, you will work with the Pacman game. I have included a more complete description of the Pacman game at Pacman.
Classes and functions#
This section provides an overview of the most important classes and functions:
- class irlc.ex02.dp_model.DPModel(N)[source]#
The Dynamical Programming model class
The purpose of this class is to translate a dynamical programming problem, defined by the equations,
\[\begin{split}x_{k+1} & = f_k(x_k, u_k, w_k) \\ \text{cost} & = g_k(x_k, u_k, w_k) \\ \text{terminal cost} & = g_N(x_N) \\ \text{Noise disturbances:} \quad w_k & \sim P_W(w_k | x_k, u_k) \\ \text{State/action spaces:} \quad & \mathcal A_k(x_k), \mathcal S_k\end{split}\]into a single python object which we can then use for planning.
Note
This is the first time many of you encounter a class. If so, you might wonder why you can’t just implement the functions as usual, i.e.
def f(x, k, ...):
,def g(x, k, ...):
, as regular python function and just let that be it?The reason is that we want to pass all these function (which taken together represents a planning problem) to planning methods such as the DP-algorithm (see the function
DP_stochastic()
) all at once. It is not very convenient to pass the functions one at a time – instead we collect them into a class and simply call the function as>>> from irlc.ex02.inventory import InventoryDPModel >>> from irlc.ex02.dp import DP_stochastic >>> model = InventoryDPModel() # Intialize the model >>> J, pi = DP_stochastic(model) # All functions are passed to DP_stochastic
To actually use the model, you need to extend it and implement the methods. The basic recipe for this is something like:
class MyDPModel(DPModel): def f(self, x, u, w, k): # Note the `self`-variable. You can use it to access class variables such as`self.N`. return x + u - w # Just an example def S(self, k): return [0, 1, 2] # State space S_k = {0, 1, 2} # Implement the other functions A, g, gN and Pw here.
You should take a look at
InventoryDPModel()
for a concrete example. Once the functions have been implemented, you can call them as:>>> from irlc.ex02.inventory import InventoryDPModel >>> model = InventoryDPModel(N=5) # Plan on a horizon of 5 >>> print("State space S_2", model.S(2)) State space S_2 {0, 1, 2} >>> model.f(x=1, u=2, w=1, k=0) # Just an example. You don't have to use named arguments, although it helps on readability. 2 >>> model.A(1, k=2) # Action space A_1(2), i.e. the actions available at time step k=1 in state 2. {0, 1, 2}
- __init__(N)[source]#
Called when the DP Model is initialized. By default, it simply stores the planning horizon
N
- Parameters:
N – The planning horizon in the DP problem \(N\)
- f(x, u, w, k)[source]#
Implements the transition function \(x_{k+1} = f_k(x, u, w)\) and returns the next state \(x_{k+1}\)
- Parameters:
x – The state \(x_k\)
u – The action taken \(u_k\)
w – The random noise disturbance \(w_k\)
k (
int
) – The current time step \(k\)
- Returns:
The state the environment (deterministically) transitions to, i.e. \(x_{k+1}\)
- gN(x)[source]#
Implements the terminal cost function \(c = g_N(x)\) and returns the terminal cost \(c\).
- Parameters:
x – A state seen at the last time step \(x_N\)
- Return type:
- Returns:
The terminal cost (as a
float
) incurred by the environment, i.e. \(g_N(x)\)
- S(k)[source]#
Computes the state space \(\mathcal S_k\) at time step \(k\). In other words, this function returns a set of all states the system can possibly be in at time step \(k\).
Note
I think the cleanest implementation is one where this function returns a python
set
. However, it won’t matter if the function returns alist
ortuple
instead.- Parameters:
k (
int
) – The current time step \(k\)- Returns:
The state space (as a
list
orset
) available at time stepk
, i.e. \(\mathcal S_k\)
- A(x, k)[source]#
Computes the action space \(\mathcal A_k(x)\) at time step \(k\) in state x.
In other words, this function returns a
set
of all actions the agent can take in time step \(k\).Note
An example where the actions depend on the state is chess (in this case, the state is board position, and the actions are the legal moves)
- Parameters:
k (
int
) – The current time step \(k\)x – The state we want to compute the actions in \(x_k\)
- Returns:
The action space (as a
list
orset
) available at time stepk
, i.e. \(\mathcal A_k(x_k)\)
- Pw(x, u, k)[source]#
Returns the random noise disturbances and their probability. In other words, this function implements the distribution:
\[P_k(w_k | x_k, u_k)\]To implement this distribution, we must keep track of both the possible values of the noise disturbances \(w_k\) as well as the (numerical) value of their probability \(p(w_k| ...)\).
To do this, the function returns a dictionary of the form
P = {w1: p_w1, w2: p_w2, ...}
whereThe keys
w
represents random noise disturbancesthe values
P[w]
represents their probability (i.e. afloat
)
This can hopefully be made more clear with the Inventory environment:
>>> from irlc.ex02.inventory import InventoryDPModel >>> model = InventoryDPModel(N=5) # Plan on a horizon of 5 >>> print("Random noise disturbances in state x=1 using action u=0 is:", model.Pw(x=1, u=0, k=0)) Random noise disturbances in state x=1 using action u=0 is: {0: 0.1, 1: 0.7, 2: 0.2} >>> for w, pw in model.Pw(x=1, u=0, k=0).items(): # Iterate and print: ... print(f"p_k({w}|x, u) =", pw) ... p_k(0|x, u) = 0.1 p_k(1|x, u) = 0.7 p_k(2|x, u) = 0.2
- Parameters:
x – The state \(x_k\)
u – The action taken \(u_k\)
k (
int
) – The current time step \(k\)
- Returns:
A dictionary representing the distribution of random noise disturbances \(P_k(w |x_k, u_k)\) of the form
{..., w_i: pw_i, ...}
such thatpw_i = P_k(w_i | x, u)
- w_rnd(x, u, k)[source]#
This helper function computes generates a random noise disturbance using the function
irlc.ex02.dp_model.DPModel.Pw()
, i.e. it returns a sample:\[w \sim P_k(x_k, u_k)\]This will be useful for simulating the model.
Note
You don’t have to implement or change this function.
- Parameters:
x – The state \(x_k\)
u – The action taken \(u_k\)
k – The current time step \(k\)
- Returns:
A random noise disturbance \(w\) distributed as \(P_k(x_k, u_k)\)
- irlc.ex02.dp.DP_stochastic(model)[source]#
Implement the stochastic DP algorithm. The implementation follows (Her24, Algorithm 1). Once you are done, you should be able to call the function as:
>>> from irlc.ex02.graph_traversal import SmallGraphDP >>> from irlc.ex02.dp import DP_stochastic >>> model = SmallGraphDP(t=5) # Instantiate the small graph with target node 5 >>> J, pi = DP_stochastic(model) >>> print(pi[0][2]) # Action taken in state ``x=2`` at time step ``k=0``. 3
- Parameters:
model (
DPModel
) – An instance ofirlc.ex02.dp_model.DPModel
class. This represents the problem we wish to solve.- Returns:
J
- A list of of cost function so thatJ[k][x]
represents \(J_k(x)\)pi
- A list of dictionaries so thatpi[k][x]
represents \(\mu_k(x)\)
Solutions to selected exercises#
Problem 2.2: Deterministic DP
Problem 2.3: Stochastic DP
Problem 2.4: Dynamical Programming Agent
Problem 2.7: Flower-store