Exercise 9: Policy and value iteration#

Note

  • The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:

  • You are encouraged to prepare the homework problems 1, 2 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.

  • To get the newest version of the course material, please see Making sure your files are up to date

Markov decision processes#

The Markov Decision Process (MDP) is a mathematical description of the problem solved by most reinforcement learning methods. It will be represented as the class MDP and will stand in the same relationship to methods such as value iteration as the DPModel play to the dynamical programming method.

Let’s look at the MDP for the frozen-lake gridworld environment. It looks as follows:

# BookGridEnvironment
from irlc.gridworld.gridworld_environments import FrozenLake
from irlc import train, Agent, interactive
env = FrozenLake(render_mode="human")  # Pass render_mode='human' for visualization.
env, agent = interactive(env, Agent(env))  # Make th
env.reset()     # You always need to call reset
env.plot()      # Plot the environment.
env.close()

(Source code, png, hires.png, pdf)

../_images/ex09-1.png

We can instantiate the MDP class directly, or extract if from the gridworld environment as env.mdp. A couple of examples:

This code plots the initial state, all non-terminal states, and the actions in the initial state:

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp  # Get the MDP.
>>> mdp.initial_state # States are tuples (x, y)
(0, 3)
>>> mdp.A(mdp.initial_state) # Get all actions in the initial state
(0, 1, 2, 3)

Warning

The actions available in the different states are not the same in the gridworld environment. Specifically, the winning (goal) state only has a single action, which takes the environment to the terminal state. Let’s take a look:

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp
>>> mdp.A( (0, 3) )         # Actions in the starting state
(0, 1, 2, 3)
>>> mdp.A( (0, 0) )         # Actions the bottom-left state
(0,)

Transitions probabilities#

The transition probabilities:

\[P(s', r | s, a)\]

Are represented as a function Psr() that returns a dictionary of the form {(s1, r1): p1, (s2, r2): p2, ...}, compare to irlc.ex02.dp_model.DPModel.Pw(). An example:

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp
>>> transitions = mdp.Psr(mdp.initial_state, 0)
>>> for (sp, r), p in transitions.items():
...     print(f"P(s'={sp}, r={r} | s={mdp.initial_state}, a=0) = {p}")
... 
P(s'=(0, 3), r=0 | s=(0, 3), a=0) = 0.6666666666666667
P(s'=(1, 3), r=0 | s=(0, 3), a=0) = 0.3333333333333333

Note

It is easy to miss, but (Sutton and Barto [SB18]) distinguish between

  • The set of all states, \(\mathcal{S}^+\) corresponding to states()

  • The set of all non non-terminal states \(\mathcal{S}\) corresponding to nonterminal_states()

The value-function for a terminal states. Both the policy, value-function, action-value function and transition probabilities \(P(s',r|s,a)\) are only defined in non-terminal states, and it is a common mistake in reinforcement learning to update or evaluate them in terminal states. Let’s take a look:

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp  # Get the MDP.
>>> mdp.nonterminal_states
[(0, 1), (1, 2), (2, 1), (0, 0), (3, 1), (1, 1), (0, 3), (2, 0), (3, 0), (2, 3), (0, 2), (3, 3), (2, 2), (1, 0), (3, 2), (1, 3)]
>>> mdp.states
[(0, 1), (1, 2), 'Terminal state', (2, 1), (0, 0), (3, 1), (1, 1), (0, 3), (2, 0), (3, 0), (2, 3), (0, 2), (3, 3), (2, 2), (1, 0), (3, 2), (1, 3)]
>>> terminal_states = [s for s in mdp.states if s not in mdp.nonterminal_states]
>>> terminal_states
['Terminal state']

Building a MDP#

Building your own MDP is easy. Let’s try to implement a MDP corresponding to a small gridworld example (see (Sutton and Barto [SB18]), Example 4.1)

# small_gridworld.py
UP,RIGHT, DOWN, LEFT = 0, 1, 2, 3 
class SmallGridworldMDP(MDP):
    def __init__(self, rows=4, cols=4):
        self.rows, self.cols = rows, cols # Number of rows, columns.
        super().__init__(initial_state=(rows//2, cols//2) ) # Initial state is in the middle of the board.

    def A(self, state):
        return [UP, DOWN, RIGHT, LEFT] # All four directions available.

    def Psr(self, state, action):
        row, col = state # state is in the format  state = (row, col)
        if action == UP:    row -= 1
        if action == DOWN:  row += 1
        if action == LEFT:  col += 1
        if action == RIGHT: col -= 1

        col = min(self.cols-1, max(col, 0)) # Check boundary conditions.
        row = min(self.rows-1, max(row, 0))
        reward = -1  # Always get a reward of -1
        next_state = (row, col)
        # Note that P(next_state, reward | state, action) = 1 because environment is deterministic
        return {(next_state, reward): 1}

    def is_terminal(self, state):
        row, col = state
        return (row == 0 and col == 0) or (row == self.rows-1 and col == self.cols-1)  

Note that:

  • The states are tuples of row, col coordinates (i,j)

  • Each state has 4 actions, so the action-space function A(state) simply returns the list [0, 1, 2, 3]

  • The is_terminal function determines if we are in a corner (\((0,0)\) or \((3,3)\))

  • The transition probabilities are deterministic. So the dictionary it returns has a single key (next_state, reward) and value of 1

Classes and functions#

class irlc.ex09.mdp.MDP(initial_state=None, verbose=False)[source]#

Bases: object

This class represents a Markov Decision Process. It defines three main components:

  • The actions available in a given state \(A(s)\)

  • The transition probabilities \(p(s', r | s, a)\)

  • A terminal check to determine if a state \(s\) is terminal

  • A way to specify the initial state:

    • As a single state the MDP always begins in (most common)

    • As a general distribution \(p(s_0)\).

In addition to this it allows you to access either
  • The set of all states (including terminal states) as mdp.states

  • The set of all non-terminal states as mdp.non_terminal_states

Note

The states and non_termianl_states are computed lazily. This means that if you don’t access them, they won’t use memory. This allows you to specify MDPs with an infinite number of states without running out of memory.

__init__(initial_state=None, verbose=False)[source]#

Initialize the MDP. In the case where initial_state is set to a value \(s_0\), the initial state distribution will be

\[p(s_0) = 1\]
Parameters:
  • initial_state – An optional initial state.

  • verbose – If True, the class will print out debug information (useful for very large MDPs)

is_terminal(state)[source]#

Determines if a state is terminal (i.e., the environment/model is complete). In (SB18), the terminal state is written as \(s_T\).

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp
>>> mdp.is_terminal(mdp.initial_state) # False, obviously.
False
Parameters:

state – The state \(s\) to check

Return type:

bool

Returns:

True if the state is terminal and otherwise False.

Psr(state, action)[source]#

Represents the transition probabilities:

\[P(s', r | s, a)\]

When called with state state and action action, the function returns a dictionary of the form {(s1, r1): p1, (s2, r2): p2, ...}, so that p2 is the probability of transitioning to s2 (and obtaining reward r2) given we are in state state and take action action:

\[P(s_2, r_2 | s,a) = p_2\]

An example:

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp
>>> transitions = mdp.Psr(mdp.initial_state, 0) # P( ... | s0, a=0)
>>> for (sp, r), p in transitions.items():
...     print(f"P(s'={sp}, r={r} | s={mdp.initial_state}, a=0) = {p}")
... 
P(s'=(0, 3), r=0 | s=(0, 3), a=0) = 0.6666666666666667
P(s'=(1, 3), r=0 | s=(0, 3), a=0) = 0.3333333333333333
Parameters:
  • state – The state to compute the transition probabilities in

  • action – The action to compute the transition probabilities in

Return type:

dict

Returns:

A dictionary where the keys are state, reward pairs we will transition to, \(p(s', r | ...)\), and the values are their probability.

A(state)[source]#

Returns a list of actions available in the given state:

\[A(s)\]

An example to get the actions in the initial state:

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp
>>> mdp.A(mdp.initial_state)
(0, 1, 2, 3)
Parameters:

state – State to compute the actions in \(s\)

Return type:

list

Returns:

The list of available actions \(\mathcal A(s) = \{0, 1, ..., n-1\}\)

initial_state_distribution()[source]#

(Optional) specify the initial state distribution. Should return a dictionary of the form: {s0: p0, s1: p1, ..., sn: pn}, in which case \(p(S_0 = s_k) = p_k\).

You will typically not overwrite this function but just set the initial state. In that case the initial state distribution is deterministic:

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp
>>> mdp.initial_state_distribution()
{(0, 3): 1}
Returns:

An initial state distribution as a dictionary, where the keys are states, and the valuse are their probability.

property nonterminal_states#

The list of non-terminal states, i.e. \(\mathcal{S}\) in (SB18)

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp
>>> mdp.nonterminal_states
[(0, 1), (1, 2), (2, 1), (0, 0), (3, 1), (1, 1), (0, 3), (2, 0), (3, 0), (2, 3), (0, 2), (3, 3), (2, 2), (1, 0), (3, 2), (1, 3)]
Returns:

The list of non-terminal states \(\mathcal{S}\)

property states#

The list of all states including terminal ones, i.e. \(\mathcal{S}^+\) in (SB18). The terminal states are those where is_terminal(state) is true.

>>> from irlc.gridworld.gridworld_environments import FrozenLake
>>> mdp = FrozenLake().mdp
>>> mdp.states
[(0, 1), (1, 2), (2, 1), (0, 0), (3, 1), (1, 1), (0, 3), (2, 0), (3, 0), (2, 3), 'Terminal state', (0, 2), (3, 3), (2, 2), (1, 0), (3, 2), (1, 3)]
Returns:

The list all states \(\mathcal{S}^+\)

Solutions to selected exercises#