Exercise 9: Monte-carlo methods#

Note

  • This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:

  • Slides: [1x] ([6x]). Reading: Chapter 5-5.4+5.10, [SB18].

  • You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.

  • To get the newest version of the course material, please see Making sure your files are up to date

Tabular methods (Q-learning, Sarsa, etc.)#

As the name suggests, tabular methods requires us to maintain a table of \(Q\)-values or state-values \(V\). The \(Q\)-values in particular can be a bit tricky to keep track of and I have therefore made a helper class irlc.ex09.rl_agent.TabularAgent which will hopefully simplify the process.

Note

The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when \(\mathcal{A}(s) \neq \mathcal{A}(s')\). Gymasiums ways of dealing with this situation is to use the info-dictionary, e.g. so that s, info = env.reset() will specify a info['mask'] variable which is a numpy ndarray so that a given action is available if info['mask'][a] == 1. You can read more about this choice at The gymnasium discrete space documentation.

The \(Q\)-values behave like a 2d numpy ndarray:

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.3) # Use epsilon-greedy exploration.
>>> state, _ = env.reset()
>>> state
(0, 0)
>>> agent.Q[state, 1] = 2 # Update a Q-value
>>> agent.Q[state, 1]     # Get a Q-value
2
>>> agent.Q[state, 0]     # Q-values are by default zero
0

To implement masking, the agent.Q-table has two special functions which requires the info-dictionary. As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles.

  • To get the optimal action use agent.Q.get_optimal_action(s, info_s)

    >>> from irlc.ex08.rl_agent import TabularAgent
    >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
    >>> env = BookGridEnvironment()
    >>> agent = TabularAgent(env)
    >>> state, info = env.reset()               # Get the info-dictionary corresponding to s
    >>> agent.Q[state, 1] = 2.5                 # Update a Q-value; action a=1 is now optimal.
    >>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state
    1
    
  • To get all Q-values corresponding to a state use agent.Q.get_Qs(s, info_s)

    >>> from irlc.ex08.rl_agent import TabularAgent
    >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
    >>> env = BookGridEnvironment()
    >>> agent = TabularAgent(env)
    >>> state, info = env.reset()                  # Get the info-dictionary corresponding to s
    >>> agent.Q[state, 1] = 2.5                    # Update a Q-value; action a=1 is now optimal.
    >>> actions, Qs = agent.Q.get_Qs(state, info)  # Note we pass along the info-dictionary corresopnding to this state
    >>> actions                                    # All actions that are available in this state (after masking)
    (0, 1, 2, 3)
    >>> Qs                                         # All Q-values available in this state (after masking)
    (0, 2.5, 0, 0)
    

You can combine this functionality to get e.g. the maximal Q-value using agent.Q[s, agent.Q.get_optimal_action(s, info)].

Note

The Q-table will remember the masking information for a given state and warn you if you are trying to access an action that has been previously masked.

We often want to perform \(\varepsilon\)-greedy exploration. To simplify this, the agent has the function agent.pi_eps. Since this function uses the Q-values, it also requires an info-dictionary:

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.1)  # epsilon-greedy exploration
>>> state, info = env.reset()               # to get a state and info-dictionary
>>> a = agent.pi_eps(state, info)           # Epsilon-greedy action selection
>>> a
0

Warning

In the train(s, a, r, sp, done, info_s, info_sp)-method, remember to use the info-dictionary corresponding to the state.

  • use self.Q.get_Qs(s, info_s) and self.Q.get_Qs(sp, info_sp)

  • never use self.Q.get_Qs(s, info_sp)

Classes and functions#

None so far.

Solutions to selected exercises#