Exercise 8: Exploration and Bandits#

Note

  • The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:

  • You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.

  • To get the newest version of the course material, please see Making sure your files are up to date

Bandits#

Bandits are an example of a particular, simplified reinforcement learning setup where the emphasis is on exploration. We are therefore going to model bandits as reinforcement learning problems using the Env and Agent.

Let’s first explore the bandit problem corresopnding to the simple 10-armed bandit in (Sutton and Barto [SB18]). Recall that this bandit:

  • There are \(k=10\) arms.

  • An action \(a_k \in \{0, 1, .., 9\}\) selects an arm, and we then obtain a reward \(r_t\)

  • An episode corresponds to e.g. \(T=1000\) such interactions. The purpose is to maximize the accumulated reward

  • We want to be able to reset the environment and re-do the experiment many times

A basic interaction with the environment is as follows:

/builds/02465material/02465public/02465students_complete/irlc/ex08/bandits.py:64: SyntaxWarning: invalid escape sequence '\m'
  """

The script print out the immediate reward in the first step \(r_0\).

Internally, the environment will store the (expected) reward for each arm as env.q_star. These values are reset (sampled from a normal distribution) when the irlc.ex08.bandits.StationaryBandit.reset() method is called, and can be used to compute the regret. An example:

>>> from irlc.ex08.bandits import StationaryBandit
>>> env = StationaryBandit(k=4)     # 4-armed testbed.
>>> env.reset()                     # Reset all parameters.
(0, {})
>>> env.q_star
array([-0.44458644,  2.14137093,  1.12145884,  1.4188155 ])
>>> env.reset()                     # Reset again
(0, {})
>>> env.q_star
array([-0.51644724, -0.86580653, -0.32726181, -1.01541516])
>>> env.optimal_action              # This variable stores the optimal action
np.int64(2)
>>> _, r, _, _, info = env.step(2)  # Take action a=2
>>> print(f"Reward from a=2 was {r=}, the regret was {info['average_regret']=}")
Reward from a=2 was r=np.float64(0.31475494322022307), the regret was info['average_regret']=np.float64(0.0)

This code also computes the per-step regret, info['average_regret'], which for an action \(a\) is defined as

\[\max_{a'} q^*_{a'} - q_a\]

Perhaps you can tell how it can be computed using env.optimal_action and env.q_star?

Implementing a bandit environment#

To implement a new bandit environment, all you need to do is to implement the reward function and compute the regret. To simplify things I have made a helper class BanditEnvironment where you only need to implement the reset() function and a function bandit_step() which accepts an action and returns the reward \(r_t\) and regret.

# bandits.py
class StationaryBandit(BanditEnvironment): 
    def reset(self): 
        """ Set q^*_k = N(0,1) + mean_value. The mean_value is 0 in most examples. I.e., implement the 10-armed testbed environment. """
        self.q_star = np.random.randn(self.k) + self.q_star_mean
        self.optimal_action = np.argmax(self.q_star) # Optimal action is the one with the largest q^*-value. 
    def bandit_step(self, a): 
        # Actual logic goes here. Use self.q_star[a] to get mean reward and np.random.randn() to generate random numbers.  
        return reward, regret 

Training and visualizing bandit agents#

Agents for bandits problems are implemented as usual by overwritign the pi() and train function in the Agent class. You can train the bandit agent by using the train() as usual, but to simplify things I have made a convenience method to train and visualize multiple bandits as follows:

Tip

If you set use_cache=True then old simulations will be stored and plotted until there are max_episodes total simulations. Saving old simulations for later plotting is essential in reinforcement learning.

>>> from irlc.ex08.bandits import StationaryBandit, eval_and_plot
>>> from irlc.ex08.simple_agents import BasicAgent
>>> env = StationaryBandit(k=10)                                                                # 10-armed testbed
>>> agents = [BasicAgent(env, epsilon=.1), BasicAgent(env, epsilon=.01)]                        # Create two agents
>>> k = eval_and_plot(env, agents, num_episodes=10, steps=100, max_episodes=10, use_cache=True)    # Fight!

(Source code, png, hires.png, pdf)

../_images/ex08-1.png

Classes and functions#

class irlc.ex08.bandits.BanditEnvironment(k)[source]#

Bases: Env

A helper class for defining bandit problems similar to e.g. the 10-armed testbed discsused in (SB18). We are going to implement the bandit problems as greatly simplfied gym environments, as this will allow us to implement the bandit agents as the familiar Agent. I hope this way of doing it will make it clearer that bandits are in fact a sort of reinforcement learning method.

The following code shows an example of how to use a bandit environment:

>>> from irlc.ex08.bandits import StationaryBandit
>>> env = StationaryBandit(k=10)                    # 10-armed testbed.
>>> env.reset()                                     # Reset env.q_star
(0, {})
>>> s, r, _, _, info = env.step(3)
>>> print(f"The reward we got from taking arm a=3 was {r=}")
The reward we got from taking arm a=3 was r=np.float64(2.5708076875410706)
__init__(k)[source]#

Initialize a bandit problem. The observation space is given a dummy value since bandit problems of the sort (SB18) discuss don’t have observations.

Parameters:

k (int) – The number of arms.

reset()[source]#

Use this function to reset the all internal parameters of the environment and get ready for a new episode. In the (SB18) 10-armed bandit testbed, this would involve resetting the expected return

\[q^*_a\]

The function must return a dummy state and info dictionary to agree with the gym Env class, but their values are irrelevant

Returns:

  • s - a state, for instance 0

  • info - the info dictionary, for instance {}

bandit_step(a)[source]#

This helper function simplify the definition of the environments step-function.

Given an action \(r\), this function computes the reward obtained by taking that action \(r_t\) and the average regret. This is defined as the expected reward we miss out on by taking the potentially suboptimal action \(a\) and is defined as:

\[\max_{a'} q^*_{a'} - q_a\]

Once implemented, the reward and regret enters into the step function as follows:

>>> from irlc.ex08.bandits import StationaryBandit
>>> env = StationaryBandit(k=4)     # 4-armed testbed.
>>> env.reset()                     # Reset all parameters.
(0, {})
>>> _, r, _, _, info = env.step(2)  # Take action a=2
>>> print(f"Reward from a=2 was {r=}, the regret was {info['average_regret']=}")
Reward from a=2 was r=np.float64(0.044011457277292054), the regret was info['average_regret']=np.float64(1.1332973711188685)
Parameters:

a – The current action we take

Returns:

  • r - The reward we thereby incur

  • regret - The average regret incurred by taking this action (0 for an optimal action)

step(action)[source]#

This function is automatically defind and you do not have to edit it. In a bandit environment, the step function is simplified greatly since there are no states to keep track on. It should simply return the reward incurred by the action a and (for convenience) also returns the average regret in the info-dictionary.

Parameters:

action – The current action we take \(a_t\)

Returns:

  • next_state - This is always None

  • reward - The reward obtained by taking the given action. In (SB18) this is defined as \(r_t\)

  • terminated - Always False. Bandit problems don’t terminate.

  • truncated - Always False

  • info - For convenience, this includes the average regret (used by the plotting methods)

class irlc.ex08.bandits.StationaryBandit(k, q_star_mean=0)[source]#

Bases: BanditEnvironment

Implement the ‘stationary bandit environment’ which is described in (SB18, Section 2.3) and used as a running example throughout the chapter.

We will implement a version with a constant mean offset (q_star_mean), so that

q* = x + q_star_mean, x ~ Normal(0,1)

q_star_mean can just be considered to be zero at first.

__init__(k, q_star_mean=0)[source]#

Initialize a bandit problem. The observation space is given a dummy value since bandit problems of the sort (SB18) discuss don’t have observations.

Parameters:

k – The number of arms.

reset()[source]#

Set q^*_k = N(0,1) + mean_value. The mean_value is 0 in most examples. I.e., implement the 10-armed testbed environment.

bandit_step(a)[source]#

Return the reward/regret for action a for the simple bandit. Use self.q_star (see reset-function above). To implement it, implement the reward (see the description of the 10-armed testbed for more information. How is it computed from from q^*_k?) and also compute the regret.

As a small hint, since we are computing the average regret, it will in fact be the difference between the value of q^* corresponding to the current arm, and the q^* value for the optimal arm. Remember it is 0 if the optimal action is selected.

Solutions to selected exercises#