Introduction to reinforcement learning
This post is based on this colab.
You can also watch a video where I go through the basics here.
Pueden ver un video (en español) donde presento el material aquí.
Introduction
Reinforcement learning methods are used for sequential decision making in uncertain environments. It is typically framed as an agent (the learner) interacting with an environment which provides the agent with reinforcement (positive or negative), based on the agent’s decisions. The agent leverages this reinforcement to update its behaviour in an aim to get closer to acting optimally. In interacting with the uncertain environment, the agent is also learning about the dynamics of the underlying system.
The following image depicts the typical flow of a reinforcement learning problem.
- An agent, at some state in the environment, sends an action to the environment
- The environment responds by providing the agent with:
- A reward/cost that depends on the state the agent was in and the action chosen
- A new state that the agent transitions to after performing the selected action (which can sometimes be the same state the agent was already in).
- The agent then uses this information to update its behaviour so that it increases the likelihood of choosing actions that give a positive reward, and reduces the likelihood of choosing actions that result in a penalty.

This notebook provides a brief introduction to reinforcement learning, eventually ending with an exercise to train a deep reinforcement learning agent with the dopamine framework.
The notebook is roughly organized as follows:
- A simple motivating example is presented to illustrate some of the main points and challenges.
- Markov decision processes (MDPs), the mathematical formalism used to express these problems, is introduced.
- Exact tabular methods are presented, which can be used when the environment is fully known. These methods form the foundation for the learning algorithms in uncertain environments.
- Value-based learning algorithms are introduced, and related algorithms are also presented.
- The methods introduced are put to the test on a simple MDP.
- Deep reinforcement learning is introduced, along with some implementation details.
- We use deep reinforcement learning to learn a policy on a larger MDP.
- Resources for further study are provided.
A motivating example
Consider the following simple grid where the agent (the blue circle) must navigate to the gold star by choosing from four actions at each grid cell (up, down, left, right):

Being able to see the whole grid, it is trivial for us to find a shortest path to the gold star (the optimal policy for the agent):

In reinforcement learning problems, however, we typically do not have immediate access to the environment. In our simple example, the agent may simply know the number of cells in the grid, but may not have any information on what is contained in each:

The agent must then explore the environment by taking actions and keeping track of what is encountered at the different cells:

In the previous image, the grey arrows represent the actions that the agent has already taken in each of those cells. It benefits the agent to keep track of these so as to remember both the actions that lead to rewards and those that lead to penalties.
To anticipate some of the concepts that will be formalized below, the grid represents the environment. This environment consists of:
- A set of states (the grid cells)
- A set of possible actions (up, down, left, right)
- Transition dynamics, defining the effect of the actions (e.g. taking the right action will take the agent to the cell to the right of its current state)
- A reward signal that indicates the benefit or cost of the action just taken (e.g. the gold star is a positive reward received upon entering the bottom-right state).
The agent is typically aware of the states and actions, but not of the transition dynamics nor reward signal. Thus, the agent make use of a method that:
- Will explore the environment to learn about the transition and reward dynamics.
- Will keep track of previously visited states and the actions performed therein, which can be used to inform future action choices.
- Can balance exploring new experiences and exploiting the knowledge it has already achieved.
This last point is the “exploit-explore dilemma”, which is a central (and open!) problem in reinforcement learning. How to best balance this tradeoff in a computationally efficient manner is the subject of active research.
To illustrate its importance, assume the agent encountered a small reward in the grid. If at that point the agent were to decide to purely exploit its current knowledge, it would settle on the policy below, without ever exploring the unknown area where a larger reward awaits:

Having finished this motivating example, we are now ready to dive into the mathematical details.
Markov decision processes
More formally, the problem is typically expressed as a Markov decision process
is a finite set of states that the agent “inhabits”. You can also consider continuous state spaces, but for this notebook we will only be considering the finite kind. is a finite set of actions that the agent can perform from each state. In the vast majority of cases, it is assumed that all actions are available in all states. Action spaces can also be continuous, but we will be limiting ourselves to finite action spaces in this notebook. is a bounded reward (or cost) function which provides the agent with (positive or negative) reinforcement. encodes the transition dynamics, where is the probability of ending in state after taking action from state . is a discount factor.
The notation
Since the discount factor
The agent interacts with the environment in discrete time steps. If at time step
Policies and value functions
A policy
This is quantifying the expected sum of discounted rewards when starting from state
or equivalently, replacing the expectation over actions with a summation:
We may sometimes also be interested in the value of performing action
Optimal policies and value functions
The goal of the agent is to find a policy
The state-action value function can be defined in a similar way as above:
Clearly, we can compute
The important question, then, is how do we find
To begin with, run the first two cells below to install and import all necessary packages.
Computing the optimal value exactly
If we have access to
- Value iteration
- Policy iteration
When the transition and reward dynamics are known, these problems are typically referred to as planning problems, which has its own field of active research. In this scenario, the “agent” does not need to interact with an environment to gather information for improving its policy. There is a close relationship between planning and reinforcement learning, and many lines of research sit at their intersection.
For our purposes, it will be useful to review some solutions to the planning problem, as they will serve as foundations for the learning algorithms we present further down.
Value iteration
In Value iteration we are continuously updating an estimate
This is typically referred to as the Bellman backup. It can be shown that, starting from an initial estimate
This gives us the value iteration algorithm:
- Initialize
- Loop until convergence:
- For every
:
- For every
- Return
It will be useful to think of the functions above as vectors, which then allows us to do the Bellman backup with matrix operations. If we assume our matrices have the following shapes:
P.shape = [num_states, num_actions, num_states]
R.shape = [num_states, num_actions]
V.shape = [num_states]
Q.shape = [num_states, num_actions]
Then we can compute numpy
code:
import numpy as np
Q = R + gamma * np.matmul(P, V)
V = np.max(Q, axis=1)
Alternatively, we can use the einsum function, which can help clarify the dimensions along which the multiplication is happening:
Q = R + gamma * np.einsum('sat,t->sa', P, V)
V = np.max(Q, axis=1)
As mentioned above, once we have
The code cell below is an implementation of value iteration.
def value_iteration(P, R, gamma, tolerance=1e-3):
"""Find V* using value iteration.
Args:
P: numpy array defining transition dynamics. Shape: |S| x |A| x |S|.
R: numpy array defining rewards. Shape: |S| x |A|.
gamma: float, discount factor.
tolerance: float, tolerance level for computation.
Returns:
V*: numpy array of shape ns.
Q*: numpy array of shape ns x na.
"""
assert P.shape[0] == P.shape[2]
assert P.shape[0] == R.shape[0]
assert P.shape[1] == R.shape[1]
ns = P.shape[0]
na = P.shape[1]
V = onp.zeros(ns)
Q = onp.zeros((ns, na))
error = tolerance * 2
while error > tolerance:
# This is the Bellman backup (onp.einsum FTW!).
Q = R + gamma * onp.einsum('sat,t->sa', P, V)
new_V = onp.max(Q, axis=1)
error = onp.max(onp.abs(V - new_V))
V = onp.copy(new_V)
return V, Q
Policy iteration
Instead of iterating over
- Initialize
arbitrarily. For simplicity we will assume it is a matrix of shape[num_states, num_actions]
with only one1.0
for each row (i.e. a deterministic policy). - While
is changing: - Return
Notice that, in contrast to value iteration, here we compute
def policy_iteration(P, R, gamma):
"""Find V* using policy iteration.
Args:
P: numpy array defining transition dynamics. Shape: |S| x |A| x |S|.
R: numpy array defining rewards. Shape: |S| x |A|.
gamma: float, discount factor.
Returns:
V*: numpy array of shape ns.
Q*: numpy array of shape ns x na.
"""
assert P.shape[0] == P.shape[2]
assert P.shape[0] == R.shape[0]
assert P.shape[1] == R.shape[1]
ns = P.shape[0]
na = P.shape[1]
V = onp.zeros(ns)
Q = onp.zeros((ns, na))
pi = onp.zeros((ns, na))
for s in range(ns):
pi[s, onp.random.choice(na)] = 1.
policy_stable = False
while not policy_stable:
old_pi = onp.copy(pi)
# Extract V from Q using pi.
V = [Q[s, onp.argmax(pi[s])] for s in range(ns)]
Q = R + gamma * onp.einsum('sat,t->sa', P, V)
pi = onp.zeros((ns, na))
for s in range(ns):
pi[s, onp.argmax(Q[s])] = 1.
policy_stable = onp.array_equal(pi, old_pi)
V = [Q[s, onp.argmax(pi[s])] for s in range(ns)]
Q = R + gamma * onp.einsum('sat,t->sa', P, V)
V = [Q[s, onp.argmax(pi[s])] for s in range(ns)]
return V, Q
Learning the optimal value
What if one does not have access to
The diagram below depicts this pictorially:
- The agent, in state
, picks an action from its policy and sends this action to the environment. - The environment returns a new state
and reward to the agent. - The agent can then use this new information to update its policy
.
Model-based versus model-free approaches
Two common approaches for handling this are:
- Model-based methods: Learn approximate models
and from the experience received from the environment, and solve for , , and using value/policy iteration. - Model-free methods: Learn approximates
, , and/or using the experience received from the environment.
There are pros and cons for each of these approaches, and there is extensive (and continuing) research for both.
For example, model-based methods tend to be more sample-efficient, since one can always sample from
On the other hand, model-free methods are typically easier to learn and update in an online fashion (as new experience arrives), which is usually desirable.
Value-based versus policy-based approaches
Within model-free methods, there are two main approaches used: value-based versus policy-based.
- Value-based methods maintain and update an estimate for
- Policy-based methods maintain and update an estimate for
Once again, there are pros and cons and extensive literature for both of these approaches.
In this notebook, we will focus on value-based methods.
You can learn more about policy-based methods in the Sutton & Barto book and in the Spinning up in Deep RL post by Open-AI.
Exploration
One of the central issues in reinforcement learning is the exploration-exploitation dilemma. This was briefly introduced in the motivating example at the top, but is critical to the eventual performance of the agent.
Consider an agent that started with an initial policy
What if, had the agent picked a different action,
At iteration
Selecting a sub-optimal action is what is referred to as exploration, since the reasoning behind choosing sub-optimally is precisely to explore the environment and potentially discover better policies.
A purely exploratory policy would not be desirable either, as the agent would then always select actions randomly, which would make it unlikely to maximize the expected rewards received.
As mentioned previously, balancing this exploration-exploitation tradeoff is a very active area of research.
-greedy exploration
Perhaps the simplest and best-known exploration method is
- With probability
select an action randomly - With probability
select action
The code snippet below implements this exploratory policy.
def epsilon_greedy(s, pi, epsilon):
"""A simple implementation of epsilon-greedy exploration.
Args:
s: int, the agent's current state.
pi: numpy array of shape [num_states, num_actions] encoding the agent's
policy.
epsilon: float, the epsilon value for epsilon-greedy exploration.
Returns:
An integer representing the action choice.
"""
na = pi.shape[1]
p = onp.random.rand()
if onp.random.rand() < epsilon:
return onp.random.choice(na)
return onp.random.choice(na, p=pi[s])
Monte Carlo methods
Perhaps the simplest way of estimating the sum of future returns is using Monte Carlo methods. These methods execute full trajectories in the environment and estimate
The algorithm can be described as follows:
- Initialize
and pick a start state . - Initialize a list
of elements which accumulates observed returns for each state. - Initialize
randomly. - While learning:
- Generate a trajectory
using . - For
down to :- If
does not appear in :- Append
to
- Append
- Generate a trajectory
The next code cell implements this approach.
def monte_carlo(ns, na, step_fn, gamma, start_state, reset_state, total_episodes,
max_steps_per_iteration, epsilon, V):
"""A simple implementation of Q-learning.
Args:
ns: int, the number of states.
na: int, the number of actions.
step_fn: a function that receives a state and action, and returns a float
(reward) and next state. This represents the interaction with the
environment.
gamma: float, the discount factor.
start_state: int, index of starting state.
reset_state: int, index of state where environment resets back to start
state, or None if there is no reset state.
total_episodes: int, total number of episodes.
max_steps_per_iteration: int, maximum number of steps per iteration.
epsilon: float, exploration rate for epsilon-greedy exploration.
V: numpy array, true V* used for computing errors. Shape: [num_states].
Returns:
V_hat: numpy array, learned value function. Shape: [num_states].
Q_hat: numpy array, learned Q function. Shape: [num_states, num_actions].
max_errors: list of floats, contains the error max_s |V*(s) - \hat{V}*(s)|.
avg_errors: list of floats, contains the error avg_s |V*(s) - \hat{V}*(s)|.
"""
# Initialize policy randomly.
pi_hat = onp.zeros((ns, na))
for s in range(ns):
pi_hat[s, onp.random.choice(na)] = 1.
# Initialize Q randomly.
Q_hat = onp.zeros((ns, na))
# Initialize the accumulated returns and number of updates.
returns = onp.zeros((ns, na))
counts = onp.zeros((ns, na))
# Lists to keep track of training statistics.
iteration_returns = []
max_errors = []
avg_errors = []
for episode in range(total_episodes):
# Each episode starts in the same start state.
s = start_state
step = 0
# Lists collected for each trajectory.
states = []
actions = []
rewards = []
# Generate a trajectory for a limited number of steps.
while step < max_steps_per_iteration:
step += 1
states.append(s)
a = epsilon_greedy(s, pi_hat, epsilon) # Pick action.
actions.append(a)
r, s2 = step_fn(s, a) # Take a step in the environment.
rewards.append(r)
if s2 == reset_state:
# If we've reached a reset state, the trajectory is over.
break
s = s2
# Update the Q-values based on the rewards received by traversing the
# trajectory in reverse order.
G = 0 # Accumulated returns.
step -= 1
while step >= 0:
G = gamma * G + rewards[-1]
rewards = rewards[:-1]
s = states[-1]
states = states[:-1]
a = actions[-1]
actions = actions[:-1]
# We only update Q(s, a) for the first occurence of the pair in the
# trajectory.
update_q = True
for i in range(len(states)):
if s == states[i] and a == actions[i]:
update_q = False
break
if update_q:
returns[s, a] += G
counts[s, a] += 1
Q_hat[s, a] = returns[s, a] / counts[s, a]
pi_hat[s] = onp.zeros(na)
pi_hat[s, onp.argmax(Q_hat[s])] = 1.
step -= 1
iteration_returns.append(G)
V_hat = onp.max(Q_hat, axis=1)
max_errors.append(onp.max(onp.abs(V - V_hat)))
avg_errors.append(onp.mean(onp.abs(V - V_hat)))
return V_hat, Q_hat, iteration_returns, max_errors, avg_errors
Q-learning
Although Monte Carlo methods can update value estimates based on interactions with the environment, a more common approach in reinforcement learning is to use Temporal-Difference (TD) methods. This approach combines ideas from Monte Carlo estimation and dynamic programming.
Like Monte Carlo methods, Q-learning updates its estimates from sampled experiences; but like dynamic programming methods, it does so with single-step transitions. In its simplest form, after performing action
Here,
In this simplified setting we are assuming
- Initialize
and , and pick a start state . - While learning:
- Pick action
according to (and any exploratory policy). - Send
to the environment and receive and in return. - Compute the TD-error as:
- Update the estimate for
as follows: - Update
.
The next cell provides an implementation of Q-learning.
def q_learning(ns, na, step_fn, gamma, start_state, reset_state, total_episodes,
max_steps_per_iteration, epsilon, alpha, V):
"""A simple implementation of Q-learning.
Args:
ns: int, the number of states.
na: int, the number of actions.
step_fn: a function that receives a state and action, and returns a float
(reward) and next state. This represents the interaction with the
environment.
gamma: float, the discount factor.
start_state: int, index of starting state.
reset_state: int, index of state where environment resets back to start
state, or None if there is no reset state.
total_episodes: int, total number of episodes.
max_steps_per_iteration: int, maximum number of steps per iteration.
epsilon: float, exploration rate.
alpha: float, learning rate.
V: numpy array, true V* used for computing errors. Shape: [num_states].
Returns:
V_hat: numpy array, learned value function. Shape: [num_states].
Q_hat: numpy array, learned Q function. Shape: [num_states, num_actions].
max_errors: list of floats, contains the error max_s |V*(s) - \hat{V}*(s)|.
avg_errors: list of floats, contains the error avg_s |V*(s) - \hat{V}*(s)|.
"""
# Initialize policy randomly.
pi_hat = onp.zeros((ns, na))
for s in range(ns):
pi_hat[s, onp.random.choice(na)] = 1.
# Initialize Q to zeros.
Q_hat = onp.zeros((ns, na))
# Lists collected for each trajectory.
iteration_returns = []
max_errors = []
avg_errors = []
for episode in range(total_episodes):
# Each episode begins in the same start state.
s = start_state
step = 0
num_episodes = 0
steps_in_episode = 0
cumulative_return = 0.
average_episode_returns = 0.
# Interact with the environment for a maximum number of steps
while step < max_steps_per_iteration:
a = epsilon_greedy(s, pi_hat, epsilon) # Pick action.
r, s2 = step_fn(s, a) # Take a step in the environment.
delta = r + gamma * max(Q_hat[s2]) - Q_hat[s, a] # TD-error.
Q_hat[s, a] += alpha * delta # Q-learning update.
cumulative_return += gamma**(steps_in_episode) * r
pi_hat[s] = onp.zeros(na)
pi_hat[s, onp.argmax(Q_hat[s])] = 1.
s = s2
steps_in_episode += 1
if s2 == reset_state:
s = 0
num_episodes += 1
steps_in_episode = 0
average_episode_returns += cumulative_return
cumulative_return = 0.
step += 1
average_episode_returns /= max(1, num_episodes)
iteration_returns.append(average_episode_returns)
V_hat = onp.max(Q_hat, axis=1)
max_errors.append(onp.max(onp.abs(V - V_hat)))
avg_errors.append(onp.mean(onp.abs(V - V_hat)))
return V_hat, Q_hat, iteration_returns, max_errors, avg_errors
Colab!
The rest of this intro is hands-on, so you should follow the instructions in the colab.