Scalable methods for computing state similarity in deterministic MDPs
This post describes my paper Scalable methods for computing state similarity in deterministic MDPs, published at AAAI 2020. The code is available here.
Motivation
We consider distance metrics between states in an MDP. Take the following MDP, where the goal is to reach the green cells:

Physical distance betweent states?
Physical distance often fails to capture the similarity properties we’d like:

State abstractions
Now imagine we add an exact copy of these states to the MDP (think of it as an additional “floor”):

And imagine after each action the agent randomly transitions between these two floors:

Notice that since the optimal policy is identical on both floors, we just doubled the state space for no reason! We should really be grouping together similar states.
MDPs
Consider the standard MDP defintion
Bisimulation
What we need is a notion of state distance that captures behavioural indistinguishability.
Equivalence relations
First, let me explain bisimulation relations, which was introduced by Givan, Dean, and Greig.
Two states,
- They have equal immediate rewards:
- Transition with equal probability to bisimulation equivalence classes:
This is often easiest to understand via an illustration. Consider the folloowing system:

A bisimulation equivalence relation would collapse the 8 states above into an equivalent 4-state MDP:

But, unfortunately, bisimulation equivalence is brittle, as it’s a 0/1 relationship!
Metrics
Bisimulation metrics generalize bisimulation relations, and quantify the behavioural distance between two states in an MDP. They are defined as any metric
where
These metrics have nice theoretical properties, such as: the bisimulation distance between two states is an upper-bound on their optimal value difference:
Shortcomings and solutions
Bisimulation metrics have three shortcomings I address in my paper.
Pessimism
Problem
They’re inherently pessimistic (the max is considering the worst possible case!):

When
Solution
I introduce on-policy bisimulation metrics,
Computational Expense
Problem
Bisimulation metrics have been traditionally solved via dynamic programming. But this can be very expensive, as it requires updating the metric estimate for all state-action pairs at every iteration! Note that this means computing the Kantorovich (which is an expensive linear program)
Solution
Sampling! If we assume a deterministic MDP, I provide an update rule using sampled pairs of
Full state enumerability
Problem
Existing methods for computing/approximating bisimulation metrics require full state enumerability.
Solution
Use neural networks! Consider a trained Rainbow agent, and take the penultimate layer as the representation (

We can concatenate the representations of two states:

And feed them through a neural network, where we denote the output of the network as

We define a target for regular bisimulation (
and for on-policy bisimulation:
Which is then incorporated into the loss (where
Evaluate on Atari 2600
Take a trained Rainbow agent and compute the distance from the first state:

to every other state:

Let’s see the results! Notice how the metric jumps when an alien ship is destroyed:

Citing
The original tweet is here.
Please use the following BibTeX entry if you’d like to cite this work:
@inproceedings{castro20bisimulation,
author = {Pablo Samuel Castro},
title = {Scalable methods for computing state similarity in deterministic {M}arkov {D}ecision {P}rocesses},
year = {2020},
booktitle = {Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)},
}