This paper was accepted as a spotlight at ICLR'21.

We propose a new metric and contrastive loss that comes equipped with theoretical and empirical results.

## Policy Similarity Metric

We introduce the policy similarity metric (PSM) which is based on bisimulation metrics. In contrast to bisimulation metrics (which is built on reward differences), PSMs are built on differences in optimal policies.

If we were to use this metric for policy transfer (as Doina Precup & I explored previously), we can upper-bound the difference between the optimal and the transferred policy:

## Policy Similarity Embeddings

We use this metric to define a loss for learning contrastive metric embeddings (CMEs) that aim to learn a structured representation respecting $d^*$.

The resulting algorithm is as follows: