machine learning, computer science, and more

Mapping between two Gaussians using optimal transport and the KL-divergence

Suppose you have two multivariate Gaussian distributions \( S\) and \(T\), parameterized as \( N(\mu_S, \Sigma_S)\) and \( N(\mu_T, \Sigma_T)\). How do you linearly transform \( x \sim S\) so that the transformed vectors have distribution \( T\)? Is there an optimal way to do this? The field of optimal transport (OT) provides an answer. If we choose the transport cost as the type-2 Wasserstein distance \( W^2_2\) between probability measures, then we apply the following linear function:

\[ x \rightarrow \mu_T + A(x-\mu_S) = Ax + (\mu_T – A\mu_Q) \]
where
\[ A =\Sigma^{-½}_S (\Sigma^{½}_S \Sigma_T \Sigma^{½}_S )^{½}\Sigma^{-½}_S = A^{\top}.\]

For more details, see Remark 2.31 in “Computational Optimal Transport” by Peyre & Cuturi (available on arXiv here).

But we might instead want to find the transformation which minimizes the Kullback-Leibler divergence between \( T\) and the transformed \( S\). We will use the fact that the transformed vector will also come from a Gaussian distribution, with mean and covariance given by

\[ \begin{aligned}\mathbb{E}[Mx+b] =& M \mathbb{E}[x]+b = M\mu_S + b \\
Cov[Mx+b] =& M Cov[x] M^\top = M\Sigma\_S M^{\top} \end{aligned}. \]

We then set up an optimization problem:

\[ \min_{M, b} D_{KL}(N(\mu_T, \Sigma_T) || N(M\mu_S + b, M\Sigma_S M^{\top})). \]

This leads to the following nasty-looking objective:

\[ \min_{M, b} \log(|M \Sigma_S M^\top|) – \log(|\Sigma_T|) + \textrm{tr}([M \Sigma_S M^\top]^{-1} \Sigma_T) + (M\mu_S + b – \mu_T)^\top [M \Sigma_S M^\top]^{-1} (M\mu_S + b – \mu_T) . \]

But we don't actually need to work through all this algebra, because the optimal transport solution also minimizes the KL-divergence. The KL-divergence \( D_{KL}(P || Q) \) reaches a minimum of 0 when \( P \)and \( Q \) are equal, so we only need to verify that the first optimal transport transformation produces samples with distribution \( T \).

First checking the mean, we verify that \( \mathbb{E}[\mu_T + A(x-\mu_S)] = \mu_T + A(\mu_S-\mu_S) = \mu_T\). Next, checking the covariance, we have

\[ \begin{aligned} Cov[\mu_T + A(x-\mu_S)] =& A Cov[x] A^\top = A Cov[x] A = A \Sigma_S A \\
=& \Big[ \Sigma^{-½}_S (\Sigma^{½}_S \Sigma_T \Sigma^{½}_S )^{½}\Sigma^{-½}_S\Big] \Sigma_S \Big[ \Sigma^{-½}_S (\Sigma^{½}_S \Sigma_T \Sigma^{½}_S )^{½}\Sigma^{-½}_S \Big] \\
=& \Sigma_T.\end{aligned} \]

We've verified that \( \mu_T + A(x-\mu_S) \sim N(\mu_T, \Sigma_T) \), which means that our optimal transport solution also gives us the KL-divergence minimizer.

I'm using this fact in my ongoing research on domain adaptation under confounding. See the arXiv preprint here.


This was originally published here: https://calvinmccarter.wordpress.com/2022/03/29/mapping-between-two-gaussians-using-optimal-transport-and-the-kl-divergence/

#ML