Error propagation in the Logit-Normal family

7 décembre 2022

I recently came across the problem of computing posteriors and priors according to basic sum-product rules while having uncertainty on them. Suppose you have four probabilities $p,q,r,k$ and all of them have some level of uncertainty that's framed as some sort of $\text{LogitNormal}(\mu,\sigma)$.

As each of those variables are outputs of different regressions, it seems only fair to give them each parameters $(\mu_s, \sigma_s)$ with $ s\in\{p,q,r,k\}$.

Another twist is that I am given an Odds-Ratio as a way to estimate the uncertainty of $p,q$ through $p/q$.

In my case I found it simpler to express all of them as gaussians $Z_s$ compressed through a logit:

$$ p/q = \dfrac{\text{logit}(Z_p)}{\text{logit}(Z_q)} = \dfrac{1+\exp(-Z_q)}{1+\exp(-Z_p)} = \exp\left(Z_\text{OR}\right) $$

I also know that $k = rq + (1-r)p $ because $p,q$ are conditional distributions.

This seems very abstract, but just know I have specific values for some of these quantities: I know that $\mu_\text{OR} = 5.5$ and I have values for the confidence interval $\text{OR}\in (4.1, 7.28)$. Let's estimate the mean $\mu_\star$ and variance $\sigma_\star$ of the variable $Z_\text{OR}$.

From Normal CI's to LogNormal CI's

As it is laid out in this stack exchange post, if your variable $X\sim\text{Normal}(\mu, \sigma)$ and $Y = \exp X$, a $95\%$ CI for $Y$ is obtained by solving

$$ \mathbb{P}(a\leq Y \leq b) \triangleq \Phi\left(\dfrac{\log(b)-\mu}{\sigma}\right) – \Phi\left(\dfrac{\log(a)-\mu}{\sigma}\right) = 0.95 $$

With the additional knowledge that the normal CI's are attained at $2\Phi(1.96) – 1$ I am left solving the equations

$$ 4.1 = \exp( \mu\,-\,1.96\sigma)\qquad\text{and}\qquad 7.28 = \exp( \mu + 1.96\sigma)$$

Hence $$\mu_\star = \dfrac{\log(a) + \log(b)}{2}\qquad\text{and}\qquad\sigma_\star = \dfrac{\log(b) – \log(a)}{2\cdot 1.96}$$

Which yields $\mu_\star = 0.74269 $ and $\sigma_\star = 0.0609$

Getting the distribution of every variable

Now we can guess the distribution of $p,q$ because

$$ 1+\exp(-Z_q) = \exp\left(Z_\text{OR}\right) + \exp\left(Z_\text{OR}\right)\exp(-Z_p) $$

So I'm back at estimating the distribution of sums/multiplications of a bunch of $\text{LogNormal}$ RVs. According to the wikipedia page

Sum of Log-Normals

If $X,Y$ are log-normal then a reasonable approximation of $S \triangleq X+Y$ is given by a log-normal with parameters

$$ \sigma_S^2 = \dfrac{\exp\left(2\mu_X + \sigma_X^2/2\right)(\exp(\sigma_X^2)-1) + \exp\left(2\mu_Y + \sigma_Y^2/2\right)(\exp(\sigma_Y^2)-1) + 1}{\left[\exp\left(2\mu_X + \sigma_X^2/2\right) + \exp\left(2\mu_Y + \sigma_Y^2/2\right)\right]^2} $$

and $\mu_S = \log\left[\exp\left(2\mu_X + \sigma_X^2/2\right) + \exp\left(2\mu_Y + \sigma_Y^2/2\right)\right] – \sigma_S^2/2$

Product of Log-Normals

When you have to compute $ P = XY$ it's much easier, it's just a $\text{LogNormal}\left(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2\right)$

Wrapping everything up

It is conceptually easy to solve for our parameters, but doing the computations requires layers upon layers of tedious arithmetics.

Thankfully I have implemented this ugly beast in python. All it required was a symbolic expressions simplifier and some well-executed implementation of a nonlinear optimization algorithm called BFGS.

Here's a piece of the code:

    mu_q, mu_outcome, mu_condition, mu_or = sp.symbols(
        'mu_q mu_outcome mu_condition mu_or'
    )
    sigma_q, sigma_outcome, sigma_condition, sigma_or = sp.symbols(
        'sigma_q sigma_outcome sigma_condition sigma_or', positive=True
    )
    
    params_q = (mu_q, sigma_q)
    params_outcome = (mu_outcome, sigma_outcome)
    params_condition = (mu_condition, sigma_condition)
    params_or = (mu_or, sigma_or)

    mu_lhs, sigma_lhs = sum_lognormals(
        params_outcome, prod_lognormals(
            params_q, params_condition,
            symbolic=True
            ),
        symbolic=True
        )
    mu_rhs, sigma_rhs = sum_lognormals(
        params_q,
        prod_lognormals(
            params_q, params_or, params_condition,
            symbolic=True
            ),
            symbolic=True
        )

    return sp.simplify(mu_lhs - mu_rhs), sp.simplify(sigma_lhs - sigma_rhs)

The results !

If we truncate to 4 decimal places, our trusty algorithm finds that if

The condition $ k = p(1-r) + rq $ holds,
The condition $ p = \text{OR}\cdot q$ holds for some lognormal odds-ratio $\text{OR}$
Each $k,p,q,r$ are lognormals

Then the values

$ \text{OR}\in (4.1, 7.28)$
$k\in (0.1, 0.2) $
$r\in (0.004, 0.013)$

imply that

$\mu_k = -4.2585,\quad\sigma_k = 0.21238 $
$\mu_r = -4.9321,\quad \sigma_r = 0.2769$
$\mu_q = -4.3583,\quad \sigma_q = 0.4212$
finally $\mu_p = \mu_q + \mu_\star,\quad \sigma_p = \sqrt{\sigma_q + \sigma_\star}$

One last twist: let's not forget we didn't really have the expression $\exp(Z_p)/\exp(Z_q)$, but rather the expression $\dfrac{1+\exp(-Z_p)}{1+\exp(-Z_q)}$

This is easily remedied according the so-called Three-parameter log-normal distribution reparametrization :

$ \mu_p \gets (-\mu_p\;-\;1)\;,\quad \sigma_p \gets \sigma_p$
$ \mu_q \gets (-\mu_q\;-\;1)\;,\quad \sigma_q \gets \sigma_q$

Let's summarize

Variable name	Log-Normal mean $\mu$	Log-Normal std $\sigma$
$r\in (0.004, 0.013) $	$\mu_r = -4.9321$	$\sigma_r = 0.2769$
$k \in (0.01, 0.02) $	$\mu_k = -4.2585$	$\sigma_k = 0.21238$
$q = ?$	$\mu_q = 3.3583$	$\sigma_q = 0.4212$
$p = \text{OR}\cdot q$	$\mu_p = 2.61561$	$\sigma_p = 0.42558$

Variable name	Log-Normal mean \(\mu\)	Log-Normal std \(\sigma\)
\(r\in (0.004, 0.013) \)	\(\mu_r = -4.9321\)	\(\sigma_r = 0.2769\)
\(k \in (0.01, 0.02) \)	\(\mu_k = -4.2585\)	\(\sigma_k = 0.21238\)
\(q = ?\)	\(\mu_q = 3.3583\)	\(\sigma_q = 0.4212\)
\(p = \text{OR}\cdot q\)	\(\mu_p = 2.61561\)	\(\sigma_p = 0.42558\)