Error propagation in the Logit-Normal family
I recently came across the problem of computing posteriors and priors according to basic sum-product rules while having uncertainty on them. Suppose you have four probabilities \(p,q,r,k\) and all of them have some level of uncertainty that's framed as some sort of \(\text{LogitNormal}(\mu,\sigma)\).
As each of those variables are outputs of different regressions, it seems only fair to give them each parameters \((\mu_s, \sigma_s)\) with \( s\in\{p,q,r,k\}\).
Another twist is that I am given an Odds-Ratio as a way to estimate the uncertainty of \(p,q\) through \(p/q\).
In my case I found it simpler to express all of them as gaussians \(Z_s\) compressed through a logit:
$$
p/q = \dfrac{\text{logit}(Z_p)}{\text{logit}(Z_q)} = \dfrac{1+\exp(-Z_q)}{1+\exp(-Z_p)} = \exp\left(Z_\text{OR}\right)
$$
I also know that \(k = rq + (1-r)p \) because \(p,q\) are conditional distributions.
This seems very abstract, but just know I have specific values for some of these quantities: I know that \(\mu_\text{OR} = 5.5\) and I have values for the confidence interval \(\text{OR}\in (4.1, 7.28)\).
Let's estimate the mean \(\mu_\star\) and variance \(\sigma_\star\) of the variable \(Z_\text{OR}\).
From Normal CI's to LogNormal CI's
As it is laid out in this stack exchange post,
if your variable \(X\sim\text{Normal}(\mu, \sigma)\) and \(Y = \exp X\), a \(95\%\) CI for \(Y\) is obtained by solving
$$
\mathbb{P}(a\leq Y \leq b) \triangleq \Phi\left(\dfrac{\log(b)-\mu}{\sigma}\right) – \Phi\left(\dfrac{\log(a)-\mu}{\sigma}\right) = 0.95
$$
With the additional knowledge that the normal CI's are attained at \(2\Phi(1.96) – 1\) I am left solving the equations
$$ 4.1 = \exp( \mu\,-\,1.96\sigma)\qquad\text{and}\qquad 7.28 = \exp( \mu + 1.96\sigma)$$
Hence
$$\mu_\star = \dfrac{\log(a) + \log(b)}{2}\qquad\text{and}\qquad\sigma_\star = \dfrac{\log(b) – \log(a)}{2\cdot 1.96}$$
Which yields \(\mu_\star = 0.74269 \) and \(\sigma_\star = 0.0609\)
Getting the distribution of every variable
Now we can guess the distribution of \(p,q\) because
$$
1+\exp(-Z_q) = \exp\left(Z_\text{OR}\right) + \exp\left(Z_\text{OR}\right)\exp(-Z_p)
$$
So I'm back at estimating the distribution of sums/multiplications of a bunch of \(\text{LogNormal}\) RVs. According to the wikipedia page
Sum of Log-Normals
If \(X,Y\) are log-normal then a reasonable approximation of \(S \triangleq X+Y\) is given by a log-normal with parameters
$$
\sigma_S^2 = \dfrac{\exp\left(2\mu_X + \sigma_X^2/2\right)(\exp(\sigma_X^2)-1) + \exp\left(2\mu_Y + \sigma_Y^2/2\right)(\exp(\sigma_Y^2)-1) + 1}{\left[\exp\left(2\mu_X + \sigma_X^2/2\right) + \exp\left(2\mu_Y + \sigma_Y^2/2\right)\right]^2}
$$
and \(\mu_S = \log\left[\exp\left(2\mu_X + \sigma_X^2/2\right) + \exp\left(2\mu_Y + \sigma_Y^2/2\right)\right] – \sigma_S^2/2\)
Product of Log-Normals
When you have to compute \( P = XY\) it's much easier, it's just a \(\text{LogNormal}\left(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2\right)\)
Wrapping everything up
It is conceptually easy to solve for our parameters, but doing the computations requires layers upon layers of tedious arithmetics.
Thankfully I have implemented this ugly beast in python. All it required was a symbolic expressions simplifier and some well-executed implementation of a nonlinear optimization algorithm called BFGS.
Here's a piece of the code:
mu_q, mu_outcome, mu_condition, mu_or = sp.symbols(
'mu_q mu_outcome mu_condition mu_or'
)
sigma_q, sigma_outcome, sigma_condition, sigma_or = sp.symbols(
'sigma_q sigma_outcome sigma_condition sigma_or', positive=True
)
params_q = (mu_q, sigma_q)
params_outcome = (mu_outcome, sigma_outcome)
params_condition = (mu_condition, sigma_condition)
params_or = (mu_or, sigma_or)
mu_lhs, sigma_lhs = sum_lognormals(
params_outcome, prod_lognormals(
params_q, params_condition,
symbolic=True
),
symbolic=True
)
mu_rhs, sigma_rhs = sum_lognormals(
params_q,
prod_lognormals(
params_q, params_or, params_condition,
symbolic=True
),
symbolic=True
)
return sp.simplify(mu_lhs - mu_rhs), sp.simplify(sigma_lhs - sigma_rhs)
The results !
If we truncate to 4 decimal places, our trusty algorithm finds that if
- The condition \( k = p(1-r) + rq \) holds,
- The condition \( p = \text{OR}\cdot q\) holds for some lognormal odds-ratio \(\text{OR}\)
- Each \(k,p,q,r\) are lognormals
Then the values
- \( \text{OR}\in (4.1, 7.28)\)
- \(k\in (0.1, 0.2) \)
- \(r\in (0.004, 0.013)\)
imply that
- \(\mu_k = -4.2585,\quad\sigma_k = 0.21238 \)
- \(\mu_r = -4.9321,\quad \sigma_r = 0.2769\)
- \(\mu_q = -4.3583,\quad \sigma_q = 0.4212\)
- finally \(\mu_p = \mu_q + \mu_\star,\quad \sigma_p = \sqrt{\sigma_q + \sigma_\star}\)
One last twist: let's not forget we didn't really have the expression \(\exp(Z_p)/\exp(Z_q)\), but rather the expression \(\dfrac{1+\exp(-Z_p)}{1+\exp(-Z_q)}\)
This is easily remedied according the so-called Three-parameter log-normal distribution reparametrization :
- \( \mu_p \gets (-\mu_p\;-\;1)\;,\quad \sigma_p \gets \sigma_p\)
- \( \mu_q \gets (-\mu_q\;-\;1)\;,\quad \sigma_q \gets \sigma_q\)
Let's summarize
Variable name | Log-Normal mean \(\mu\) | Log-Normal std \(\sigma\) |
---|---|---|
\(r\in (0.004, 0.013) \) | \(\mu_r = -4.9321\) | \(\sigma_r = 0.2769\) |
\(k \in (0.01, 0.02) \) | \(\mu_k = -4.2585\) | \(\sigma_k = 0.21238\) |
\(q = ?\) | \(\mu_q = 3.3583\) | \(\sigma_q = 0.4212\) |
\(p = \text{OR}\cdot q\) | \(\mu_p = 2.61561\) | \(\sigma_p = 0.42558\) |