## Approximating the differential entropy

This post summarizes the main ideas of a method to approximate the entropy of a continuous random variable \(X\). Recall the definition of \(h(X)\):

$$

h(X) \triangleq \;-\int_\mathcal{X} p(x)\log p(x)\,\mathrm{d}x

$$

## Maximum entropy

If we have information about the distribution \(p\) in the form of equality contraints

$$

\mathrm{E}_p[G_i(X)] = \int_\mathcal{X} p(x)G_i(x)\,\mathrm{d}x = c_i\,,\qquad i\in \{1, \dots, n \}

$$

then the distribution \(\hat{p}\) which maximizes the entropy \(h\) while respecting the above constraints must be of the form:

$$ p_0(x) = A\,\exp\left(\sum_{i=1}^n a_iG_i(x)\right) $$

where \(A, a_i\) are constants determined by the \(c_i\) which are hard to compute.

## Simplifying the candidate distribution

Let \(\phi\) be the standard normal distribution. The authors of the paper assume the candidate distribution \(p\) is not far from \(\phi\).

Thus, they normalize the data and put extra constraints \(G_{n+1}(x) = x\) with \(c_{n+1} = 0\) and \(G_{n+2}(x) = x^2\) with \(c_{n+2} = 1\).

They further assume the constraint functions \(G_j\) are orthonormal with respect to the inner product \(\langle f,g \rangle \triangleq \mathrm{E}_{\phi}[fg]\) which can be obtained through the Gram-Schmidt algorithm.

Near gaussianity implies that the coefficients \(a_i\) in the above expression for \(f_0\) are near zero for \(i \leq n\) compared to \(a_{n+2} \approx -\frac12\) since

\begin{align}

p_0(x) &= A\,\exp\left(\sum_{i=1}^{n+2} a_iG_i(x)\right) \\

&= A\,\exp\left(a_{n+2}x^2 + \sum_{i=1}^n a_iG_i(x)\right) \\

& \approx \frac{1}{\sqrt{2\pi}}\exp( -\frac12x^2 )

\end{align}

therefore as \(\delta = \sum_{i=1}^n a_iG_i(x) \approx 0\) we can expand \(p_0\) as ( related to Edgeworth expansions, more on this later)

$$

p_0(x) \approx \phi(x)\left(1 + \sum_{i=1}^nc_iG_i(x)\right)

$$

**To be added.**