# Shae's Ramblings

Stuff I find slightly meaningful and close to achievable

## Proving the multivariate chain rule

This post reviews a proof of the chain rule from Spivak's “Calculus on manifolds”.
We first recall the definition of a differentiable function $f\colon \mathbb{R}^n\to\mathbb{R}^m$.
The function $f$ is differentiable at $a\in \mathbb{R}^n$ if there is a linear map $\lambda\colon \mathbb{R}^n\to\mathbb{R}^m$ such that

$$\lim_{h\to 0} \dfrac{\lvert f(a+h) – f(a) – \lambda(h) \rvert}{\lvert h \rvert} = 0$$

where $\lvert \cdot \rvert$ stands for the euclidean norm $\lVert \cdot \rVert_2$.

## The theorem

If $f\colon \mathbb{R}^n\to\mathbb{R}^m$ is differentiable at $a\in\mathbb{R}^n$ and $g\colon \mathbb{R}^m \to \mathbb{R}^p$ is differentiable at $f(a)\in \mathbb{R}^m$, then $g\circ f$ is differentiable at $a$ and:

$$D(g\circ f) ( a ) = Dg( f( a ) ) \circ Df( a )$$

As a consequence, the jacobian matrices $J_{g\circ f}, J_g, J_f$ are related as $J_{g\circ f} = J_gJ_f$.

## The proof

The idea is that the local linear map of the composition $g\circ f$ should behave similarly to the functions themselves: the linear map $\eta$ of $g\circ f$ at $a$ should be the composition of the linear maps $\lambda, \mu$ of $f$ and $g$.
Thus, define $Df(a) = \lambda$, $Dg( f ( a ) ) = \mu$ and $\mu \circ \lambda = \eta$. Define $b = f(a)$ to simplify notation.
Further, define the functions

$$\begin{cases} &\varphi(x) = f(x) – f(a) – \lambda( x-a ) \\ & \\ &\psi(y) = g(y) – g(b) – \mu( y-b ) \\ & \\ & \rho (x) = g\circ f (x) \,-\, g\circ f ( a ) \,– \,\mu\circ\lambda ( x-a ) \end{cases}$$

The functions $\varphi, \psi$ are such that

$$\lim_{x\to a}\dfrac{\lvert \varphi(x) \rvert}{\lvert x-a\rvert} = 0 \qquad\text{and}\qquad \lim_{y\to b}\dfrac{\lvert \psi(y) \rvert}{\lvert y-b\rvert } = 0$$

The trick is to express $\rho$ as a composition of functions which we already know: $\varphi, \psi, \lambda, \mu$.
First, notice that $\lambda( x-a ) = f(x) – f(a) – \varphi ( x )$, which we can expand using the linearity of $\mu$:

\begin{align}
\rho (x) &= g\circ f (x) \,-\, g\circ f ( a ) \,– \,\mu(f(x) – f(a)) + \mu(\varphi (x)) \\
&= \psi(f(x)) + \mu(\varphi (x))
\end{align}

due to the properties of the euclidean norm, it is enough to show that

$$\dfrac{\lvert \psi(f(x)) \rvert}{\lvert x-a \rvert} \to 0\qquad\text{and}\qquad \dfrac{\lvert \mu(\varphi(x)) \rvert}{\lvert x-a \rvert} \to 0$$
as $x\to a$. Because $\mu$ is a linear map between finitely-dimensional spaces, [we have that](https://write.as/arnov/matrix-vector-product-inequalities $\lvert\mu(y)\rvert \leq M_\mu\lvert y \rvert$ for some constant $M_\mu$. This means that the second quotient is upper bounded by $M_\mu\frac{\lvert\varphi(x)\rvert}{\lvert x-a \rvert}$ and therefore tends to $0$ as $x\to a$.
To prove that the first ratio tends to zero, we have to prove that

for any $\varepsilon > 0$ there is always a $\delta > 0$
such that $\lvert x – a \rvert < \delta$ implies $\lvert \psi(f(x)) \rvert < \varepsilon \lvert x – a \rvert$

First, because $\lim_{y\to b}\frac{\lvert \psi(y) \rvert}{\lvert y-b\rvert } = 0$ we can pick $\delta_0 > 0$ so that
$$\lvert \psi(f(x)) \rvert < \varepsilon_0 \lvert f(x)-b \rvert$$
We can change the right-hand side into functions which we can upper bound because
\begin{align}
\lvert f(x)-b \rvert = \lvert f(x) – f(a) \rvert &= \lvert \varphi(x) – \lambda ( x-a ) \rvert \\
& \leq \lvert \varphi(x)\rvert + \lvert \lambda ( x-a )\rvert \\
& \leq \varepsilon_1 \lvert x – a \rvert + M_\lambda \lvert x – a \rvert
\end{align}

we obtain that there are $\delta_0,\delta_1$ positive so that
$$\lvert \psi(f(x)) \rvert \leq \varepsilon_0(\varepsilon_1 + M_\lambda)\lvert x – a \rvert$$

We pick any $\varepsilon_1 > 0$ first, for which we find the appropriate $\delta_1$ making the inequality hold. Then, we can pick $\delta_1$ so that $\varepsilon_0 = \varepsilon / (\varepsilon_1 + M_\lambda)$.
Thus, for $\lvert x – a \rvert < \delta = \min(\delta_0, \delta_1)$ we have that $\lvert \psi(f(x)) \rvert \leq \varepsilon \lvert x – a \rvert$ as expected.

We just proved that $D(g\circ f)(a) = \mu\circ \lambda = Dg( f( a ) ) \circ Df( a )$.