## Proving the multivariate chain rule

This post reviews a proof of the chain rule from Spivak's “Calculus on manifolds”.

We first recall the definition of a differentiable function \(f\colon \mathbb{R}^n\to\mathbb{R}^m\).

The function \(f\) is differentiable at \(a\in \mathbb{R}^n\) if there is a linear map \(\lambda\colon \mathbb{R}^n\to\mathbb{R}^m\) such that

$$

\lim_{h\to 0} \dfrac{\lvert f(a+h) – f(a) – \lambda(h) \rvert}{\lvert h \rvert} = 0

$$

where \(\lvert \cdot \rvert\) stands for the euclidean norm \(\lVert \cdot \rVert_2\).

## The theorem

If \(f\colon \mathbb{R}^n\to\mathbb{R}^m\) is differentiable at \(a\in\mathbb{R}^n\) and \(g\colon \mathbb{R}^m \to \mathbb{R}^p\) is differentiable at \(f(a)\in \mathbb{R}^m\), then \(g\circ f\) is differentiable at \(a\) and:

$$ D(g\circ f) ( a ) = Dg( f( a ) ) \circ Df( a ) $$

As a consequence, the *jacobian matrices* \(J_{g\circ f}, J_g, J_f\) are related as \( J_{g\circ f} = J_gJ_f\).

## The proof

The idea is that the local linear map of the composition \(g\circ f\) should behave similarly to the functions themselves: the linear map \(\eta\) of \(g\circ f\) at \(a\) should be the *composition* of the linear maps \(\lambda, \mu\) of \(f\) and \(g\).

Thus, define \(Df(a) = \lambda\), \(Dg( f ( a ) ) = \mu\) and \(\mu \circ \lambda = \eta\). Define \(b = f(a)\) to simplify notation.

Further, define the functions

$$

\begin{cases}

&\varphi(x) = f(x) – f(a) – \lambda( x-a ) \\

& \\

&\psi(y) = g(y) – g(b) – \mu( y-b ) \\

& \\

& \rho (x) = g\circ f (x) \,-\, g\circ f ( a ) \,– \,\mu\circ\lambda ( x-a )

\end{cases}

$$

The functions \(\varphi, \psi\) are such that

$$ \lim_{x\to a}\dfrac{\lvert \varphi(x) \rvert}{\lvert x-a\rvert} = 0 \qquad\text{and}\qquad \lim_{y\to b}\dfrac{\lvert \psi(y) \rvert}{\lvert y-b\rvert } = 0$$

The trick is to express \(\rho\) as a composition of functions which we already know: \(\varphi, \psi, \lambda, \mu\).

First, notice that \(\lambda( x-a ) = f(x) – f(a) – \varphi ( x )\), which we can expand using the linearity of \(\mu\):

\begin{align}

\rho (x) &= g\circ f (x) \,-\, g\circ f ( a ) \,– \,\mu(f(x) – f(a)) + \mu(\varphi (x)) \\

&= \psi(f(x)) + \mu(\varphi (x))

\end{align}

due to the properties of the euclidean norm, it is enough to show that

$$\dfrac{\lvert \psi(f(x)) \rvert}{\lvert x-a \rvert} \to 0\qquad\text{and}\qquad \dfrac{\lvert \mu(\varphi(x)) \rvert}{\lvert x-a \rvert} \to 0$$

as \(x\to a\). Because \(\mu\) is a linear map between finitely-dimensional spaces, [we have that](https://write.as/arnov/matrix-vector-product-inequalities \(\lvert\mu(y)\rvert \leq M_\mu\lvert y \rvert\) for some constant \(M_\mu\). This means that the second quotient is upper bounded by \(M_\mu\frac{\lvert\varphi(x)\rvert}{\lvert x-a \rvert}\) and therefore tends to \(0\) as \(x\to a\).

To prove that the first ratio tends to zero, we have to prove that

for any \(\varepsilon > 0\) there is always a \(\delta > 0\)

such that \(\lvert x – a \rvert < \delta\) implies \(\lvert \psi(f(x)) \rvert < \varepsilon \lvert x – a \rvert\)

First, because \(\lim_{y\to b}\frac{\lvert \psi(y) \rvert}{\lvert y-b\rvert } = 0\) we can pick \(\delta_0 > 0\) so that

$$\lvert \psi(f(x)) \rvert < \varepsilon_0 \lvert f(x)-b \rvert$$

We can change the right-hand side into functions which we can upper bound because

\begin{align}

\lvert f(x)-b \rvert = \lvert f(x) – f(a) \rvert &= \lvert \varphi(x) – \lambda ( x-a ) \rvert \\

& \leq \lvert \varphi(x)\rvert + \lvert \lambda ( x-a )\rvert \\

& \leq \varepsilon_1 \lvert x – a \rvert + M_\lambda \lvert x – a \rvert

\end{align}

we obtain that there are \(\delta_0,\delta_1\) positive so that

$$\lvert \psi(f(x)) \rvert \leq \varepsilon_0(\varepsilon_1 + M_\lambda)\lvert x – a \rvert$$

We pick any \(\varepsilon_1 > 0\) first, for which we find the appropriate \(\delta_1\) making the inequality hold. Then, we can pick \(\delta_1\) so that \(\varepsilon_0 = \varepsilon / (\varepsilon_1 + M_\lambda)\).

Thus, for \(\lvert x – a \rvert < \delta = \min(\delta_0, \delta_1)\) we have that \(\lvert \psi(f(x)) \rvert \leq \varepsilon \lvert x – a \rvert\) as expected.

We just proved that \(D(g\circ f)(a) = \mu\circ \lambda = Dg( f( a ) ) \circ Df( a )\).