Matrix Calculus Demystified#

Despite having been a machine learning practitioner for a considerable period of time, I have never truly dived into the mathematical details of deep neural networks, until this morning. This morning I went over a paper [1] on matrix calculus and discovered that the maths of deep learning isn’t so challenging after all! (Well, so long as you understand the basics of calculus and linear algebra).

Jacobians#

Consider the derivative of a function \(y = f(x)\):

\[ \frac{dy}{dx} = \frac{df(x)}{dx} = f'(x) \]

If we make \(f(x)\) output a vector instead (\(\mathbf{y} = \mathbf{f}(x)\)), then:

\[\begin{split} \frac{\partial\mathbf{y}}{\partial x} = \begin{bmatrix} f'_1(x) \\ f'_2(x) \\ \vdots \\ f'_m(x) \\ \end{bmatrix} \end{split}\]

If \(x\) is instead a vector (\(y = f(\mathbf{x})\)), then

\[ \newcommand{\pop}[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\V}[1]{\mathbf{#1}} \pop{y}{\mathbf{x}} = \nabla y = \begin{bmatrix} \pop{}{x_1}f(\V{x}) & \pop{}{x_2}f(\V{x}) & \cdots & \pop{}{x_n}f(\V{x}) \end{bmatrix} \]

Combine the two and we get:

\[\begin{split} \newcommand{\pop}[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\V}[1]{\mathbf{#1}} \pop{\V{y}}{\V{x}} = \begin{bmatrix} \pop{}{x_1}f_1(\V{x}) & \cdots & \pop{}{x_n}f_1(\V{x}) \\ \vdots & \ddots & \vdots \\ \pop{}{x_1}f_m(\V{x}) & \cdots & \pop{}{x_n}f_m(\V{x}) \\ \end{bmatrix} \end{split}\]

which is the Jacobian.

The chain rule#

One of the most delightful aspects of matrix calculus is that the chain rule works perfectly well as in Calculus I:

\[ \newcommand{\pop}[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\V}[1]{\mathbf{#1}} \pop{}{\V{x}}\V{f}(\V{g}(\V{x})) = \pop{\V{f}}{\V{g}}\pop{\V{g}}{\V{x}} \]

An example#

Let’s consider a simple neural network:

Neural Network

Assuming the input data are stored in matrix \(\mathbf{X}\) (shape \(n \times 3\), \(n\) is the number of samples), then

\[\begin{split} \newcommand{\pop}[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\V}[1]{\mathbf{#1}} \begin{align*} \V{Z}_{h} &= \V{X}\V{W}_h^T + \V{b}_h \\ (n \times 5 &= (n \times 3)@(5 \times 3)^T + [n\times]5) \\ \V{A}_h &= \text{activation}(\V{Z}_h) \\ (n \times 5 &= n \times 5) \\ \V{Z}_{out} &= \V{A}_h\V{W}_{out}^T + \V{b}_{out} \\ (n \times 3 &= (n \times 5)@(3 \times 5)^T + [n \times] 3) \\ \V{A}_{out} &= \text{activation}(\V{Z}_{out}) \\ (n \times 3 &= n \times 3) \\ \end{align*} \end{split}\]

Assuming the lost function is MSE:

\[ L = \frac{1}{3n}\sum_i^n\sum_j^3(\mathbf{A}_{out_{ij}} - \mathbf{Y}_{ij})^2 \]

and the activation function is \(\tanh x\), we can work out the derivatives for the weights and biases:

\[\begin{split} \newcommand{\pop}[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\V}[1]{\mathbf{#1}} \begin{align*} \pop{L}{\V{W}_{out}} &= \pop{L}{\V{A}_{out}}\pop{\V{A}_{out}}{\V{Z}_{out}}\pop{\V{Z}_{out}}{\V{W}_{out}} \\ \pop{L}{\V{b}_{out}} &= \pop{L}{\V{A}_{out}}\pop{\V{A}_{out}}{\V{Z}_{out}}\pop{\V{Z}_{out}}{\V{b}_{out}} \\ \pop{L}{\V{W}_{h}} &= \pop{L}{\V{A}_{out}}\pop{\V{A}_{out}}{\V{Z}_{out}}\pop{\V{Z}_{out}}{\V{A}_h}\pop{\V{A}_h}{\V{Z}_h}\pop{\V{Z}_h}{\V{W}_h} \\ \pop{L}{\V{b}_{h}} &= \pop{L}{\V{A}_{out}}\pop{\V{A}_{out}}{\V{Z}_{out}}\pop{\V{Z}_{out}}{\V{A}_h}\pop{\V{A}_h}{\V{Z}_h}\pop{\V{Z}_h}{\V{b}_h} \\ \end{align*} \end{split}\]

(which is a ton of partial derivatives)

Howver, we got into some trouble: by considering a minibatch input, many of our partial derivatives have matrices both in their numerators and their denominators, which is problematic.

The solution: we extend the Jacobian matrices into higher orders: tensors.

Tensor Calculus#

Attention

Ahem. There does seem to be some special points about tensor calculus that I have overlooked. After experimenting with PyTorch tensors for a while, I figured out that the jacobian function isn’t giving results I expected. So perhaps I will stick to vector and matrix calculus for now. 😅

However, if we switch back to vectors, most of the above partial derivatives will be easily representable by ordinary, scalar derivatives.

Consider the following example:

\[\begin{split} \newcommand{\pop}[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\V}[1]{\mathbf{#1}} \begin{align*} \pop{L}{\V{b}_{h}} &= \pop{L}{\V{a}_{out}}\pop{\V{a}_{out}}{\V{z}_{out}}\pop{\V{z}_{out}}{\V{a}_h}\pop{\V{a}_h}{\V{z}_h}\pop{\V{z}_h}{\V{b}_h} \\ (1 \times 5) &= (1 \times 3)@(3 \times 3)@(3 \times 5)@(5 \times 5)@(5 \times 5) \end{align*} \end{split}\]

The six partial derivatives involved can be expressed as follows:

\[\begin{split} \newcommand{\pop}[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\V}[1]{\mathbf{#1}} \newcommand{\sechsq}[0]{\text{sech}^2\,} \begin{align} \pop{L}{\V{b}_h} &= \begin{bmatrix} \pop{L}{b_{h_1}} & \pop{L}{b_{h_2}} & \pop{L}{b_{h_3}} & \pop{L}{b_{h_4}} & \pop{L}{b_{h_5}} \\ \end{bmatrix} \\ \pop{L}{\V{a}_{out}} &= \begin{bmatrix} \dfrac{2}{3}(a_{out_1}-y_1) & \dfrac{2}{3}(a_{out_2}-y_2) & \dfrac{2}{3}(a_{out_3}-y_3) \\ \end{bmatrix} \\ \pop{\V{a}_{out}}{\V{z}_{out}} &= \text{diag}(\sechsq z_{out_1},\;\sechsq z_{out_2},\;\sechsq z_{out_3},) \\ \pop{\V{z}_{out}}{\V{a}_h} &= \V{W}_{out} \\ \pop{\V{a}_h}{\V{z}_h} &= \text{diag}(\sechsq z_{h_1},\;\sechsq z_{h_2},\;\sechsq z_{h_3},\;\sechsq z_{h_4},\;\sechsq z_{h_5}) \\ \pop{\V{z}_h}{\V{b}_h} &= \V{I} \end{align} \end{split}\]

And it all works out.