Neural Networks from Scratch

Published May 18, 2024

Introduction
Heteroassociative memory and Hebb's Rule
Nonlinear models
- Bidirectional associative memories
Feedforward Neural Networks (Multilayer Perceptrons)
- Notation
- Review: FNN basics
Backpropagation
References

Bootstrapping a mathematical and programmatic description of Neural Networks

Introduction

There are many great online resources for learning about neural networks. However, the recent fashionability of neural nets has seen a proliferation of introductory materials which mostly repackage the same few ideas. Search engines love the broad appeal of these blog posts and articles, but technically inclined readers are often left without the necessary intuitions that are required to build and operate these networks.

Fortunately, for such a powerful tool, Feedforward Neural Networks are relatively simple objects, the mechanics of which are digestible by those with a working knowledge of vector calculus and linear algebra.

The complexity of the study of networks as universal approximators comes when trying to prove statements about their expressivity and trainability, but these topics are not explored here.

In this essay, my goal is to build the core concepts of FNNs from the ground up while filling in some oft-missed details from other popular educational materials. To that end, this essay assumes the following of the reader.

Comfort with linear maps between real vector spaces and their matrix representations
Familiarity with multivariate calculus
Familiarity with the basic structure of a neural network

I would not publish this essay unless I thought I was bringing something new to the table; herein you will find the following:

An intuitive bridge between the index and vector interpretation of backpropagation
Preciseness / rigor preferred over notational simplicity
Special attention given to pitfalls on the way to building intuition
Intuition checks which I have not found elsewhere online

Please email any questions or feedback to brendan.schlaman@gmail.com. Enjoy!

Heteroassociative memory and Hebb's Rule

Linear models such as linear associators are a critical building block in understanding neural networks, not only for their role in stitching together layers in a network, but also for their ability to store relationships.

Perhaps the simplest type of linear model is associative memory (content-addressable memory). Associators map arbitrary input representations to arbitrary output representations. We will focus on heteroassociative memory, which is used to retrieve patterns which differ (e.g. in encoding, dimensionality) from the input pattern, in contrast to autoassociative memory, used for pattern correction.

A theory for synaptic placticity was introduced by Donald Hebb in 1949, which showed how simple interactions between neurons can eventually learn information. The mathematical formulation of these ideas is called Hebbian learning, and it is a natural starting point for constructing a heteroassociative memory network.

A word of caution on an oft repeated falsehood: while it is indeed true that neurons of a human brain provided inspiration for elementary linear models, there is little in common between how modern neural nets learn and how humans learn!

The simple linear associator

The canonical application of Hebbian learning is the simple linear associator. It is a linear map from independent inputs $\mathbf{x}_i \in \mathbb{R}^n$ to targets $\mathbf{y}_i \in \mathbb{R}^m$ .

\hat{\mathbf{Y}} = \mathbf{W} \mathbf{X}

where $\mathbf{X}$ and $\mathbf{Y}$ are matrices comprising the input and target patterns $\mathbf{x}_i$ and $\mathbf{y}_i$ respectively, arranged as columns. $\hat{\mathbf{Y}}$ denotes the model's output (which may or may not match the desired targets $\mathbf{Y}$ ). The system learns by identifying a weight matrix $\mathbf{W}$ which stores information about the associations between inputs and targets such that recall looks like a linear transformation from input to target space.

This type of "learning" is something of a toy, but it should convince you that information can be stored in weights between neurons, and indeed, requiring less space than rote memorization.

As we will see, perfect recall can be achieved under certain special conditions.

One-shot Hebbian learning (correlation memory)

Simple linear associators can be trained in one shot using Hebb's Rule by constructing a correlation matrix of the inputs and targets. Each entry in $\mathbf{W}$ represents the correlation between an input and output neuron across $k$ datapoints.

\mathbf{W} = \sum_{k=1}^m \mathbf{y}^{(k)} (\mathbf{x}^{(k)})^\intercal = \mathbf{Y} \mathbf{X}^\intercal

Retrieval from a linear correlation memory reveals its limitations:

\hat{\mathbf{y}}^{(h)} = \left( \sum_k \mathbf{y}^{(k)} (\mathbf{x}^{(k)})^\intercal \right) \mathbf{x}^{(h)} = \left\|\mathbf{x}^{(h)}\right\|^2 \mathbf{y}^{(h)} + \sum_{k \ne h} \mathbf{y}^{(k)} (\mathbf{x}^{(k)})^\intercal\mathbf{x}^{(k)}

Two strict requirements are required for perfect retrieval:

$\mathbf{x}^{(h)}$ must be normalized such that the $\left\|\mathbf{x}^{(h)}\right\|^2$ term is 1 (or retrieval can be altered to divide by magnitude of the input).
The $\mathbf{x}^{(k)}$ must be pairwise orthogonal such that the "cross-talk" terms $\mathbf{y}^{(k)} (\mathbf{x}^{(k)})^\intercal\mathbf{x}^{(k)}$ are 0.

If these conditions are met, perfect retrieval after a single round of training is easy to prove. Suppose we are given a set of $n$ inputs $\mathbf{X} \in \mathbb{R}^{n \times n}$ and $n$ target values $\mathbf{T} \in \mathbb{R}^{m \times n}$ .

\mathbf{W} \mathbf{X} = \mathbf{T} \mathbf{X}^\intercal\mathbf{X} = \mathbf{T} \mathbf{X}^{-1} \mathbf{X} = \mathbf{T}

This follows from the fact orthonormal matrices have an inverse equal to their transpose.

Storage analysis

The orthogonality restriction greatly reduces the capacity of this memory. However, there are indeed theoretical savings to be had here. If tasked with storing an arbitrary linear map $L : \mathbb{R}^{n \times n} \to \mathbb{R}^{m \times n}$ , one might naively record the inputs and outputs, requiring $O(n^2 + mn)$ storage space for $\mathbf{X}$ and $\mathbf{Y}$ ¹. Encoding the map as a Simple Linear Associator requires only $O(mn)$ , the size of $\mathbf{W}$ .

Generalization: on-line Hebbian learning

The above one-shot Hebbian learning rule can be generalized to accept inputs at different times, and indeed, in continuous time.

\Delta_p w_{ij} = \eta a_{jp} y_{pi}; \quad \Delta_p W = \eta \mathbf{y}_p \mathbf{a}^\intercal; \quad \Delta \mathbf{W} = \eta \mathbf{Y} \mathbf{a}^\intercal

where $\mathbf{Y} \mathbf{a}^\intercal$ denotes an outer product.

Hebbian on-line supervised learning:

\Delta \mathbf{W} = \eta \mathbf{y} \mathbf{x}^\intercal = \eta (\mathbf{W} \mathbf{x}) \mathbf{x}^\intercal

The storage capacity for this associator is the same; only $n$ input-output pairs can be memorized at any given time, but here, pairs can be unlearned or reassigned at will.

Hebbian learning: a closer look

Consider a single iteration of online Hebbian learning with orthogonal inputs. That is, imagine $\Delta_p \mathbf{W}$ after being presented with the $p$ ^th input / output pair, $\left(\mathbf{x}^{(p)}, \mathbf{y}^{(p)}\right)$ . $\Delta_p \mathbf{W} \in \mathbb{R}^{m \times n}$ is then an inner product:

\Delta_p \mathbf{W} = \mathbf{y}^{(p)} \left(\mathbf{x}^{(p)}\right)^\intercal = \begin{bmatrix} y_1^{(p)} x_1^{(p)} & \dots & y_1^{(p)} x_n^{(p)} \\ \vdots & \ddots & \vdots \\ y_m^{(p)} x_1^{(p)} & \dots & y_m^{(p)} x_n^{(p)} \end{bmatrix}

In index notation, $\Delta_p w_{ij} = y_i^{(p)} x_j^{(p)}$ . The $i$ ^th row of $\Delta_p \mathbf{W}$ can be interpreted as the input $\mathbf{x}^{(p)}$ scaled by the output $y_i^{(p)}$ . Thus, during recall (let $\mathbf{W} = \Delta_p \mathbf{W}$ ), the $\hat{y}_i^{(p)} = \left[\mathbf{W}^\intercal\right]_i \cdot \mathbf{x}^{(p)}$ are obtained by projecting $\mathbf{x}^{(p)}$ onto each of these scaled input vectors, yielding the original $y_i^{(p)}$ (recall $\|\mathbf{x}\| = 1$ ).

This should make it clear why perfect recall requires that the $\mathbf{x}$ be orthonormal. Orthonormal $\mathbf{X}$ means that the rows of $\mathbf{W} = \sum_p \Delta_p \mathbf{W}$ will be orthogonal, such that during recall, there will be no "crosstalk" or "interference" from other inputs.

Key takeaway:
The memory mechanism of Hebbian learning can be viewed as storing scaled versions of of the input as the rows of $\mathbf{W}$ , and layering these orthogonal vectors using vector addition.

Beyond Hebb: the Delta Rule (Widrow-Hoff)

The orthonormality requirement for inputs $X$ above can be relaxed to a less restrictive linear independence requirement by introducing a new strategy for learning: the delta rule.

\label{eq:DeltaRule} \Delta w_{ij} = \eta (t_i(t) - a_i(t)) o_j(t)

Widrow-Hoff is multiple linear regression

Analogy:

Each element of the target vector for i/o pair $(p, \mathbf{t}_p)$ is analogous to a to-be-predicted observation $y_j$
prediction variables $\mathbf{x}_j$ are analogous to input vectors $i_p$
regression coefficients $\theta$ correspond to a row in weight matrix $W$
intercept $\theta_0$ corresponds to the bias oft assumed for the units.
the delta rule converges to the form which is analogous to lin reg:
$\mathbb{E}[\mathbf{W}_\infty] = \mathbb{E}[\mathbf{i}^\intercal\mathbf{i}]^+ \mathbb{E}[\mathbf{i}^\intercal\mathbf{t}]$

Again, we wish to attain perfect recall of inputs $\mathbf{X}$ . One might assume that, much like in the case of Hebbian learning with orthogonal inputs, the system can be solved with matrix inversion:

\begin{gathered} \mathbf{Y} = \mathbf{W} \mathbf{X} \\ \mathbf{Y} \mathbf{X}^\intercal= \mathbf{W} \mathbf{X} \mathbf{X}^\intercal\\ \mathbf{W} = \mathbf{Y} \mathbf{X}^\intercal\left(\mathbf{X} \mathbf{X}^\intercal\right)^{-1} \end{gathered}

Interestingly this appears to be the exact form for the unique solution of a multi-linear regression extimator. But there is a subtle catch! In a linear regression setting, this only works if the feature vectors are linearly independent; in other words, the system must be overdetermined. This allows the Gram matrix $\mathbf{X}^\intercal\mathbf{X}$ to be inverted (recall that, by convention, the columns of $\mathbf{X}$ are the features in linear regression).

By contrast, in a linear association context, an underdetermined system is required to enable perfect recall. Thus, unfortunately, $\mathbf{X} \mathbf{X}^\intercal$ is not invertible, and so the above procedure will not work.

One way to get around this limitation is to purposefully remove information from the inputs by trucating $\mathbf{X}$ such that it's number of rows equal the number of pairs which must be perfectly recalled. This is equivalent to padding $\mathbf{W}$ with 0s, representing a neural network that shuts off the signal from these redundant input neurons.

\begin{gathered} \mathbf{W}^* = \mathbf{Y} \left(\mathbf{X}^*\right)^\intercal \left(\mathbf{X}^* \left(\mathbf{X}^*\right)^\intercal\right)^{-1} \\ \mathbf{W} = \begin{bmatrix} \mathbf{W}^* & \textbf{\large 0} \end{bmatrix} \\ \mathbf{Y} = \mathbf{W} \mathbf{X} \end{gathered}

For the following, we can simplify the analysis by noticing that each of the output nodes $y_i$ (equivalently, the rows of $\mathbf{W}$ ) can be viewed as completely separate problems, since they are independent. Focusing on a single $\left[\mathbf{W}\right]_i, y_i$ , we want

\hat{\mathbf{Y}}_i = \left[\mathbf{W}\right]_i \mathbf{X} \equiv \mathbf{w}_i \mathbf{X}

If we attempted to proceed as above with a multiplicative update rule like Hebbian learning, the problem becomes clear.

\begin{aligned} \hat{y}_i^{(p)} &= \mathbf{w}_i \cdot \mathbf{x}^{(p)} \\ &= \left( \sum_p \Delta_p \mathbf{w}_i \right) \cdot \mathbf{x}^{(p)} \\ &= \Delta_p \mathbf{w}_i \cdot \mathbf{x}^{(p)} + \left( \sum_{q \ne p} \Delta_q \mathbf{w}_i \right) \cdot \mathbf{x}^{(p)} \end{aligned}

An additional error term emerges, which in the Hebbian learning context, would represent the correlation between input examples.

Think about it this way: Unless the set of all previous $\Delta_p \mathbf{w}_i$ are known, it is impossible to know which direction to extend $\mathbf{w}_i$ so that it can recall $y_i^{(p)}$ . How can $\mathbf{w}_i$ be updated in a way that interferes as little as possible with the already-stored pairs?

An update $\Delta_p \mathbf{w}_i$ must incorporate information about the associations which have already been seen. One cannot update $\mathbf{w}_i$ along an input $\mathbf{x}^{(p)}$ without interfering with the stored memory.

The goal is to recall $P$ datapoints:

\begin{aligned} \hat{y}_i^{(1)} &= \mathbf{w}_i \cdot \mathbf{x}^{(1)} \\ \hat{y}_i^{(2)} &= \mathbf{w}_i \cdot \mathbf{x}^{(2)} \\ &\vdots \\ \hat{y}_i^{(P)} &= \mathbf{w}_i \cdot \mathbf{x}^{(P)} \end{aligned}

Going forward, I will drop the $i$ index, as we will focus on a single output. The independence of output nodes in a simple linear model means that the analysis generalizes trivially to all outputs.

When first approaching this topic, I wondered if it were possible to exploit the degrees of freedom in the weights to recover on-line Hebbian learning by performing a quasi-Gram-Schmidt routine on the $\Delta_p \mathbf{w}$ , thus iteratively building up an orthonormal basis.

Start with $\mathbf{w} \gets \frac{y^{0}}{\mathbf{x} \cdot \mathbf{x}} \mathbf{x}$ .

Then, each successive update $\Delta_p \mathbf{w}$ must satisfy the following constraints:

It is orthogonal to all previous $\Delta_j \mathbf{w}, j < p$ .
$\mathbf{w} \gets \mathbf{w} + \Delta_p \mathbf{w}$ recalls $\mathbf{x}^{(p)}$ .

In practice this might look like

\begin{aligned} &\text{Step 1:} \quad &&\Delta_1 \mathbf{w} = a_1 \mathbf{x}^{(1)} \\ &\text{Step 2:} \quad &&\Delta_2 \mathbf{w} = a_2 \left( \mathbf{x}^{(1)} - \mathop{\mathrm{proj}}_{\Delta_1 \mathbf{w}}\left(\mathbf{x}^{(1)}\right) \right) \\ &\text{Step 3:} \quad &&\Delta_3 \mathbf{w} = a_3 \left( \mathbf{x}^{(2)} - \mathop{\mathrm{proj}}_{\Delta_1 \mathbf{w}}\left(\mathbf{x}^{(2)}\right) - \mathop{\mathrm{proj}}_{\Delta_2 \mathbf{w}}\left(\mathbf{x}^{(2)}\right) \right) \\ & && \vdots \\ &\text{Step $P$:} \quad &&\Delta_P \mathbf{w} = a_P \left( \mathbf{x}^{(P)} - \sum_{j=1}^{P-1} \mathop{\mathrm{proj}}_{\Delta_j \mathbf{w}}\left(\mathbf{x}^{(P)}\right) \right) \end{aligned}

where $a_1, \dots, a_P$ are scalars which can be found algebraically at each step, and $\mathop{\mathrm{proj}}_a(b)$ is the projection of $b$ onto $a$ .

This method works by constructing an orthonormal basis of $\Delta \mathbf{w}$ which transforms the non-orthonormal $\mathbf{X}$ into a pattern basis before once again transforming it to the output space. This is essentially equivalent to $QR$ -decomposition.

However, this method is disadvantageous, as it requires retaining information about the components of $\mathbf{w}$ as the algorithm progresses.

A key idea to unlocking a new approach to this problem was provided by Widrow and Hoff (1960) in 1960.

The Widrow-Hoff algorithm or delta rule is an incredibly powerful technique and a critical stop on our journey towards neural networks.

The quintessential optimization method for neural networks, gradient descent, simplifies to Widrow-Hoff for single-layer networks like linear associators.

\label{eq:WidrowHoff} \mathbf{w}_{t+1} = \mathop{\mathrm{argmin}}_\mathbf{w} \eta \left( y^{(p)} - \mathbf{w} \cdot \mathbf{x}^{(p)} \right)^2 + \left\| \mathbf{w}_{t+1} - \mathbf{w}_t \right\|^2

It turns the search for $\mathbf{w}$ into an optimization problem by introducing a notion of measurable error (which I will suggestively call $J$ ) for a single association pair:

\begin{gathered} J_p = \left( y^{(p)} - \mathbf{w} \cdot \mathbf{x}^{(p)} \right)^2 \\ \frac{\partial J_p}{\partial \mathbf{w}} = -2 \left( y^{(p)} - \mathbf{w} \cdot \mathbf{x}^{(p)} \right) \mathbf{x}^{(p)} \end{gathered}

Thus, the update to $\mathbf{w}$ is made in the direction that yields the best reduction in error for that pair:

\Delta_p \mathbf{w} = - \eta \frac{\partial J_p}{\partial \mathbf{w}} = \eta \left( y^{(p)} - \mathbf{w} \cdot \mathbf{x}^{(p)} \right) \mathbf{x}^{(p)}

Unsurprisingly, the update is made in the direction of $\mathbf{x}^{(p)}$ . While it would indeed be possible to update $\mathbf{w}$ in one shot such that it recalls $y^{(p)}$ perfectly, notice that updating $\mathbf{w}$ may increase the error for other input pairs, proportional to their correlation with $\mathbf{x}^{(p)}$ . Instead, input / output pairs are repeatedly presented to the network until perfect convergence is achieved.

Another advantage of this method is that it is useful even in contexts when perfect recall is not possible; i.e. when the system is overdetermined. In that context, Widrow-Hoff is equivalent to online linear regression (i.e. gradient descent on linear regression)!

Linear models: conclusion

Now that the "learning" mechanism of the linear neural network has been demonstrated, what is actually being learned here?

The following analysis comes from the landmark book Parallel Distributed Processing (David E. Rumelhart, McClelland, and Group (1986)) and is my favorite result regarding linear associators.

Consider the recall procedure.

\hat{\mathbf{Y}} = \mathbf{W} \mathbf{X} = \mathbf{T}

Before performing delta rule learning, we can reconceptualize the problem from neuron space into pattern space. Instead of a single transformation, we can think of recall as a 3 step process.

Convert the input neuron space into input pattern space
$\mathbf{X}^* = \mathbf{P}_I \mathbf{X}$
Map inputs patterns to outputs patterns
$\mathbf{Y}^* = \mathbf{W}^* \mathbf{X}^*$
Convert the output pattern space into output neuron space
$\mathbf{Y} = \mathbf{P}_T^{-1} \mathbf{Y}^*$

Summarizing,

\mathbf{Y} = \mathbf{P}_T^{-1} \mathbf{W}^* \mathbf{P}_I \mathbf{X}; \quad \mathbf{W} = \mathbf{P}_T^{-1} \mathbf{W}^* \mathbf{P}_I; \quad \mathbf{W}^* = \mathbf{P}_T \mathbf{W} \mathbf{P}_I^{-1}

Let's apply this to the delta rule.

\begin{aligned} \mathbf{W}_{t+1} &= \mathbf{W}_t + \eta (\mathbf{Y}_t - \mathbf{W}_t \mathbf{X}_t) \mathbf{X}_t^\intercal\\ \mathbf{P}_T \mathbf{W}_{t+1} \mathbf{P}_I^{-1} &= \mathbf{P}_T \mathbf{W}_t \mathbf{P}_I^{-1} + \eta \mathbf{P}_T (\mathbf{Y}_t - \mathbf{W}_t \mathbf{X}_t) \mathbf{X}_t^\intercal \mathbf{P}_I^{-1} \\ \mathbf{W}_{t+1}^* &= \mathbf{W}_t^* + \eta \mathbf{P}_T (\mathbf{Y}_t - \mathbf{P}_T^{-1} \mathbf{W}_t^* \mathbf{P}_I \mathbf{X}_t) \mathbf{X}_t^\intercal\mathbf{P}_I^{-1} \\ &= \mathbf{W}_t^* + \eta (\mathbf{Y}_t^* - \mathbf{W}_t^* \mathbf{X}_t^*) \mathbf{X}_t^\intercal\mathbf{P}_I^{-1} \\ &= \mathbf{W}_t^* + \eta (\mathbf{Y}_t^* - \mathbf{W}_t^* \mathbf{X}_t^*) \left(\mathbf{P}_I^{-1} \mathbf{X}_t^*\right)^\intercal\mathbf{P}_I^{-1} \\ &= \mathbf{W}_t^* + \eta (\mathbf{Y}_t^* - \mathbf{W}_t^* \mathbf{X}_t^*) \left(\mathbf{X}_t^*\right)^\intercal \left(\mathbf{P}_I^{-1}\right)^\intercal \mathbf{P}_I^{-1} \end{aligned}

The matrix $\left(\mathbf{P}_I^{-1}\right)^\intercal\mathbf{P}_I^{-1}$ is simply the correlation structure among the input examples; i.e. $\left(\mathbf{P}_I^{-1}\right)^\intercal\mathbf{P}_I^{-1} = \mathbf{X}^\intercal\mathbf{X}$ .

The learning process for $\mathbf{W}$ and $\mathbf{W}^*$ are equivalent at each step even though translating between them is not possible without full knowledge of the input patterns. In other words, $\left(\mathbf{P}_I^{-1}\right)^\intercal\mathbf{P}_I^{-1}$ can be considered a "hidden" parameter which influences how the training converges (i.e. the respective rates for different inputs). Even the outputs don't affect learning; they are completely independent! In other words, the difference in output activations for two given inputs is irrelevant; it is only the correlation of the inputs that affects the system's ability to learn the associations!

Notice, however, that $\mathbf{W}^*$ , $\mathbf{X}^*$ , and $\mathbf{Y}^*$ are common across all association problems of $P$ examples. The pattern basis is always the same, with $\mathbf{W}^*$ converging on $\mathbf{I}$ in the case where inputs are linearly independent. This means that the only information distinguishing one delta rule learning system from another is the correlations between inputs! This correlation alone scales the updates to $\mathbf{W}^*$ and determines how the system converges.

The error found in the output pattern space for a given input can be interpreted as the degree to which the outputs of correlated features have not yet been corrected for by $\mathbf{W}^*$ . The scaling of this error by $\left(\mathbf{P}_I^{-1}\right)^\intercal\mathbf{P}_I^{-1}$ highlights how highly correlated pairs will produce greater errors by weight at each step.

Another insight to notice here is that any modifications to the inputs which preserves their correlations will not be detected by the system.

Key insight:
This means that, in linear networks, the correlation among inputs is the only thing that matters, even if the inputs and outputs are drawn from a distribution. There is no hidden representation of knowledge which associates inputs and outputs.

Important insight:
By introducing a measure of "distance" between the target and the output, we can actually measure how good a prediction is and take steps to making it better.

This is an additional layer beyond random pattern mapping, and it means we can still find decent solutions even when the inputs are linearly dependent (which is guaranteed to be the case if we have more pairs than we have input nodes). I would wager that the min error is better when similar inputs map to similar outputs, but I haven't proven this yet.

But this is interesting, from the book:

In a linear system, it is only the "structure" of the inputs and outputs that matter, not details of the internal representation of the system. It is only the pattern of correlations among the patterns that matter, not the contents of the specific patterns themselves.

For an overview of how associators behave in statistical environments, i.e. when the input output pairs are sampled from a distribution, I recommend reading Chapter 11 of Parallel Distributed Processing, Volume 1 written by G. O. Stone (David E. Rumelhart, McClelland, and Group (1986)[Chapter 11]).

Nonlinear models

It is a well known result that a network with consecutive linear layers (i.e. no activation) is equivalent to a network with a single layer (this is easy to prove and a worthwhile exercise). Thus, if we wish to broaden the space of learnable functions for network models, we must introduce nonlinearity into the system. In the simplest cases, this is done by applying nonlinear "activation" functions to the output neurons of a linear model.

The canonical instance of this type of architecture is the single-layer Perceptron (with $m$ outputs), often introduced as a nonlinear successor to linear models. However, I consider it somewhat tangent to the study of feedforward neural networks. The Perceptron convergence theorem is worth reviewing, but the algorithm itself applies only to a narrow class of problems. It is better suited as a precursor to Support Vector Machines.

A slightly more useful example of single layer, nonlinear network is the Bidirectional Associative Memory introduced by Bart Kosko in 1988 (Kosko (1988)). This is a natural extension of the associative memory discussed in the previous section, so I will explore it briefly here.

Bidirectional associative memories

The memory of linear associators can be improved by introducing nonlinearity and relaxing the perfect recall requirement.

Let $\mathbf{x}$ and $\mathbf{y}$ be bipolar vectors so as to increase the chances of inputs / outputs being orthogonal, and introduce the nonlinear operator $\operatorname{sign}(\cdot)$ to retrieval.

\hat{\mathbf{y}}^{(h)} = \operatorname{sign} \left( \mathbf{y}^{(h)} + \sum_{k \ne h} \mathbf{y}^{(k)} (\mathbf{x}^{(k)})^\intercal\mathbf{x}^{(k)} \right)

This association is bidirectional:

\mathbf{Y}^{(v+1)} = \operatorname{sign} \left( \mathbf{W} \mathbf{X}^{(v)} \right); \quad \mathbf{X}^{(v+1)} = \operatorname{sign} \left( \mathbf{W}^\intercal\mathbf{Y}^{(v+1)} \right)

Here, the cross-talk term need only be small for perfect recall. The performance of this associator is inversely related to the pairwise correlation of the $\mathbf{x}$ ; that is, the model has a hard time distinguishing between highly correlated inputs.

Feedforward Neural Networks (Multilayer Perceptrons)

The key difference between the class of models discussed up to this point and the class of models going forward is the introduction of hidden layers. In fact, calling linear associators "models" gives them too much credit - associators learn via rote memorization. They memorize geometric differences between inputs which are later recited. These associators only have enough degrees of freedom to intepret inputs as directions in $n$ -dimensional space, and thus perform terribly at simple tasks which require a more nuanced view of the inputs (encoding, bitwise operations, arithmetic operations, the XOR problem, etc).

Hidden layers allow for learning representations, which awards a massive step-up in the power to these models. I confess that I haven't yet found a particularly instructive bridge between associative networks and hidden layer networks. Perhaps this shouldn't be so surprising, as the evolution between the two paradigms took place over the better part of two decades. One possible angle to consider is that the Delta Rule is essentially a special case of gradient descent (on mean squared error loss function). But this conceptual discontinuity does not prevent us from adapting some of the language and mathematical formulations from our linear beginnings to this new setting. I may revisit this topic in the future. For now, we will explore the power of hidden layer architectures with fresh eyes.

One of the simplest kinds of hidden layer networks is the Feedforward Neural Network (FNN), sometimes called the multilayer Perceptron. Don't let the naming fool you - the multilayer Perceptron has little in common with the single-layer Perceptron because hidden layers change the whole game.

Ultimately, feedforward networks approximate some function $f : \mathbb{R}^n \to \mathbb{R}^m$ by computing the function

\label{eq:FNN} \mathbf{F}(\mathbf{x}) = \left( F_1(\mathbf{x}), \dots, F_m(\mathbf{x}) \right)^\intercal

where $\mathbf{x} \in \mathbb{R}^n$ .

Notation

Getting the notation right is important for FNNs, and in this section, I will sacrifice brevity for precision at every turn. This section borrows some notation from Guilhoto (2018) and adds extra details which will prove useful.

I will use the following notation:

Expression	Definition
$\mathbf{W}^{(l)}$	The weights matrix comprising the weights that act on activated $(l - 1)$ neurons. $\mathbf{W}^{(l)}$ is an $m \times n$ matrix, where $m$ is the width of layer $l$ , and $n$ is the width of layer $l - 1$ .
$\mathbf{w}_{j,*}^{(l)}$	$j$ th row vector of $\mathbf{W}^{(l)}$ .
$\mathbf{w}_{*,k}^{(l)}$	$k$ th column vector of $\mathbf{W}^{(l)}$ .
$w_{j,k}^{(l)}$	Element of $\mathbf{W}^{(l)}$ . The weight of the connection of the $k$ th neuron of the $(l - 1)$ th layer to the $j$ th neuron of the $l$ th layer.
$i, j, k, l$	In general, $i$ will index the training examples, and $j$ and $k$ will index the rows and columns of various matrices respectively. $l$ will index the layer of the network, with $L$ being the final (output) layer. Layer $l$ can be interpreted as "owning" the weights which feed into it. Therefore, it is natural that $j$ indexes objects corresponding to neurons at this layer (e.g. activations $a_j$ , preactivations $z_j$ ).
$\sigma(\cdot)$	The activation function. For simplicity, we will assume that a single activation function is shared by the entire network.
$\mathbf{F} = \hat{\mathbf{y}} = \mathbf{a}^{(L)}$	The output of the network at the final layer $L$ . This is the estimation.

The astute reader will notice that I've almost completely omitted analysis of the bias term $\mathbf{b}$ from this essay. This is quite intentional, as I don't believe it contributes much to the intuitions I'm looking to build here. There are plenty of other resources which provide the $\mathbf{b}$ derivatives.

I don't want to list out the part of the engine. I want to show how it makes its power.

Review: FNN basics

The weighted sum of inputs to a layer $l$ can be interpreted as an affine transformation between real vector spaces.

T^{(l)} : \mathbb{R}^n \to \mathbb{R}^m

Unlike in the linear context, however, the interpretation of this map is less straightforward. It may be useful to view the purpose of this transformation as enhancing the expressivity of the neural network. Dense layers vastly increase the searchable space of models and increase the generality of the architecture.

Despite great attention given to the weights in the linear models earlier in this essay, going forward, it may be prudent to avoid thinking too much about the weights in isolation. Rather, they are an efficient way to construct complicated functions with lots of parameters that possess important symmetries which can be exploited. The killer feature of neural networks is that we can repurpose both the mathematical and computational tools of linear algebra to manipulate state and train the network, providing an abundance of paths in parameter space to encode complex internal representations and dodge local minima as it searches through function space.

The layers in a NN act as a synchronization mechanism, meaning that layer $l$ nodes are activated at the same time. This gives the structure of the ensemble approximation function a natural composition, which is easy to deal with mathematically.

In fact, none of the results below require that layers be dense; connections may be omitted entirely or connect nonconsecutive layers.

That being said, it is true that the weights matrices (and the weights matrices alone) contain all information which a neural network has ever learned and (in the simplest case) deterministically decides its future performance. If AI bots become sentient, it will mean that tables of 64-bit floating point numbers are a suitable substrate in which an internal representation of conciousness itself can be embedded.

Forward pass notation

The linear transformation between dense layers is given by²

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)}

The activation of a single neuron (the $j$ ^th neuron of the $l$ ^th layer) is therefore

\label{eq:SingleNeuronActivation} a_j^{(l)} = \sigma\left( \sum_k w_{j,k}^{(l)} a_k^{(l-1)} + b_j^{(l)} \right)

The contribution of the $(l - 1)$ ^th layer to the $j$ ^th neuron of the $l$ ^th layer is:

z_j^{(l)} - b_j^{(l)} = \mathbf{w}_{j,*}^{(l)} \mathbf{a}^{(l-1)} = (\mathbf{W}^{(l)} \mathbf{a}^{(l-1)})_j

The contribution of a single neuron (the $k$ ^th neuron of the $(l - 1)$ ^th layer) to the $l$ ^th layer is:

\mathbf{z}_k^{(l)} = \mathbf{w}_{*,k}^{(l)} a_k^{(l-1)}

where

\sum_k \mathbf{z}_k^{(l)} = \mathbf{z}^{(l)}

I won't make much use of these equations, but it is worth understanding them as a description of the movement of data through the network.

Gradient descent and loss functions

Generally speaking, loss functions come in two broad flavors, the simplest cases of which will be described in this essay.

MSE incorporates both the variance of the estimator and its bias.

C := \tfrac{1}{n} \sum_{i=1}^n \| f(x_i) - F(x_i) \|^2 = \sum_{i=1}^n C_i

Gradient descent is a general numerical optimization method for finding function minimums which requires that updates to function parameters be in the opposite direction of that function's gradient:

\Delta_p \mathbf{W}^{(l)} \propto - \frac{\partial C_p}{\partial \mathbf{W}^{(l)}}

There are many online resources which teach this method, so I won't rehash the topic here. I will, however, mention the one of the most important features of this method for our purposes: the gradient of a function is linear, and thus satisfies the additivity property:

\nabla C = \nabla \sum_{i=1}^n C_i = \sum_{i=1}^n \nabla C_i

This property allows us to aggregate gradients which were computed for different training examples independently.

Backpropagation

The real power of neural networks stems from their

symmetry (backwards looks a lot like forwards, layers synchronize per-neuron operations),
modularity (composable functions as independently-differentiable clusters), and
sparsity (linear map followed by element-wise activation)

embedded in their structure.

This section explains why densely connected networks are such a natural vehicle for performing gradient descent in high-dimensional parameter space. There are many other ways to construct complex models which, in theory, could be just as expressive and intelligent as neural networks (our own brains, for example!), but training them would be messy and slow.

Don't forget that all this flexibility comes at a cost. Training even the simplest multilayer network is np-complete (Blum and Rivest (1992)).

On vectorization and VJP

So far in this essay, I have not shied away from keeping notation and computations vectorized, and that will continue for this section. There are several reasons for this. For one, it is prudent to keep analysis of any system as general as possible for as long as possible such that tools built up in the process can be ported to other domains more easily, and so a greater catalog of optimizations remain applicable. In the case of the backpropagation algorithm, operation of networks via the language of vector calculus enables them to benefit from parallel computing and graphics libraries. Additionally, by putting in some work up front, we get input batching for free. Do not take this fact for granted! It is a beautiful consequence of linear algebra that we can pass stacks of independent data around in our networks without updating any of the mathematics or code.

That being said, one might wonder what makes FNNs such a hot commodity nowadays. The tabloid headline is that neural net architectures are the most flexible and efficient ways to optimize functions for a given level of expressivity³. No neurologist worth their salt would make the claim that the synapse chains in our brains behave anything like GPT-4. If the original neuronal approximators were inspired by biology, they certainly aren't anymore. The pictures of circles connected by lines which we draw for ourselves are, of course, merely representational⁴. CPUs move data between registers and perform simple arithmetic on the values they store - there is no grey matter hiding in your laptop.

That being said, there will be a trap waiting for us in the development of backprop if we simply close our eyes and let vector calculus take the wheel.

The tricky bit in the analysis of feedforward networks is that matrices are being used in two distinct ways:

as a linear subcomponent of the entire function; a transfer function between layers - a way to move data backwards and forwards through the network
as a means to enumerate paths of influence of the parameters (weights) during gradient computation - a way to store and chain derivatives along a compounding number of paths in the network

Great care should be taken to understand which case is in play in each moment, as the next example shows. In many introductory materials on neural networks, this inequivalence goes unaddressed.

The universality of the chain rule and the linear tools of vector calculus means that they are ignorant of the latent efficiencies available for exploit in the network. Following the letter of the law with these mathematical tools in the hopes of computing the perfect loss gradient leads to inefficient structures (order $n>2$ tensors, large Jacobians) which can be cleverly avoided.

Most introductory materials on this subject optimize prematurely in my view. I want to emphasize that naively throwing Jacobians at the problem is perfectly valid and has wonderful sanity checking mechanisms baked in. I encourage the reader to try this approach at least once and at least verify that the matrix dimensions really do match up.

With the preamble out of the way, what are these mysterious optimizations? They go by the name Vector Jacobian Product (VJP). The punchline is that in practice, no actual Jacobians need to be computed during gradient descent! It turns out that, due to the compute graph enforced by the network architecture, the product of any Jacobian with its neighboring vector in the derivative chain can be computed directly; i.e. without first computing the components of the product.

In some sense, this is the essence of Neural Networks, and I think it goes underappreciated.

Before jumping into VJP, let's trudge through the brute-force Jacobian approach. The chain rule tracks and merges the contributions of independent variables to the final outputs of a function along all possible paths of influence. Vector calculus accounts for these paths (to a linear approximation) by recording Cartesian products of intermediate variable dependencies in the form of first-order partial derivatives. By arranging these partial derivatives into a matrix (the Jacobian), the job of path enumeration and combinatorics can be offloaded to linear algebra. Any complicated variable dependency graph just looks like composed matrix multiplication! Consider the following linear transition from layer $l - 1$ to layer $l$ :

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)}

Despite the transition being written in vector form, the chain rule as carried out via vector calculus sees this operation as

\mathbf{z}^{(l)} = \vec{f}(w_{1,1}^{(l)}, \dots, w_{n,m}^{(l)}; \mathbf{a}^{(l-1)}) \implies \mathbf{J}_{\vec{f}} \equiv \frac{\partial \vec{f}}{\partial \mathbf{w}^{(l)}} \in \mathbb{R}^{m \times (n m)}

where $\mathbf{a}^{(l-1)}$ is treated as a parameter of $\vec{f}$ during derivative computation.

Watch out - $\mathbf{W}^{(l)}$ is not the same as $\mathbf{w}^{(l)}$ ! $\mathbf{W}^{(l)}$ describes the parameters of the linear function $\vec{f}$ , whereas $\mathbf{w}^{(l)}$ is merely an alias for $w_{1,1}^{(l)}, \dots, w_{n,m}^{(l)}$ !

We might then write an $L = 3$ network like so:

\mathbf{a} = \mathbf{h}( \mathbf{g}( \mathbf{f}( \mathbf{x}, \mathbf{w}^{(1)} ), \mathbf{w}^{(2)} ), \mathbf{w}^{(3)} )

where $\mathbf{f}$ , $\mathbf{g}$ , $\mathbf{h}$ represent arbitrary vector-valued functions which operate on inputs, somehow incorporating the weights. Then we would write the Jacobian of $\mathbf{a}$ with respect to, for example, $\mathbf{w}^{(1)}$ as

\frac{\partial \mathbf{a}}{\partial \mathbf{w}^{(1)}} = \frac{\partial \mathbf{a}}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{g}} \frac{\partial \mathbf{g}}{\partial \mathbf{f}} \frac{\partial \mathbf{f}}{\partial \mathbf{w}^{(1)}}

There is nothing inconsistent about this view of the system, and in fact, it is already quite good (if suboptimal); the intermediate derivatives like $\frac{\partial \mathbf{h}}{\partial \mathbf{g}}$ can be computed once and then reused for calculating all gradients which appear to its left in the network.

But we know something that the calculus doesn't! This is where differentiating between the two distinct applications of linear algebra is critical. To see where we might save CPU cycles, notice that only $n \times m$ of the possible $m \times n \times m$ derivatives will be nonzero. This is because for a vanilla FNN, these intermediate transition functions between layers are constrained to be linear maps followed by nonlinear scalar functions applied element-wise.

VJP to the rescue:

For linear maps, VJP allows us to replace the Jacobian with the transpose of the weights matrix
For element-wise functions, VJP allows us to use $O(n)$ Hadamard products

The Backprop Toolkit

The following results provide us with tools to aid in interpreting and simplifying notation later on while tackling the backpropagation algorithm.

Neuronal gradient symmetry

Recall the following from the definition of the forward-pass:

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} \qquad\qquad \frac{\partial z_j^{(l)}}{\partial a_k^{(l-1)}} = w_{j,k}^{(l)}

Dense, linear connections between layers leads to a nice result which provides a mechanism for propagating errors backwards through the network.

\begin{gathered} \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{a}^{(l-1)}} = \begin{bmatrix} \frac{\partial z_1^{(l)}}{\partial a_1^{(l-1)}} & \cdots & \frac{\partial z_1^{(l)}}{\partial a_n^{(l-1)}} \\ \vdots & \ddots & \vdots \\ \frac{\partial z_m^{(l)}}{\partial a_1^{(l-1)}} & \cdots &\frac{\partial z_m^{(l)}}{\partial a_n^{(l-1)}} \end{bmatrix} = \mathbf{W}^{(l)} \\ \frac{\partial C}{\partial \mathbf{a}^{(l-1)}} = \frac{\partial C}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{a}^{(l-1)}} = \frac{\partial C}{\partial \mathbf{z}^{(l)}} \mathbf{W}^{(l)} \iff \nabla_{\mathbf{a}^{(l-1)}} C = \left(\mathbf{W}^{(l)}\right)^\intercal\nabla_{\mathbf{z}^{(l)}} C \end{gathered}

The consequence is that the exact same network can be used in reverse to compute gradients with no more difficulty than the forward pass. Ponder this for a moment. The ubiquity of neural networks owes a great debt to this special property.

Delta

As the error derivatives get passed back through the network, the intermediate quantities $\boldsymbol{\delta}^{(l)}$ are somewhat analogous to the activated neurons at each layer.

$\boldsymbol{\delta}^{(l)}$ is the celebrity VJP in backpropagation, but I reiterate that it is not strictly required to describe or implement backpropagation; it is nothing more than a location along the derivative chain where we've made a strategic substitution.

We will use it here, though, as it helps conceptualize the data flow during backpropagation and is emphasized in the literature (e.g. David E. Rumelhart, Hinton, and Williams (1986)).

\boldsymbol{\delta}^{(l)} := \left(\frac{\partial C}{\partial \mathbf{z}^{(l)}}\right)^\intercal = \left( \frac{\partial C}{\partial \mathbf{a}^{(l)}} \frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{z}^{(l)}} \right)^\intercal \qquad\qquad \delta_j^{(l)} = \frac{\partial C}{\partial z_j^{(l)}}

It is also common to see this kind of notation (e.g. (Nielsen, n.d.)):

\delta^{(l)} = \nabla_a C \odot \sigma'(z^{(l)})

I tend to avoid this notation because it hides some subtle details while also importing operations which aren't actually needed. By avoiding the gradient operator $\nabla$ (in favor of total derivatives and Jacobians), we can also avoid using the Hadamard product. This is because the Jacobian matrix $\frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{z}^{(l)}}$ is diagonal, which has the desired effect of element-wise multiplication.

Ultimately, I prefer the partial derivative notation because

the matrix dimensions provide for easy sanity checks, and
the technique generalizes better to other architectures (there is no magic - it's just the chain rule!)

During backpropagation, just as in the forward pass, the activations are computed based on previous activations, so it's useful to keep track of this relationship between the $\boldsymbol{\delta}$ of subsequent layers.

\begin{aligned} \boldsymbol{\delta}^{(l)} &= \left(\frac{\partial C}{\partial \mathbf{a}^{(l)}} \frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{z}^{(l)}}\right)^\intercal\\ &= \left(\frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{z}^{(l)}}\right)^\intercal \nabla_{\mathbf{a}^{(l)}} C \\ &= \left(\frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{z}^{(l)}}\right)^\intercal \left(\mathbf{W}^{(l+1)}\right)^\intercal\nabla_{\mathbf{z}^{(l+1)}} C \\ &= \left(\frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{z}^{(l)}}\right)^\intercal \left(\mathbf{W}^{(l+1)}\right)^\intercal\boldsymbol{\delta}^{(l+1)} \\ &= \left(\mathbf{W}^{(l+1)}\right)^\intercal\boldsymbol{\delta}^{(l+1)} \odot \sigma'\left(\mathbf{z}^{(l)}\right) \end{aligned}

Of course, $\boldsymbol{\delta}^{(l)}$ can be recursively expanded to obtain a symbolic chain of derivatives.

Consider what happens to the quantity

\left( \frac{\partial C}{\partial \mathbf{a}^{(L)}} \frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{z}^{(L)}} \right)^\intercal = \boldsymbol{\delta}^{(L)}

when it is applied to the right-hand side of the above formula.

Starting at the end of the network and working backwards, for each layer, that numerical quantity is first sent backward to the previous layer by transforming it with the transpose of the weights matrix. It is then "activated" by scaling the components of the output by the derivative of the activation function at its inputs during the forward pass.

Note that this does not undo the forward pass, we are simply moving data in the reverse direction.

Generalized weights gradient

These "activated" and backpropagated intermediate sub-network derivatives can be accumulated such that only a single backward pass is required.

The formula for weight derivatives for a layer $l$ is simply

\frac{\partial C}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \left(\mathbf{a}^{(l-1)}\right)^\intercal \quad\quad \frac{\partial C}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}

where the partial derivative for the biases $\mathbf{b}^{(l)}$ is included for completeness. This should look familiar - it is the same as the delta update rule from the linear model section!

An apt summary of backprop from (Peter Bloem, n.d.):

We work out the derivatives of the parts symbolically, and then chain these together numerically.

Backpropagation algorithm

Now we can express the backpropagation algorithm in full. The below algorithm assumes that the preactivations and activations $\{\mathbf{z}^{(l)}, \mathbf{a}^{(l)} : l \in 1, \dots, L\}$ and the training example $\mathbf{x}^{(p)}$ have been stored.

Note that in practice, the weights are often updated in a single shot at the end of the backward pass by some type of optimizer that modulates training parameters like learning rates and momentum. In the algorithm above, the weights are updated at each layer to emphasize the modularity of the updates.

References

Blum, Avrim L., and Ronald L. Rivest. 1992. "Training a 3-Node Neural Network Is NP-Complete." Neural Networks 5 (1): 117--27. https://doi.org/

Guilhoto, Leonardo Ferreira. 2018. "An Overview of Artificial Neural Networks for Mathematicians." In.

Kosko, B. 1988. "Bidirectional Associative Memories." IEEE Transactions on Systems, Man, and Cybernetics 18 (1): 49--60.

Nielsen, Michael. n.d. "Chapter 2: How the Backpropagation Algorithm Works."

Peter Bloem, Vrije Universiteit Amsterdam. n.d. "Backpropagation." .

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986. "Learning Representations by Back-Propagating Errors." Nature 323: 533--36.

Rumelhart, David E., James L. McClelland, and PDP Research Group. 1986. Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. The MIT Press.

Widrow, Bernard, and Marcian E. Hoff. 1960. "(1960) Bernard Widrow and Marcian E. Hoff, "Adaptive switching circuits," 1960 IRE WESCON Convention Record, New York: IRE, pp. 96-104." In Neurocomputing, Volume 1: Foundations of Research. The MIT Press.

Yes, yes, the orthonormality constraint on the inputs would allow one to compress this data quite a bit, but the Ordnung is still the same.↩
Modern machine learning frameworks typically use batch-major inputs, and it's conventional in computer science to think of the first dimension of e.g. a tensor.Tensor object as the rows of a matrix. Under this interpretation, the notation is reversed, and linear transformations look like
$\mathbf{z} = \mathbf{x} \mathbf{W}^\intercal+ \mathbf{b}$
↩
It's quite difficult to formalize this statement; I may revisit this topic in the future.↩
How meta!↩