Lazy backpropagation
In this post we’ll go through the neural-network script, and see how to fit a simple neural network in Goal. I will then show how I’ve implemented an implicit version of backprop by combining Goal types with lazy evaluation, and thereby avoid explicitly storing and passing gradients. Explaining this implementation of backprop is my primary interest in this tutorial, as it is general, efficient, and (I think) kind of cool.
A toy example
To begin, we’re going to consider the arbitrary function
f :: Double -> Double
= exp . sin $ 2 * x f x
and prepare to generate some noisy samples from it
fp :: Source # Normal
= fromTuple (0,0.1)
fp
mxx :: Double
mnx,= -3
mnx = 3
mxx
xs :: [Double]
= concat . replicate 5 $ range mnx mxx 8 xs
The xs
here are the inputs in the training data, and we’ll generate the outputs ys
once we get to defining main
.
The next step is to define and initialize our neural network. We’ll initialize the parameters of our network with the normal distribution
cp :: Source # Normal
= fromTuple (0,0.1) cp
As for the network itself, we may specify its structure with the following type synonym:
type NeuralNetwork' =
NeuralNetwork '[ '(Tensor, R 50 Bernoulli)]
Tensor NormalMean NormalMean
This defines a model with 1-dimensional inputs and outputs, a linear output layer, and a 50 neuron hidden layer with sigmoid activation functions. The types for neural networks in Goal are somewhat idiosyncratic, but there is a method to the madness. Feel free to skip the next paragraph if you don’t care how to read this type; regardless, please note that the underlying implementation is quite efficient, and interested parties are welcome to contact me if a more standard interface for specifying neural networks is desired.
In Goal, a NeuralNetwork
is a type of Map
, where Map
s are parametric functions. Starting from the right, the NormalMean
type indicates that the input of the NeuralNetwork is a 1-dimensional real number. The next NormalMean
and the Tensor
indicate that the NeuralNetwork
has a linear output layer. Finally, the type-list with one element indicates that the network has one layer, and the Tensor
type again indicates that it is fully-connected. Finally, R 50 Bernoulli
indicates that the layer has 50 neurons, and the transfer function is given by the mapping from Natural
to Mean
coordinates for the Bernoulli
distribution, which happens to be the sigmoid.
Getting back to things, we next have some training parameters
nepchs :: Int
= 1000
nepchs
eps :: Double
= 0.05
eps
mxmu :: Double
= 0.999 mxmu
which we’ll explain momentarily in context.
Now, the first things we want to do in main
are generate some noisy responses, and bundle them up with the inputs to define our training data xys
main :: IO ()
= do
main
<- realize $ mapM (noisyFunction fp f) xs
ys
let xys = zip ys xs
We then define our cost function as the conditional log-likelihood given these training data
let cost :: Natural # NeuralNetwork' -> Double
= conditionalLogLikelihood xys cost
and we use its differential
let backprop :: Natural # NeuralNetwork' -> Natural #* NeuralNetwork'
= conditionalLogLikelihoodDifferential xys backprop
to drive our gradient pursuit algorithms. In the following paragraph I describe some of the details of differentials in goal, but this can also be freely skipped.
So, note that where Natural # NeuralNetwork'
is a point in a neural network manifold in natural coordinates, the #*
in the differential Natural #* NeuralNetwork'
indicates that the differential is in the Dual
space of Natural # NeuralNetwork
. For exponential families, the dual space of the Natural
coordinates are the Mean
coordinates. That being said, NeuralNetwork
s aren’t really exponential families, and we rather only use exponential family structures as a shortcut for defining them.
Anyway, we then need to initialize our neural network
<- realize $ initialize cp mlp0
and do some gradient pursuit!
let sgdmlps = take nepchs $ vanillaGradientSequence
Classic mlp0
backprop eps = take nepchs $ vanillaGradientSequence
mtmmlps
backprop eps (defaultMomentumPursuit mxmu) mlp0= take nepchs $ vanillaGradientSequence
admmlps backprop eps defaultAdamPursuit mlp0


Just like in my introduction to gradient pursuit, we’re going to compare the performance of classic, momentum, and Adam gradient descent algorithms. In this case however, the problem itself is not trivial. And indeed, in contrast with the previous tutorial, here Adam has a clear advantage, and achieves a higher overall log-likelihood, and more exactly captures the true function from which the data was noisily sampled.
Lazy backprop implementation
Now, having demonstrated that my implementation of neural networks does indeed work, I would like to dig into the implementation. First, let’s review the math of backpropagation. We may think of a neural network as a function \(f(x;~\theta)\) where \(x\) is an input and \(\theta\) are the parameters of the network. Suppose we have a cost function \(c(f(x_i\ ; \theta),y_i)\) which tells us how close the neural network value \(f(x_i)\) is to the target \(y_i\). Then given some input output data \((x_1,y_1), \ldots, (x_n,y_n)\), we optimize the neural network by maximizing \(\sum c(f(x_i\ ; \theta),y_i)\) with respect to \(\theta\), which we do in turn by computing differentials (or derivatives) of \(c\) with respect to \(\theta\) and using them to iteratively improving the network.
Now, let us focus on a single data point \((x,y)\), and imagine we have a network with \(d\) layers, and that the parameters of the network \(\theta = (W_1, \ldots, W_d)\) correspond to the weights of each layer of the network. We may then express the differential of \(c\) with respect to the weights \(W_i\) in layer \(i\) as \[\delta_{W_i} \ c(f(x; \theta),y) = r_i \otimes z_{i-1},\] where \(\otimes\) is the outer product, \(z_{i-1}\) are the feedforward outputs from the previous layer, and \(r_i\) are the backpropagated error differentials from layer \(i+1\).
The tricky part here is two-fold: Firstly, the error differentials \(r_i\) depend on the outputs \(z_{i-1}\). Now, we typically would get around such dependencies by passing each \(z_{i-1}\) forward and recursively computing \(r_i\). However, our aim is not simply to compute the local errors \(r_i\) but rather the derivatives of all the parameters of the network. As such, in a strongly typed-language like Haskell, our second problem is that any recursive function for computing the differentials of a whole network’s parameters has to know the type of the whole network in advance. This subtlety means that most implementations of backprop in Haskell (as far as I know), rely on explicitly implementing and storing the results of the forward and backward passes, and then collecting the results into the network differential. Theoretically though, this shouldn’t be necessary, because the differential for the weights \(W_i\) is fundamentally local and recursive.
We can implement this without explicitly storing values of \(z\) by leveraging Goal types and laziness. In Goal, a Manifold
is a Map
if it can be applied to some inputs on a Manifold
to return some outputs on another Manifold
(the functions for doing that are >.>
and the batch version >$>
). A Map
is then an instance of Propagate
if it has a propagate
function
class Map c f y x => Propagate c f y x where
propagate :: [c #* y] -- ^ The error differential
-> [c #* x] -- ^ A vector of inputs
-> c # f y x -- ^ The function (e.g. NN) to differentiate
-> (c #* f y x, [c # y]) -- ^ The derivative, and function output
which takes differentials [c #* y]
on the output manifold y
(these are the \(r_i\)), the inputs that trigged those errors [c #* x]
(these are the \(z_i\)), and the Map
to differentiate c # f y x
(this is the whole network). Finally, it returns the differential c #* f y x
of the Map
, as well as the outputs [c # y]
of the Map
.
Now, given an instance of Propagate
, we may use the backpropagation
function
backpropagation :: Propagate c f y x
=> (a -> c # y -> c #* y)
-> [(a, c #* x)]
-> c # f y x
-> c #* f y x
{-# INLINE backpropagation #-}
=
backpropagation grd ysxs f let (yss,xs) = unzip ysxs
= propagate dys xs f
(df,yhts) = zipWith grd yss yhts
dys in df
which takes a function for computing differentials (the differential of \(c\) at \(f(x)\)), some output/input pairs and the Map
c # f y x
, and then it returns the differential c #* f y x
. Notice the laziness the definition of the function: the propagate
function uses the error differentials dys
and returns the outputs yhts
, while at the same time, the error differentials depend on the outputs yhts
.
Finally, let’s look at how a NeuralNetwork
implements propagate
(apologies, but note that the variable notation here is slightly different than what we saw earlier)
instance
Propagate c f z y, Propagate c (NeuralNetwork gys g) y x, Map c f y z
( Transition c (Dual c) y, Legendre y, Riemannian c y, Bilinear f z y)
, => Propagate c (NeuralNetwork ('(g,y) : gys) f) z x where
{-# INLINE propagate #-}
=
propagate dzs xs fg let (f,g) = split fg
= snd $ split f
fmtx = transition <$> ys
mys = propagate dzs mys f
(df,zhts) = propagate dys xs g
(dg,ys) = dzs <$< fmtx
dys0 = zipWith flat ys dys0
dys in (join df dg, zhts)
Again, there are some details here that I will elide, but I hope the core story still shines through. So to begin, line by line, dzs
are the error differentials on the output, xs
are the inputs, and fg
is the neural network.
The first thing we do is split
fg
into the top layer f
, and the rest of the network g
. f
is a linear function and a shift, and we’ll need to split
off the linear part fmtx
. The next line produces mys
(which are inputs to f
) by applying the transfer, or, transition
function (e.g. a sigmoid) to the linear output ys
of g
. We next apply propagate
to the error differentials dzs
, the mys
we just computed, and the top layer f
, to compute the the differential df
, and the estimated outputs zhts
. We also apply propagate
to the error differentials dys
— which are differentials at the output of g
/input of f
— the inputs xs
, and the neural network g
. Finally, to compute the error differentials dys
, we apply the transpose of fmtx
to the error differentials dzs
, and combine the results with ys
by using the flat
function. flat
is a bit of verbiage from Riemannian geometry, but sufficed to say, when the transfer function is a sigmoid it computes the derivative of the sigmoid function at the given ys
. Finally, we may gather our differentials by join
ing df
and dg
, and returning the outputs zhts
.
So, how does this work? After the splitting of the neural network fg
, we may rewrite this instance as two calls of propagate: propagate dzs (transition <$> ys) f
and propagate (zipWith flat ys $ dzs <$< snd (split f)) xs g
. These calls depend on each other: we need the output of g
to feed-forward through f
, and we need to backpropagate dzs
through f
to compute the differentials dys
. This mutual dependence allows us to solve the two-fold problem with backprop we discussed earlier: the first backpropagate solves the dependence of \(r_i\) on \(z_i\), and the second dependence solves the problem of maintaining the whole network structure, by returning the differential of the lower layers g
, and allowing the propagate
instance to combine them with the differential of f
. As far as I understand (and I’m not at all an expert here), this is a lazy implementation of reverse-mode automatic differentiation.