Lazy backpropagation

In this post we’ll go through the neural-network script, and see how to fit a simple neural network in Goal. I will then show how I’ve implemented an implicit version of backprop by combining Goal types with lazy evaluation, and thereby avoid explicitly storing and passing gradients. Explaining this implementation of backprop is my primary interest in this tutorial, as it is general, efficient, and (I think) kind of cool.

A toy example

To begin, we’re going to consider the arbitrary function

f :: Double -> Double
f x = exp . sin $ 2 * x

and prepare to generate some noisy samples from it

fp :: Source # Normal
fp = fromTuple (0,0.1)

mnx,mxx :: Double
mnx = -3
mxx = 3

xs :: [Double]
xs = concat . replicate 5 $ range mnx mxx 8

The xs here are the inputs in the training data, and we’ll generate the outputs ys once we get to defining main.

The next step is to define and initialize our neural network. We’ll initialize the parameters of our network with the normal distribution

cp :: Source # Normal
cp = fromTuple (0,0.1)

As for the network itself, we may specify its structure with the following type synonym:

type NeuralNetwork' =
    NeuralNetwork '[ '(Tensor, R 50 Bernoulli)]
    Tensor NormalMean NormalMean

This defines a model with 1-dimensional inputs and outputs, a linear output layer, and a 50 neuron hidden layer with sigmoid activation functions. The types for neural networks in Goal are somewhat idiosyncratic, but there is a method to the madness. Feel free to skip the next paragraph if you don’t care how to read this type; regardless, please note that the underlying implementation is quite efficient, and interested parties are welcome to contact me if a more standard interface for specifying neural networks is desired.

In Goal, a NeuralNetwork is a type of Map, where Maps are parametric functions. Starting from the right, the NormalMean type indicates that the input of the NeuralNetwork is a 1-dimensional real number. The next NormalMean and the Tensor indicate that the NeuralNetwork has a linear output layer. Finally, the type-list with one element indicates that the network has one layer, and the Tensor type again indicates that it is fully-connected. Finally, R 50 Bernoulli indicates that the layer has 50 neurons, and the transfer function is given by the mapping from Natural to Mean coordinates for the Bernoulli distribution, which happens to be the sigmoid.

Getting back to things, we next have some training parameters

nepchs :: Int
nepchs = 1000

eps :: Double
eps = 0.05

mxmu :: Double
mxmu = 0.999

which we’ll explain momentarily in context.

Now, the first things we want to do in main are generate some noisy responses, and bundle them up with the inputs to define our training data xys

main :: IO ()
main = do

    ys <- realize $ mapM (noisyFunction fp f) xs

    let xys = zip ys xs

We then define our cost function as the conditional log-likelihood given these training data

    let cost :: Natural # NeuralNetwork' -> Double
        cost = conditionalLogLikelihood xys

and we use its differential

    let backprop :: Natural # NeuralNetwork' -> Natural #* NeuralNetwork'
        backprop = conditionalLogLikelihoodDifferential xys

to drive our gradient pursuit algorithms. In the following paragraph I describe some of the details of differentials in goal, but this can also be freely skipped.

So, note that where Natural # NeuralNetwork' is a point in a neural network manifold in natural coordinates, the #* in the differential Natural #* NeuralNetwork' indicates that the differential is in the Dual space of Natural # NeuralNetwork. For exponential families, the dual space of the Natural coordinates are the Mean coordinates. That being said, NeuralNetworks aren’t really exponential families, and we rather only use exponential family structures as a shortcut for defining them.

Anyway, we then need to initialize our neural network

    mlp0 <- realize $ initialize cp

and do some gradient pursuit!

    let sgdmlps = take nepchs $ vanillaGradientSequence
            backprop eps Classic mlp0
        mtmmlps = take nepchs $ vanillaGradientSequence
            backprop eps (defaultMomentumPursuit mxmu) mlp0
        admmlps = take nepchs $ vanillaGradientSequence
            backprop eps defaultAdamPursuit mlp0
Log-likelihood ascent Learned models

Just like in my introduction to gradient pursuit, we’re going to compare the performance of classic, momentum, and Adam gradient descent algorithms. In this case however, the problem itself is not trivial. And indeed, in contrast with the previous tutorial, here Adam has a clear advantage, and achieves a higher overall log-likelihood, and more exactly captures the true function from which the data was noisily sampled.

Lazy backprop implementation

Now, having demonstrated that my implementation of neural networks does indeed work, I would like to dig into the implementation. First, let’s review the math of backpropagation. We may think of a neural network as a function \(f(x;~\theta)\) where \(x\) is an input and \(\theta\) are the parameters of the network. Suppose we have a cost function \(c(f(x_i\ ; \theta),y_i)\) which tells us how close the neural network value \(f(x_i)\) is to the target \(y_i\). Then given some input output data \((x_1,y_1), \ldots, (x_n,y_n)\), we optimize the neural network by maximizing \(\sum c(f(x_i\ ; \theta),y_i)\) with respect to \(\theta\), which we do in turn by computing differentials (or derivatives) of \(c\) with respect to \(\theta\) and using them to iteratively improving the network.

Now, let us focus on a single data point \((x,y)\), and imagine we have a network with \(d\) layers, and that the parameters of the network \(\theta = (W_1, \ldots, W_d)\) correspond to the weights of each layer of the network. We may then express the differential of \(c\) with respect to the weights \(W_i\) in layer \(i\) as \[\delta_{W_i} \ c(f(x; \theta),y) = r_i \otimes z_{i-1},\] where \(\otimes\) is the outer product, \(z_{i-1}\) are the feedforward outputs from the previous layer, and \(r_i\) are the backpropagated error differentials from layer \(i+1\).

The tricky part here is two-fold: Firstly, the error differentials \(r_i\) depend on the outputs \(z_{i-1}\). Now, we typically would get around such dependencies by passing each \(z_{i-1}\) forward and recursively computing \(r_i\). However, our aim is not simply to compute the local errors \(r_i\) but rather the derivatives of all the parameters of the network. As such, in a strongly typed-language like Haskell, our second problem is that any recursive function for computing the differentials of a whole network’s parameters has to know the type of the whole network in advance. This subtlety means that most implementations of backprop in Haskell (as far as I know), rely on explicitly implementing and storing the results of the forward and backward passes, and then collecting the results into the network differential. Theoretically though, this shouldn’t be necessary, because the differential for the weights \(W_i\) is fundamentally local and recursive.

We can implement this without explicitly storing values of \(z\) by leveraging Goal types and laziness. In Goal, a Manifold is a Map if it can be applied to some inputs on a Manifold to return some outputs on another Manifold (the functions for doing that are >.> and the batch version >$>). A Map is then an instance of Propagate if it has a propagate function

class Map c f y x => Propagate c f y x where
    propagate :: [c #* y] -- ^ The error differential
              -> [c #* x] -- ^ A vector of inputs
              -> c # f y x -- ^ The function (e.g. NN) to differentiate
              -> (c #* f y x, [c # y]) -- ^ The derivative, and function output

which takes differentials [c #* y] on the output manifold y (these are the \(r_i\)), the inputs that trigged those errors [c #* x] (these are the \(z_i\)), and the Map to differentiate c # f y x (this is the whole network). Finally, it returns the differential c #* f y x of the Map, as well as the outputs [c # y] of the Map.

Now, given an instance of Propagate, we may use the backpropagation function

    :: Propagate c f y x
    => (a -> c # y -> c #* y)
    -> [(a, c #* x)]
    -> c # f y x
    -> c #* f y x
{-# INLINE backpropagation #-}
backpropagation grd ysxs f =
    let (yss,xs) = unzip ysxs
        (df,yhts) = propagate dys xs f
        dys = zipWith grd yss yhts
     in df

which takes a function for computing differentials (the differential of \(c\) at \(f(x)\)), some output/input pairs and the Map c # f y x, and then it returns the differential c #* f y x. Notice the laziness the definition of the function: the propagate function uses the error differentials dys and returns the outputs yhts, while at the same time, the error differentials depend on the outputs yhts.

Finally, let’s look at how a NeuralNetwork implements propagate (apologies, but note that the variable notation here is slightly different than what we saw earlier)

    ( Propagate c f z y, Propagate c (NeuralNetwork gys g) y x, Map c f y z
    , Transition c (Dual c) y, Legendre y, Riemannian c y, Bilinear f z y)
  => Propagate c (NeuralNetwork ('(g,y) : gys) f) z x where
      {-# INLINE propagate #-}
      propagate dzs xs fg =
          let (f,g) = split fg
              fmtx = snd $ split f
              mys = transition <$> ys
              (df,zhts) = propagate dzs mys f
              (dg,ys) = propagate dys xs g
              dys0 = dzs <$< fmtx
              dys = zipWith flat ys dys0
           in (join df dg, zhts)

Again, there are some details here that I will elide, but I hope the core story still shines through. So to begin, line by line, dzs are the error differentials on the output, xs are the inputs, and fg is the neural network.

The first thing we do is split fg into the top layer f, and the rest of the network g. f is a linear function and a shift, and we’ll need to split off the linear part fmtx. The next line produces mys (which are inputs to f) by applying the transfer, or, transition function (e.g. a sigmoid) to the linear output ys of g. We next apply propagate to the error differentials dzs, the mys we just computed, and the top layer f, to compute the the differential df, and the estimated outputs zhts. We also apply propagate to the error differentials dys — which are differentials at the output of g/input of f — the inputs xs, and the neural network g. Finally, to compute the error differentials dys, we apply the transpose of fmtx to the error differentials dzs, and combine the results with ys by using the flat function. flat is a bit of verbiage from Riemannian geometry, but sufficed to say, when the transfer function is a sigmoid it computes the derivative of the sigmoid function at the given ys. Finally, we may gather our differentials by joining df and dg, and returning the outputs zhts.

So, how does this work? After the splitting of the neural network fg, we may rewrite this instance as two calls of propagate: propagate dzs (transition <$> ys) f and propagate (zipWith flat ys $ dzs <$< snd (split f)) xs g. These calls depend on each other: we need the output of g to feed-forward through f, and we need to backpropagate dzs through f to compute the differentials dys. This mutual dependence allows us to solve the two-fold problem with backprop we discussed earlier: the first backpropagate solves the dependence of \(r_i\) on \(z_i\), and the second dependence solves the problem of maintaining the whole network structure, by returning the differential of the lower layers g, and allowing the propagate instance to combine them with the differential of f. As far as I understand (and I’m not at all an expert here), this is a lazy implementation of reverse-mode automatic differentiation.