Fitting a mixture of von Mises distributions

In this post we’ll go through the von-mises-mixture script, and see how to fit a mixture of von Mises distributions with Goal. I have developed the Goal libraries primarily to do type-safe statistical modelling, and this post will touch on many of the features of Goal that make it worth using.

The imports

import Goal.Core
import Goal.Geometry
import Goal.Probability
import Goal.Graphical

import qualified Goal.Core.Vector.Storable as S

are standard for a Goal script. The four unqualified imports pull in all the Goal libraries, and the qualified import of Goal.Core.Vector.Storable, provides the tools for working with Vectors that have not been wrapped up as Points.

Goal treats the task of fitting a model to data as that of finding a Point on a Statistical Manifold. A statistical manifold is a manifold where points are probability distributions. In this case we consider manifolds of Von Mises distributions, which are approximately like normal distributions on a unit circle. The von Mises manifold in Goal is indicated by the VonMises type, and is typically parameterized by two coordinates: the mean and concentration (approximately the inverse variance). In Goal we use the so-called Source coordinates to describe a model in its “Standard” coordinate system, e.g. for normal distributions this is the mean and variance. For von Mises distributions this is the mean and concentration, and we can create a von Mises distribution in source coordinates by writing

vm :: Source # VonMises
vm = fromTuple (0,1)

so that vm is a von Mises distribution with mean 0 and concentration 1.

In this tutorial we’re not simply mixing von Mises distributions, but rather products of von Mises distributions, so that we can model distributions on a bounded plane. Such a product manifold is defined by the type (VonMises,VonMises), and we create three such product von Mises distributions with the lines

vm1,vm2,vm3 :: Source # (VonMises,VonMises)
vm1 = fromTuple (1, 1.5, 4.5, 2)
vm2 = fromTuple (3, 2, 3.5, 3)
vm3 = fromTuple (4.5, 4, 1.5, 2)

To create a mixture distribution we also need to define weights for the component distributions, which in Goal is represented by the Categorical distribution. We create a Categorical distribution over three categories with the following lines

wghts :: Source # Categorical 2
wghts = fromTuple (0.33,0.33)

where the probability of the first category is given by 1 - sum (listCoordinates wghts).

Finally, we may create a mixture distribution out of these previously defined distributions with the lines

strumxmdl :: Source # Mixture (VonMises,VonMises) 2
strumxmdl = joinSourceMixture (S.fromTuple (vm1,vm2,vm3)) wghts

In Goal, I attempt to define the Source coordinates to match conventional definitions provided in textbooks and on Wikipedia. Typically, when creating example or initial distributions, it is most intuitive to do so in Source coordinates. Neverthless, most algorithms and implementations in Goal are based on exponential family formulations of statistical models, and thus the workhorse coordinate systems of Goal are the Mean and Natural coordinates.

To convert our mixture model from Source coordinates to Natural coordinates, we simply use the transition function

trumxmdl :: Natural # Mixture (VonMises,VonMises) 2
trumxmdl = transition strumxmdl

Having created a mixture distribution, our next task will be to generate some data from this “ground-truth” distribution, and see if we can recover the ground-truth distribution through observation with the power of statistics. To do so we’re going to define a few variables to initialize and train our model.

Firstly, we’re going to want to randomly generate an initial point in our statistical model that we will then optimize. We will choose this point by setting each of its coordinates randomly according to the normal distribution

mxmdlint :: Source # Normal
mxmdlint = fromTuple (0,0.1)

Secondly, there are a handful of variables pertaining to training that we need to define, namely

nsmps :: Int
nsmps = 100

eps :: Double
eps = 0.05

bnd :: Double
bnd = 1e-5

admmlt :: Int
admmlt = 10

-- EM
nepchs :: Int
nepchs = 200

but I’ll explain them as we review the training procedures that we’ll be using.

Now, the standard framework for fitting a mixture model is expectation-maximization (EM), and Goal provides the expectationMaximization function which works for a wide class of models. Nevertheless, it requires that the model in question satisfies certain analytic properties, which are not satisfied by von Mises mixtures. For such cases Goal provies the expectationMaximizationAscent function, which approximates the maximization step of EM with gradient descent, and has looser restrictions on the model. We use this function to construct the EM iterator

vonMisesEM
    :: Sample (VonMises,VonMises) -- ^ Observations
    -> Natural # Mixture (VonMises,VonMises) 2
    -> Natural # Mixture (VonMises,VonMises) 2
vonMisesEM zs nmxmdl = cauchyLimit euclideanDistance bnd
    $ expectationMaximizationAscent eps defaultAdamPursuit zs nmxmdl

where bnd indicates the the threshold for stopping the gradient ascent, and eps is the learning rate.

An alternative procedure for fitting latent variable models is to simply perform gradient ascent on the marginal log-likelihood of the mixture given data. This procedure fails when applied to Gaussian mixture models, but we will test if it works for von Mises mixtures.

To continue, let us generate nsmps samples from our ground-truth mixture distribution trumxmdl, initialize our model mxmdl0, and collect our observations xys:

main :: IO ()
main = do

    cxys <- realize $ sample nsmps trumxmdl
    mxmdl0 <- realize $ initialize mxmdlint

    let xys = fst <$> cxys

We we will then run our two procedures for nepchs number of iterations

    let emmxmdls = take nepchs $ iterate (vonMisesEM xys) mxmdl0

    let admmxmdls = take nepchs . takeEvery admmlt
            $ vanillaGradientSequence (logLikelihoodDifferential xys) eps defaultAdamPursuit mxmdl0

In the case of ascending the logLikelihoodDifferential, we multiply the number of steps taken by admmlt, since each iteration of the EM algorithm requires a number of sub steps, and we want a fair comparison of these two algorithms.

    let emnlls = logLikelihood xys <$> emmxmdls
        admnlls = logLikelihood xys <$> admmxmdls
Log-likelihood ascent Learned models

We plot the results of our simulations in the above figure. On the left we have the model performance (log-likelihood) as function of number of training iterations, where an “iteration” is one step of EM, or ten steps of gradient pursuit. In contrast with normal distributions, mixtures of von Mises distributions can be trained effectively with gradient pursuit.

On the right side representations of the ground-truth mixture distribution, the 100 sampled data-points from the ground-truth distribution, and the models learned by EM and GP. Indeed, we see that they both learn more-or-less the same results. I’ll note though that the results aren’t always so similar — try running the von-mises-mixture script yourself to see.