Modelling the neural code in large populations of correlated neurons
In this post I’m going to show how to use the neural-mixtures
command-line interface (CLI) to generate the key results and corresponding figures from my paper Modelling the neural code in large populations of correlated neurons (Sokoloski, Aschner, and Coen-Cagli 2021). Along the way I will also explain how interested users can format new datasets for use with the neural-mixtures
CLI.
Introduction
The goal of my work is provide a model of neural spike-count responses to stimuli that
- is analytically tractable,
- captures first, second, and even higher-order neural response statistics,
- captures response features used by the neural code,
- and efficiently scales to large populations of neurons.
In the following paragraphs I will demonstrate how I have achieved these aims, and how new users can apply my models to new data. To ease navigation, I’ve broken this post down into the following sections:
- Multi-threading and performance optimization
- Performance scaling and computational complexity
- Modifying the plots and code
Formatting datasets for use with the neural-mixtures CLI
Before we begin the tutorial proper, I’ll explain how to format datasets for use with the neural-mixtures
CLI. After installing neural-mixtures
, and as we try out the CLI on the included datasets, you may substitute your own datasets into the commands to begin analyzing your data with neural-mixtures
.
Data in neural-mixtures
is always organized in a simple directory structure relative to the root of the neural-mixtures
repository. Suppose we have a {dataset} from an {experiment}, where the {dataset} is a collection of trials, and where each trial corresponds to a single stimulus-response recording. To analyze the {dataset} with neural-mixtures
, create the following plain text files relative to the root of the neural-mixtures
repository:
experiments/{experiment}/{dataset}/stimuli.csv
where:- The index of the line of the file corresponds to the index of the trial.
- Each line is a single stimulus value. Note that the stimulus-value is treated by
neural-mixtures
as a text string, and can be arbitrary.
experiments/{experiment}/{dataset}/responses.csv
where:- The index of the line of the file corresponds to the index of the trial.
- Each line is a comma (,) seperated list of natural numbers, where each number is the response of a single neuron.
Any programming language should have tools for outputting matrices of data as comma-separated text files. In this case, the data in stimuli.csv
corresponds a matrix with one column, and the data in responses.csv
corresponds to a matrix with \(d_N\) columns, where \(d_N\) is the number of recorded neurons.
The train
, analyze
, and cross-validate
programs can then be applied to these files by running e.g.
stack exec -- train {experiment} {dataset}
and setting the program arguments as desired.
Installation
I developed neural-mixtures
in the Haskell programming language, but installing and applying the neural-mixtures
CLI to data does not require any knowledge of Haskell. Managing Haskell packages is automated by the “Haskell Tool Stack” (stack
for short). stack
can be installed on Linux and macOS by running
curl -sSL https://get.haskellstack.org/ | sh
from a terminal. Windows users may run the installer provided here.
Installation on Linux/macOS
stack
reduces compiling most Haskell code to a single single command. However, neural-mixtures
also depends on non-Haskell code for numerical computation and plotting. These tools ensure that neural-mixtures
is fast, but they cannot be installed automatically by stack
. What we need are
- some form of BLAS and LAPACK for linear algebra,
- the GNU Scientific Libraries (GSL) for numerics,
- and
gnuplot
for plotting.
These are all very standard, and if your target computer is used to perform scientific computing, they are likely already installed.
I use Arch Linux, which provides the pacman
package manager, and I use the OpenBLAS implementation of BLAS. As such, on my system I install the relevant packages with
pacman -S gsl lapack openblas gnuplot
For Debian-based systems this should be something like
apt-get install libgsl-dev liblapack-dev libopenblas-dev gnuplot
and for macOS,
brew install gsl lapack openblas gnuplot
should do the trick.
Now, once you have installed stack
and the aforementioned numerics libraries, neural-mixtures
may be installed by by running
git clone https://gitlab.com/sacha-sokoloski/neural-mixtures
cd neural-mixtures
stack build
Installation will take some time, but should work if the numerics libraries were properly installed. Please let me know if any errors arise on your machine.
Installation on Windows
For Windows things are more complicated, and require using the MSYS2 tools, which provide a Unix-style environment for Windows. Helpfully, stack
comes with an installation of MSYS2 that simplifies things. For reference, the following is based on my experience setting up neural-mixtures
on an Thinkpad X1 Carbon (8th Gen) running Windows 10.
After running the stack
installer, the first step is to locate its MSYS2 install. The directory where stack
stores its programs can be located by executing
stack path --programs
from the Windows command prompt. This will trigger stack
to install MSYS2, after which it will display the directory, which should be something like:
C:\Users\{username}\AppData\Local\Programs\stack\x86_64-windows
Within this directory there should be an install of MSYS2, with a name like msys2-20210604
, where the number corresponds to the date of the particular MSYS2 build provided by the current version of stack
. As such, you want to open a directory with the form
C:\Users\{username}\AppData\Local\Programs\stack\x86_64-windows\msys2-{builddate}
Within this directory is the msys2_shell
command, which is the program we will use to run the neural-mixtures
CLI. If you plan on making frequent use of neural-mixtures
, I recommend adding a link to msys2_shell
to your desktop or some other convenient place. Note that from this point on we no longer require the Windows command prompt.
Now that we have installed stack
and a Unix-like environment, we need to install the aforementioned libraries, as well as some development tools. MSYS2 actually uses the pacman
package manager that was developed for Arch Linux, and from here on in the installation process will start to look like that of Linux. Run
pacman -S base-devel p7zip git mingw-w64-x86_64-openblas mingw-w64-x86_64-gsl mingw-w64-x86_64-gnuplot
to install basic development tools and the neural-mixtures
dependencies.
Next, to run stack
from within the MSYS2 environment, we run the installer for Unix-like systems
curl -sSL https://get.haskellstack.org/ | sh
Although it’s also possible to run stack
from the Windows command prompt, and use it to install neural-mixtures
, I designed the neural-mixtures
CLI to be run in a Unix-like environment, and I’d recommend sticking with MSYS2 to avoid errors.
Anyway, we’re now ready to build and install neural-mixtures
. In the MSYS2 prompt, run the commands
git clone https://gitlab.com/sacha-sokoloski/neural-mixtures
cd neural-mixtures
stack build --flag hmatrix:openblas --flag hmatrix-gsl:onlygsl
These extra flags help stack build
find the requisite libraries in our Frankenstein development environment. Having gone through these steps, the rest of this tutorial is platform-independent. One last note: if you’re looking for the neural-mixtures
repository from the standard Windows environment (so that you can find e.g. the plots generated by the CLI), you will find it under
msys2-{builddate}/home/{username}/neural-mixtures
relative to stack
’s MSYS2 install.
CLI introduction and creating a toy experiment
Installing neural-mixtures
compiles four command-line programs: synthesize
, train
, analyze
, and cross-validate
. These can be run in the root of the neural-mixtures
repository by using the stack exec --
command, followed by the program name and its arguments (the double dash --
separates the arguments for the given program from the arguments for stack
itself). One argument accepted by all programs is --help
, which will output a long (and hopefully helpful) description of the various arguments that the given program can accept.
Synthesizing an experiment
To introduce the CLI, we will focus on the synthesize
program, which allows us to create a ground-truth models and toy datasets for use with the rest of the CLI. We generate the help output for the synthesize
command by running
stack exec -- synthesize --help
I won’t reproduce the entire output of the help command here, but here is the list of arguments that synthesize
accepts:
Usage: synthesize [-e|--experiment-name ARG] [-n|--n-neurons ARG]
-k|--k-components ARG] [-s|--n-oris ARG] [-S|--n-samples ARG]
[-N|--sensory-noise ARG] [-g|--gain-mu ARG] [-G|--gain-vr ARG]
[-p|--precision-mu ARG] [-P|--log-precision-vr ARG]
[-h|--min-shape ARG] [-H|--max-shape ARG] [
In summary, we can specify a name for this synthetic experiment, properties of the generated model such as average tuning properties and number of mixture components, and properties of the synthesized data such as number of stimulus conditions and sample size.
In the neural-mixtures
CLI, an experiment is a directory that lives in the experiments
directory, at the root of the neural-mixtures
repository. An experiment directory contains a set of dataset directories, and each dataset directory contains a stimuli.csv
file and a responses.csv
file that list presented stimuli and recorded neural responses, respectively. Data and plots generated by the neural-mixtures
CLI are also stored in the dataset directories.
Let’s go ahead and execute the command
stack exec -- synthesize
which runs synthesize
with all of its default arguments. The default experiment name for synthesize
is synthetic
, and so if this command executed successfully, there should be a new directory at experiments/synthetic
. Within the synthetic
directory there should be the following files
csv
experiments/synthetic/poisson/stimuli.csv
experiments/synthetic/poisson/responses.dat
experiments/synthetic/poisson/true/parameters.csv
experiments/synthetic/com-based/stimuli.csv
experiments/synthetic/com-based/responses.dat experiments/synthetic/com-based/true/parameters.
The synthesize
program always synthesizes two datasets, which are stored in the poisson
and com-based
directories. These datasets are generated from conditional mixtures of Poisson distributions and Conway-Maxwell Poisson distributions, respectively (see Sokoloski, Aschner, and Coen-Cagli (2021)). The synthesize
program also saves the ground-truth parameters of each randomized conditional mixture in the files poisson/true/parameters.dat
and com-based/true/parameters.dat
.
Analyzing synthetic data
The analyze
program generates collections of plots given a dataset and a model, and we may use it to compare the ground-truth models to their synthesized data in our synthetic
experiment. The first arguments to the analyze
program are the list of directories in which an appropriate parameters.dat
file is located. For example, we may analyze the models we generated with the synthesize
command by executing
stack exec -- analyze synthetic poisson true
stack exec -- analyze synthetic com-based true
analyze
generates the same set of plots for both commands, so for the sake of brevity we’ll review some of the results of only the com-based
analysis. All of these results can be found in the com-based/true
directory.
In the paper, a conditional mixture (CM) is a model of the form \(p(\mathbf n, k \mid x)\), where \(\mathbf n\) is a vector of spike-counts, \(k\) is the index of a mixture-component, and \(x\) is the stimulus; we refer to the CoM-based conditional mixture, in particular, as a CB-CM. The analyze
command renders the “tuning curves” of the given CM (i.e. the stimulus-dependent firing rate of each neuron) in the file tuning-curves.png
, and renders the stimulus-dependent probabilities of the mixture components \(p(k \mid x)\) in the file weight-dependence.png
.



The “Fisher information” (FI) is a mathematical tool for understanding how much information a given neural population has about a stimulus. The linear Fisher information (LFI) is a linear approximation of the FI which can be estimated effectively from data, and is useful when the FI of a model cannot be evaluated in closed-form. Nevertheless, for CMs the FI can be computed in closed-form. In the fisher-information.png
file, the analyze
command visualizes both the FI and LFI of the model, as well as a bias-corrected empirical estimate of the LFI (Kanitscheider et al. 2015) based on the synthetic data generated by the synthesize
command.
The analyze
command also generates the fano-factor-scatter.gif
and noise-correlation-scatter.gif
files. These files are animations, and render scatter plots of the Fano factors and noise correlations of the model, as compared to the empirical Fano factors and noise correlations, where each step of the animation corresponds to a different stimulus.


It is interesting to note that the \(r^2\) of these scatter plots is quite modest, indicating that even with a sample size of 1600, the empirical estimates of the Fano factors and noise correlations are far from that of the ground-truth model.
Modelling stimulus-response recordings with conditional mixtures
In the rest of the tutorial we’ll analyze response recordings from macaque primary visual cortex (V1), but feel free to substitute your own datasets into the neural-mixtures
CLI to see what happens!
Training conditional mixtures on V1 response data
Now that we know how to analyze data a CM with neural-mixtures
, let’s fit one to data. In (Sokoloski, Aschner, and Coen-Cagli 2021) we considered two recordings from macaque V1 responding to oriented gratings — one where the monkey is under anaesthesia, and one where it is awake. The stimuli.csv
and responses.csv
files for these two experiments are available under experiments/amir-anaesthetized/129r001p173_preprocess
and experiments/amir-awake-filtered/cadetv1p438_tuning
, respectively.
Let’s fit a CB-CM to the cadetv1p438_tuning
data, by running the command
stack exec -- train amir-awake-filtered cadetv1p438_tuning discrete com-based -k 20 -e 1000
To break this down, the first two arguments amir-awake-filtered
cadetv1p438_tuning
indicate the dataset we will be analyzing. The next argument discrete
indicates that we’ll be using discrete tuning curves (which make no assumption about the continuity of the stimulus like von Mises tuning curves), and com-based
indicates we’ll be using com-based
, rather than poisson
tuning curves. The arguments -k 20
and -e 1000
tell train
to train a model with 20 mixture components for 1000 epochs. All the parameters to train
are described with the --help
command.

The train program renders only a single plot: log-likelihood-ascent.png
. In this particular demonstration we’re fitting to the complete dataset rather than holding out data, and the plot shows the log-likelihood of the model given the training data after each training epoch. Note that at 800 epochs there’s a sudden jump in performance. This is because when training a CB-CM, we first train an IP-CM for a certain number of iterations (in this case 800) before extending it to a CB-CM, which significantly reduces execution time.
To see a more complete analysis of the resulting model, we return to the analyze
command. The results of our train
command (i.e. parameters.dat
) are saved in experiments/{experiment}/{dataset}/discrete/com-based
, and so in this case we run
stack exec -- analyze amir-awake-filtered cadetv1p438_tuning discrete com-based -c
where the -c
argument tells train
that even though we’re analyzing a discrete model, the stimulus is continuous.
The outputs of the analyze
command remain the same as for the synthetic experiment we considered previously, and so for the sake of brevity we’ll just look at the Fano factor and noise correlation scatter plots.


Based on these scatter plots the discrete CB-CM appears to do a good job at capturing the empirical statistics of the recorded V1 responses. To estimate whether the CB-CM captures the true statistics of the response distribution, we deploy the cross-validation
command.
Cross-validating conditional mixtures
The last program in the neural-mixtures
CLI is cross-validate
, which essentially wraps up the train
program inside a cross-validation loop, and evaluates the desired model for a number of different mixture components \(d_K\). To cross-validate the performance of an IP-CM on our awake V1 data, we run
stack exec -- cross-validate amir-awake-filtered cadetv1p438_tuning discrete com-based
There are a number of parameters that we could set here (consult stack exec -- cross-validate --help
), but the defaults are fine in this case. Note that where our train
command from earlier took about 10 minutes on a 16-core machine, we are now running over 100 instances of this train command. As such, you may want to pass alternative (more conservative) parameters to cross-validate
to get a sense of how it runs, and in any case please consult the following section on performance optimization on how to activate multi-threading in neural-mixtures
. On a 16-core machine with multi-threading enabled, the above command took about 5 hours to complete.
cross-validate
generates a variety of plots, and below we look at two of them.


For exact details on what this mean (especially the one on the right) please consult the paper, but in short they describe how well (Left) the model captures the statistics of how the recorded neurons respond to stimuli, and (Right) how well the model supports decoding the neural responses for the information they contain about the stimuli. As we see on the left, log-likelihood performance peaks around \(d_K = 25\) mixture components, with most of the performance gain happening by about \(d_K = 15\) components. On the right, we see that the log-posterior performance is less clearly peaked, but demonstrates a clear gain in performance up to around \(d_K = 15\) components.
Additional notes on working with the neural-mixtures library
Multi-threading and performance optimization
Every neural-mixture
program can be given +RTS -N{x}
as the last argument, which activates multi-threading. It is not part of neural-mixtures
per se, but rather part of the Haskell runtime system (hence +RTS
, and why it needs to be the last part of the command). Multi-threading confers no benefit to the synthesize
and analyze
programs. Multi-threading does not speed up the train
program when fitting poisson
models, but can significantly speed up the fitting of com-based
models. The major use-case for multi-threading is the cross-validation
program, because cross-validation is an “embarrassingly parallel” algorithm. When run in multi-threaded mode, cross-validate
will create a thread for every fold and every mixture component, and so can easily populate a many-cored system. On my 16-core desktop, the cross-validate
command we ran earlier took 5 hours to complete, as opposed to taking more than a day as it would in single-threaded mode.
The argument can be given simply as +RTS -N
, which tells the program to run with as many threads as is available on the machine. This is often suboptimal, as it’s usually best to just create one thread per CPU core. On my computer with an AMD Ryzen 9 5950x (16 cores and 32 threads), running the train
command from earlier takes about 6 minutes when run with +RTS -N16
, about 9 minutes +RTS -N
, and about 21 minutes when single-threaded.
Finally, when using openblas
, you may want to adjust the OMP_NUM_THREADS
environment variable. This tells openblas
how many threads it’s allowed to use, and although it can improve performance when running single-threaded programs, in my experience it’s sometimes more efficient to e.g. set OMP_NUM_THREADS=1
and allow the neural-mixtures
to manage multi-threading exclusively.
Performance scaling and computational complexity
Theoretically, conditional mixture models scale quite well, and I have successfully applied the neural-mixtures
CLI to synthetic data generated from thousands of neurons. The largest computation in the optimization is an outer product computation which is \(\mathcal O(d_K d_N)\), where \(d_K\) is the number of mixture components and \(d_N\) is the number of neurons. As such, computational complexity can grow more-or-less linearly with population size if \(d_K\) is held fixed. That being said, sample complexity of the model also increases, and thus larger \(d_N\) necessitates more training, and more data for a good fit.
Modifying the plots and code
The data for all plots in neural-mixtures
are first saved as csv
files, and then rendered with gnuplot
. The gnuplot
scripts are available in the plot-files
directory, and can be edited as desired. Also note that whenever neural-mixtures
runs gnuplot
, it outputs the command it runs in the terminal, and with some copy-pasting this command can be re-run to regenerate plots without regenerating the csv
s. Alternatively, the csv
s can be fed into alternative plotting programs, or used as the basis for further analyses.
If you want to take things further, the four command line programs are defined in the executables
folder, and are quite high-level; the lower-level code is in the libraries
folder, and the whole project is built on my so-called Geometric Optimization Libraries (Goal), which are available here. Simple modifications of the executables should be possible without much knowledge of Haskell. If you are familiar with Haskell, then I encourage you to hack away, and I’m happy to support outside contributions to the codebase. Finally, if you aren’t familiar with Haskell, but wish you could access some of the neural-mixtures
libraries more directly, feel free to contact me and I’d be willing to put bindings together for e.g. Python.
Bibliography
Kanitscheider, Ingmar, Ruben Coen-Cagli, Adam Kohn, and Alexandre Pouget. 2015. “Measuring Fisher Information Accurately in Correlated Neural Populations.” PLoS Computational Biology 11 (6): e1004218.
Sokoloski, Sacha, Amir Aschner, and Ruben Coen-Cagli. 2021. “Modelling the Neural Code in Large Populations of Correlated Neurons.” Edited by Jonathan W Pillow, Joshua I Gold, and Kenneth D Harris. eLife 10 (October): e64615. https://doi.org/10.7554/eLife.64615.