## Modelling the neural code in large populations of correlated neurons

In this post I’m going to show how to use the neural-mixtures command-line interface (CLI) to generate the key results and corresponding figures from my paper Modelling the neural code in large populations of correlated neurons (Sokoloski, Aschner, and Coen-Cagli 2021). Along the way I will also explain how interested users can format new datasets for use with the neural-mixtures CLI.

### Introduction

The goal of my work is provide a model of neural spike-count responses to stimuli that

• is analytically tractable,
• captures first, second, and even higher-order neural response statistics,
• captures response features used by the neural code,
• and efficiently scales to large populations of neurons.

In the following paragraphs I will demonstrate how I have achieved these aims, and how new users can apply my models to new data. To ease navigation, I’ve broken this post down into the following sections:

#### Formatting datasets for use with the neural-mixtures CLI

Before we begin the tutorial proper, I’ll explain how to format datasets for use with the neural-mixtures CLI. After installing neural-mixtures, and as we try out the CLI on the included datasets, you may substitute your own datasets into the commands to begin analyzing your data with neural-mixtures.

Data in neural-mixtures is always organized in a simple directory structure relative to the root of the neural-mixtures repository. Suppose we have a {dataset} from an {experiment}, where the {dataset} is a collection of trials, and where each trial corresponds to a single stimulus-response recording. To analyze the {dataset} with neural-mixtures, create the following plain text files relative to the root of the neural-mixtures repository:

• experiments/{experiment}/{dataset}/stimuli.csv where:
• The index of the line of the file corresponds to the index of the trial.
• Each line is a single stimulus value. Note that the stimulus-value is treated by neural-mixtures as a text string, and can be arbitrary.
• experiments/{experiment}/{dataset}/responses.csv where:
• The index of the line of the file corresponds to the index of the trial.
• Each line is a comma (,) seperated list of natural numbers, where each number is the response of a single neuron.

Any programming language should have tools for outputting matrices of data as comma-separated text files. In this case, the data in stimuli.csv corresponds a matrix with one column, and the data in responses.csv corresponds to a matrix with $$d_N$$ columns, where $$d_N$$ is the number of recorded neurons.

The train, analyze, and cross-validate programs can then be applied to these files by running e.g.

stack exec -- train {experiment} {dataset}

and setting the program arguments as desired.

### Installation

I developed neural-mixtures in the Haskell programming language, but installing and applying the neural-mixtures CLI to data does not require any knowledge of Haskell. Managing Haskell packages is automated by the “Haskell Tool Stack” (stack for short). stack can be installed on Linux and macOS by running

curl -sSL https://get.haskellstack.org/ | sh

from a terminal. Windows users may run the installer provided here.

#### Installation on Linux/macOS

stack reduces compiling most Haskell code to a single single command. However, neural-mixtures also depends on non-Haskell code for numerical computation and plotting. These tools ensure that neural-mixtures is fast, but they cannot be installed automatically by stack. What we need are

• some form of BLAS and LAPACK for linear algebra,
• the GNU Scientific Libraries (GSL) for numerics,
• and gnuplot for plotting.

These are all very standard, and if your target computer is used to perform scientific computing, they are likely already installed.

I use Arch Linux, which provides the pacman package manager, and I use the OpenBLAS implementation of BLAS. As such, on my system I install the relevant packages with

pacman -S gsl lapack openblas gnuplot

For Debian-based systems this should be something like

apt-get install libgsl-dev liblapack-dev libopenblas-dev gnuplot

and for macOS,

brew install gsl lapack openblas gnuplot

should do the trick.

Now, once you have installed stack and the aforementioned numerics libraries, neural-mixtures may be installed by by running

git clone https://gitlab.com/sacha-sokoloski/neural-mixtures
cd neural-mixtures
stack build

Installation will take some time, but should work if the numerics libraries were properly installed. Please let me know if any errors arise on your machine.

#### Installation on Windows

For Windows things are more complicated, and require using the MSYS2 tools, which provide a Unix-style environment for Windows. Helpfully, stack comes with an installation of MSYS2 that simplifies things. For reference, the following is based on my experience setting up neural-mixtures on an Thinkpad X1 Carbon (8th Gen) running Windows 10.

After running the stack installer, the first step is to locate its MSYS2 install. The directory where stack stores its programs can be located by executing

stack path --programs

from the Windows command prompt. This will trigger stack to install MSYS2, after which it will display the directory, which should be something like:

C:\Users\{username}\AppData\Local\Programs\stack\x86_64-windows

Within this directory there should be an install of MSYS2, with a name like msys2-20210604, where the number corresponds to the date of the particular MSYS2 build provided by the current version of stack. As such, you want to open a directory with the form

C:\Users\{username}\AppData\Local\Programs\stack\x86_64-windows\msys2-{builddate}

Within this directory is the msys2_shell command, which is the program we will use to run the neural-mixtures CLI. If you plan on making frequent use of neural-mixtures, I recommend adding a link to msys2_shell to your desktop or some other convenient place. Note that from this point on we no longer require the Windows command prompt.

Now that we have installed stack and a Unix-like environment, we need to install the aforementioned libraries, as well as some development tools. MSYS2 actually uses the pacman package manager that was developed for Arch Linux, and from here on in the installation process will start to look like that of Linux. Run

pacman -S base-devel p7zip git mingw-w64-x86_64-openblas mingw-w64-x86_64-gsl mingw-w64-x86_64-gnuplot

to install basic development tools and the neural-mixtures dependencies.

Next, to run stack from within the MSYS2 environment, we run the installer for Unix-like systems

curl -sSL https://get.haskellstack.org/ | sh

Although it’s also possible to run stack from the Windows command prompt, and use it to install neural-mixtures, I designed the neural-mixtures CLI to be run in a Unix-like environment, and I’d recommend sticking with MSYS2 to avoid errors.

Anyway, we’re now ready to build and install neural-mixtures. In the MSYS2 prompt, run the commands

git clone https://gitlab.com/sacha-sokoloski/neural-mixtures
cd neural-mixtures
stack build --flag hmatrix:openblas --flag hmatrix-gsl:onlygsl

These extra flags help stack build find the requisite libraries in our Frankenstein development environment. Having gone through these steps, the rest of this tutorial is platform-independent. One last note: if you’re looking for the neural-mixtures repository from the standard Windows environment (so that you can find e.g. the plots generated by the CLI), you will find it under

msys2-{builddate}/home/{username}/neural-mixtures

relative to stack’s MSYS2 install.

### CLI introduction and creating a toy experiment

Installing neural-mixtures compiles four command-line programs: synthesize, train, analyze, and cross-validate. These can be run in the root of the neural-mixtures repository by using the stack exec -- command, followed by the program name and its arguments (the double dash -- separates the arguments for the given program from the arguments for stack itself). One argument accepted by all programs is --help, which will output a long (and hopefully helpful) description of the various arguments that the given program can accept.

#### Synthesizing an experiment

To introduce the CLI, we will focus on the synthesize program, which allows us to create a ground-truth models and toy datasets for use with the rest of the CLI. We generate the help output for the synthesize command by running

stack exec -- synthesize --help

I won’t reproduce the entire output of the help command here, but here is the list of arguments that synthesize accepts:

Usage: synthesize [-e|--experiment-name ARG] [-n|--n-neurons ARG]
[-k|--k-components ARG] [-s|--n-oris ARG] [-S|--n-samples ARG]
[-N|--sensory-noise ARG] [-g|--gain-mu ARG] [-G|--gain-vr ARG]
[-p|--precision-mu ARG] [-P|--log-precision-vr ARG]
[-h|--min-shape ARG] [-H|--max-shape ARG]

In summary, we can specify a name for this synthetic experiment, properties of the generated model such as average tuning properties and number of mixture components, and properties of the synthesized data such as number of stimulus conditions and sample size.

In the neural-mixtures CLI, an experiment is a directory that lives in the experiments directory, at the root of the neural-mixtures repository. An experiment directory contains a set of dataset directories, and each dataset directory contains a stimuli.csv file and a responses.csv file that list presented stimuli and recorded neural responses, respectively. Data and plots generated by the neural-mixtures CLI are also stored in the dataset directories.

Let’s go ahead and execute the command

stack exec -- synthesize

which runs synthesize with all of its default arguments. The default experiment name for synthesize is synthetic, and so if this command executed successfully, there should be a new directory at experiments/synthetic. Within the synthetic directory there should be the following files

experiments/synthetic/poisson/stimuli.csv
experiments/synthetic/poisson/responses.csv
experiments/synthetic/poisson/true/parameters.dat
experiments/synthetic/com-based/stimuli.csv
experiments/synthetic/com-based/responses.csv
experiments/synthetic/com-based/true/parameters.dat

The synthesize program always synthesizes two datasets, which are stored in the poisson and com-based directories. These datasets are generated from conditional mixtures of Poisson distributions and Conway-Maxwell Poisson distributions, respectively (see Sokoloski, Aschner, and Coen-Cagli (2021)). The synthesize program also saves the ground-truth parameters of each randomized conditional mixture in the files poisson/true/parameters.dat and com-based/true/parameters.dat.

#### Analyzing synthetic data

The analyze program generates collections of plots given a dataset and a model, and we may use it to compare the ground-truth models to their synthesized data in our synthetic experiment. The first arguments to the analyze program are the list of directories in which an appropriate parameters.dat file is located. For example, we may analyze the models we generated with the synthesize command by executing

stack exec -- analyze synthetic poisson true
stack exec -- analyze synthetic com-based true

analyze generates the same set of plots for both commands, so for the sake of brevity we’ll review some of the results of only the com-based analysis. All of these results can be found in the com-based/true directory.

In the paper, a conditional mixture (CM) is a model of the form $$p(\mathbf n, k \mid x)$$, where $$\mathbf n$$ is a vector of spike-counts, $$k$$ is the index of a mixture-component, and $$x$$ is the stimulus; we refer to the CoM-based conditional mixture, in particular, as a CB-CM. The analyze command renders the “tuning curves” of the given CM (i.e. the stimulus-dependent firing rate of each neuron) in the file tuning-curves.png, and renders the stimulus-dependent probabilities of the mixture components $$p(k \mid x)$$ in the file weight-dependence.png.

The “Fisher information” (FI) is a mathematical tool for understanding how much information a given neural population has about a stimulus. The linear Fisher information (LFI) is a linear approximation of the FI which can be estimated effectively from data, and is useful when the FI of a model cannot be evaluated in closed-form. Nevertheless, for CMs the FI can be computed in closed-form. In the fisher-information.png file, the analyze command visualizes both the FI and LFI of the model, as well as a bias-corrected empirical estimate of the LFI (Kanitscheider et al. 2015) based on the synthetic data generated by the synthesize command.

The analyze command also generates the fano-factor-scatter.gif and noise-correlation-scatter.gif files. These files are animations, and render scatter plots of the Fano factors and noise correlations of the model, as compared to the empirical Fano factors and noise correlations, where each step of the animation corresponds to a different stimulus.

It is interesting to note that the $$r^2$$ of these scatter plots is quite modest, indicating that even with a sample size of 1600, the empirical estimates of the Fano factors and noise correlations are far from that of the ground-truth model.

### Modelling stimulus-response recordings with conditional mixtures

In the rest of the tutorial we’ll analyze response recordings from macaque primary visual cortex (V1), but feel free to substitute your own datasets into the neural-mixtures CLI to see what happens!

#### Training conditional mixtures on V1 response data

Now that we know how to analyze data a CM with neural-mixtures, let’s fit one to data. In (Sokoloski, Aschner, and Coen-Cagli 2021) we considered two recordings from macaque V1 responding to oriented gratings — one where the monkey is under anaesthesia, and one where it is awake. The stimuli.csv and responses.csv files for these two experiments are available under experiments/amir-anaesthetized/129r001p173_preprocess and experiments/amir-awake-filtered/cadetv1p438_tuning, respectively.

Let’s fit a CB-CM to the cadetv1p438_tuning data, by running the command

stack exec -- train amir-awake-filtered cadetv1p438_tuning discrete com-based -k 20 -e 1000

To break this down, the first two arguments amir-awake-filtered cadetv1p438_tuning indicate the dataset we will be analyzing. The next argument discrete indicates that we’ll be using discrete tuning curves (which make no assumption about the continuity of the stimulus like von Mises tuning curves), and com-based indicates we’ll be using com-based, rather than poisson tuning curves. The arguments -k 20 and -e 1000 tell train to train a model with 20 mixture components for 1000 epochs. All the parameters to train are described with the --help command.

The train program renders only a single plot: log-likelihood-ascent.png. In this particular demonstration we’re fitting to the complete dataset rather than holding out data, and the plot shows the log-likelihood of the model given the training data after each training epoch. Note that at 800 epochs there’s a sudden jump in performance. This is because when training a CB-CM, we first train an IP-CM for a certain number of iterations (in this case 800) before extending it to a CB-CM, which significantly reduces execution time.

To see a more complete analysis of the resulting model, we return to the analyze command. The results of our train command (i.e. parameters.dat) are saved in experiments/{experiment}/{dataset}/discrete/com-based, and so in this case we run

stack exec -- analyze amir-awake-filtered cadetv1p438_tuning discrete com-based -c

where the -c argument tells train that even though we’re analyzing a discrete model, the stimulus is continuous.

The outputs of the analyze command remain the same as for the synthetic experiment we considered previously, and so for the sake of brevity we’ll just look at the Fano factor and noise correlation scatter plots.

Based on these scatter plots the discrete CB-CM appears to do a good job at capturing the empirical statistics of the recorded V1 responses. To estimate whether the CB-CM captures the true statistics of the response distribution, we deploy the cross-validation command.

#### Cross-validating conditional mixtures

The last program in the neural-mixtures CLI is cross-validate, which essentially wraps up the train program inside a cross-validation loop, and evaluates the desired model for a number of different mixture components $$d_K$$. To cross-validate the performance of an IP-CM on our awake V1 data, we run

stack exec -- cross-validate amir-awake-filtered cadetv1p438_tuning discrete com-based

There are a number of parameters that we could set here (consult stack exec -- cross-validate --help), but the defaults are fine in this case. Note that where our train command from earlier took about 10 minutes on a 16-core machine, we are now running over 100 instances of this train command. As such, you may want to pass alternative (more conservative) parameters to cross-validate to get a sense of how it runs, and in any case please consult the following section on performance optimization on how to activate multi-threading in neural-mixtures. On a 16-core machine with multi-threading enabled, the above command took about 5 hours to complete.

cross-validate generates a variety of plots, and below we look at two of them.

For exact details on what this mean (especially the one on the right) please consult the paper, but in short they describe how well (Left) the model captures the statistics of how the recorded neurons respond to stimuli, and (Right) how well the model supports decoding the neural responses for the information they contain about the stimuli. As we see on the left, log-likelihood performance peaks around $$d_K = 25$$ mixture components, with most of the performance gain happening by about $$d_K = 15$$ components. On the right, we see that the log-posterior performance is less clearly peaked, but demonstrates a clear gain in performance up to around $$d_K = 15$$ components.

### Additional notes on working with the neural-mixtures library

Every neural-mixture program can be given +RTS -N{x} as the last argument, which activates multi-threading. It is not part of neural-mixtures per se, but rather part of the Haskell runtime system (hence +RTS, and why it needs to be the last part of the command). Multi-threading confers no benefit to the synthesize and analyze programs. Multi-threading does not speed up the train program when fitting poisson models, but can significantly speed up the fitting of com-based models. The major use-case for multi-threading is the cross-validation program, because cross-validation is an “embarrassingly parallel” algorithm. When run in multi-threaded mode, cross-validate will create a thread for every fold and every mixture component, and so can easily populate a many-cored system. On my 16-core desktop, the cross-validate command we ran earlier took 5 hours to complete, as opposed to taking more than a day as it would in single-threaded mode.

The argument can be given simply as +RTS -N, which tells the program to run with as many threads as is available on the machine. This is often suboptimal, as it’s usually best to just create one thread per CPU core. On my computer with an AMD Ryzen 9 5950x (16 cores and 32 threads), running the train command from earlier takes about 6 minutes when run with +RTS -N16, about 9 minutes +RTS -N, and about 21 minutes when single-threaded.

Finally, when using openblas, you may want to adjust the OMP_NUM_THREADS environment variable. This tells openblas how many threads it’s allowed to use, and although it can improve performance when running single-threaded programs, in my experience it’s sometimes more efficient to e.g. set OMP_NUM_THREADS=1 and allow the neural-mixtures to manage multi-threading exclusively.

#### Performance scaling and computational complexity

Theoretically, conditional mixture models scale quite well, and I have successfully applied the neural-mixtures CLI to synthetic data generated from thousands of neurons. The largest computation in the optimization is an outer product computation which is $$\mathcal O(d_K d_N)$$, where $$d_K$$ is the number of mixture components and $$d_N$$ is the number of neurons. As such, computational complexity can grow more-or-less linearly with population size if $$d_K$$ is held fixed. That being said, sample complexity of the model also increases, and thus larger $$d_N$$ necessitates more training, and more data for a good fit.

#### Modifying the plots and code

The data for all plots in neural-mixtures are first saved as csv files, and then rendered with gnuplot. The gnuplot scripts are available in the plot-files directory, and can be edited as desired. Also note that whenever neural-mixtures runs gnuplot, it outputs the command it runs in the terminal, and with some copy-pasting this command can be re-run to regenerate plots without regenerating the csvs. Alternatively, the csvs can be fed into alternative plotting programs, or used as the basis for further analyses.

If you want to take things further, the four command line programs are defined in the executables folder, and are quite high-level; the lower-level code is in the libraries folder, and the whole project is built on my so-called Geometric Optimization Libraries (Goal), which are available here. Simple modifications of the executables should be possible without much knowledge of Haskell. If you are familiar with Haskell, then I encourage you to hack away, and I’m happy to support outside contributions to the codebase. Finally, if you aren’t familiar with Haskell, but wish you could access some of the neural-mixtures libraries more directly, feel free to contact me and I’d be willing to put bindings together for e.g. Python.

### Bibliography

Kanitscheider, Ingmar, Ruben Coen-Cagli, Adam Kohn, and Alexandre Pouget. 2015. “Measuring Fisher Information Accurately in Correlated Neural Populations.” PLoS Computational Biology 11 (6): e1004218.

Sokoloski, Sacha, Amir Aschner, and Ruben Coen-Cagli. 2021. “Modelling the Neural Code in Large Populations of Correlated Neurons.” Edited by Jonathan W Pillow, Joshua I Gold, and Kenneth D Harris. eLife 10 (October): e64615. https://doi.org/10.7554/eLife.64615.