Using CEBRA#

This page covers a standard CEBRA usage. We recommend checking out the Demo Notebooks for CEBRA usage examples as well. Here we present a quick overview on how to use CEBRA on various datasets. Note that we provide two ways to interact with the code:

  • For regular usage, we recommend leveraging the high-level interface, adhering to scikit-learn formatting.

  • Upon specific needs, advanced users might consider diving into the low-level interface that adheres to PyTorch formatting.

Firstly, why use CEBRA?#

CEBRA is primarily designed for producing robust, consistent extractions of latent factors from time-series data. It supports three modes, and is a self-supervised representation learning algorithm that uses our modified contrastive learning approach designed for multi-modal time-series data. In short, it is a type of non-linear dimensionality reduction, like tSNE and UMAP. We show in our original paper that it outperforms tSNE and UMAP at producing closer-to-ground-truth latents and is more consistent.

That being said, CEBRA can be used on non-time-series data and it does not strictly require multi-modal data. In general, we recommend considering using CEBRA for measuring changes in consistency across conditions (brain areas, cells, animals), for hypothesis-guided decoding, and for topological exploration of the resulting embedding spaces. It can also be used for visualization and considering dynamics within the embedding space. For examples of how CEBRA can be used to map space, decode natural movies, and make hypotheses for neural coding of sensorimotor systems, see Schneider, Lee, Mathis. Nature 2023.

The CEBRA workflow#

CEBRA supports three modes: fully unsupervised (CEBRA-Time), supervised (via joint modeling of auxiliary variables; CEBRA-Behavior), and a hybrid variant (CEBRA-Hybrid). We recommend to start with running CEBRA-Time (unsupervised) and look both at the loss value (goodness-of-fit) and visualize the embedding. Then use labels via CEBRA-Behavior to test which labels give you this goodness-of-fit (tip: see our Figure 2: Hypothesis-driven and discovery-driven analysis with CEBRA hippocampus data example). Notably, if you use CEBRA-Behavior with labels that are not encoded in your data, the embedding will collapse (not-converged). This is a feature, not a bug. This allows you to rule out which obervable behaviors (labels) are truly in your data. To get a sense of this workflow, you can also look at Figure 2: Hypothesis-driven and discovery-driven analysis with CEBRA and Extended Data Figure 5: Hypothesis testing with CEBRA.. πŸ‘‰ Here is a quick-start workflow, with many more details below:

  1. Use CEBRA-Time for unsupervised data exploration.

  2. Consider running a hyperparameter sweep on the inputs to the model, such as cebra.CEBRA.model_architecture, cebra.CEBRA.time_offsets, cebra.CEBRA.output_dimension, and set cebra.CEBRA.batch_size to be as high as your GPU allows. You want to see clear structure in the 3D plot (the first 3 latents are shown by default).

  3. Use CEBRA-Behavior with many different labels and combinations, then look at the InfoNCE loss - the lower the loss value, the better the fit (see Extended Data Figure 5: Hypothesis testing with CEBRA.), and visualize the embeddings. The goal is to understand which labels are contributing to the structure you see in CEBRA-Time, and improve this structure. Again, you should consider a hyperparameter sweep (and avoid overfitting by performing the proper train/validation split (see Step 3 in our quick start guide below).

  4. Interpretability: now you can use these latents in downstream tasks, such as measuring consistency, decoding, and determining the dimensionality of your data with topological data analysis.

All the steps to do this are described below. Enjoy using CEBRA! πŸ”₯πŸ¦“

Step-by-step CEBRA#

For a quick start into applying CEBRA to your own datasets, we provide a scikit-learn compatible API, similar to methods such as tSNE, UMAP, etc. We assume you have CEBRA installed in the environment you are working in, if not go to the Installation Guide. Next, launch your conda env (e.g., conda activate cebra).

Create a CEBRA workspace#

Assuming you have your data recorded, you want to start using CEBRA on it. For instance you can create a new jupyter notebook.

For the sake of this usage guide, we create some example data:

# Create a .npz file
import numpy as np

X = np.random.normal(0,1,(100,3))
X_new = np.random.normal(0,1,(100,4))
np.savez("neural_data", neural = X, new_neural = X_new)

# Create a .h5 file, containing a pd.DataFrame
import pandas as pd

X_continuous = np.random.normal(0,1,(100,3))
X_discrete = np.random.randint(0,10,(100, ))
df = pd.DataFrame(np.array(X_continuous), columns=["continuous1", "continuous2", "continuous3"])
df["discrete"] = X_discrete
df.to_hdf("auxiliary_behavior_data.h5", key="auxiliary_variables")

You can start by importing the CEBRA package, as well as the CEBRA model as a classical scikit-learn estimator.

import cebra
from cebra import CEBRA

Data loading#

Get the data ready#

We acknowledge that your data can come in all formats. That is why we developed a loading helper function to help you get your data ready to be used by CEBRA.

The function cebra.load_data() supports various file formats to convert the data of interest to a numpy.array(). It handles three categories of data. Note that it will only read the data of interest and output the corresponding numpy.array(). It does not perform pre-processing so your data should be ready to be used for CEBRA.

  • Your data is a 2D array. In that case, we handle Numpy, HDF5, PyTorch, csv, Excel, Joblib, Pickle and MAT-files. If your file only containsyour data then you can use the default cebra.load_data(). If your file contains more than one dataset, you will have to provide a key, which corresponds to the data of interest in the file.

  • Your data is a pandas.DataFrame. In that case, we handle HDF5 files only. Similarly, you can use the default cebra.load_data() if your file only contains a single dataset and you want to get the whole pandas.DataFrame as your dataset. Else, if your file contains more than one dataset, you will have to provide the corresponding key. Moreover, if your pandas.DataFrame is a single index, you can precise the columns to fetch from the pandas.DataFrame for your data of interest.

In the following example, neural_data.npz contains multiple numpy.array() and auxiliary_behavior_data.h5, multiple pandas.DataFrame.

import cebra

# Load the .npz
neural_data = cebra.load_data(file="neural_data.npz", key="neural")

# ... and similarly load the .h5 file, providing the columns to keep
continuous_label = cebra.load_data(file="auxiliary_behavior_data.h5", key="auxiliary_variables", columns=["continuous1", "continuous2", "continuous3"])
discrete_label = cebra.load_data(file="auxiliary_behavior_data.h5", key="auxiliary_variables", columns=["discrete"]).flatten()

You can then use neural_data, continuous_label or discrete_label directly as the input or index data of your CEBRA model. Note that we flattened discrete_label in order to get a 1D numpy.array() as required for discrete index inputs.

Note

cebra.load_data() only handles one set of data at a time, either the data or the labels, for one session only. To use multiple sessions and/or multiple labels, the function can be called for each of dataset. For files containing multiple matrices, the corresponding key, referenciating the dataset in the file, must be provided.

Model definition#

CEBRA training is modular, and model fitting can serve different downstream applications and research questions. Here, we describe how you can adjust the parameters depending on your data type and the hypotheses you might have.

Model architecture model_architecture

We provide a set of pre-defined models. You can access (and search) a list of available pre-defined models by running:

import cebra.models
print(cebra.models.get_options('offset*', limit = 4))
['offset10-model', 'offset10-model-mse', 'offset5-model', 'offset1-model-mse']

Then, you can choose the one that fits best with your needs and provide it to the CEBRA model as the model_architecture parameter.

As an indication the table below presents the model architecture we used to train CEBRA on the datasets presented in our paper (Schneider, Lee, Mathis. Nature 2023).

Dataset

Data type

Brain area

Model architecture

Artificial spiking

Synthetic

β€˜offset1-model-mse’

Rat hippocampus

Electrophysiology

CA1 hippocampus

β€˜offset10-model’

Macaque

Electrophysiology

Somatosensory cortex (S1)

β€˜offset10-model’

Allen Mouse

Calcium imaging (2P)

Visual cortex

β€˜offset10-model’

Allen Mouse

Neuropixels

Visual cortex

β€˜offset40-model-4x-subsample’

πŸš€ Optional: design your own model architectures

It is possible to construct a personalized model and use the @cebra.models.register decorator on it. For example:

from torch import nn
import cebra.models
import cebra.data
from cebra.models.model import _OffsetModel, ConvolutionalModelMixin

@cebra.models.register("my-model") # --> add that line to register the model!
class MyModel(_OffsetModel, ConvolutionalModelMixin):

    def __init__(self, num_neurons, num_units, num_output, normalize=True):
        super().__init__(
            nn.Conv1d(num_neurons, num_units, 2),
            nn.GELU(),
            nn.Conv1d(num_units, num_units, 40),
            nn.GELU(),
            nn.Conv1d(num_units, num_output, 5),
            num_input=num_neurons,
            num_output=num_output,
            normalize=normalize,
        )

    # ... and you can also redefine the forward method,
    # as you would for a typical pytorch model

    def get_offset(self):
        return cebra.data.Offset(22, 23)

# Access the model
print(cebra.models.get_options('my-model'))

Once your personalized model is defined, you can use by setting model_architecture='my-model'. πŸ‘‰ See the Models and Criteria API for more details.

Criterion and distance criterion and distance

For standard usage we recommend the default values (i.e., InfoNCE and cosine respectively) which are specifically designed for our contrastive learning algorithms.

Conditional distribution conditional

πŸ‘‰ See the previous section on how to choose the auxiliary variables and a conditional distribution.

Note

If the auxiliary variables types do not match with conditional, the model training will fall back to time contrastive learning.

Temperature temperature

temperature has the largest effect on visualization of the embedding (see Extended Data Figure 2: Hyperparameter changes on visualization and consistency.). Hence, it is important that it is fitted to your specific data. Lower temperatures (e.g. around 0.1) will result in a more dispersed embedding, higher temperatures (larger than 1) will concentrate the embedding.

πŸš€ For advance usage, you might need to find the optimal temperature. For that we recommend to perform a grid-search.

πŸ‘‰ More examples on how to handle temperature can be found in Technical: Learning the temperature parameter.

Time offsets \(\Delta\) time_offsets

This corresponds to the distance (in time) between positive pairs and informs the algorithm about the time-scale of interest.

The interpretation of this parameter depends on the chosen conditional distribution. A higher time offset typically will increase the difficulty of the learning task, and (within a range) improve the quality of the representation. For time-contrastive learning, we generally recommend that the time offset should be larger than the specified receptive field of the model.

Number of iterations max_iterations

We recommend to use at least 10,000 iterations to train the model. For prototyping, it can be useful to start with a smaller number (a few 1,000 iterations). However, when you notice that the loss function does not converge or the embedding looks uniformly distributed (cloud-like), we recommend increasing the number of iterations.

Note

You should always assess the convergence of your model at the end of training by observing the training loss (see Visualize the training loss).

Number of adaptation iterations max_adapt_iterations

One feature of CEBRA is you can apply (adapt) your model to new data. If you are planning to adapt your trained model to a new set of data, we recommend to use around 500 steps to re-tuned the first layer of the model.

In the paper, we show that fine-tuning the input embedding (first layer) on the novel data while using a pretrained model can be done with 500 steps in 3.5s only, and has better performance overall.

Batch size batch_size

CEBRA should be trained on the biggest batch size possible. Ideally, and depending on the size of your dataset, you should set batch_size to None (default value) which will train the model drawing samples from the full dataset at each iteration. As an indication, all the models used in the paper were trained with batch_size=512. You should avoid having to set your batch size to a smaller value.

Warning

Using the full dataset (batch_size=None) is only implemented for single-session training with continuous auxiliary variables.

Here is an example of a CEBRA model initialization:

cebra_model = CEBRA(
    model_architecture = "offset10-model",
    batch_size = 1024,
    learning_rate = 0.001,
    max_iterations = 10,
    time_offsets = 10,
    output_dimension = 8,
    device = "cuda_if_available",
    verbose = False
)

print(cebra_model)
CEBRA(batch_size=1024, learning_rate=0.001, max_iterations=10,
      model_architecture='offset10-model', time_offsets=10)

Model training#

Single-session versus multi-session training#

We provide both single-sesison and multi-session training. The latest makes the resulting embeddings invariant to the auxiliary variables across all sessions.

Note

For flexibility reasons, the multi-session training fits one model for each session and thus sessions don’t necessarily have the same number of features (e.g., number of neurons).

Check out the following list to verify if the multi-session implementation is the right tool for your needs.

I have multiple sessions/animals that I want to consider as a pseudo-subject and use them jointly for training CEBRA. That is the case because of limited access to simultaneously recorded neurons or looking for animal-invariant features in the neural data.

I want to get more consistent embeddings from one session/animal to the other.

I want to be able to use CEBRA for a new session that is fully unseen during training.

Warning

Using multi-session training limits the influence of individual variations per session on the embedding. Make sure that this session/animal-specific information won’t be needed in your downstream analysis.

πŸ‘‰ Have a look at Technical: Training models across animals for more in-depth usage examples of the multi-session training.

Training#

Single-session training

CEBRA is trained using cebra.CEBRA.fit(), similarly to the examples below for single-session training, using cebra_model as defined above. You can pass the input data as well as the behavioral labels you selected.

timesteps = 5000
neurons = 50
out_dim = 8

neural_data = np.random.normal(0,1,(timesteps, neurons))
continuous_label = np.random.normal(0,1,(timesteps, 3))
discrete_label = np.random.randint(0,10,(timesteps,))

single_cebra_model = cebra.CEBRA(batch_size=512,
                                 output_dimension=out_dim,
                                 max_iterations=10,
                                 max_adapt_iterations=10)

Note that the discrete_label array needs to be one dimensional, and needs to be of type int.

We can now fit the model in different modes.

  • For CEBRA-Time (time-contrastive training) with the chosen time_offsets, run:

single_cebra_model.fit(neural_data)
  • For CEBRA-Behavior (supervised constrastive learning) using discrete labels, run:

single_cebra_model.fit(neural_data, discrete_label)
  • For CEBRA-Behavior (supervised constrastive learning) using continuous labels, run:

single_cebra_model.fit(neural_data, continuous_label)
  • For CEBRA-Behavior (supervised constrastive learning) using a mix of discrete and continuous labels, run

single_cebra_model.fit(neural_data, continuous_label, discrete_label)

Multi-session training

For multi-session training, lists of data are provided instead of a single dataset and eventual corresponding auxiliary variables.

Warning

For now, multi-session training can only handle a unique set of continuous labels or a unique discrete label. All other combinations will raise an error. For the continuous case we provide the following example:

timesteps1 = 5000
timesteps2 = 3000
neurons1 = 50
neurons2 = 30
out_dim = 8

neural_session1 = np.random.normal(0,1,(timesteps1, neurons1))
neural_session2 = np.random.normal(0,1,(timesteps2, neurons2))
continuous_label1 = np.random.uniform(0,1,(timesteps1, 3))
continuous_label2 = np.random.uniform(0,1,(timesteps2, 3))

multi_cebra_model = cebra.CEBRA(batch_size=512,
                                output_dimension=out_dim,
                                max_iterations=10,
                                max_adapt_iterations=10)

Once you defined your CEBRA model, you can run:

multi_cebra_model.fit([neural_session1, neural_session2], [continuous_label1, continuous_label2])

Similarly, for the discrete case a discrete label can be provided and the CEBRA model will use the discrete multisession mode:

timesteps1 = 5000
timesteps2 = 3000
neurons1 = 50
neurons2 = 30
out_dim = 8

neural_session1 = np.random.normal(0,1,(timesteps1, neurons1))
neural_session2 = np.random.normal(0,1,(timesteps2, neurons2))
discrete_label1 = np.random.randint(0,10,(timesteps1, ))
discrete_label2 = np.random.randint(0,10,(timesteps2, ))

multi_cebra_model_discrete = cebra.CEBRA(batch_size=512,
                                output_dimension=out_dim,
                                max_iterations=10,
                                max_adapt_iterations=10)


multi_cebra_model_discrete.fit([neural_session1, neural_session2], [discrete_label1, discrete_label2])

Partial training

Consistently with the scikit-learn API, cebra.CEBRA.partial_fit() can be used to perform incremental learning of your model on multiple data batches. That means by using cebra.CEBRA.partial_fit(), you can fit your model on a set of data a first time and the model training will take on from the resulting parameters to train at the next call of cebra.CEBRA.partial_fit(), either on a new batch of data with the same number of features or on the same dataset. It can be used for both single-session or multi-session training, similarly to cebra.CEBRA.fit().

cebra_model = cebra.CEBRA(max_iterations=10)

# The model is fitted a first time ...
cebra_model.partial_fit(neural_data)

# ... later on the model can be fitted again
cebra_model.partial_fit(neural_data)

Tip

Partial learning is useful if your dataset is too big to fit in memory. You can separate it into multiple batches and call cebra.CEBRA.partial_fit() for each data batch.

Saving/Loading a model#

You can save a (trained/untrained) CEBRA model on your disk using cebra.CEBRA.save(), and load using cebra.CEBRA.load(). If the model is trained, you’ll be able to load it again to transform (adapt) your dataset in a different session.

The model will be saved as a .pt file.

import tempfile
from pathlib import Path

# create temporary file to save the model
tmp_file = Path(tempfile.gettempdir(), 'cebra.pt')

cebra_model = cebra.CEBRA(max_iterations=10)
cebra_model.fit(neural_data)

# Save the model
cebra_model.save(tmp_file)

# New session: load and use the model
loaded_cebra_model = cebra.CEBRA.load(tmp_file)
embedding = loaded_cebra_model.transform(neural_data)

Model evaluation#

Computing the embedding#

Once the model is trained, embeddings can be computed using cebra.CEBRA.transform().

Single-session training

For a model trained on a single session, you just have to provide the input data on which to compte the embedding.

embedding = single_cebra_model.transform(neural_data)
assert(embedding.shape == (timesteps, out_dim))

Multi-session training

For a model trained on multiple sessions, you will need to provide the session_id (between 0 and num_sessions-1), to select the model corresponding to the accurate number of features.

embedding = multi_cebra_model.transform(neural_session1, session_id=0)
assert(embedding.shape == (timesteps1, out_dim))

In both case, the embedding will be of size time x output_dimension.

Results visualization#

Here, we want to emphasize that if CEBRA is providing a low-dimensional representation of your data, i.e., the embedding, there are also plenty of elements that should be checked to assess the results. We provide a post-hoc package to easily visualize the crucial information.

The visualization functions all have the same structure such that they are merely wrappers around matplotlib.pyplot.plot() and matplotlib.pyplot.scatter(). Consequently, you can provide the functions parameters to be used by those matplotlib.pyplot functions.

Note that all examples were computed on the rat hippocampus dataset (Grosmark & BuzsΓ‘ki, 2016) with default parameters, max_iterations=15000 , batch_size=512 , model_architecture=offset10-model , output_dimension=3 except if stated otherwise.

Displaying the embedding#

To get a 3D visualization of an embedding embedding, obtained using cebra.CEBRA.transform() (see above), you can use plot_embedding().

It takes a 2D matrix representing an embedding and returns a 3D scatter plot by taking the 3 first latents by default.

Note

If your embedding only has 2 dimensions, then the plot will automatically switch to a 2D mode. You can then use the function similarly.

cebra.plot_embedding(embedding)
Default embedding

Note

Be aware that the latents are not visualized by rank of importance. Consequently if your embedding is initially larger than 3, a 3D-visualization taking the first 3 latents might not be a good representation of the most relevant features. Note that you can set the parameter idx_order to select the latents to display (see API).

πŸš€ Go further: personalize your embedding visualization

The function is a wrapper around matplotlib.pyplot.scatter() and consequently accepts all the parameters of that function (e.g., vmin, vmax, alpha, markersize, title, etc.) as parameters.

Regarding the color of the embedding, the default value is set to grey but can be customized using the parameter embedding_labels. There are 3 ways of doing it.

  • By setting embedding_labels as a valid RGB(A) color (i.e., recognized by matplotlib, see Specifying colors for more details). You can use the following list of named colors as a good set of options already.

Matplotlib list of named colors
cebra.plot_embedding(embedding, embedding_labels="darkorchid")
darkorchid embedding
  • By setting embedding_labels to time. It will use the color map cmap to display the embedding based on temporality. By default, cmap=cool. You can customize it by setting it to a valid matplotlib.colors.Colormap (see Choosing Colormaps in Matplotlib for more information). You can also use our CEBRA-custom colormap by setting cmap="cebra".

darkorchid embedding

CEBRA-custom colormap. You can use it by calling cmap="cebra" .#

In the following example, you can also see how to change the size (markersize) or the transparency (alpha) of the markers.

cebra.plot_embedding(embedding, embedding_labels="time", cmap="magma", markersize=5, alpha=0.5)
Time embedding
  • By setting embedding_labels as a vector of same size as the embedding to be mapped to colors, using cmap (see previous point for customization). The vector can consist of a discrete label or one of the auxiliary variables for example.

cebra.plot_embedding(embedding, embedding_labels=continuous_label[:, 0])
Position embedding

Note

embedding_labels must be uni-dimensional. Be sure to provide only one dimension of your auxiliary variables if you are using multi-dimensional continuous data for instance (e.g., only the x-coordinate of the position).

You can specify the latents to display by setting idx_order=(latent_num_1, latent_num_2, latent_num_3) with latent_num_* the latent indices of your choice. In the following example we trained a model with output_dimension==10 and we show embeddings when displaying latents (1, 2, 3) on the left and (4, 5, 6) on the right respectively. The code snippet also offers an example on how to combine multiple graphs and how to set a customized title (title). Note the parameter projection="3d" when adding a subplot to the figure.

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(121, projection="3d")
ax2 = fig.add_subplot(122, projection="3d")

ax1 = cebra.plot_embedding(embedding, embedding_labels=continuous_label[:,0], idx_order=(1,2,3), title="Latents: (1,2,3)", ax=ax1)
ax2 = cebra.plot_embedding(embedding, embedding_labels=continuous_label[:,0], idx_order=(4,5,6), title="Latents: (4,5,6)", ax=ax2)
Reordered embedding

If your embedding only has 2 dimensions or if you only want to display 2 dimensions from it, you can use the same function. The plot will automatically switch to 2D. Then you can use the function as usual.

The plot will be 2D if:

  • If your embedding only has 2 dimensions and you don’t specify the idx_order (then the default will be idx_order=(0,1))

  • If your embedding is more than 2 dimensions but you specify the idx_order with only 2 dimensions.

cebra.plot_embedding(embedding, idx_order=(0,1), title="2D Embedding")
2D embedding

πŸš€ Look at the plot_embedding() API for more details on customization.

Displaying the training loss#

Observing the training loss is of great importance. It allows you to assess that your model converged for instance or to compare models performances and fine-tune the parameters.

To visualize the loss evolution through training, you can use plot_loss().

It takes a CEBRA model and returns a 2D plot of the loss against the number of iterations. It can be used with default values as simply as this:

cebra.plot_loss(cebra_model)
Default loss

πŸš€ The function is a wrapper around matplotlib.pyplot.plot() and consequently accepts all the parameters of that function (e.g., alpha, linewidth, title, color, etc.) as parameters.

Displaying the temperature#

temperature has the largest effect on the visualization of the embedding. Hence it might be interesting to check its evolution when temperature_mode=auto. We recommend only using auto if you have first explored the constant setting. If you use the auto mode, please always check the time evolution of the temperature over time alongside the loss curve.

To that extend, you can use the function plot_temperature().

It takes a CEBRA model and returns a 2D plot of the value of temperature against the number of iterations. It can be used with default values as simply as this:

cebra.plot_temperature(cebra_model)
Default temperature

πŸš€ The function is a wrapper around matplotlib.pyplot.plot() and consequently accepts all the parameters of that function (e.g., alpha, linewidth, title, color, etc.) as parameters.

Comparing models#

In order to select the most performant model, you might need to plot the training loss for a set of models on the same figure.

First, we create a list of fitted models to compare. Here we suppose we have a dataset with neural data available, as well as the position and the direction of the animal. We will show differences in performance when training with any combination of these variables.

cebra_posdir_model = CEBRA(model_architecture='offset10-model',
                batch_size=512,
                output_dimension=32,
                max_iterations=5,
                time_offsets=10)
cebra_posdir_model.fit(neural_data, continuous_label, discrete_label)

cebra_pos_model = CEBRA(model_architecture='offset10-model',
            batch_size=512,
            output_dimension=32,
            max_iterations=5,
            time_offsets=10)
cebra_pos_model.fit(neural_data, continuous_label)

cebra_dir_model = CEBRA(model_architecture='offset10-model',
            batch_size=512,
            output_dimension=32,
            max_iterations=5,
            time_offsets=10)
cebra_dir_model.fit(neural_data, discrete_label)

Then, you can compare their losses. To do that you can use compare_models(). It takes a list of CEBRA models and returns a 2D plot displaying their training losses. It can be used with default values as simply as this:

import cebra

# Labels to be used for the legend of the plot (optional)
labels = ["position+direction", "position", "direction"]

cebra.compare_models([cebra_posdir_model, cebra_pos_model, cebra_dir_model], labels)
Default comparison

πŸš€ The function is a wrapper around matplotlib.pyplot.plot() and consequently accepts all the parameters of that function (e.g., alpha, linewidth, title, color, etc.) as parameters. Note that however, if you want to differentiate the traces with a set of colors, you need to provide a colormap to the cmap parameter. If you want a unique color for all traces, you can provide a valid color to the color parameter that will override the cmap parameter. By default, color=None and cmap="cebra" our very special CEBRA-custom color map.

What else do to with your CEBRA model#

As mentioned at the start of the guide, CEBRA is much more than a visualization tool. Here we present a (non-exhaustive) list of post-hoc analysis and investigations that we support with CEBRA. Happy hacking! πŸ‘©β€πŸ’»

Consistency across features#

One of the major strengths of CEBRA is measuring consistency across embeddings. We demonstrate in Schneider, Lee, Mathis 2023, that consistent latents can be derived across animals (i.e., across CA1 recordings in rats), and even across recording modalities (i.e., from calcium imaging to electrophysiology recordings).

Thus, we provide the consistency_score() metrics to compute consistency across model runs or models computed on different datasets (i.e., subjects, sessions).

To use it, you have to set the between parameter to either datasets or runs. The main difference between the two modes is that for between-datasets comparisons you will provide labels to align the embeddings on. When using between-runs comparison, it supposes that the embeddings are already aligned. The simplest example being the model was run on the same dataset but it can also be for datasets that were recorded at the same time for example, i.e., neural activity in different brain regions, recorded during the same session.

Note

As consistency between CEBRA runs on the same dataset is demonstrated in Schneider, Lee, Mathis 2023 (consistent up to linear transformations), assessing consistency between different runs on the same dataset is a good way to reinsure you that you set your CEBRA model properly.

We first create the embeddings to compare: we use two different datasets of data and fit a CEBRA model three times on each.

n_runs = 3
dataset_ids = ["session1", "session2"]

cebra_model = CEBRA(model_architecture='offset10-model',
                batch_size=512,
                output_dimension=32,
                max_iterations=5,
                time_offsets=10)

embeddings_runs = []
embeddings_datasets, ids, labels = [], [], []
for i in range(n_runs):
    embeddings_runs.append(cebra_model.fit_transform(neural_session1, continuous_label1))

labels.append(continuous_label1[:, 0])
embeddings_datasets.append(embeddings_runs[-1])

embeddings_datasets.append(cebra_model.fit_transform(neural_session2, continuous_label2))
labels.append(continuous_label2[:, 0])

n_datasets = len(dataset_ids)

To get the consistency_score() on the set of embeddings that we just generated:

# Between-runs
scores_runs, pairs_runs, ids_runs = cebra.sklearn.metrics.consistency_score(embeddings=embeddings_runs,
                                                                            between="runs")
assert scores_runs.shape == (n_runs**2 - n_runs, )
assert pairs_runs.shape == (n_runs**2 - n_runs, 2)
assert ids_runs.shape == (n_runs, )

# Between-datasets, by aligning on the labels
(scores_datasets,
    pairs_datasets,
    ids_datasets) = cebra.sklearn.metrics.consistency_score(embeddings=embeddings_datasets,
                                                                labels=labels,
                                                                dataset_ids=dataset_ids,
                                                                between="datasets")
assert scores_datasets.shape == (n_datasets**2 - n_datasets, )
assert pairs_datasets.shape == (n_datasets**2 - n_datasets, 2)
assert ids_datasets.shape == (n_datasets, )

You can then display the resulting scores using plot_consistency().

fig = plt.figure(figsize=(10,4))

ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

ax1 = cebra.plot_consistency(scores_runs, pairs_runs, ids_runs, vmin=0, vmax=100, ax=ax1, title="Between-runs consistencies")
ax2 = cebra.plot_consistency(scores_datasets, pairs_datasets, ids_runs, vmin=0, vmax=100, ax=ax2, title="Between-subjects consistencies")
Consistency scores

πŸš€ This function is a wrapper around matplotlib.pyplot.imshow() and, similarly to the other plot functions we provide, it accepts all the parameters of that function (e.g., cmap, vmax, vmin, etc.) as parameters. Check the full API for more details.

Embeddings comparison via the InfoNCE loss#

Usage case πŸ‘©β€πŸ”¬

You can also compare how a new dataset compares to prior models. This can be useful when you have several groups of data and you want to see how a new session maps to the prior models. Then you will compute infonce_loss() of the new samples compared to other models.

How to use it

The performances of a given model on a dataset can be evaluated by using the infonce_loss() function. That metric corresponds to the loss over the data, obtained using the criterion on which the model was trained (by default, infonce). Hence, the smaller that metric is, the higher the model performances on a sample are, and so the better the fit to the positive samples is.

Note

As an indication, you can consider that a good trained CEBRA model should get a value for the InfoNCE loss smaller than ~6.1. If that is not the case, you might want to refer to the dedicated section Improve your model.

Here are examples on how you can use infonce_loss() on your data for both single-session and multi-session trained models.

# single-session
single_score = cebra.sklearn.metrics.infonce_loss(single_cebra_model,
                                                  neural_data,
                                                  continuous_label,
                                                  discrete_label,
                                                  num_batches=5)

# multi-session
multi_score = cebra.sklearn.metrics.infonce_loss(multi_cebra_model,
                                                 neural_session1,
                                                 continuous_label1,
                                                 session_id=0,
                                                 num_batches=5)

Adapt the model to new data#

In some cases, it can be useful to adapt your CEBRA model to a novel dataset, with a different number of features. For that, you can set adapt=True as a parameter of cebra.CEBRA.fit(). It will reset the first layer of the model so that the input dimension corresponds to the new features dimensions and retrain it for cebra.CEBRA.max_adapt_iterations. You can set that parameter cebra.CEBRA.max_adapt_iterations when initializing your cebra.CEBRA model.

Note

Adapting your CEBRA model to novel data is only implemented for single session training. Make sure that your model was trained on a single dataset.

# Fit your model once ...
single_cebra_model.fit(neural_session1)

# ... do something with it (embedding, visualization, saving) ...

# ... and adapt the model
cebra_model.fit(neural_session2, adapt=True)

Note

We recommend that you save your model, using cebra.CEBRA.save(), before adapting it to a different dataset. The adapted model will replace the previous model in cebra_model.state_dict_ so saving it beforehand allows you to keep the trained parameters for later. You can then load the model again, using cebra.CEBRA.load() whenever you need it.

Decoding#

The CEBRA latent embedding can be used for decoding analysis, meaning to investigate if a specific variable in the task can be decoded from the latent embeddings. Decoding using the embedding can be easily perform by mean of the decoders we implemented as part of CEBRA and following the scikit-learn API. We provide two decoders: KNNDecoder and L1LinearRegressor. Here is a simple usage of the KNNDecoder after using CEBRA-Time.

from sklearn.model_selection import train_test_split

# 1. Train a CEBRA-Time model on the whole dataset
cebra_model = cebra.CEBRA(max_iterations=10)
cebra_model.fit(neural_data)
embedding = cebra_model.transform(neural_data)

# 2. Split the embedding and label to decode into train/validation sets
(
     train_embedding,
     valid_embedding,
     train_discrete_label,
     valid_discrete_label,
) = train_test_split(embedding,
                     discrete_label,
                     test_size=0.3)

# 3. Train the decoder on the training set
decoder = cebra.KNNDecoder()
decoder.fit(train_embedding, train_discrete_label)

# 4. Get the score on the validation set
score = decoder.score(valid_embedding, valid_discrete_label)

# 5. Get the discrete labels predictions
prediction = decoder.predict(valid_embedding)

prediction contains the predictions of the decoder on the discrete labels.

Warning

Be careful to avoid double dipping when using the decoder. The previous example uses time contrastive learning. If you are using CEBRA-Behavior or CEBRA-Hybrid and you consequently use labels, you will have to split your original data from start as you don’t want decode labels from an embedding that is itself trained on those labels.

πŸ‘‰ Decoder example with CEBRA-Behavior
from sklearn.model_selection import train_test_split

# 1. Split your neural data and auxiliary variable
(
    train_data,
    valid_data,
    train_discrete_label,
    valid_discrete_label,
) = train_test_split(neural_data,
                     discrete_label,
                     test_size=0.2)

# 2. Train a CEBRA-Behavior model on training data only
cebra_model = cebra.CEBRA(max_iterations=10, batch_size=512)
cebra_model.fit(train_data, train_discrete_label)

# 3. Get embedding for training and validation data
train_embedding = cebra_model.transform(train_data)
valid_embedding = cebra_model.transform(valid_data)

# 4. Train the decoder on training embedding and labels
decoder = cebra.KNNDecoder()
decoder.fit(train_embedding, train_discrete_label)

# 5. Compute the score on validation embedding and labels
score = decoder.score(valid_embedding, valid_discrete_label)

Improve model performance#

🧐 Below is a (non-exhaustive) list of actions you can try if your embedding looks different from what you were expecting.

  1. Assess that your model converged. For that, observe if the training loss stabilizes itself around the end of the training or still seems to be decreasing. Refer to Visualize the training loss for more details on how to display the training loss.

  2. Increase the number of iterations. It typically should be at least 10,000. On small datasets, it can make sense to stop training earlier to avoid overfitting effects.

  3. Make sure the batch size is big enough. It should be at least 512.

  4. Fine-tune the model’s hyperparameters, namely learning_rate, output_dimension, num_hidden_units and eventually temperature (by setting temperature_mode back to constant). Refer to Grid search for more details on performing hyperparameters tuning.

  5. To note, you should still be mindful of performing train/validation splits and shuffle controls to avoid overfitting.

Quick Start: Scikit-learn API example#

Putting all previous snippet examples together, we obtain the following pipeline.

import cebra
from numpy.random import uniform, randint
from sklearn.model_selection import train_test_split
import os
import tempfile
from pathlib import Path

# 1. Define a CEBRA model
cebra_model = cebra.CEBRA(
   model_architecture = "offset10-model",
   batch_size = 512,
   learning_rate = 1e-4,
   temperature_mode='constant',
   temperature = 0.1,
   max_iterations = 10, # TODO(user): to change to ~500-10000 depending on dataset size
   #max_adapt_iterations = 10, # TODO(user): use and to change to ~100-500 if adapting
   time_offsets = 10,
   output_dimension = 8,
   verbose = False
)

# 2. Load example data
neural_data = cebra.load_data(file="neural_data.npz", key="neural")
new_neural_data = cebra.load_data(file="neural_data.npz", key="new_neural")
continuous_label = cebra.load_data(file="auxiliary_behavior_data.h5", key="auxiliary_variables", columns=["continuous1", "continuous2", "continuous3"])
discrete_label = cebra.load_data(file="auxiliary_behavior_data.h5", key="auxiliary_variables", columns=["discrete"]).flatten()


assert neural_data.shape == (100, 3)
assert new_neural_data.shape == (100, 4)
assert discrete_label.shape == (100, )
assert continuous_label.shape == (100, 3)

# 3. Split data and labels into train/validation
from sklearn.model_selection import train_test_split

split_idx = int(0.8 * len(neural_data))
# suggestion: 5%-20% depending on your dataset size; note that this splits the
# into an early and late part, which might not be ideal for your data/experiment!
# As a more involved alternative, consider e.g. a nested time-series split.

train_data = neural_data[:split_idx]
valid_data = neural_data[split_idx:]

train_continuous_label = continuous_label[:split_idx]
valid_continuous_label = continuous_label[split_idx:]

train_discrete_label = discrete_label[:split_idx]
valid_discrete_label = discrete_label[split_idx:]

# 4. Fit the model
# time contrastive learning
cebra_model.fit(train_data)
# discrete behavior contrastive learning
cebra_model.fit(train_data, train_discrete_label)
# continuous behavior contrastive learning
cebra_model.fit(train_data, train_continuous_label)
# mixed behavior contrastive learning
cebra_model.fit(train_data, train_discrete_label, train_continuous_label)


# 5. Save the model
tmp_file = Path(tempfile.gettempdir(), 'cebra.pt')
cebra_model.save(tmp_file)

# 6. Load the model and compute an embedding
cebra_model = cebra.CEBRA.load(tmp_file)
train_embedding = cebra_model.transform(train_data)
valid_embedding = cebra_model.transform(valid_data)

assert train_embedding.shape == (80, 8) # TODO(user): change to split ratio & output dim
assert valid_embedding.shape == (20, 8) # TODO(user): change to split ratio & output dim

# 7. Evaluate the model performance (you can also check the train_data)
goodness_of_fit = cebra.sklearn.metrics.goodness_of_fit_score(cebra_model,
                                                     valid_data,
                                                     valid_discrete_label,
                                                     valid_continuous_label)

# 8. Adapt the model to a new session
cebra_model.fit(new_neural_data, adapt = True)

# 9. Decode discrete labels behavior from the embedding
decoder = cebra.KNNDecoder()
decoder.fit(train_embedding, train_discrete_label)
prediction = decoder.predict(valid_embedding)
assert prediction.shape == (20,)

πŸ‘‰ For further guidance on different/customized applications of CEBRA on your own data, refer to the examples/ folder or to the full documentation folder docs/.

Quick Start: Torch API example#

πŸš€ You have special custom data analysis needs or want more features? We invite you to use the torch-API interface.

Refer to the Demos tab for a demo notebook using the torch-API: https://cebra.ai/docs/demo_notebooks/Demo_Allen.html.

Single- and multi-session training could be launched using the following bash command.

$ PYTHONPATH=. python examples/train.py [customized arguments]

Below is the documentation on the available arguments.

$ PYTHONPATH=. python examples/train.py --help
usage: train.py [-h] [--data <dataclasses._MISSING_TYPE object at 0x7f2eeb13f070>] [--variant single-session]
                [--logdir /logs/single-rat-hippocampus-behavior/] [--loss-distance cosine] [--temperature 1]
                [--time-offset 10] [--conditional time_delta] [--num-steps 1000] [--learning-rate 0.0003]
                [--model offset10-model] [--batch-size 512] [--num-hidden-units 32] [--num-output 8] [--device cpu]
                [--tqdm False] [--save-frequency SAVE_FREQUENCY] [--valid-frequency 100] [--train-ratio 0.8]
                [--valid-ratio 0.1] [--share-model]

CEBRA Demo

options:
-h, --help            show this help message and exit
--data <dataclasses._MISSING_TYPE object at 0x7f2eeb13f070>
                        The dataset to run CEBRA on. Standard datasets are available in cebra.datasets. Your own datasets can
                        be created by subclassing cebra.data.Dataset and registering the dataset using the
                        ``@cebra.datasets.register`` decorator.
--variant single-session
                        The CEBRA variant to run.
--logdir /logs/single-rat-hippocampus-behavior/
                        Model log directory. This should be either a new empty directory, or a pre-existing directory
                        containing a trained CEBRA model.
--loss-distance cosine
                        Distance type to use in calculating loss
--temperature 1       Temperature for InfoNCE loss
--time-offset 10      Distance (in time) between positive pairs. The interpretation of this parameter depends on the chosen
                        conditional distribution, but generally a higher time offset increases the difficulty of the learning
                        task, and (in a certain range) improves the quality of the representation. The time offset would
                        typically be larger than the specified receptive field of the model.
--conditional time_delta
                        Type of conditional distribution. Valid standard methods are "time_delta" and "time", and more
                        methods can be added to the ``cebra.data`` registry.
--num-steps 1000      Number of total training steps. Number of total training steps. Note that training duration of CEBRA
                        is independent of the dataset size. The total training examples seen will amount to ``num-steps x
                        batch-size``, irrespective of dataset size.
--learning-rate 0.0003
                        Learning rate for Adam optimizer.
--model offset10-model
                        Model architecture. Available options are 'offset10-model', 'offset5-model' and 'offset1-model'.
--batch-size 512      Total batch size for each training step.
--num-hidden-units 32
                        Number of hidden units.
--num-output 8        Dimension of output embedding
--device cpu          Device for training. Options: cpu/cuda
--tqdm False          Activate tqdm for logging during the training
--save-frequency SAVE_FREQUENCY
                        Interval of saving intermediate model
--valid-frequency 100
                        Interval of validation in training
--train-ratio 0.8     Ratio of train dataset. The remaining will be used for valid and test split.
--valid-ratio 0.1     Ratio of validation set after the train data split. The remaining will be test split
--share-model

Model training using the Torch API#

The scikit-learn API provides parametrization to many common use cases. The Torch API however allows for more flexibility and customization, for e.g. sampling, criterions, and data loaders.

In this minimal example we show how to initialize a CEBRA model using the Torch API. Here the cebra.data.single_session.DiscreteDataLoader gets initialized which also allows the prior to be directly parametrized.

πŸ‘‰ For an example notebook using the Torch API check out the Decoding movie features from (V1) visual cortex.

import numpy as np
import cebra.datasets
import torch

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

neural_data = cebra.load_data(file="neural_data.npz", key="neural")

discrete_label = cebra.load_data(
    file="auxiliary_behavior_data.h5", key="auxiliary_variables", columns=["discrete"],
)

# 1. Define a CEBRA-ready dataset
input_data = cebra.data.TensorDataset(
    torch.from_numpy(neural_data).type(torch.FloatTensor),
    discrete=torch.from_numpy(np.array(discrete_label[:, 0])).type(torch.LongTensor),
).to(device)

# 2. Define a CEBRA model
neural_model = cebra.models.init(
    name="offset10-model",
    num_neurons=input_data.input_dimension,
    num_units=32,
    num_output=2,
).to(device)

input_data.configure_for(neural_model)

# 3. Define the Loss Function Criterion and Optimizer
crit = cebra.models.criterions.LearnableCosineInfoNCE(
    temperature=1,
).to(device)

opt = torch.optim.Adam(
    list(neural_model.parameters()) + list(crit.parameters()),
    lr=0.001,
    weight_decay=0,
)

# 4. Initialize the CEBRA model
solver = cebra.solver.init(
    name="single-session",
    model=neural_model,
    criterion=crit,
    optimizer=opt,
    tqdm_on=True,
).to(device)

# 5. Define Data Loader
loader = cebra.data.single_session.DiscreteDataLoader(
    dataset=input_data, num_steps=10, batch_size=200, prior="uniform"
)

# 6. Fit Model
solver.fit(loader=loader)

# 7. Transform Embedding
train_batches = np.lib.stride_tricks.sliding_window_view(
    neural_data, neural_model.get_offset().__len__(), axis=0
)

x_train_emb = solver.transform(
    torch.from_numpy(train_batches[:]).type(torch.FloatTensor).to(device)
).to(device)

# 8. Plot Embedding
cebra.plot_embedding(
    x_train_emb.cpu(),
    discrete_label[neural_model.get_offset().__len__() - 1 :, 0],
    markersize=10,
)