Helper functions and utilities#

This section contains information on modules that are used within the packages, but are not part of the key algorithm/functionality.

File IO#

Helper classes and functions for I/O functionality.

cebra.io.device()#

The preferred compute device.

Return type:: str
Returns:: cuda, if available, otherwise cpu.

class cebra.io.HasDevice(device=None)#

Bases: object

Base class for classes that use CPU/CUDA processing via PyTorch.

If implementing this class, any derived instanced will track any attribute of types torch.Tensor (this includes torch.nn.Parameter), torch.nn.Module as well as any class subclassing HasDevice itself.

When calling to(), all of these attributes will themselves be moved to the specified device.

Any instance of this class will need to be initialized. This can happen explicitly through the constructor (by specifying a device during initialization) or by assigning the first element, which will assign the device of the first element to the whole instance.

Every following assignment will yield in tensors, parameters, modules and other instances being moved to the device of the instance.

Parameters:: device (Optional[str]) – The device name, typically cpu or cuda, or combined with a device idea, e.g. cuda:0.

Note

Do not subclass this class when some dependency to a compute device is already implemented, e.g. via a pytorch torch.nn.Module.

property device: str#

Returns the name of the current compute device.

Return type:: str

to(device)#

Moves the instance to the specified device.

Parameters:: device (str) – The device (cpu or cuda) to move this instance to.
Return type:: HasDevice
Returns:: the instance itself.

cebra.io.reduce(data, *, ratio=None, num_components=None)#

Map the specified data to its principal components

Specify either an explained variance ratio between 0 and 1, or a number of principle components to use.

Parameters:

ratio – The ratio (needs to be between 0 and 1) of explained variance required by the returned number of components. Note that the dimension of the output will vary based on the provided input data.
num_components – The number of principal components to return

Returns:

An (N, d) array, where the dimension d is either limited by the specified number of components, or is chosen to explain the specified variance in the data.

class cebra.io.FileKeyValueDataset(path)#

Bases: object

Load datasets from HDF, torch, numpy or joblib files.

The data is directly accessible through attributes of instances of this class.

Parameters:: path (str) – The filepath for loading the data from. Should point to a file in a valid file format (hdf, torch, numpy, joblib). Valid extensions are jl, joblib, h5, hdf, hdf5, pth, pt and npz.

Example

>>> import cebra.io
>>> import joblib
>>> import tempfile
>>> from pathlib import Path
>>> tmp_file = Path(tempfile.gettempdir(), 'test.jl')
>>> _ = joblib.dump({'foo' : 42}, tmp_file)
>>> data = cebra.io.FileKeyValueDataset(tmp_file)
>>> data.foo
42

Registry#

A simple registry for python modules.

This module only exposes a single public function, add_helper_functions, which takes a python module or module name (or package) as its argument and defines the decorator functions

register
parametrize

and the functions

init and
get_options

within this module. It also (implicitly and lazy) initializes a singleton registry object which holds all registered classes. Typically, the helper functions should be added in the first lines of a package __init__.py module.

Note that all functions carrying the respective decorators need to be discovered by the import system, otherwise they will not be available when calling get_options or init.

cebra.registry.add_helper_functions(module)#

Add registry functionality to the given module.

Call this function within a python module to add the three functions register, init and get_options to the module.

register is a decorator for classes within the module. Each class will be added by a (unique) name and can be initialized with the init function.
init takes a name as its argument and returns an instance of the specified class, with optional arguments.
get_options returns a list of all registered names within the module.

Parameters:: module (Union[module, str]) – The module for adding registry functions. This can be the name of a module as returned by __name__ within the module, or by passing the module type directory.

cebra.registry.add_docstring(module)#

Apply additional information about configuration options to registry modules.

Parameters:: module (Union[module, str]) – Name of the module, or the module itself. If a string is given, it needs to match the representation in sys.modules.

cebra.registry.is_registry(module, check_docs=False)#

Check if the given module implements all registry functions.

Parameters:

module (Union[module, str]) – Name of the module, or the module itself. If a string is given, it needs to match the representation in sys.modules.
check_docs (bool) – Optionally specify whether or not to check if a docstring was adapted, specifying all default options.

Return type:

bool

Returns:

True if the module is a registry and implements the register, init and get_options functions. If check_docs is set to True, then the documentation needs to match in addition. False if at least one function is missing.

Grid-Search#

Utilities for performing a grid search across CEBRA models.

class cebra.grid_search.GridSearch#

Bases: object

Define and run a grid search on the CEBRA hyperparameters.

Note

We recommend that you use that grid-search implementation for rather small and simple grid-search.

Depending on the usage, one needs to optimize some parameters used in the CEBRA model, e.g., the temperature, the batch_size, the learning_rate. A grid-search on that set of parameters consists in finding the best combination of values for those parameters. For that, models with different combinations of parameters are trained and the parameters used to get the best performing model are considered to be the optimal parameters. One can also define the fixed parameters, which will stay constant from one model to the other, e.g., the max_iterations or the verbose.

The class also allows to iterate over multiple datasets and combinations of auxiliary variables.

generate_models(params)#

Generate the models to compare, based on fixed and variable CEBRA parameters.

Parameters:: params (dict) – Dict of parameter values provided by the user, either as a single value, for fixed hyperparameter values, or with a list of values for hyperparameters to optimize. If the value is a list of a single element, the hyperparameter is considered as fixed.
Return type:: Tuple[dict, List[dict]]
Returns:: A dict of (unfitted) models (first returns) and a list of dict of the parameters for each model (second returns).

Example

>>> import cebra.grid_search
>>> # 1. Define the parameters for the models
>>> params_grid = dict(
...     output_dimension = [3, 16],
...     learning_rate = [0.001],
...     time_offsets = 5,
...     max_iterations = 10,
...     verbose = False)
>>> # 2. Define the grid search and generate the models
>>> grid_search = cebra.grid_search.GridSearch()
>>> models, parameter_grid = grid_search.generate_models(params=params_grid)

fit_models(datasets, params, models_dir='saved_models')#

Fit the models to compare in the grid search on the provided datasets.

Parameters:

datasets (dict) – A dict of datasets to iterate over. The values in the dict can either be a tuple or an iterable structure (Iterable, i.e., list, numpy.array(), torch.Tensor). If the value provided for a given dataset is a tuple, contain multiple elements and those elements are all iterable structures (as defined before), then the first element is considered to be the data to fit the CEBRA models (X to be used in cebra.CEBRA.fit()) on and the other values to be the auxiliary variables be to use in the training process (y s to be used in cebra.CEBRA.fit()). The models are then trained using behavioral contrastive learning (either CEBRA-Behavior or CEBRA-Hybrid). If the value provided for a given dataset is a tuple containing a single value and it is an iterable structure (as defined before) or is such an iterable structure directly (not a tuple), then the value is considered to be the data to fit the CEBRA models on. The models are then trained using temporal contrastive learning (CEBRA-Time). An example of a valid datasets value could be: datasets={"dataset1": neural_data, "dataset2": (neurald_data, continuous_data, discrete_data), "dataset3": (neural_data2, continuous_data2)}.
params (dict) – Dict of parameter values provided by the user, either as a single value, for fixed hyperparameter values, or with a list of values for hyperparameters to optimize. If the value is a list of a single element, the hyperparameter is considered as fixed.
models_dir (str) – The path to the folder in which save the (fitted) models.

Return type:

GridSearch

Returns:

self for chaining operations.

Example

>>> import cebra.grid_search
>>> import numpy as np
>>> neural_data =  np.random.uniform(0, 1, (300, 30))
>>> # 1. Define the parameters for the models
>>> params_grid = dict(
...     output_dimension = [3, 16],
...     learning_rate = [0.001],
...     time_offsets = 5,
...     max_iterations = 10,
...     verbose = False)
>>> # 2. Fit the models generated from the list of parameters
>>> grid_search = cebra.grid_search.GridSearch()
>>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data},
...                                      params=params_grid,
...                                      models_dir="grid_search_models")

classmethod load(dir)#

Load the fitted models and parameter grid present in dir.

Note

It is recommended to generate the models to iterate over by using fit_models(), but you can also run the following function on a folder containing valid fitted models and the corresponding parameter grid.

Parameters:: dir (str) – The directory in which the fitted models are saved.
Return type:: Tuple[CEBRA, List[dict]]
Returns:: A dict containing the fitted models (first returns) and a list of dict containing the parameters used for each model present in the dir (second returns).

Example

>>> import cebra.grid_search
>>> models, parameter_grid = cebra.grid_search.GridSearch().load(dir="grid_search_models")

get_best_model(scoring='infonce_loss', dataset_name=None, models_dir=None)#

Get the model with the best performance across all sets of parameters and datasets.

Parameters:

scoring (Literal[‘infonce_loss’]) – Metric to use to evaluate the models performances.
dataset_name (Optional[str]) – Name of the dataset to find the best model for. By default, dataset_name is set to None and the best model is searched from the list of all models for all sets of parameters and across all datasets. A dataset_name is valid if models were fitted on that set of data and those models are present in models_dir. Then, the returned model will correspond to the model with the highest performances on that dataset only.
models_dir (Optional[str]) – The path to the folder in which save the (fitted) models.

Return type:

Tuple[CEBRA, str]

Returns:

The cebra.CEBRA model with the highest performance for a given dataset_name (first returns) and its name (second returns).

Example

>>> import cebra.grid_search
>>> import numpy as np
>>> neural_data =  np.random.uniform(0, 1, (300, 30))
>>> # 1. Define the parameters for the models
>>> params_grid = dict(
...     output_dimension = [3, 16],
...     learning_rate = [0.001],
...     time_offsets = 5,
...     max_iterations = 10,
...     verbose = False)
>>> # 2. Fit the models generated from the list of parameters
>>> grid_search = cebra.grid_search.GridSearch()
>>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data},
...                        params=params_grid,
...                        models_dir="grid_search_models")
>>> # 3. Get model with the best performances and use it as usual
>>> best_model, best_model_name = grid_search.get_best_model()
>>> embedding = best_model.transform(neural_data)

get_df_results(models_dir=None)#

Create a pandas.DataFrame containing the parameters and results for each model.

Parameters:: models_dir (Optional[str]) – The path to the folder in which save the (fitted) models.
Return type:: DataFrame
Returns:: A pandas.DataFrame in which each row corresponds to a model from the grid search and contains the parameters used for the model, the dataset name that was used for fitting and the performances obtained with that model.

Example

>>> import cebra.grid_search
>>> import numpy as np
>>> neural_data =  np.random.uniform(0, 1, (300, 30))
>>> # 1. Define the parameters for the models
>>> params_grid = dict(
...     output_dimension = [3, 16],
...     learning_rate = [0.001],
...     time_offsets = 5,
...     max_iterations = 10,
...     verbose = False)
>>> # 2. Fit the models generated from the list of parameters
>>> grid_search = cebra.grid_search.GridSearch()
>>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data},
...                        params=params_grid,
...                        models_dir="grid_search_models")
>>> # 3. Get results for all models
>>> df_results = grid_search.get_df_results()

plot_loss_comparison(models_dir=None, **kwargs)#

Display the losses for all fitted models present in models_dir.

Note

The method is a wrapper around compare_models(), meaning you can provide all parameters that are provided to compare_models() to that function.

Parameters:: models_dir (Optional[str]) – The path to the folder in which save the (fitted) models.
Return type:: Axes
Returns:: A matplotlib.axes.Axes on which to generate the plot.

Example

>>> import cebra.grid_search
>>> import numpy as np
>>> neural_data =  np.random.uniform(0, 1, (300, 30))
>>> # 1. Define the parameters for the models
>>> params_grid = dict(
...     output_dimension = [3, 16],
...     learning_rate = [0.001],
...     time_offsets = 5,
...     max_iterations = 10,
...     verbose = False)
>>> # 2. Fit the models generated from the list of parameters
>>> grid_search = cebra.grid_search.GridSearch()
>>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data},
...                        params=params_grid,
...                        models_dir="grid_search_models")
>>> # 3. Plot losses for all models
>>> ax = grid_search.plot_loss_comparison()

plot_transform()#: TODO.

Data helpers#

cebra.data.helper.get_loader_options(dataset)#

Return all possible dataloaders for the given dataset.

Return type:: List[str]

class cebra.data.helper.OrthogonalProcrustesAlignment(top_k=5, subsample=None)#

Bases: object

Aligns two dataset by solving the orthogonal Procrustes problem.

Tip

In linear algebra, the orthogonal Procrustes problem is a matrix approximation problem. Considering two matrices A and B, it consists in finding the orthogonal matrix R which most closely maps A to B, so that it minimizes the Frobenius norm of (A @ R) - B subject to R.T @ R = I. See scipy.linalg.orthogonal_procrustes() for more information.

For each dataset, the data and labels to align the data on is provided.

The top_k indexes of the labels to align (label) that are the closest to the labels of the reference dataset (ref_label) are selected and used to sample from the dataset to align (data).
data and ref_data (the reference dataset) are subsampled to the same number of samples subsample.
The orthogonal mapping is computed, using scipy.linalg.orthogonal_procrustes(), on those subsampled datasets.
The resulting orthongonal matrix _transform can be used to map the original data to the ref_data.

Note

data and ref_data can be of different sample size (axis 0) but must have the same number of features (axis 1) to be aligned.

top_k#

Number of indexes in the labels of the matrix to align to consider for alignment (label). The selected indexes consist in the top_k th indexes the closest to the reference labels (ref_label).

Type:: int

subsample#

Number of samples to subsample the data and ref_data from, to solve the orthogonal Procrustes problem on.

Type:: int

fit(ref_data, data, ref_label=None, label=None)#

Compute the matrix solution of the orthogonal Procrustes problem.

The obtained matrix is used to align a dataset to a reference dataset.

Parameters:

ref_data (Union[ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]], Tensor]) – Reference data matrix on which to align the data.
data (Union[ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]], Tensor]) – Data matrix to align on the reference dataset.
ref_label (Optional[ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]]) – Set of indices associated to ref_data.
label (Optional[ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]]) – Set of indices associated to data.

Return type:

OrthogonalProcrustesAlignment

Returns:

self, for chaining operations.

Example

>>> import cebra.data.helper
>>> import numpy as np
>>> ref_embedding = np.random.uniform(0, 1, (1000, 30))
>>> aux_embedding = np.random.uniform(0, 1, (800, 30))
>>> ref_label = np.random.uniform(0, 1, (1000, 1))
>>> aux_label = np.random.uniform(0, 1, (800, 1))
>>> orthogonal_procrustes = cebra.data.helper.OrthogonalProcrustesAlignment()
>>> orthogonal_procrustes = orthogonal_procrustes.fit(ref_data=ref_embedding,
...                                                   data=aux_embedding,
...                                                   ref_label=ref_label,
...                                                   label=aux_label)

transform(data)#

Transform the data using the matrix solution computed in py:meth:fit.

Parameters:: data (Union[ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]], Tensor]) – The 2D data matrix to align.
Return type:: ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]
Returns:: The aligned input matrix.

Example

>>> import cebra.data.helper
>>> import numpy as np
>>> ref_embedding = np.random.uniform(0, 1, (1000, 30))
>>> aux_embedding = np.random.uniform(0, 1, (800, 30))
>>> ref_label = np.random.uniform(0, 1, (1000, 1))
>>> aux_label = np.random.uniform(0, 1, (800, 1))
>>> orthogonal_procrustes = cebra.data.helper.OrthogonalProcrustesAlignment()
>>> orthogonal_procrustes = orthogonal_procrustes.fit(ref_data=ref_embedding,
...                                                   data=aux_embedding,
...                                                   ref_label=ref_label,
...                                                   label=aux_label)
>>> aligned_aux_embedding = orthogonal_procrustes.transform(data=aux_embedding)
>>> assert aligned_aux_embedding.shape == aux_embedding.shape

fit_transform(ref_data, data, ref_label=None, label=None)#

Compute the matrix solution to align a data array to a reference matrix.

Note

Uses a combination of fit() and transform().

Parameters:

ref_data (ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Reference data matrix on which to align the data.
data (ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Data matrix to align on the reference dataset.
ref_label (Optional[ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]]) – Set of indices associated to ref_data.
label (Optional[ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]]) – Set of indices associated to data.

Return type:

ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

The data matrix aligned onto the reference data matrix.

Example

>>> import cebra.data.helper
>>> import numpy as np
>>> ref_embedding = np.random.uniform(0, 1, (1000, 30))
>>> aux_embedding = np.random.uniform(0, 1, (800, 30))
>>> ref_label = np.random.uniform(0, 1, (1000, 1))
>>> aux_label = np.random.uniform(0, 1, (800, 1))
>>> orthogonal_procrustes = cebra.data.helper.OrthogonalProcrustesAlignment(top_k=10,
...                                                                         subsample=700)
>>> aligned_aux_embedding = orthogonal_procrustes.fit_transform(ref_data=ref_embedding,
...                                                             data=aux_embedding,
...                                                             ref_label=ref_label,
...                                                             label=aux_label)
>>> assert aligned_aux_embedding.shape == aux_embedding.shape

cebra.data.helper.ensemble_embeddings(embeddings, labels=None, post_norm=False, n_jobs=0)#

Ensemble aligned embeddings together.

The embeddings contained in embeddings are aligned onto the same embedding, using OrthogonalProcrustesAlignment. Then, they are averaged and the resulting averaged embedding is the ensemble embedding.

Tip

By ensembling embeddings coming from the same dataset but obtained from different models, the resulting joint embedding usually shows an increase in performances compared to the individual embeddings.

Note

The embeddings in embeddings must be the same shape, i.e., the same number of samples and same number of features (axis 1).

Parameters:

embeddings (List[Union[ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]], Tensor]]) – List of embeddings to align and ensemble.
labels (Optional[List[Union[ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]], Tensor]]]) – Optional list of indexes associated to the embeddings in embeddings to align the embeddings on. To be ensembled, the embeddings should already be aligned on time, and consequently do not require extra labels for alignment.
post_norm (bool) – If True, the resulting joint embedding is normalized (divided by its norm across the features - axis 1).
n_jobs (int) – The maximum number of concurrently running jobs to compute embedding alignment in a parallel manner using joblib.Parallel. Specify 0 to iterate naively over the embeddings for ensembling without using joblib.Parallel. Specify -1 to use all cores. Using more than a single core can considerably speed up the computation of ensembled embeddings for large datasets, but will also require more memory.

Return type:

ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A numpy.array() corresponding to the joint embedding.

Example

>>> import cebra.data.helper
>>> import numpy as np
>>> embedding1 = np.random.uniform(0, 1, (100, 4))
>>> embedding2 = np.random.uniform(0, 1, (100, 4))
>>> embedding3 = np.random.uniform(0, 1, (100, 4))
>>> joint_embedding = cebra.data.helper.ensemble_embeddings(embeddings=[embedding1, embedding2, embedding3])
>>> assert joint_embedding.shape == embedding1.shape

Masking helpers#

class cebra.data.masking.MaskedMixin#

Bases: object

A mixin class for applying masking to data.

Note

This class is designed to be used as a mixin for other classes. It provides functionality to apply masking to data. The set_masks method should be called to set the masking types and their corresponding probabilities.

set_masks(masking=None)#

Set the mask type and probability for the dataset.

Parameters:: masking (Dict[str, float]) – A dictionary of masking types and their corresponding required masking values. The keys are the names of the Mask instances.

Note

By default, no masks are applied.

Return type:: None

apply_mask(data, chunk_size=1000)#

Apply masking to the input data.

Note

By default, no masking. Else apply masking on the input data.
Only one masking type can be applied at a time, but multiple
masking types can be set so that it alternates between them across iterations.
Masking is applied to the data in chunks to avoid memory issues.

Parameters:

data (torch.Tensor) – batch of size (batch_size, num_neurons, offset).
chunk_size (int) – Number of rows to process at a time.

Returns:

The masked data.

Return type:

torch.Tensor