Helper functions and utilities#
This section contains information on modules that are used within the packages, but are not part of the key algorithm/functionality.
File IO#
Helper classes and functions for I/O functionality.
- cebra.io.device()#
The preferred compute device.
- Return type:
- Returns:
cuda, if available, otherwisecpu.
- class cebra.io.HasDevice(device=None)#
Bases:
objectBase class for classes that use CPU/CUDA processing via PyTorch.
If implementing this class, any derived instanced will track any attribute of types
torch.Tensor(this includestorch.nn.Parameter),torch.nn.Moduleas well as any class subclassingHasDeviceitself.When calling
to(), all of these attributes will themselves be moved to the specifieddevice.Any instance of this class will need to be initialized. This can happen explicitly through the constructor (by specifying a device during initialization) or by assigning the first element, which will assign the device of the first element to the whole instance.
Every following assignment will yield in tensors, parameters, modules and other instances being moved to the device of the instance.
- Parameters:
device (
Optional[str]) – The device name, typicallycpuorcuda, or combined with a device idea, e.g.cuda:0.
Note
Do not subclass this class when some dependency to a compute device is already implemented, e.g. via a pytorch
torch.nn.Module.
- cebra.io.reduce(data, *, ratio=None, num_components=None)#
Map the specified data to its principal components
Specify either an explained variance
ratiobetween 0 and 1, or a number of principle components to use.- Parameters:
ratio – The ratio (needs to be between 0 and 1) of explained variance required by the returned number of components. Note that the dimension of the output will vary based on the provided input data.
num_components – The number of principal components to return
- Returns:
An
(N, d)array, where the dimensiondis either limited by the specified number of components, or is chosen to explain the specified variance in the data.
- class cebra.io.FileKeyValueDataset(path)#
Bases:
objectLoad datasets from HDF, torch, numpy or joblib files.
The data is directly accessible through attributes of instances of this class.
- Parameters:
path (
str) – The filepath for loading the data from. Should point to a file in a valid file format (hdf, torch, numpy, joblib). Valid extensions arejl,joblib,h5,hdf,hdf5,pth,ptandnpz.
Example
>>> import cebra.io >>> import joblib >>> import tempfile >>> from pathlib import Path >>> tmp_file = Path(tempfile.gettempdir(), 'test.jl') >>> _ = joblib.dump({'foo' : 42}, tmp_file) >>> data = cebra.io.FileKeyValueDataset(tmp_file) >>> data.foo 42
Registry#
A simple registry for python modules.
This module only exposes a single public function, add_helper_functions, which takes a python module or module name (or package) as its argument and defines the decorator functions
registerparametrize
and the functions
initandget_options
within this module. It also (implicitly and lazy) initializes a singleton
registry object which holds all registered classes. Typically, the helper
functions should be added in the first lines of a package __init__.py
module.
Note that all functions carrying the respective decorators need to be discovered
by the import system, otherwise they will not be available when calling get_options
or init.
- cebra.registry.add_helper_functions(module)#
Add registry functionality to the given module.
Call this function within a python module to add the three functions
register,initandget_optionsto the module.registeris a decorator for classes within the module. Each class will be added by a (unique) name and can be initialized with theinitfunction.inittakes a name as its argument and returns an instance of the specified class, with optional arguments.get_optionsreturns a list of all registered names within the module.
- cebra.registry.add_docstring(module)#
Apply additional information about configuration options to registry modules.
- cebra.registry.is_registry(module, check_docs=False)#
Check if the given module implements all registry functions.
- Parameters:
- Return type:
- Returns:
True if the module is a registry and implements the
register,initandget_optionsfunctions. Ifcheck_docsis set toTrue, then the documentation needs to match in addition. False if at least one function is missing.
Grid-Search#
Utilities for performing a grid search across CEBRA models.
- class cebra.grid_search.GridSearch#
Bases:
objectDefine and run a grid search on the CEBRA hyperparameters.
Note
We recommend that you use that grid-search implementation for rather small and simple grid-search.
Depending on the usage, one needs to optimize some parameters used in the CEBRA model, e.g., the
temperature, thebatch_size, thelearning_rate. A grid-search on that set of parameters consists in finding the best combination of values for those parameters. For that, models with different combinations of parameters are trained and the parameters used to get the best performing model are considered to be the optimal parameters. One can also define the fixed parameters, which will stay constant from one model to the other, e.g., themax_iterationsor theverbose.The class also allows to iterate over multiple datasets and combinations of auxiliary variables.
- generate_models(params)#
Generate the models to compare, based on fixed and variable CEBRA parameters.
- Parameters:
params (
dict) – Dict of parameter values provided by the user, either as a single value, for fixed hyperparameter values, or with a list of values for hyperparameters to optimize. If the value is a list of a single element, the hyperparameter is considered as fixed.- Return type:
- Returns:
A dict of (unfitted) models (first returns) and a list of dict of the parameters for each model (second returns).
Example
>>> import cebra.grid_search >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Define the grid search and generate the models >>> grid_search = cebra.grid_search.GridSearch() >>> models, parameter_grid = grid_search.generate_models(params=params_grid)
- fit_models(datasets, params, models_dir='saved_models')#
Fit the models to compare in the grid search on the provided datasets.
- Parameters:
datasets (
dict) – A dict of datasets to iterate over. The values in the dict can either be a tuple or an iterable structure (Iterable, i.e., list,numpy.array(),torch.Tensor). If the value provided for a given dataset is a tuple, contain multiple elements and those elements are all iterable structures (as defined before), then the first element is considered to be thedatato fit the CEBRA models (Xto be used incebra.CEBRA.fit()) on and the other values to be the auxiliary variables be to use in the training process (ys to be used incebra.CEBRA.fit()). The models are then trained using behavioral contrastive learning (either CEBRA-Behavior or CEBRA-Hybrid). If the value provided for a given dataset is a tuple containing a single value and it is an iterable structure (as defined before) or is such an iterable structure directly (not a tuple), then the value is considered to be thedatato fit the CEBRA models on. The models are then trained using temporal contrastive learning (CEBRA-Time). An example of a validdatasetsvalue could be:datasets={"dataset1": neural_data, "dataset2": (neurald_data, continuous_data, discrete_data), "dataset3": (neural_data2, continuous_data2)}.params (
dict) – Dict of parameter values provided by the user, either as a single value, for fixed hyperparameter values, or with a list of values for hyperparameters to optimize. If the value is a list of a single element, the hyperparameter is considered as fixed.models_dir (
str) – The path to the folder in which save the (fitted) models.
- Return type:
- Returns:
selffor chaining operations.
Example
>>> import cebra.grid_search >>> import numpy as np >>> neural_data = np.random.uniform(0, 1, (300, 30)) >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Fit the models generated from the list of parameters >>> grid_search = cebra.grid_search.GridSearch() >>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data}, ... params=params_grid, ... models_dir="grid_search_models")
- classmethod load(dir)#
Load the fitted models and parameter grid present in
dir.Note
It is recommended to generate the models to iterate over by using
fit_models(), but you can also run the following function on a folder containing valid fitted models and the corresponding parameter grid.- Parameters:
dir (
str) – The directory in which the fitted models are saved.- Return type:
- Returns:
A dict containing the fitted models (first returns) and a list of dict containing the parameters used for each model present in the
dir(second returns).
Example
>>> import cebra.grid_search >>> models, parameter_grid = cebra.grid_search.GridSearch().load(dir="grid_search_models")
- get_best_model(scoring='infonce_loss', dataset_name=None, models_dir=None)#
Get the model with the best performance across all sets of parameters and datasets.
- Parameters:
scoring (
Literal[‘infonce_loss’]) – Metric to use to evaluate the models performances.dataset_name (
Optional[str]) – Name of the dataset to find the best model for. By default,dataset_nameis set toNoneand the best model is searched from the list of all models for all sets of parameters and across all datasets. Adataset_nameis valid if models were fitted on that set of data and those models are present inmodels_dir. Then, the returned model will correspond to the model with the highest performances on that dataset only.models_dir (
Optional[str]) – The path to the folder in which save the (fitted) models.
- Return type:
- Returns:
The
cebra.CEBRAmodel with the highest performance for a givendataset_name(first returns) and its name (second returns).
Example
>>> import cebra.grid_search >>> import numpy as np >>> neural_data = np.random.uniform(0, 1, (300, 30)) >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Fit the models generated from the list of parameters >>> grid_search = cebra.grid_search.GridSearch() >>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data}, ... params=params_grid, ... models_dir="grid_search_models") >>> # 3. Get model with the best performances and use it as usual >>> best_model, best_model_name = grid_search.get_best_model() >>> embedding = best_model.transform(neural_data)
- get_df_results(models_dir=None)#
Create a
pandas.DataFramecontaining the parameters and results for each model.- Parameters:
models_dir (
Optional[str]) – The path to the folder in which save the (fitted) models.- Return type:
- Returns:
A
pandas.DataFramein which each row corresponds to a model from the grid search and contains the parameters used for the model, the dataset name that was used for fitting and the performances obtained with that model.
Example
>>> import cebra.grid_search >>> import numpy as np >>> neural_data = np.random.uniform(0, 1, (300, 30)) >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Fit the models generated from the list of parameters >>> grid_search = cebra.grid_search.GridSearch() >>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data}, ... params=params_grid, ... models_dir="grid_search_models") >>> # 3. Get results for all models >>> df_results = grid_search.get_df_results()
- plot_loss_comparison(models_dir=None, **kwargs)#
Display the losses for all fitted models present in
models_dir.Note
The method is a wrapper around
compare_models(), meaning you can provide all parameters that are provided tocompare_models()to that function.- Parameters:
models_dir (
Optional[str]) – The path to the folder in which save the (fitted) models.- Return type:
- Returns:
A
matplotlib.axes.Axeson which to generate the plot.
Example
>>> import cebra.grid_search >>> import numpy as np >>> neural_data = np.random.uniform(0, 1, (300, 30)) >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Fit the models generated from the list of parameters >>> grid_search = cebra.grid_search.GridSearch() >>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data}, ... params=params_grid, ... models_dir="grid_search_models") >>> # 3. Plot losses for all models >>> ax = grid_search.plot_loss_comparison()
- plot_transform()#
TODO.
Data helpers#
- cebra.data.helper.get_loader_options(dataset)#
Return all possible dataloaders for the given dataset.
- class cebra.data.helper.OrthogonalProcrustesAlignment(top_k=5, subsample=None)#
Bases:
objectAligns two dataset by solving the orthogonal Procrustes problem.
Tip
In linear algebra, the orthogonal Procrustes problem is a matrix approximation problem. Considering two matrices A and B, it consists in finding the orthogonal matrix R which most closely maps A to B, so that it minimizes the Frobenius norm of
(A @ R) - Bsubject toR.T @ R = I. Seescipy.linalg.orthogonal_procrustes()for more information.For each dataset, the data and labels to align the data on is provided.
The
top_kindexes of the labels to align (label) that are the closest to the labels of the reference dataset (ref_label) are selected and used to sample from the dataset to align (data).dataandref_data(the reference dataset) are subsampled to the same number of samplessubsample.The orthogonal mapping is computed, using
scipy.linalg.orthogonal_procrustes(), on those subsampled datasets.The resulting orthongonal matrix
_transformcan be used to map the originaldatato theref_data.
Note
dataandref_datacan be of different sample size (axis 0) but must have the same number of features (axis 1) to be aligned.- top_k#
Number of indexes in the labels of the matrix to align to consider for alignment (
label). The selected indexes consist in thetop_kth indexes the closest to the reference labels (ref_label).- Type:
- subsample#
Number of samples to subsample the
dataandref_datafrom, to solve the orthogonal Procrustes problem on.- Type:
- fit(ref_data, data, ref_label=None, label=None)#
Compute the matrix solution of the orthogonal Procrustes problem.
The obtained matrix is used to align a dataset to a reference dataset.
- Parameters:
ref_data (
Union[ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]],Tensor]) – Reference data matrix on which to align the data.data (
Union[ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]],Tensor]) – Data matrix to align on the reference dataset.ref_label (
Optional[ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]]) – Set of indices associated toref_data.label (
Optional[ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]]) – Set of indices associated todata.
- Return type:
- Returns:
self, for chaining operations.
Example
>>> import cebra.data.helper >>> import numpy as np >>> ref_embedding = np.random.uniform(0, 1, (1000, 30)) >>> aux_embedding = np.random.uniform(0, 1, (800, 30)) >>> ref_label = np.random.uniform(0, 1, (1000, 1)) >>> aux_label = np.random.uniform(0, 1, (800, 1)) >>> orthogonal_procrustes = cebra.data.helper.OrthogonalProcrustesAlignment() >>> orthogonal_procrustes = orthogonal_procrustes.fit(ref_data=ref_embedding, ... data=aux_embedding, ... ref_label=ref_label, ... label=aux_label)
- transform(data)#
Transform the data using the matrix solution computed in py:meth:fit.
- Parameters:
data (
Union[ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]],Tensor]) – The 2D data matrix to align.- Return type:
ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]- Returns:
The aligned input matrix.
Example
>>> import cebra.data.helper >>> import numpy as np >>> ref_embedding = np.random.uniform(0, 1, (1000, 30)) >>> aux_embedding = np.random.uniform(0, 1, (800, 30)) >>> ref_label = np.random.uniform(0, 1, (1000, 1)) >>> aux_label = np.random.uniform(0, 1, (800, 1)) >>> orthogonal_procrustes = cebra.data.helper.OrthogonalProcrustesAlignment() >>> orthogonal_procrustes = orthogonal_procrustes.fit(ref_data=ref_embedding, ... data=aux_embedding, ... ref_label=ref_label, ... label=aux_label) >>> aligned_aux_embedding = orthogonal_procrustes.transform(data=aux_embedding) >>> assert aligned_aux_embedding.shape == aux_embedding.shape
- fit_transform(ref_data, data, ref_label=None, label=None)#
Compute the matrix solution to align a data array to a reference matrix.
Note
Uses a combination of
fit()andtransform().- Parameters:
ref_data (
ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]) – Reference data matrix on which to align the data.data (
ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]) – Data matrix to align on the reference dataset.ref_label (
Optional[ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]]) – Set of indices associated toref_data.label (
Optional[ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]]) – Set of indices associated todata.
- Return type:
ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]- Returns:
The
datamatrix aligned onto the reference data matrix.
Example
>>> import cebra.data.helper >>> import numpy as np >>> ref_embedding = np.random.uniform(0, 1, (1000, 30)) >>> aux_embedding = np.random.uniform(0, 1, (800, 30)) >>> ref_label = np.random.uniform(0, 1, (1000, 1)) >>> aux_label = np.random.uniform(0, 1, (800, 1)) >>> orthogonal_procrustes = cebra.data.helper.OrthogonalProcrustesAlignment(top_k=10, ... subsample=700) >>> aligned_aux_embedding = orthogonal_procrustes.fit_transform(ref_data=ref_embedding, ... data=aux_embedding, ... ref_label=ref_label, ... label=aux_label) >>> assert aligned_aux_embedding.shape == aux_embedding.shape
- cebra.data.helper.ensemble_embeddings(embeddings, labels=None, post_norm=False, n_jobs=0)#
Ensemble aligned embeddings together.
The embeddings contained in
embeddingsare aligned onto the same embedding, usingOrthogonalProcrustesAlignment. Then, they are averaged and the resulting averaged embedding is the ensemble embedding.Tip
By ensembling embeddings coming from the same dataset but obtained from different models, the resulting joint embedding usually shows an increase in performances compared to the individual embeddings.
Note
The embeddings in
embeddingsmust be the same shape, i.e., the same number of samples and same number of features (axis 1).- Parameters:
embeddings (
List[Union[ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]],Tensor]]) – List of embeddings to align and ensemble.labels (
Optional[List[Union[ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]],Tensor]]]) – Optional list of indexes associated to the embeddings inembeddingsto align the embeddings on. To be ensembled, the embeddings should already be aligned on time, and consequently do not require extra labels for alignment.post_norm (
bool) – If True, the resulting joint embedding is normalized (divided by its norm across the features - axis 1).n_jobs (
int) – The maximum number of concurrently running jobs to compute embedding alignment in a parallel manner usingjoblib.Parallel. Specify0to iterate naively over the embeddings for ensembling without usingjoblib.Parallel. Specify-1to use all cores. Using more than a single core can considerably speed up the computation of ensembled embeddings for large datasets, but will also require more memory.
- Return type:
ndarray[tuple[int,...],dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]- Returns:
A
numpy.array()corresponding to the joint embedding.
Example
>>> import cebra.data.helper >>> import numpy as np >>> embedding1 = np.random.uniform(0, 1, (100, 4)) >>> embedding2 = np.random.uniform(0, 1, (100, 4)) >>> embedding3 = np.random.uniform(0, 1, (100, 4)) >>> joint_embedding = cebra.data.helper.ensemble_embeddings(embeddings=[embedding1, embedding2, embedding3]) >>> assert joint_embedding.shape == embedding1.shape
Masking helpers#
- class cebra.data.masking.MaskedMixin#
Bases:
objectA mixin class for applying masking to data.
Note
This class is designed to be used as a mixin for other classes. It provides functionality to apply masking to data. The set_masks method should be called to set the masking types and their corresponding probabilities.
- set_masks(masking=None)#
Set the mask type and probability for the dataset.
- Parameters:
masking (Dict[str, float]) – A dictionary of masking types and their corresponding required masking values. The keys are the names of the Mask instances.
Note
By default, no masks are applied.
- Return type:
- apply_mask(data, chunk_size=1000)#
Apply masking to the input data.
Note
By default, no masking. Else apply masking on the input data.
- Only one masking type can be applied at a time, but multiple
masking types can be set so that it alternates between them across iterations.
Masking is applied to the data in chunks to avoid memory issues.
- Parameters:
data (torch.Tensor) – batch of size (batch_size, num_neurons, offset).
chunk_size (int) – Number of rows to process at a time.
- Returns:
The masked data.
- Return type: