Helper functions and utilities#
This section contains information on modules that are used within the packages, but are not part of the key algorithm/functionality.
File IO#
Helper classes and functions for I/O functionality.
- cebra.io.device()#
The preferred compute device.
- Return type:
- Returns:
cuda
, if available, otherwisecpu
.
- class cebra.io.HasDevice(device=None)#
Bases:
object
Base class for classes that use CPU/CUDA processing via PyTorch.
If implementing this class, any derived instanced will track any attribute of types
torch.Tensor
(this includestorch.nn.Parameter
),torch.nn.Module
as well as any class subclassingHasDevice
itself.When calling
to()
, all of these attributes will themselves be moved to the specifieddevice
.Any instance of this class will need to be initialized. This can happen explicitly through the constructor (by specifying a device during initialization) or by assigning the first element, which will assign the device of the first element to the whole instance.
Every following assignment will yield in tensors, parameters, modules and other instances being moved to the device of the instance.
- Parameters:
device (
Optional
[str
]) – The device name, typicallycpu
orcuda
, or combined with a device idea, e.g.cuda:0
.
Note
Do not subclass this class when some dependency to a compute device is already implemented, e.g. via a pytorch
torch.nn.Module
.
- cebra.io.reduce(data, *, ratio=None, num_components=None)#
Map the specified data to its principal components
Specify either an explained variance
ratio
between 0 and 1, or a number of principle components to use.- Parameters:
ratio – The ratio (needs to be between 0 and 1) of explained variance required by the returned number of components. Note that the dimension of the output will vary based on the provided input data.
num_components – The number of principal components to return
- Returns:
An
(N, d)
array, where the dimensiond
is either limited by the specified number of components, or is chosen to explain the specified variance in the data.
- class cebra.io.FileKeyValueDataset(path)#
Bases:
object
Load datasets from HDF, torch, numpy or joblib files.
The data is directly accessible through attributes of instances of this class.
- Parameters:
path (
str
) – The filepath for loading the data from. Should point to a file in a valid file format (hdf, torch, numpy, joblib). Valid extensions arejl
,joblib
,h5
,hdf
,hdf5
,pth
,pt
andnpz
.
Example
>>> import cebra.io >>> import joblib >>> import tempfile >>> from pathlib import Path >>> tmp_file = Path(tempfile.gettempdir(), 'test.jl') >>> _ = joblib.dump({'foo' : 42}, tmp_file) >>> data = cebra.io.FileKeyValueDataset(tmp_file) >>> data.foo 42
Registry#
A simple registry for python modules.
This module only exposes a single public function, add_helper_functions, which takes a python module or module name (or package) as its argument and defines the decorator functions
register
parametrize
and the functions
init
andget_options
within this module. It also (implicitly and lazy) initializes a singleton
registry object which holds all registered classes. Typically, the helper
functions should be added in the first lines of a package __init__.py
module.
Note that all functions carrying the respective decorators need to be discovered
by the import system, otherwise they will not be available when calling get_options
or init
.
- cebra.registry.add_helper_functions(module)#
Add registry functionality to the given module.
Call this function within a python module to add the three functions
register
,init
andget_options
to the module.register
is a decorator for classes within the module. Each class will be added by a (unique) name and can be initialized with theinit
function.init
takes a name as its argument and returns an instance of the specified class, with optional arguments.get_options
returns a list of all registered names within the module.
- cebra.registry.add_docstring(module)#
Apply additional information about configuration options to registry modules.
- cebra.registry.is_registry(module, check_docs=False)#
Check if the given module implements all registry functions.
- Parameters:
- Return type:
- Returns:
True if the module is a registry and implements the
register
,init
andget_options
functions. Ifcheck_docs
is set toTrue
, then the documentation needs to match in addition. False if at least one function is missing.
Grid-Search#
Utilities for performing a grid search across CEBRA models.
- class cebra.grid_search.GridSearch#
Bases:
object
Define and run a grid search on the CEBRA hyperparameters.
Note
We recommend that you use that grid-search implementation for rather small and simple grid-search.
Depending on the usage, one needs to optimize some parameters used in the CEBRA model, e.g., the
temperature
, thebatch_size
, thelearning_rate
. A grid-search on that set of parameters consists in finding the best combination of values for those parameters. For that, models with different combinations of parameters are trained and the parameters used to get the best performing model are considered to be the optimal parameters. One can also define the fixed parameters, which will stay constant from one model to the other, e.g., themax_iterations
or theverbose
.The class also allows to iterate over multiple datasets and combinations of auxiliary variables.
- generate_models(params)#
Generate the models to compare, based on fixed and variable CEBRA parameters.
- Parameters:
params (
dict
) – Dict of parameter values provided by the user, either as a single value, for fixed hyperparameter values, or with a list of values for hyperparameters to optimize. If the value is a list of a single element, the hyperparameter is considered as fixed.- Return type:
- Returns:
A dict of (unfitted) models (first returns) and a list of dict of the parameters for each model (second returns).
Example
>>> import cebra.grid_search >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Define the grid search and generate the models >>> grid_search = cebra.grid_search.GridSearch() >>> models, parameter_grid = grid_search.generate_models(params=params_grid)
- fit_models(datasets, params, models_dir='saved_models')#
Fit the models to compare in the grid search on the provided datasets.
- Parameters:
datasets (
dict
) – A dict of datasets to iterate over. The values in the dict can either be a tuple or an iterable structure (Iterable
, i.e., list,numpy.array()
,torch.Tensor
). If the value provided for a given dataset is a tuple, contain multiple elements and those elements are all iterable structures (as defined before), then the first element is considered to be thedata
to fit the CEBRA models (X
to be used incebra.CEBRA.fit()
) on and the other values to be the auxiliary variables be to use in the training process (y
s to be used incebra.CEBRA.fit()
). The models are then trained using behavioral contrastive learning (either CEBRA-Behavior or CEBRA-Hybrid). If the value provided for a given dataset is a tuple containing a single value and it is an iterable structure (as defined before) or is such an iterable structure directly (not a tuple), then the value is considered to be thedata
to fit the CEBRA models on. The models are then trained using temporal contrastive learning (CEBRA-Time). An example of a validdatasets
value could be:datasets={"dataset1": neural_data, "dataset2": (neurald_data, continuous_data, discrete_data), "dataset3": (neural_data2, continuous_data2)}
.params (
dict
) – Dict of parameter values provided by the user, either as a single value, for fixed hyperparameter values, or with a list of values for hyperparameters to optimize. If the value is a list of a single element, the hyperparameter is considered as fixed.models_dir (
str
) – The path to the folder in which save the (fitted) models.
- Return type:
- Returns:
self
for chaining operations.
Example
>>> import cebra.grid_search >>> import numpy as np >>> neural_data = np.random.uniform(0, 1, (300, 30)) >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Fit the models generated from the list of parameters >>> grid_search = cebra.grid_search.GridSearch() >>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data}, ... params=params_grid, ... models_dir="grid_search_models")
- classmethod load(dir)#
Load the fitted models and parameter grid present in
dir
.Note
It is recommended to generate the models to iterate over by using
fit_models()
, but you can also run the following function on a folder containing valid fitted models and the corresponding parameter grid.- Parameters:
dir (
str
) – The directory in which the fitted models are saved.- Return type:
- Returns:
A dict containing the fitted models (first returns) and a list of dict containing the parameters used for each model present in the
dir
(second returns).
Example
>>> import cebra.grid_search >>> models, parameter_grid = cebra.grid_search.GridSearch().load(dir="grid_search_models")
- get_best_model(scoring='infonce_loss', dataset_name=None, models_dir=None)#
Get the model with the best performance across all sets of parameters and datasets.
- Parameters:
scoring (
Literal
[‘infonce_loss’]) – Metric to use to evaluate the models performances.dataset_name (
Optional
[str
]) – Name of the dataset to find the best model for. By default,dataset_name
is set toNone
and the best model is searched from the list of all models for all sets of parameters and across all datasets. Adataset_name
is valid if models were fitted on that set of data and those models are present inmodels_dir
. Then, the returned model will correspond to the model with the highest performances on that dataset only.models_dir (
Optional
[str
]) – The path to the folder in which save the (fitted) models.
- Return type:
- Returns:
The
cebra.CEBRA
model with the highest performance for a givendataset_name
(first returns) and its name (second returns).
Example
>>> import cebra.grid_search >>> import numpy as np >>> neural_data = np.random.uniform(0, 1, (300, 30)) >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Fit the models generated from the list of parameters >>> grid_search = cebra.grid_search.GridSearch() >>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data}, ... params=params_grid, ... models_dir="grid_search_models") >>> # 3. Get model with the best performances and use it as usual >>> best_model, best_model_name = grid_search.get_best_model() >>> embedding = best_model.transform(neural_data)
- get_df_results(models_dir=None)#
Create a
pandas.DataFrame
containing the parameters and results for each model.- Parameters:
models_dir (
Optional
[str
]) – The path to the folder in which save the (fitted) models.- Return type:
- Returns:
A
pandas.DataFrame
in which each row corresponds to a model from the grid search and contains the parameters used for the model, the dataset name that was used for fitting and the performances obtained with that model.
Example
>>> import cebra.grid_search >>> import numpy as np >>> neural_data = np.random.uniform(0, 1, (300, 30)) >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Fit the models generated from the list of parameters >>> grid_search = cebra.grid_search.GridSearch() >>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data}, ... params=params_grid, ... models_dir="grid_search_models") >>> # 3. Get results for all models >>> df_results = grid_search.get_df_results()
- plot_loss_comparison(models_dir=None, **kwargs)#
Display the losses for all fitted models present in
models_dir
.Note
The method is a wrapper around
compare_models()
, meaning you can provide all parameters that are provided tocompare_models()
to that function.- Parameters:
models_dir (
Optional
[str
]) – The path to the folder in which save the (fitted) models.- Return type:
- Returns:
A
matplotlib.axes.Axes
on which to generate the plot.
Example
>>> import cebra.grid_search >>> import numpy as np >>> neural_data = np.random.uniform(0, 1, (300, 30)) >>> # 1. Define the parameters for the models >>> params_grid = dict( ... output_dimension = [3, 16], ... learning_rate = [0.001], ... time_offsets = 5, ... max_iterations = 10, ... verbose = False) >>> # 2. Fit the models generated from the list of parameters >>> grid_search = cebra.grid_search.GridSearch() >>> grid_search = grid_search.fit_models(datasets={"neural_data": neural_data}, ... params=params_grid, ... models_dir="grid_search_models") >>> # 3. Plot losses for all models >>> ax = grid_search.plot_loss_comparison()
- plot_transform()#
TODO.
Data helpers#
- cebra.data.helper.get_loader_options(dataset)#
Return all possible dataloaders for the given dataset.
- class cebra.data.helper.OrthogonalProcrustesAlignment(top_k=5, subsample=None)#
Bases:
object
Aligns two dataset by solving the orthogonal Procrustes problem.
Tip
In linear algebra, the orthogonal Procrustes problem is a matrix approximation problem. Considering two matrices A and B, it consists in finding the orthogonal matrix R which most closely maps A to B, so that it minimizes the Frobenius norm of
(A @ R) - B
subject toR.T @ R = I
. Seescipy.linalg.orthogonal_procrustes()
for more information.For each dataset, the data and labels to align the data on is provided.
The
top_k
indexes of the labels to align (label
) that are the closest to the labels of the reference dataset (ref_label
) are selected and used to sample from the dataset to align (data
).data
andref_data
(the reference dataset) are subsampled to the same number of samplessubsample
.The orthogonal mapping is computed, using
scipy.linalg.orthogonal_procrustes()
, on those subsampled datasets.The resulting orthongonal matrix
_transform
can be used to map the originaldata
to theref_data
.
Note
data
andref_data
can be of different sample size (axis 0) but must have the same number of features (axis 1) to be aligned.- top_k#
Number of indexes in the labels of the matrix to align to consider for alignment (
label
). The selected indexes consist in thetop_k
th indexes the closest to the reference labels (ref_label
).- Type:
- subsample#
Number of samples to subsample the
data
andref_data
from, to solve the orthogonal Procrustes problem on.- Type:
- fit(ref_data, data, ref_label=None, label=None)#
Compute the matrix solution of the orthogonal Procrustes problem.
The obtained matrix is used to align a dataset to a reference dataset.
- Parameters:
ref_data (
Union
[ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]],Tensor
]) – Reference data matrix on which to align the data.data (
Union
[ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]],Tensor
]) – Data matrix to align on the reference dataset.ref_label (
Optional
[ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]]) – Set of indices associated toref_data
.label (
Optional
[ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]]) – Set of indices associated todata
.
- Return type:
- Returns:
self
, for chaining operations.
Example
>>> import cebra.data.helper >>> import numpy as np >>> ref_embedding = np.random.uniform(0, 1, (1000, 30)) >>> aux_embedding = np.random.uniform(0, 1, (800, 30)) >>> ref_label = np.random.uniform(0, 1, (1000, 1)) >>> aux_label = np.random.uniform(0, 1, (800, 1)) >>> orthogonal_procrustes = cebra.data.helper.OrthogonalProcrustesAlignment() >>> orthogonal_procrustes = orthogonal_procrustes.fit(ref_data=ref_embedding, ... data=aux_embedding, ... ref_label=ref_label, ... label=aux_label)
- transform(data)#
Transform the data using the matrix solution computed in py:meth:fit.
- Parameters:
data (
Union
[ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]],Tensor
]) – The 2D data matrix to align.- Return type:
ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
The aligned input matrix.
Example
>>> import cebra.data.helper >>> import numpy as np >>> ref_embedding = np.random.uniform(0, 1, (1000, 30)) >>> aux_embedding = np.random.uniform(0, 1, (800, 30)) >>> ref_label = np.random.uniform(0, 1, (1000, 1)) >>> aux_label = np.random.uniform(0, 1, (800, 1)) >>> orthogonal_procrustes = cebra.data.helper.OrthogonalProcrustesAlignment() >>> orthogonal_procrustes = orthogonal_procrustes.fit(ref_data=ref_embedding, ... data=aux_embedding, ... ref_label=ref_label, ... label=aux_label) >>> aligned_aux_embedding = orthogonal_procrustes.transform(data=aux_embedding) >>> assert aligned_aux_embedding.shape == aux_embedding.shape
- fit_transform(ref_data, data, ref_label=None, label=None)#
Compute the matrix solution to align a data array to a reference matrix.
Note
Uses a combination of
fit()
andtransform()
.- Parameters:
ref_data (
ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Reference data matrix on which to align the data.data (
ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Data matrix to align on the reference dataset.ref_label (
Optional
[ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]]) – Set of indices associated toref_data
.label (
Optional
[ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]]) – Set of indices associated todata
.
- Return type:
ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
The
data
matrix aligned onto the reference data matrix.
Example
>>> import cebra.data.helper >>> import numpy as np >>> ref_embedding = np.random.uniform(0, 1, (1000, 30)) >>> aux_embedding = np.random.uniform(0, 1, (800, 30)) >>> ref_label = np.random.uniform(0, 1, (1000, 1)) >>> aux_label = np.random.uniform(0, 1, (800, 1)) >>> orthogonal_procrustes = cebra.data.helper.OrthogonalProcrustesAlignment(top_k=10, ... subsample=700) >>> aligned_aux_embedding = orthogonal_procrustes.fit_transform(ref_data=ref_embedding, ... data=aux_embedding, ... ref_label=ref_label, ... label=aux_label) >>> assert aligned_aux_embedding.shape == aux_embedding.shape
- cebra.data.helper.ensemble_embeddings(embeddings, labels=None, post_norm=False, n_jobs=0)#
Ensemble aligned embeddings together.
The embeddings contained in
embeddings
are aligned onto the same embedding, usingOrthogonalProcrustesAlignment
. Then, they are averaged and the resulting averaged embedding is the ensemble embedding.Tip
By ensembling embeddings coming from the same dataset but obtained from different models, the resulting joint embedding usually shows an increase in performances compared to the individual embeddings.
Note
The embeddings in
embeddings
must be the same shape, i.e., the same number of samples and same number of features (axis 1).- Parameters:
embeddings (
List
[Union
[ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]],Tensor
]]) – List of embeddings to align and ensemble.labels (
Optional
[List
[Union
[ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]],Tensor
]]]) – Optional list of indexes associated to the embeddings inembeddings
to align the embeddings on. To be ensembled, the embeddings should already be aligned on time, and consequently do not require extra labels for alignment.post_norm (
bool
) – If True, the resulting joint embedding is normalized (divided by its norm across the features - axis 1).n_jobs (
int
) – The maximum number of concurrently running jobs to compute embedding alignment in a parallel manner usingjoblib.Parallel
. Specify0
to iterate naively over the embeddings for ensembling without usingjoblib.Parallel
. Specify-1
to use all cores. Using more than a single core can considerably speed up the computation of ensembled embeddings for large datasets, but will also require more memory.
- Return type:
ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A
numpy.array()
corresponding to the joint embedding.
Example
>>> import cebra.data.helper >>> import numpy as np >>> embedding1 = np.random.uniform(0, 1, (100, 4)) >>> embedding2 = np.random.uniform(0, 1, (100, 4)) >>> embedding3 = np.random.uniform(0, 1, (100, 4)) >>> joint_embedding = cebra.data.helper.ensemble_embeddings(embeddings=[embedding1, embedding2, embedding3]) >>> assert joint_embedding.shape == embedding1.shape