Data Loading#

Data loaders use distributions and indices to make samples available for training.

This package contains all helper functions and classes for defining and loading datasets in the various usage modes of CEBRA, e.g. single- and multi-session datasets. It is non-specific to a particular dataset (see cebra.datasets for actual dataset implementations). However, the base classes for all datasets are defined here, as well as helper functions to interact with datasets.

CEBRA supports different dataset types out-of-the box:

  • cebra.data.single_session.SingleSessionDataset is the abstract base class for a single session dataset. Single session datasets have the same feature dimension across the samples (e.g., neural data) and all context variables (e.g. behavior, stimuli, etc.).

  • cebra.data.multi_session.MultiSessionDataset is the abstract base class for a multi session dataset. Multi session datasets contain of multiple single session datasets. Crucially, the dimensionality of the auxiliary variable dimension needs to match across the sessions, which allows alignment of multiple sessions. The dimensionality of the signal variable can vary arbitrarily between sessions.

Note that the actual implementation of datasets (e.g. for benchmarking) is done in the cebra.datasets package.

Base classes#

Base classes for datasets and loaders.

class cebra.data.base.Dataset(device='cpu', download=False, data_url=None, data_checksum=None, location=None, file_name=None)#

Bases: ABC, HasDevice

Abstract base class for implementing a dataset.

The class attributes provide information about the shape of the data when indexing this dataset.

input_dimension#

The input dimension of the signal in this dataset. Models applied on this this dataset should match this dimensionality.

offset#

The offset determines the shape of the data obtained with the __getitem__ and expand_index() methods.

property continuous_index: torch.Tensor#

The continuous index, if available.

The continuous index along with a similarity metric is used for drawing positive and/or negative samples.

Return type:

Tensor

Returns:

Tensor of shape (N,d), representing the index for all N samples in the dataset.

property discrete_index: torch.Tensor#

The discrete index, if available.

The discrete index can be used for making an embedding invariant to a variable for to restrict positive samples to share the same index variable. To implement more complicated indexing operations (such as modeling similiarities between indices), it is better to transform a discrete into a continuous index.

Return type:

Tensor

Returns:

Tensor of shape (N,), representing the index for all N samples in the dataset.

expand_index(index)#
Parameters:

index (Tensor) – A one-dimensional tensor of type long containing indices to select from the dataset.

Return type:

Tensor

Returns:

An expanded index of shape (len(index), len(self.offset)) where the elements will be expanded_index[i,j] = index[i] + j - self.offset.left for all j in range(0, len(self.offset)).

Note

Requires the offset to be set.

expand_index_in_trial(index, trial_ids, trial_borders)#

When the neural/behavior is in discrete trial, e.g) Monkey Reaching Dataset the slice should be defined within the trial. trial_ids is in size of a length of self.index and indicate the trial id of the index belong to. trial_borders is in size of a length of self.idnex and indicate the border of each trial.

abstract load_batch(index)#

Return the data at the specified index location.

TODO: adapt signature to support Batches and List[Batch]

Return type:

Batch

configure_for(model)#

Configure the dataset offset for the provided model.

Call this function before indexing the dataset. This sets the offset attribute of the dataset.

Parameters:

model (Model) – The model to configure the dataset for.

class cebra.data.base.Loader(dataset=None, num_steps=None, batch_size=None)#

Bases: ABC, HasDevice

Base dataloader class.

Parameters:
  • dataset (Optional[Dataset]) – A dataset instance specifying a __getitem__ function.

  • num_steps (Optional[int]) – The total number of batches when iterating over the dataloader.

  • batch_size (Optional[int]) – The total batch size.

Yields:

Batches of the specified size from the given dataset object.

Note

The __iter__ method is non-deterministic, unless explicit seeding is implemented in derived classes. It is recommended to avoid global seeding in numpy and torch, and instead locally instantiate a Generator object for drawing samples.

abstract get_indices(num_samples)#

Sample and return the specified number of indices.

The elements of the returned BatchIndex will be used to index the dataset of this data loader.

Parameters:

num_samples (int) – The size of each of the reference, positive and negative samples.

Returns:

batch indices for the reference, positive and negative sample.

General File Loading#

cebra.load_data(file, key=None, columns=None)#

Load a dataset from the given file.

The following file types are supported:
  • Numpy files: npy, npz;

  • HDF5 files: h5, hdf, hdf5, including h5 generated through DLC;

  • PyTorch files: pt, p;

  • csv files;

  • Excel files: xls, xlsx, xlsm;

  • Joblib files: jl;

  • Pickle files: p, pkl;

  • MAT-files: mat.

The assumptions on your data are as following:
  • it contains at least one data structure (e.g. a numpy array, a torch.Tensor, etc.);

  • it can be directly in the form of a collection (e.g. a dictionary);

  • if the file contains a collection, the user can provide a key to refer to the data value they want to access;

  • if no key is provided, the first data structure found upon iteration of the collection will be loaded;

  • if a key is provided, it needs to correspond to an existing item of the collection;

  • if a key is provided, the data value accessed needs to be a data structure;

  • the function loads data for only one data structure, even if the file contains more. The function can be called again with the corresponding key to get the other ones.

Parameters:
  • file (Union[str, Path]) – The path to the given file to load, in a supported format.

  • key (Union[str, int, None]) – The key referencing the data of interest in the file, if the file has a dictionary-like structure.

  • columns (Optional[list]) – The part of the data to keep in the output 2D-array. For now, it corresponds to the columns of a DataFrame to keep if the data selected is a DataFrame.

Return type:

ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

The loaded data.

Example

>>> import cebra
>>> import cebra.helper as cebra_helper
>>> import numpy as np
>>> # Create the files to load the data from
>>> # Create a .npz file
>>> X = np.random.normal(0,1,(100,3))
>>> y = np.random.normal(0,1,(100,4))
>>> np.savez("data", neural = X, trial = y)
>>> # Create a .h5 file
>>> url = "https://github.com/DeepLabCut/DeepLabCut/blob/main/examples/Reaching-Mackenzie-2018-08-30/labeled-data/reachingvideo1/CollectedData_Mackenzie.h5?raw=true"
>>> dlc_file = cebra_helper.download_file_from_url(url) # an .h5 example file
>>> # Load data
>>> X = cebra.load_data(file="data.npz", key="neural")
>>> y_trial_id = cebra.load_data(file="data.npz", key="trial")
>>> y_behavior = cebra.load_data(file=dlc_file, columns=["Hand", "Tongue"])

DeepLabCut File Loading#

cebra.load_deeplabcut(filepath, keypoints=None, pcutoff=0.6)#

Load DLC data from h5 files.

Parameters:
  • filepath (Union[Path, str]) – Path to the .h5 file containing DLC output data.

  • keypoints (Optional[list]) – List of keypoints to keep in the output numpy.array.

  • pcutoff (float) – Drop-out threshold. If the likelihood value on the estimated positions a sample is smaller than that threshold, then the sample is set to nan. Then, the nan values are interpolated.

Return type:

ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

A 2D array (n_samples x n_features) containing the data (x and y) generated by DLC for each keypoint of interest. Note that the likelihood is dropped.

Example

>>> import cebra
>>> url = ANNOTATED_DLC_URL = "https://github.com/DeepLabCut/DeepLabCut/blob/main/examples/Reaching-Mackenzie-2018-08-30/labeled-data/reachingvideo1/CollectedData_Mackenzie.h5?raw=true"
>>> file = cebra.helper.download_file_from_url(url) # an .h5 example file
>>> dlc_data = cebra.load_deeplabcut(file, keypoints=["Hand", "Joystick1"], pcutoff=0.6)

Pre-defined Datasets#

Pre-defined datasets.

class cebra.data.datasets.TensorDataset(neural, continuous=None, discrete=None, offset=Offset(left=0, right=1, length=1), device='cpu')#

Bases: SingleSessionDataset

Discrete and/or continuously indexed dataset based on torch/numpy arrays.

If dealing with datasets sufficiently small to fit numpy.array() or torch.Tensor, this dataset is sufficient—the sampling auxiliary variable should be specified with a dataloader. Based on whether continuous and/or discrete auxiliary variables are provided, this class can be used with the discrete, continuous and/or mixed data loader classes.

Parameters:

Example

>>> import cebra.data
>>> import torch
>>> data = torch.randn((100, 30))
>>> index1 = torch.randn((100, 2))
>>> index2 = torch.randint(0,5,(100, ))
>>> dataset = cebra.data.datasets.TensorDataset(data, continuous=index1, discrete=index2)
property continuous_index#

The continuous index, if available.

The continuous index along with a similarity metric is used for drawing positive and/or negative samples.

Returns:

Tensor of shape (N,d), representing the index for all N samples in the dataset.

property discrete_index#

The discrete index, if available.

The discrete index can be used for making an embedding invariant to a variable for to restrict positive samples to share the same index variable. To implement more complicated indexing operations (such as modeling similiarities between indices), it is better to transform a discrete into a continuous index.

Returns:

Tensor of shape (N,), representing the index for all N samples in the dataset.

class cebra.data.datasets.DatasetCollection(*datasets)#

Bases: MultiSessionDataset

Multi session dataset made up of a list of datasets.

Parameters:

*datasets – Collection of datasets to add to the collection. The order will be maintained for indexing.

Example

>>> import cebra.data
>>> import torch
>>> session1 = torch.randn((100, 30))
>>> session2 = torch.randn((100, 50))
>>> index1 = torch.randn((100, 4))
>>> index2 = torch.randn((100, 4)) # same index dim as index1
>>> dataset = cebra.data.DatasetCollection(
...               cebra.data.TensorDataset(session1, continuous=index1),
...               cebra.data.TensorDataset(session2, continuous=index2))
property num_sessions: int#

The number of sessions in the dataset.

Return type:

int

get_input_dimension(session_id)#

Get the feature dimension of the required session.

Parameters:

session_id (int) – The session ID, an integer between 0 and num_sessions.

Return type:

int

Returns:

A single session input dimension for the requested session id.

get_session(session_id)#

Get the dataset for the specified session.

Parameters:

session_id (int) – The session ID, an integer between 0 and num_sessions.

Return type:

SingleSessionDataset

Returns:

A single session dataset for the requested session id.

property continuous_index: torch.Tensor#

The continuous index, if available.

The continuous index along with a similarity metric is used for drawing positive and/or negative samples.

Return type:

Tensor

Returns:

Tensor of shape (N,d), representing the index for all N samples in the dataset.

property discrete_index: torch.Tensor#

The discrete index, if available.

The discrete index can be used for making an embedding invariant to a variable for to restrict positive samples to share the same index variable. To implement more complicated indexing operations (such as modeling similiarities between indices), it is better to transform a discrete into a continuous index.

Return type:

Tensor

Returns:

Tensor of shape (N,), representing the index for all N samples in the dataset.

class cebra.data.datasets.DatasetxCEBRA(neural, device='cpu', **labels)#

Bases: HasDevice

Dataset class for xCEBRA models.

This class handles neural data and associated labels for xCEBRA models, providing functionality for data loading and batch preparation.

neural#

Neural data as a torch.Tensor or numpy array

labels#

Labels associated with the data

offset#

Offset for the dataset

Parameters:
  • neural (Union[Tensor, ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]]) – Neural data as a torch.Tensor or numpy array

  • device – Device to store the data on (default: “cpu”)

  • **labels – Additional keyword arguments for labels associated with the data

property input_dimension: int#

Get the input dimension of the neural data.

Return type:

int

Returns:

The number of features in the neural data

configure_for(model)#

Configure the dataset offset for the provided model.

Call this function before indexing the dataset. This sets the offset attribute of the dataset.

Parameters:

model (Model) – The model to configure the dataset for.

expand_index(index)#

Expand indices based on the configured offset.

Parameters:

index (Tensor) – A one-dimensional tensor of type long containing indices to select from the dataset.

Return type:

Tensor

Returns:

An expanded index of shape (len(index), len(self.offset)) where the elements will be expanded_index[i,j] = index[i] + j - self.offset.left for all j in range(0, len(self.offset)).

Note

Requires the offset to be set.

load_batch_supervised(index, labels_supervised)#

Load a batch for supervised learning.

Parameters:
  • index (Batch) – Batch indices for reference data

  • labels_supervised – Labels to load for supervised learning

Return type:

Tensor

Returns:

Batch containing reference data and corresponding labels

load_batch_contrastive(index)#

Load a batch for contrastive learning.

Parameters:

index (BatchIndex) – BatchIndex containing reference, positive and negative indices

Return type:

Batch

Returns:

Batch containing reference, positive and negative samples

Single Session Dataloaders#

Datasets and loaders for single session training.

All dataloaders should be implemented using dataclasses for handling arguments and configuration values and subclass base.Loader.

class cebra.data.single_session.SingleSessionDataset(device='cpu', download=False, data_url=None, data_checksum=None, location=None, file_name=None)#

Bases: Dataset

A dataset with data from a single experimental session.

A single experimental session contains a single data matrix with shape num_timesteps x dimension, potentially paired with auxiliary information of shape num_timesteps x aux_dimension.

Loaders for single session datasets can be found in cebra.data.single_session.

load_batch(index)#

Return the data at the specified index location.

Return type:

Batch

class cebra.data.single_session.DiscreteDataLoader(dataset=None, num_steps=None, batch_size=None, prior='empirical')#

Bases: Loader

Supervised contrastive learning on fully discrete dataset.

Reference and negative samples will be drawn from a uniform prior distribution. Depending on the prior attribute, the prior will uniform over time-steps (setting empirical), or be adjusted such that each discrete value in the dataset is uniformly distributed (setting uniform).

The positive samples will have a matching discrete auxiliary variable as the reference samples.

Sampling is implemented in the cebra.distributions.discrete.DiscreteUniform and cebra.distributions.discrete.DiscreteEmpirical distributions.

Parameters:
  • dataset (Optional[Dataset]) – A dataset instance specifying a __getitem__ function.

  • num_steps (Optional[int]) – The total number of batches when iterating over the dataloader.

  • batch_size (Optional[int]) – The total batch size.

  • prior (str) –

    Re-sampling mode for the discrete index.

    The option empirical uses label frequencies as they appear in the dataset. The option uniform re-samples the dataset and adjust the frequencies of less common class labels. For balanced datasets, it is typically more accurate to stick to the empirical option.

property index#

The (discrete) dataset index.

get_indices(num_samples)#

Samples indices for reference, positive and negative examples.

The reference samples will be sampled from the empirical or uniform prior distribution (if uniform, the discrete index values will be used to perform histogram normalization).

The positive samples will be sampled such that their discrete index value corresponds to the respective value of the reference samples.

The negative samples will be sampled from the same distribution as the reference examples.

Parameters:

num_samples (int) – The number of samples (batch size) of the returned cebra.data.datatypes.BatchIndex.

Return type:

BatchIndex

Returns:

Indices for reference, positive and negatives samples.

class cebra.data.single_session.ContinuousDataLoader(dataset=None, num_steps=None, batch_size=None, conditional='time_delta', time_offset=10, delta=0.1)#

Bases: Loader

Contrastive learning conditioned on a continuous behavior variable.

Reference and negative samples will be drawn from a uniform prior distribution across all time-steps. The positive sample will be distributed around the reference example using either

Parameters:
  • dataset (Optional[Dataset]) – A dataset instance specifying a __getitem__ function.

  • num_steps (Optional[int]) – The total number of batches when iterating over the dataloader.

  • batch_size (Optional[int]) – The total batch size.

  • conditional (str) – Information on how the positive samples should be acquired. Setting to time_delta computes the differences between adjacent samples in the dataset, and uses reference + diff as the query for collecting the positive pair. Setting to time will use adjacent pairs of samples only and become equivalent to time contrastive learning.

  • time_offset (int) – None

  • delta (float) – None

get_indices(num_samples)#

Samples indices for reference, positive and negative examples.

The reference and negative samples will be sampled uniformly from all available time steps.

The positive samples will be sampled conditional on the reference samples according to the specified conditional distribution.

Parameters:

num_samples (int) – The number of samples (batch size) of the returned cebra.data.datatypes.BatchIndex.

Return type:

BatchIndex

Returns:

Indices for reference, positive and negatives samples.

class cebra.data.single_session.MixedDataLoader(dataset=None, num_steps=None, batch_size=None, conditional='time_delta', time_offset=10)#

Bases: Loader

Mixed discrete-continuous data loader.

This data loader combines the functionality of DiscreteDataLoader and ContinuousDataLoader for datasets that provide both a continuous and discrete variables.

Sampling can be configured in different modes:

  1. Positive pairs always share their discrete variable.

  2. Positive pairs are drawn only based on their conditional, not discrete variable.

get_indices(num_samples)#

Samples indices for reference, positive and negative examples.

The reference and negative samples will be sampled uniformly from all available time steps.

The positive distribution will either share the discrete value of the reference samples, and then sampled as in the ContinuousDataLoader, or just sampled based on the conditional variable.

Parameters:

num_samples (int) – The number of samples (batch size) of the returned cebra.data.datatypes.BatchIndex.

Return type:

BatchIndex

Returns:

Indices for reference, positive and negatives samples.

class cebra.data.single_session.HybridDataLoader(dataset=None, num_steps=None, batch_size=None, conditional='time_delta', time_distribution='time', time_offset=10, delta=0.1)#

Bases: Loader

Contrastive learning using both time and behavior information.

The dataloader combines two training modes implemented in ContinuousDataLoader and combines time and behavior information into a joint embedding.

Parameters:
  • dataset (Optional[Dataset]) – A dataset instance specifying a __getitem__ function.

  • num_steps (Optional[int]) – The total number of batches when iterating over the dataloader.

  • batch_size (Optional[int]) – The total batch size.

  • conditional (str) – None

  • time_distribution (str) – None

  • time_offset (int) – None

  • delta (float) – None

property index#

The (continuous) dataset index.

get_indices(num_samples)#

Samples indices for reference, positive and negative examples.

The reference and negative samples will be sampled uniformly from all available time steps, and a total of 2*num_samples will be returned for both.

For the positive samples, num_samples are sampled according to the behavior conditional distribution, and another num_samples are sampled according to the dime contrastive distribution. The indices for the positive samples are concatenated across the first dimension.

Parameters:

num_samples (int) – The number of samples (batch size) of the returned cebra.data.datatypes.BatchIndex.

Return type:

BatchIndex

Returns:

Indices for reference, positive and negatives samples.

class cebra.data.single_session.FullDataLoader(dataset=None, num_steps=None, batch_size=None, conditional='time_delta', time_offset=10, delta=0.1)#

Bases: ContinuousDataLoader

Data loader for batch gradient descent, loading the whole dataset at once.

get_indices(num_samples=None)#

Samples indices for reference, positive and negative examples.

The reference indices are all available (valid, according to the model’s offset) indices in the dataset, in order.

The negative indices are a permutation of the reference indices.

The positive indices are sampled as before from the conditional distribution, given the reference samples.

Return type:

BatchIndex

Returns:

Indices for reference, positive and negatives samples. The batch size will be equal to the dataset size, lesser the length of the model index.

Multi Session Dataloaders#

Datasets and loaders for multi-session training.

class cebra.data.multi_session.MultiSessionDataset(device='cpu', download=False, data_url=None, data_checksum=None, location=None, file_name=None)#

Bases: Dataset

A dataset spanning multiple recording sessions.

Multi session datasets share the same dimensionality across the index, but can have differing feature dimensions (e.g. number of neurons) between different sessions.

Multi-session datasets where the number of neurons is constant across sessions should utilize the normal Dataset class with a MultisessionLoader for better efficiency when sampling.

offset#

The offset determines the shape of the data obtained with the __getitem__ and base.Dataset.expand_index() methods.

abstract property num_sessions#

The number of sessions in the dataset.

abstract get_input_dimension(session_index)#

The feature dimension of a given session.

get_session(session_id)#

Returns a dataset instance representing a given session.

Return type:

SingleSessionDataset

load_batch(index)#

Return the data at the specified index location.

Return type:

List[Batch]

configure_for(model)#

Configure the dataset offset for the provided model.

Call this function before indexing the dataset. This sets the offset attribute of the dataset.

Parameters:

model – The model to configure the dataset for.

class cebra.data.multi_session.MultiSessionLoader(dataset=None, num_steps=None, batch_size=None, time_offset=10)#

Bases: Loader

Dataloader for multi-session datasets.

The loader will enforce a uniform distribution across the sessions. Note that if samples within different sessions share the same feature dimension, it is better to use a cebra.data.single_session.MixedDataLoader.

get_indices(num_samples)#

Sample and return the specified number of indices.

The elements of the returned BatchIndex will be used to index the dataset of this data loader.

Parameters:

num_samples (int) – The size of each of the reference, positive and negative samples.

Return type:

List[BatchIndex]

Returns:

batch indices for the reference, positive and negative sample.

class cebra.data.multi_session.ContinuousMultiSessionDataLoader(dataset=None, num_steps=None, batch_size=None, time_offset=10, conditional='time_delta')#

Bases: MultiSessionLoader

Contrastive learning conditioned on a continuous behavior variable.

class cebra.data.multi_session.DiscreteMultiSessionDataLoader(dataset=None, num_steps=None, batch_size=None, time_offset=10)#

Bases: MultiSessionLoader

Contrastive learning conditioned on a discrete behavior variable.

class cebra.data.multi_session.MixedMultiSessionDataLoader(dataset=None, num_steps=None, batch_size=None, time_offset=10)#

Bases: MultiSessionLoader

Datatypes#

class cebra.data.datatypes.Batch(reference, positive, negative, index=None, index_reversed=None)#

Bases: object

A batch of reference, positive, negative samples and an optional index.

reference#

The reference samples, typically sampled from the prior distribution

positive#

The positive samples, typically sampled from the positive conditional distribution depending on the reference samples

negative#

The negative samples, typically sampled from the negative conditional distribution depending (but often independent) from the reference samples

index#

TODO(stes), see docs for multisession training distributions

index_reversed#

TODO(stes), see docs for multisession training distributions

to(device)#

Move all batch elements to the GPU.

class cebra.data.datatypes.BatchIndex(reference, positive, negative, index, index_reversed)#

Bases: tuple

index#

Alias for field number 3

index_reversed#

Alias for field number 4

negative#

Alias for field number 2

positive#

Alias for field number 1

reference#

Alias for field number 0

class cebra.data.datatypes.Offset(*offset)#

Bases: object

Number of samples left and right from an index.

When indexing datasets, some operations require input of multiple neighbouring samples across the time dimension. Offset represents a simple pair of left and right offsets with respect to a index. It provides the range of samples to consider around the current index for sampling across the time dimension.

The provided offsets are positive int, so that the left offset corresponds to the number of samples to consider previous to the index while the right offset is strictly positive and corresponds to the the index itself and the number of samples to consider following the index.

Note

By convention, the right bound should always be strictly positive as it is including the current index itself. Hence, for instance, to only consider the current element, you will have to provide (0,1) at Offset initialization.

property left_slice#

Slice from array start to left border.

property right_slice#

Slice from right border to array end.

property valid_slice#

Slice between the two borders.