Data Loading#
Data loaders use distributions and indices to make samples available for training.
This package contains all helper functions and classes for defining and loading datasets
in the various usage modes of CEBRA, e.g. single- and multi-session datasets.
It is non-specific to a particular dataset (see cebra.datasets
for actual dataset
implementations). However, the base classes for all datasets are defined here, as well as helper
functions to interact with datasets.
CEBRA supports different dataset types out-of-the box:
cebra.data.single_session.SingleSessionDataset
is the abstract base class for a single session dataset. Single session datasets have the same feature dimension across the samples (e.g., neural data) and all context variables (e.g. behavior, stimuli, etc.).cebra.data.multi_session.MultiSessionDataset
is the abstract base class for a multi session dataset. Multi session datasets contain of multiple single session datasets. Crucially, the dimensionality of the auxiliary variable dimension needs to match across the sessions, which allows alignment of multiple sessions. The dimensionality of the signal variable can vary arbitrarily between sessions.
Note that the actual implementation of datasets (e.g. for benchmarking) is done in the cebra.datasets
package.
Base classes#
Base classes for datasets and loaders.
- class cebra.data.base.Dataset(device='cpu', download=False, data_url=None, data_checksum=None, location=None, file_name=None)#
-
Abstract base class for implementing a dataset.
The class attributes provide information about the shape of the data when indexing this dataset.
- input_dimension#
The input dimension of the signal in this dataset. Models applied on this this dataset should match this dimensionality.
- offset#
The offset determines the shape of the data obtained with the
__getitem__
andexpand_index()
methods.
- property continuous_index: torch.Tensor#
The continuous index, if available.
The continuous index along with a similarity metric is used for drawing positive and/or negative samples.
- Return type:
- Returns:
Tensor of shape
(N,d)
, representing the index for allN
samples in the dataset.
- property discrete_index: torch.Tensor#
The discrete index, if available.
The discrete index can be used for making an embedding invariant to a variable for to restrict positive samples to share the same index variable. To implement more complicated indexing operations (such as modeling similiarities between indices), it is better to transform a discrete into a continuous index.
- Return type:
- Returns:
Tensor of shape
(N,)
, representing the index for allN
samples in the dataset.
- expand_index(index)#
- Parameters:
index (
Tensor
) – A one-dimensional tensor of type long containing indices to select from the dataset.- Return type:
- Returns:
An expanded index of shape
(len(index), len(self.offset))
where the elements will beexpanded_index[i,j] = index[i] + j - self.offset.left
for allj
inrange(0, len(self.offset))
.
Note
Requires the
offset
to be set.
- expand_index_in_trial(index, trial_ids, trial_borders)#
When the neural/behavior is in discrete trial, e.g) Monkey Reaching Dataset the slice should be defined within the trial. trial_ids is in size of a length of self.index and indicate the trial id of the index belong to. trial_borders is in size of a length of self.idnex and indicate the border of each trial.
- abstract load_batch(index)#
Return the data at the specified index location.
TODO: adapt signature to support Batches and List[Batch]
- Return type:
- class cebra.data.base.Loader(dataset=None, num_steps=None, batch_size=None)#
-
Base dataloader class.
- Parameters:
- Yields:
Batches of the specified size from the given dataset object.
Note
The
__iter__
method is non-deterministic, unless explicit seeding is implemented in derived classes. It is recommended to avoid global seeding in numpy and torch, and instead locally instantiate aGenerator
object for drawing samples.- abstract get_indices(num_samples)#
Sample and return the specified number of indices.
The elements of the returned BatchIndex will be used to index the dataset of this data loader.
- Parameters:
num_samples (
int
) – The size of each of the reference, positive and negative samples.- Returns:
batch indices for the reference, positive and negative sample.
General File Loading#
- cebra.load_data(file, key=None, columns=None)#
Load a dataset from the given file.
- The following file types are supported:
Numpy files: npy, npz;
HDF5 files: h5, hdf, hdf5, including h5 generated through DLC;
PyTorch files: pt, p;
csv files;
Excel files: xls, xlsx, xlsm;
Joblib files: jl;
Pickle files: p, pkl;
MAT-files: mat.
- The assumptions on your data are as following:
it contains at least one data structure (e.g. a numpy array, a torch.Tensor, etc.);
it can be directly in the form of a collection (e.g. a dictionary);
if the file contains a collection, the user can provide a key to refer to the data value they want to access;
if no key is provided, the first data structure found upon iteration of the collection will be loaded;
if a key is provided, it needs to correspond to an existing item of the collection;
if a key is provided, the data value accessed needs to be a data structure;
the function loads data for only one data structure, even if the file contains more. The function can be called again with the corresponding key to get the other ones.
- Parameters:
file (
Union
[str
,Path
]) – The path to the given file to load, in a supported format.key (
Union
[str
,int
,None
]) – The key referencing the data of interest in the file, if the file has a dictionary-like structure.columns (
Optional
[list
]) – The part of the data to keep in the output 2D-array. For now, it corresponds to the columns of a DataFrame to keep if the data selected is a DataFrame.
- Return type:
ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
The loaded data.
Example
>>> import cebra >>> import cebra.helper as cebra_helper >>> import numpy as np >>> # Create the files to load the data from >>> # Create a .npz file >>> X = np.random.normal(0,1,(100,3)) >>> y = np.random.normal(0,1,(100,4)) >>> np.savez("data", neural = X, trial = y) >>> # Create a .h5 file >>> url = "https://github.com/DeepLabCut/DeepLabCut/blob/main/examples/Reaching-Mackenzie-2018-08-30/labeled-data/reachingvideo1/CollectedData_Mackenzie.h5?raw=true" >>> dlc_file = cebra_helper.download_file_from_url(url) # an .h5 example file >>> # Load data >>> X = cebra.load_data(file="data.npz", key="neural") >>> y_trial_id = cebra.load_data(file="data.npz", key="trial") >>> y_behavior = cebra.load_data(file=dlc_file, columns=["Hand", "Tongue"])
DeepLabCut File Loading#
- cebra.load_deeplabcut(filepath, keypoints=None, pcutoff=0.6)#
Load DLC data from h5 files.
- Parameters:
filepath (
Union
[Path
,str
]) – Path to the.h5
file containing DLC output data.keypoints (
Optional
[list
]) – List of keypoints to keep in the outputnumpy.array
.pcutoff (
float
) – Drop-out threshold. If the likelihood value on the estimated positions a sample is smaller than that threshold, then the sample is set to nan. Then, the nan values are interpolated.
- Return type:
ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
A 2D array (
n_samples x n_features
) containing the data (x
andy
) generated by DLC for each keypoint of interest. Note that thelikelihood
is dropped.
Example
>>> import cebra >>> url = ANNOTATED_DLC_URL = "https://github.com/DeepLabCut/DeepLabCut/blob/main/examples/Reaching-Mackenzie-2018-08-30/labeled-data/reachingvideo1/CollectedData_Mackenzie.h5?raw=true" >>> file = cebra.helper.download_file_from_url(url) # an .h5 example file >>> dlc_data = cebra.load_deeplabcut(file, keypoints=["Hand", "Joystick1"], pcutoff=0.6)
Pre-defined Datasets#
Pre-defined datasets.
- class cebra.data.datasets.TensorDataset(neural, continuous=None, discrete=None, offset=Offset(left=0, right=1, length=1), device='cpu')#
Bases:
SingleSessionDataset
Discrete and/or continuously indexed dataset based on torch/numpy arrays.
If dealing with datasets sufficiently small to fit
numpy.array()
ortorch.Tensor
, this dataset is sufficient—the sampling auxiliary variable should be specified with a dataloader. Based on whether continuous and/or discrete auxiliary variables are provided, this class can be used with the discrete, continuous and/or mixed data loader classes.- Parameters:
neural (
Union
[Tensor
,ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]]) – Array of dtypefloat
or float Tensor of shape(N, D)
, containing neural activity over time.continuous (
Union
[Tensor
,ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]],None
]) – Array of dtype`float
or float Tensor of shape(N, d)
, containing the continuous behavior variables over the same time dimension.discrete (
Union
[Tensor
,ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]],None
]) – Array of dtype`int64
or integer Tensor of shape(N, d)
, containing the discrete behavior variables over the same time dimension.
Example
>>> import cebra.data >>> import torch >>> data = torch.randn((100, 30)) >>> index1 = torch.randn((100, 2)) >>> index2 = torch.randint(0,5,(100, )) >>> dataset = cebra.data.datasets.TensorDataset(data, continuous=index1, discrete=index2)
- property continuous_index#
The continuous index, if available.
The continuous index along with a similarity metric is used for drawing positive and/or negative samples.
- Returns:
Tensor of shape
(N,d)
, representing the index for allN
samples in the dataset.
- property discrete_index#
The discrete index, if available.
The discrete index can be used for making an embedding invariant to a variable for to restrict positive samples to share the same index variable. To implement more complicated indexing operations (such as modeling similiarities between indices), it is better to transform a discrete into a continuous index.
- Returns:
Tensor of shape
(N,)
, representing the index for allN
samples in the dataset.
- class cebra.data.datasets.DatasetCollection(*datasets)#
Bases:
MultiSessionDataset
Multi session dataset made up of a list of datasets.
- Parameters:
*datasets – Collection of datasets to add to the collection. The order will be maintained for indexing.
Example
>>> import cebra.data >>> import torch >>> session1 = torch.randn((100, 30)) >>> session2 = torch.randn((100, 50)) >>> index1 = torch.randn((100, 4)) >>> index2 = torch.randn((100, 4)) # same index dim as index1 >>> dataset = cebra.data.DatasetCollection( ... cebra.data.TensorDataset(session1, continuous=index1), ... cebra.data.TensorDataset(session2, continuous=index2))
- get_input_dimension(session_id)#
Get the feature dimension of the required session.
- Parameters:
session_id (
int
) – The session ID, an integer between 0 andnum_sessions
.- Return type:
- Returns:
A single session input dimension for the requested session id.
- get_session(session_id)#
Get the dataset for the specified session.
- Parameters:
session_id (
int
) – The session ID, an integer between 0 andnum_sessions
.- Return type:
- Returns:
A single session dataset for the requested session id.
- property continuous_index: torch.Tensor#
The continuous index, if available.
The continuous index along with a similarity metric is used for drawing positive and/or negative samples.
- Return type:
- Returns:
Tensor of shape
(N,d)
, representing the index for allN
samples in the dataset.
- property discrete_index: torch.Tensor#
The discrete index, if available.
The discrete index can be used for making an embedding invariant to a variable for to restrict positive samples to share the same index variable. To implement more complicated indexing operations (such as modeling similiarities between indices), it is better to transform a discrete into a continuous index.
- Return type:
- Returns:
Tensor of shape
(N,)
, representing the index for allN
samples in the dataset.
- class cebra.data.datasets.DatasetxCEBRA(neural, device='cpu', **labels)#
Bases:
HasDevice
Dataset class for xCEBRA models.
This class handles neural data and associated labels for xCEBRA models, providing functionality for data loading and batch preparation.
- neural#
Neural data as a torch.Tensor or numpy array
- labels#
Labels associated with the data
- offset#
Offset for the dataset
- Parameters:
- property input_dimension: int#
Get the input dimension of the neural data.
- Return type:
- Returns:
The number of features in the neural data
- configure_for(model)#
Configure the dataset offset for the provided model.
Call this function before indexing the dataset. This sets the
offset
attribute of the dataset.- Parameters:
model (
Model
) – The model to configure the dataset for.
- expand_index(index)#
Expand indices based on the configured offset.
- Parameters:
index (
Tensor
) – A one-dimensional tensor of type long containing indices to select from the dataset.- Return type:
- Returns:
An expanded index of shape
(len(index), len(self.offset))
where the elements will beexpanded_index[i,j] = index[i] + j - self.offset.left
for allj
inrange(0, len(self.offset))
.
Note
Requires the
offset
to be set.
- load_batch_supervised(index, labels_supervised)#
Load a batch for supervised learning.
- load_batch_contrastive(index)#
Load a batch for contrastive learning.
- Parameters:
index (
BatchIndex
) – BatchIndex containing reference, positive and negative indices- Return type:
- Returns:
Batch containing reference, positive and negative samples
Single Session Dataloaders#
Datasets and loaders for single session training.
All dataloaders should be implemented using dataclasses
for handling
arguments and configuration values and subclass base.Loader
.
- class cebra.data.single_session.SingleSessionDataset(device='cpu', download=False, data_url=None, data_checksum=None, location=None, file_name=None)#
Bases:
Dataset
A dataset with data from a single experimental session.
A single experimental session contains a single data matrix with shape
num_timesteps x dimension
, potentially paired with auxiliary information of shapenum_timesteps x aux_dimension
.Loaders for single session datasets can be found in
cebra.data.single_session
.
- class cebra.data.single_session.DiscreteDataLoader(dataset=None, num_steps=None, batch_size=None, prior='empirical')#
Bases:
Loader
Supervised contrastive learning on fully discrete dataset.
Reference and negative samples will be drawn from a uniform prior distribution. Depending on the
prior
attribute, the prior will uniform over time-steps (settingempirical
), or be adjusted such that each discrete value in the dataset is uniformly distributed (settinguniform
).The positive samples will have a matching discrete auxiliary variable as the reference samples.
Sampling is implemented in the
cebra.distributions.discrete.DiscreteUniform
andcebra.distributions.discrete.DiscreteEmpirical
distributions.- Parameters:
dataset (
Optional
[Dataset
]) – A dataset instance specifying a__getitem__
function.num_steps (
Optional
[int
]) – The total number of batches when iterating over the dataloader.prior (
str
) –Re-sampling mode for the discrete index.
The option empirical uses label frequencies as they appear in the dataset. The option uniform re-samples the dataset and adjust the frequencies of less common class labels. For balanced datasets, it is typically more accurate to stick to the empirical option.
- property index#
The (discrete) dataset index.
- get_indices(num_samples)#
Samples indices for reference, positive and negative examples.
The reference samples will be sampled from the empirical or uniform prior distribution (if uniform, the discrete index values will be used to perform histogram normalization).
The positive samples will be sampled such that their discrete index value corresponds to the respective value of the reference samples.
The negative samples will be sampled from the same distribution as the reference examples.
- Parameters:
num_samples (
int
) – The number of samples (batch size) of the returnedcebra.data.datatypes.BatchIndex
.- Return type:
- Returns:
Indices for reference, positive and negatives samples.
- class cebra.data.single_session.ContinuousDataLoader(dataset=None, num_steps=None, batch_size=None, conditional='time_delta', time_offset=10, delta=0.1)#
Bases:
Loader
Contrastive learning conditioned on a continuous behavior variable.
Reference and negative samples will be drawn from a uniform prior distribution across all time-steps. The positive sample will be distributed around the reference example using either
time information (
time
): In this case, acebra.distributions.continuous.TimeContrastive
distribution is used for sampling. Positive pairs will have a fixedtime_offset
from the reference samples’ time steps.auxiliary variables, using the empirical distribution of how behavior various across
time_offset
timesteps (time_delta
). Sampling for this setting is implemented incebra.distributions.continuous.TimedeltaDistribution
.alternatively, the distribution can be selected to be a Gaussian distribution parametrized by a fixed
delta
around the reference sample, using the implementation incebra.distributions.continuous.DeltaNormalDistribution
.
- Parameters:
dataset (
Optional
[Dataset
]) – A dataset instance specifying a__getitem__
function.num_steps (
Optional
[int
]) – The total number of batches when iterating over the dataloader.conditional (
str
) – Information on how the positive samples should be acquired. Setting totime_delta
computes the differences between adjacent samples in the dataset, and usesreference + diff
as the query for collecting the positive pair. Setting totime
will use adjacent pairs of samples only and become equivalent to time contrastive learning.time_offset (
int
) – Nonedelta (
float
) – None
- get_indices(num_samples)#
Samples indices for reference, positive and negative examples.
The reference and negative samples will be sampled uniformly from all available time steps.
The positive samples will be sampled conditional on the reference samples according to the specified
conditional
distribution.- Parameters:
num_samples (
int
) – The number of samples (batch size) of the returnedcebra.data.datatypes.BatchIndex
.- Return type:
- Returns:
Indices for reference, positive and negatives samples.
- class cebra.data.single_session.MixedDataLoader(dataset=None, num_steps=None, batch_size=None, conditional='time_delta', time_offset=10)#
Bases:
Loader
Mixed discrete-continuous data loader.
This data loader combines the functionality of
DiscreteDataLoader
andContinuousDataLoader
for datasets that provide both a continuous and discrete variables.Sampling can be configured in different modes:
Positive pairs always share their discrete variable.
Positive pairs are drawn only based on their conditional, not discrete variable.
- get_indices(num_samples)#
Samples indices for reference, positive and negative examples.
The reference and negative samples will be sampled uniformly from all available time steps.
The positive distribution will either share the discrete value of the reference samples, and then sampled as in the
ContinuousDataLoader
, or just sampled based on the conditional variable.- Parameters:
num_samples (
int
) – The number of samples (batch size) of the returnedcebra.data.datatypes.BatchIndex
.- Return type:
- Returns:
Indices for reference, positive and negatives samples.
- class cebra.data.single_session.HybridDataLoader(dataset=None, num_steps=None, batch_size=None, conditional='time_delta', time_distribution='time', time_offset=10, delta=0.1)#
Bases:
Loader
Contrastive learning using both time and behavior information.
The dataloader combines two training modes implemented in
ContinuousDataLoader
and combines time and behavior information into a joint embedding.- Parameters:
- property index#
The (continuous) dataset index.
- get_indices(num_samples)#
Samples indices for reference, positive and negative examples.
The reference and negative samples will be sampled uniformly from all available time steps, and a total of
2*num_samples
will be returned for both.For the positive samples,
num_samples
are sampled according to the behavior conditional distribution, and anothernum_samples
are sampled according to the dime contrastive distribution. The indices for the positive samples are concatenated across the first dimension.- Parameters:
num_samples (
int
) – The number of samples (batch size) of the returnedcebra.data.datatypes.BatchIndex
.- Return type:
- Returns:
Indices for reference, positive and negatives samples.
- class cebra.data.single_session.FullDataLoader(dataset=None, num_steps=None, batch_size=None, conditional='time_delta', time_offset=10, delta=0.1)#
Bases:
ContinuousDataLoader
Data loader for batch gradient descent, loading the whole dataset at once.
- get_indices(num_samples=None)#
Samples indices for reference, positive and negative examples.
The reference indices are all available (valid, according to the model’s offset) indices in the dataset, in order.
The negative indices are a permutation of the reference indices.
The positive indices are sampled as before from the conditional distribution, given the reference samples.
- Return type:
- Returns:
Indices for reference, positive and negatives samples. The batch size will be equal to the dataset size, lesser the length of the model index.
Multi Session Dataloaders#
Datasets and loaders for multi-session training.
- class cebra.data.multi_session.MultiSessionDataset(device='cpu', download=False, data_url=None, data_checksum=None, location=None, file_name=None)#
Bases:
Dataset
A dataset spanning multiple recording sessions.
Multi session datasets share the same dimensionality across the index, but can have differing feature dimensions (e.g. number of neurons) between different sessions.
Multi-session datasets where the number of neurons is constant across sessions should utilize the normal
Dataset
class with aMultisessionLoader
for better efficiency when sampling.- offset#
The offset determines the shape of the data obtained with the
__getitem__
andbase.Dataset.expand_index()
methods.
- abstract property num_sessions#
The number of sessions in the dataset.
- abstract get_input_dimension(session_index)#
The feature dimension of a given session.
- get_session(session_id)#
Returns a dataset instance representing a given session.
- Return type:
- class cebra.data.multi_session.MultiSessionLoader(dataset=None, num_steps=None, batch_size=None, time_offset=10)#
Bases:
Loader
Dataloader for multi-session datasets.
The loader will enforce a uniform distribution across the sessions. Note that if samples within different sessions share the same feature dimension, it is better to use a
cebra.data.single_session.MixedDataLoader
.- get_indices(num_samples)#
Sample and return the specified number of indices.
The elements of the returned BatchIndex will be used to index the dataset of this data loader.
- Parameters:
num_samples (
int
) – The size of each of the reference, positive and negative samples.- Return type:
- Returns:
batch indices for the reference, positive and negative sample.
- class cebra.data.multi_session.ContinuousMultiSessionDataLoader(dataset=None, num_steps=None, batch_size=None, time_offset=10, conditional='time_delta')#
Bases:
MultiSessionLoader
Contrastive learning conditioned on a continuous behavior variable.
- class cebra.data.multi_session.DiscreteMultiSessionDataLoader(dataset=None, num_steps=None, batch_size=None, time_offset=10)#
Bases:
MultiSessionLoader
Contrastive learning conditioned on a discrete behavior variable.
- class cebra.data.multi_session.MixedMultiSessionDataLoader(dataset=None, num_steps=None, batch_size=None, time_offset=10)#
Bases:
MultiSessionLoader
Datatypes#
- class cebra.data.datatypes.Batch(reference, positive, negative, index=None, index_reversed=None)#
Bases:
object
A batch of reference, positive, negative samples and an optional index.
- reference#
The reference samples, typically sampled from the prior distribution
- positive#
The positive samples, typically sampled from the positive conditional distribution depending on the reference samples
- negative#
The negative samples, typically sampled from the negative conditional distribution depending (but often independent) from the reference samples
- index#
TODO(stes), see docs for multisession training distributions
- index_reversed#
TODO(stes), see docs for multisession training distributions
- to(device)#
Move all batch elements to the GPU.
- class cebra.data.datatypes.BatchIndex(reference, positive, negative, index, index_reversed)#
Bases:
tuple
- index#
Alias for field number 3
- index_reversed#
Alias for field number 4
- negative#
Alias for field number 2
- positive#
Alias for field number 1
- reference#
Alias for field number 0
- class cebra.data.datatypes.Offset(*offset)#
Bases:
object
Number of samples left and right from an index.
When indexing datasets, some operations require input of multiple neighbouring samples across the time dimension.
Offset
represents a simple pair of left and right offsets with respect to a index. It provides the range of samples to consider around the current index for sampling across the time dimension.The provided offsets are positive
int
, so that theleft
offset corresponds to the number of samples to consider previous to the index while theright
offset is strictly positive and corresponds to the the index itself and the number of samples to consider following the index.Note
By convention, the right bound should always be strictly positive as it is including the current index itself. Hence, for instance, to only consider the current element, you will have to provide (0,1) at
Offset
initialization.- property left_slice#
Slice from array start to left border.
- property right_slice#
Slice from right border to array end.
- property valid_slice#
Slice between the two borders.