Pre-defined datasets#

Pre-defined demo and benchmark datasets.

This package contains actual implementations of datasets. If you want to add a commonly used (and public dataset) to CEBRA, this is the right package to do it. Datasets here can be loaded e.g. for testing, reproducing reference results and benchmarking. When contributing to this package, you should ensure that the data is publicly available under a suitable license.

This module is a registry and currently contains the options [‘demo-discrete’, ‘demo-continuous’, ‘demo-mixed’, ‘demo-discrete-multisession’, ‘demo-continuous-multisession’, ‘demo-continuous-unified’, ‘allen-movie-one-ca-VISp-10-train-10-111’, ‘allen-movie-one-ca-VISp-10-train-10-222’, ‘allen-movie-one-ca-VISp-10-train-10-333’, ‘allen-movie-one-ca-VISp-10-train-10-444’].

To retrieve a list of options, call:

>>> print(cebra.datasets.get_options())
['demo-discrete', 'demo-continuous', 'demo-mixed', ...]

To obtain an initialized instance, call cebra.datasets.init, defined in cebra.registry.add_helper_functions(). The first parameter to provide is the datasets name to use, which is one of the available options presented above. Then the required positional arguments specific to the module are provided, if needed.

You can register additional options by defining and registering classes with a name. To do that, you can add a decorator on top of it: @cebra.datasets.register("my-cebra-datasets").

Later, initialize your class similarly to the pre-defined options, using cebra.datasets.init with the datasets name set to my-cebra-datasets.

Note that these customized options will not be automatically added to this docstring.

Synthetic datasets#

class cebra.datasets.gaussian_mixture.ContinuousGaussianMixtureDataset(noise='poisson')#

Bases: SingleSessionDataset

A dataset of synthetically generated continuous labels and the corresponding 2D latents and 100D noisy observations.

Parameters:: noise (str) – The applied noise distribution applied.

property continuous_index#

The continuous index, if available.

The continuous index along with a similarity metric is used for drawing positive and/or negative samples.

Returns:: Tensor of shape (N,d), representing the index for all N samples in the dataset.

Rat Hippocampus dataset#

Rat hippocampus dataset

References

Grosmark, A.D., and Buzsáki, G. (2016). Diversity in neural firing dynamics supports both rigid and learned
hippocampal sequences. Science 351, 1440–1443.
Chen, Z., Grosmark, A.D., Penagos, H., and Wilson, M.A. (2016). Uncovering representations of sleep-associated
hippocampal ensemble spike activity. Sci. Rep. 6, 32193.
Grosmark, A.D., Long J. and Buzsáki, G. (2016); Recordings from hippocampal area CA1, PRE, during and POST novel spatial learning. CRCNS.org. http://dx.doi.org/10.6080/K0862DC5

class cebra.datasets.hippocampus.SingleRatDataset(name='achilles', root='data', download=True)#

Bases: SingleSessionDataset

A single rat hippocampus tetrode recording while the rat navigates on a linear track.

Neural data is spike counts binned into 25ms time window and the continuous behavior label is position and the running direction (left, right) of a rat. The behavior label is structured as 3D array consists of position, right, and left.

Parameters:: name – The name of the rat to use. Choose among ‘achilles’, ‘buddy’, ‘cicero’ and ‘gatsby’.

property continuous_index#

The continuous index, if available.

The continuous index along with a similarity metric is used for drawing positive and/or negative samples.

Returns:: Tensor of shape (N,d), representing the index for all N samples in the dataset.

decode(x_train, y_train, x_test, y_test)#

kNN decoding function.

Perform a kNN decoding for n_neighbors = 1,4,9,26,25 with the given train set and test set.

Parameters:

x_train – The train set data
y_train – The train set label
x_test – The test set data
y_test – The test set label

class cebra.datasets.hippocampus.SingleRatTrialSplitDataset(name='achilles', split_no=0, split=None, root='data')#

Bases: SingleRatDataset

A single rat hippocampus tetrode recording while the rat navigates on a linear track with 3-fold splits.

Neural data is spike counts binned into 25ms time window and the behavior is position and the running direction (left, right) of a rat. The behavior label is structured as 3D array consists of position, right, and left. The neural and behavior recordings are parsed into trials (a round trip from one end of the track) and the trials are split into a train, valid and test set with k=3 nested cross validation.

Parameters:

name – The name of a rat to use. Choose among ‘achilles’, ‘buddy’, ‘cicero’ and ‘gatsby’.
split_no – The k for k-fold split. Choose among 0, 1, 2.
split – The split to use. Choose among ‘train’, ‘valid’, ‘test’, ‘all’, and ‘wo_test’ (all trials except test split).

class cebra.datasets.hippocampus.SingleRatCorruptDataset(name, seed, root='data')#

Bases: SingleRatDataset

A single rat hippocampus tetrode recording while the rat navigates on a linear track with a shuffled behavior label.

Neural data is spike counts binned into 25ms time window and the behavior is position and the running direction (left, right) of a rat. The behavior label is structured as 3D array consists of position, right, and left and it is shuffled in random orders.

Parameters:

name – The name of the rat to use. Choose among ‘achilles’, ‘buddy’, ‘cicero’ and ‘gatsby’.
seed – The random seed to set the shuffling.

class cebra.datasets.hippocampus.MultipleRatsTrialSplitDataset(split_no=0, split=None)#

Bases: DatasetCollection

4 rats hippocampus tetrode recording while the rat navigates on a linear track with 3-fold splits.

Neural and behavior recordings of 4 rats. For each rat, neural data is spike counts binned into 25ms time window and the behavior is position and the running direction (left, right) of a rat. The behavior label is structured as 3D array consists of position, right, and left. Neural and behavior recordings of each rat are parsed into trials (a round trip from one end of the track) and the trials are split into a train, valid and test set with k=3 nested cross validation.

Parameters:

split_no – The k for k-fold split. Choose among 0, 1, and 2.
split – The split to use. Choose among ‘train’, ‘valid’, ‘test’, ‘all’, and ‘wo_test’ (all trials except test split).

Monkey S1 Dataset#

Ephys neural and behavior data used for the monkey reaching experiment.

References

Chowdhury, Raeed H., Joshua I. Glaser, and Lee E. Miller. “Area 2 of primary somatosensory cortex encodes kinematics of the whole arm.” Elife 9 (2020).
Chowdhury, Raeed; Miller, Lee (2022) Area2 Bump: macaque somatosensory area 2 spiking activity during reaching with perturbations (Version 0.220113.0359) [Data set]. DANDI archive
Pei, Felix, et al. “Neural Latents Benchmark’21: Evaluating latent variable models of neural population activity.” arXiv preprint arXiv:2109.04463 (2021).

class cebra.datasets.monkey_reaching.Area2BumpDataset(path='data/monkey_reaching_preload_smth_40/', session='active', download=True)#

Bases: SingleSessionDataset

Base dataclass to generate monkey reaching datasets.

Ephys and behavior recording from -100ms and 500ms from the movement onset in 1ms bin size. Neural recording is smoothened with Gaussian kernel with 40ms std. The behavior labels can include trial types, target directions and the x,y hand positions. After initialization of the dataset, split method can splits the data into ‘train’, ‘valid’ and ‘test’ split.

Parameters:

path (str) – The path to the directory where the preloaded data is.
session (str) – The trial type. Choose between ‘active’, ‘passive’, ‘all’, ‘active-passive’.

split(split)#

Split the dataset.

The train trials are the same as one defined in Neural Latent Benchmark (NLB) Dataset. The half of the valid trials defined in NLBDataset is used as the valid set and the other half is used as the test set.

Parameters:: split – The split. It can be either all, train, valid, test.

property discrete_index#

The discrete index, if available.

The discrete index can be used for making an embedding invariant to a variable for to restrict positive samples to share the same index variable. To implement more complicated indexing operations (such as modeling similiarities between indices), it is better to transform a discrete into a continuous index.

Returns:: Tensor of shape (N,), representing the index for all N samples in the dataset.

property continuous_index#

The continuous index, if available.

The continuous index along with a similarity metric is used for drawing positive and/or negative samples.

Returns:: Tensor of shape (N,d), representing the index for all N samples in the dataset.

class cebra.datasets.monkey_reaching.Area2BumpShuffledDataset(path='data/monkey_reaching_preload_smth_40/', session='active', download=True)#

Bases: Area2BumpDataset

Base dataclass to generate shuffled monkey reaching datasets.

Ephys and behavior recording from -100ms and 500ms from the movement onset in 1ms bin size. Neural recording is smoothened with Gaussian kernel with 40ms std. The shuffled behavior labels can include trial types, target directions and the x,y hand positions.

After initialization of the dataset, split method can splits the data into ‘train’, ‘valid’ and ‘test’ split.

Parameters:

path (str) – The path to the directory where the preloaded data is.
session (str) – The trial type. Choose between ‘active’, ‘passive’, ‘all’, ‘active-passive’.

Allen Neuropixel and 2P datasets#

Datasets from the Allen Database

TODO(stes): Add additional context and information about the datasets.