Data Loading#

A simple API for loading various data formats used with CEBRA.

Availability of different data formats depends on the installed dependencies. If a dependency is not installed, an attempt to load a file of that format will throw an error with further installation instructions.

Currently available formats:

  • HDF5 via h5py

  • Pickle files via pickle

  • Joblib files via joblib

  • Various dataframe formats via pandas.

  • Matlab files via scipy.io.loadmat

  • DeepLabCut (single animal) files via deeplabcut

cebra.data.load.load(file, key=None, columns=None)#

Load a dataset from the given file.

The following file types are supported:
  • Numpy files: npy, npz;

  • HDF5 files: h5, hdf, hdf5, including h5 generated through DLC;

  • PyTorch files: pt, p;

  • csv files;

  • Excel files: xls, xlsx, xlsm;

  • Joblib files: jl;

  • Pickle files: p, pkl;

  • MAT-files: mat.

The assumptions on your data are as following:
  • it contains at least one data structure (e.g. a numpy array, a torch.Tensor, etc.);

  • it can be directly in the form of a collection (e.g. a dictionary);

  • if the file contains a collection, the user can provide a key to refer to the data value they want to access;

  • if no key is provided, the first data structure found upon iteration of the collection will be loaded;

  • if a key is provided, it needs to correspond to an existing item of the collection;

  • if a key is provided, the data value accessed needs to be a data structure;

  • the function loads data for only one data structure, even if the file contains more. The function can be called again with the corresponding key to get the other ones.

Parameters:
  • file (Union[str, Path]) – The path to the given file to load, in a supported format.

  • key (Union[str, int, None]) – The key referencing the data of interest in the file, if the file has a dictionary-like structure.

  • columns (Optional[list]) – The part of the data to keep in the output 2D-array. For now, it corresponds to the columns of a DataFrame to keep if the data selected is a DataFrame.

Return type:

ndarray[tuple[int, ...], dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

The loaded data.

Example

>>> import cebra
>>> import cebra.helper as cebra_helper
>>> import numpy as np
>>> # Create the files to load the data from
>>> # Create a .npz file
>>> X = np.random.normal(0,1,(100,3))
>>> y = np.random.normal(0,1,(100,4))
>>> np.savez("data", neural = X, trial = y)
>>> # Create a .h5 file
>>> url = "https://github.com/DeepLabCut/DeepLabCut/blob/main/examples/Reaching-Mackenzie-2018-08-30/labeled-data/reachingvideo1/CollectedData_Mackenzie.h5?raw=true"
>>> dlc_file = cebra_helper.download_file_from_url(url) # an .h5 example file
>>> # Load data
>>> X = cebra.load_data(file="data.npz", key="neural")
>>> y_trial_id = cebra.load_data(file="data.npz", key="trial")
>>> y_behavior = cebra.load_data(file=dlc_file, columns=["Hand", "Tongue"])
cebra.data.load.read_hdf(filename, key=None)#

Read HDF5 file using pandas, with fallback to h5py if pandas fails.

Parameters:
  • filename – Path to HDF5 file

  • key – Optional key to read from HDF5 file. If None, tries “df_with_missing” then falls back to first available key.

Returns:

The loaded data

Return type:

pandas.DataFrame

Raises:

RuntimeError – If both pandas and h5py fail to load the file