Data Loading#
A simple API for loading various data formats used with CEBRA.
Availability of different data formats depends on the installed dependencies. If a dependency is not installed, an attempt to load a file of that format will throw an error with further installation instructions.
Currently available formats:
HDF5 via
h5py
Pickle files via
pickle
Joblib files via
joblib
Various dataframe formats via
pandas
.Matlab files via
scipy.io.loadmat
DeepLabCut (single animal) files via
deeplabcut
- cebra.data.load.load(file, key=None, columns=None)#
Load a dataset from the given file.
- The following file types are supported:
Numpy files: npy, npz;
HDF5 files: h5, hdf, hdf5, including h5 generated through DLC;
PyTorch files: pt, p;
csv files;
Excel files: xls, xlsx, xlsm;
Joblib files: jl;
Pickle files: p, pkl;
MAT-files: mat.
- The assumptions on your data are as following:
it contains at least one data structure (e.g. a numpy array, a torch.Tensor, etc.);
it can be directly in the form of a collection (e.g. a dictionary);
if the file contains a collection, the user can provide a key to refer to the data value they want to access;
if no key is provided, the first data structure found upon iteration of the collection will be loaded;
if a key is provided, it needs to correspond to an existing item of the collection;
if a key is provided, the data value accessed needs to be a data structure;
the function loads data for only one data structure, even if the file contains more. The function can be called again with the corresponding key to get the other ones.
- Parameters:
file (
Union
[str
,Path
]) – The path to the given file to load, in a supported format.key (
Union
[str
,int
,None
]) – The key referencing the data of interest in the file, if the file has a dictionary-like structure.columns (
Optional
[list
]) – The part of the data to keep in the output 2D-array. For now, it corresponds to the columns of a DataFrame to keep if the data selected is a DataFrame.
- Return type:
ndarray
[tuple
[int
,...
],dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
The loaded data.
Example
>>> import cebra >>> import cebra.helper as cebra_helper >>> import numpy as np >>> # Create the files to load the data from >>> # Create a .npz file >>> X = np.random.normal(0,1,(100,3)) >>> y = np.random.normal(0,1,(100,4)) >>> np.savez("data", neural = X, trial = y) >>> # Create a .h5 file >>> url = "https://github.com/DeepLabCut/DeepLabCut/blob/main/examples/Reaching-Mackenzie-2018-08-30/labeled-data/reachingvideo1/CollectedData_Mackenzie.h5?raw=true" >>> dlc_file = cebra_helper.download_file_from_url(url) # an .h5 example file >>> # Load data >>> X = cebra.load_data(file="data.npz", key="neural") >>> y_trial_id = cebra.load_data(file="data.npz", key="trial") >>> y_behavior = cebra.load_data(file=dlc_file, columns=["Hand", "Tongue"])
- cebra.data.load.read_hdf(filename, key=None)#
Read HDF5 file using pandas, with fallback to h5py if pandas fails.
- Parameters:
filename – Path to HDF5 file
key – Optional key to read from HDF5 file. If None, tries “df_with_missing” then falls back to first available key.
- Returns:
The loaded data
- Return type:
- Raises:
RuntimeError – If both pandas and h5py fail to load the file