dance.data
- class dance.data.base.BaseData(data, train_size=None, val_size=0, test_size=-1, split_index_range_dict=None, full_split_name=None)[source]
Base data object.
The
dancedata object is a wrapper of theAnnDataobject, with several utility methods to help retrieving data in specific splits in specific format (seeget_split_idx()andget_feature()). TheAnnDataobjcet is saved in the attributedataand can be accessed directly.Warning
Since the underlying data object is a reference to the input
AnnDataobject, please be extra cautious *NOT* initializing two different dancedataobject using the sameAnnDataobject! If you are unsure, we recommend always initialize the dancedataobject using acopyof the inputAnnDataobject, e.g.,>>> adata = anndata.AnnData(...) >>> ddata = dance.data.Data(adata.copy())
Note
You can directly access some main properties of
AnnData(orMuDatadepending on which type of data you passed in), such asX,obs,var, and etc.- Parameters:
data (
Union[AnnData,MuData]) – Cell data.train_size (
Optional[int]) – Number of cells to be used for training. If not specified, not splits will be generated.val_size (
int) – Number of cells to be used for validation. If set to -1, use what’s left from training and testing.test_size (
int) – Number of cells to be used for testing. If set to -1, used what’s left from training and validation.split_index_range_dict (Dict[str, Tuple[int, int]] | None) –
full_split_name (str | None) –
- append(data, *, mode='merge', rename_dict=None, new_split_name=None, label_batch=False, **concat_kwargs)[source]
Append another dance data object to the current data object.
- Parameters:
data – New dance data object to be added.
mode (
Optional[Literal['merge','rename','new_split']]) – How to combine the splits from the new data and the current data. (1)"merge": merge the splits from the data, e.g., the training indexes from both data are used as the training indexes in the new combined data. (2)"rename": rename the splits of the new data and add to the current split index dictionary, e.g., renaming ‘train’ to ‘ref’. Requires passing therename_dict. Raise an error if the newly renamed key is already used in the current split index dictionary. (3)"new_split": assign the whole new data to a new split. Requires pssing thenew_split_namethat is not already used as a split name in the current data. (4)None: do not specify split index to the newly added data.rename_dict (
Optional[Dict[str,str]]) – Optional argument that is only used whenmode="rename". A dictionary to map the split names in the new data to other names.new_split_name (
Optional[str]) – Optional argument that is only used whenmode="new_split". Name of the split to assign to the new data.label_batch (
bool) – Add “batch” column to.obswhen set to True.**concat_kwargs – See
anndata.concat().
- property config: Dict[str, Any]
Return the dance data object configuration dict.
Notes
The configuration dictionary is saved in the
dataattribute, which is anAnnDataobject. Inparticular, the config will be saved in the.unsattribute with the key"dance_config".
- filter_by_mask(mask, update_splits=True)[source]
Filter cells based on a boolean mask and optionally update splits.
Filters the cells in self.data using a provided boolean mask. If update_splits is True, this method also updates the internal split indices (train_idx, val_idx, etc.) to reflect the cells remaining after filtering.
- Parameters:
mask (Union[Sequence[bool], pd.Series, np.ndarray]) – A boolean mask (list, Series, or array) with the same length as the current number of cells (self.data.shape[0]). Cells where the mask is True will be kept.
update_splits (bool, optional) – Whether to update the internal split indices to align with the filtered data. Defaults to True. If set to False, the split indices will become invalid if any cells are removed.
- Returns:
Returns the instance to allow method chaining.
- Return type:
self
- Raises:
ValueError – If the mask is not boolean or has an incorrect length.
NotImplementedError – If the underlying self.data is not an anndata.AnnData object (as filtering MuData requires more careful handling).
- filter_cells(**kwargs)[source]
Apply cell filtering using scanpy.pp.filter_cells and update splits.
Filters the cells in self.data based on the provided criteria, similar to scanpy.pp.filter_cells. Crucially, this method also updates the internal split indices (train_idx, val_idx, etc.) to reflect the cells remaining after filtering.
- Parameters:
**kwargs – Arguments passed directly to scanpy.pp.filter_cells. Common arguments include min_counts, max_counts, min_genes, max_genes. Note: inplace is forced to False internally to get the filter mask, then applied effectively inplace.
- Returns:
Returns the instance to allow method chaining.
- Return type:
self
- Raises:
NotImplementedError – If the underlying self.data is not an anndata.AnnData object. Filtering MuData requires more careful consideration of modalities.
- get_feature(*, split_name=None, return_type='numpy', channel=None, channel_type='obsm', mod=None)[source]
Retrieve features from data.
- Parameters:
split_name (
Optional[str]) – Name of the split to retrieve. If not set, return all.return_type (
Literal['anndata','default','numpy','torch','sparse']) – How should the features be returned. sparse: return as a sparse matrix; numpy: return as a numpy array; torch: return as a torch tensor; anndata: return as an anndata object.channel (
Optional[str]) – Return a particular channel as features. Ifchannel_typeisXorraw_X, then return.Xor the.raw.Xattribute from theAnnDatadirectly. Ifchannel_typeisobs, return the column named bychannel, similarly forvar. Finally, ifchannel_typeisobsm,obsp,varm,varp,layers, oruns, then return the value correspond to thechannelin the dictionary.channel_type (
Optional[str]) – Channel type to use, default toobsm(will be changed toXin the near future).mod (
Optional[str]) – Modality to use, default toNone. Options other thanNoneare only available when the underlying data object isMudata.
- get_split_data(split_name)[source]
Obtain the underlying data of a particular split.
- Parameters:
split_name (
str) – Name of the split to retrieve.- Return type:
Union[AnnData,MuData]
- get_split_idx(split_name, error_on_miss=False)[source]
Obtain cell indices for a particular split.
- Parameters:
split_name (
str) – Name of the split to retrieve.error_on_miss (
bool) – If set to True, raise KeyError if the queried split does not exit, otherwise return None.
See also
- get_split_mask(split_name, return_type='numpy')[source]
Obtain mask representation of a particular split.
- Parameters:
split_name (
str) – Name of the split to retrieve.return_type (
Literal['anndata','default','numpy','torch','sparse']) – Return numpy array if set to ‘numpy’, or torch Tensor if set to ‘torch’.
- Return type:
Union[ndarray,Tensor]
- set_config(*, overwrite=False, **kwargs)[source]
Set dance data object configuration.
See :meth: ~BaseData.set_config_from_dict.
- Parameters:
overwrite (bool) –
- set_config_from_dict(config_dict, *, overwrite=False)[source]
Set dance data object configuration from a config dict.
- Parameters:
config_dict (
Dict[str,Any]) – Configuration dictionary.overwrite (
bool) – Used to determine the behaviour of resolving config conflicts. In the case of a conflict, where the config dict passed contains a key with value that differs from an existing setting, ifoverwriteis set toFalse, then raise aKeyError. Otherwise, overwrite the configuration with the new values.
- class dance.data.Data(data, train_size=None, val_size=0, test_size=-1, split_index_range_dict=None, full_split_name=None)[source]
- Parameters:
data (AnnData | MuData) –
train_size (int | None) –
val_size (int) –
test_size (int) –
split_index_range_dict (Dict[str, Tuple[int, int]] | None) –
full_split_name (str | None) –
- get_data(split_name=None, return_type='numpy', x_kwargs={}, y_kwargs={})[source]
Retrieve cell features and labels from a particular split.
- Parameters:
split_name (
Optional[str]) – Name of the split to retrieve. If not set, return all.return_type (
Literal['anndata','default','numpy','torch','sparse']) – How should the features be returned. numpy: return as a numpy array; torch: return as a torch tensor; anndata: return as an anndata object.x_kwargs (Dict[str, Any]) –
y_kwargs (Dict[str, Any]) –
- Return type:
Tuple[Any,Any]
- get_test_data(return_type='numpy', x_kwargs={}, y_kwargs={})[source]
Retrieve cell features and labels from the ‘test’ split.
- Return type:
Tuple[Any,Any]- Parameters:
return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) –
x_kwargs (Dict[str, Any]) –
y_kwargs (Dict[str, Any]) –
- get_train_data(return_type='numpy', x_kwargs={}, y_kwargs={})[source]
Retrieve cell features and labels from the ‘train’ split.
- Return type:
Tuple[Any,Any]- Parameters:
return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) –
x_kwargs (Dict[str, Any]) –
y_kwargs (Dict[str, Any]) –
- get_val_data(return_type='numpy', x_kwargs={}, y_kwargs={})[source]
Retrieve cell features and labels from the ‘val’ split.
- Return type:
Tuple[Any,Any]- Parameters:
return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) –
x_kwargs (Dict[str, Any]) –
y_kwargs (Dict[str, Any]) –