dance.data

class dance.data.base.BaseData(data, train_size=None, val_size=0, test_size=-1, split_index_range_dict=None, full_split_name=None)[source]

Base data object.

The dance data object is a wrapper of the AnnData object, with several utility methods to help retrieving data in specific splits in specific format (see get_split_idx() and get_feature()). The AnnData objcet is saved in the attribute data and can be accessed directly.

Warning

Since the underlying data object is a reference to the input AnnData object, please be extra cautious *NOT* initializing two different dance data object using the same AnnData object! If you are unsure, we recommend always initialize the dance data object using a copy of the input AnnData object, e.g.,

>>> adata = anndata.AnnData(...)
>>> ddata = dance.data.Data(adata.copy())

Note

You can directly access some main properties of AnnData (or MuData depending on which type of data you passed in), such as X, obs, var, and etc.

Parameters:

data (Union[AnnData, MuData]) – Cell data.
train_size (Optional[int]) – Number of cells to be used for training. If not specified, not splits will be generated.
val_size (int) – Number of cells to be used for validation. If set to -1, use what’s left from training and testing.
test_size (int) – Number of cells to be used for testing. If set to -1, used what’s left from training and validation.
split_index_range_dict (Dict[str, Tuple[int, int]] | None) –
full_split_name (str | None) –

append(data, *, mode='merge', rename_dict=None, new_split_name=None, label_batch=False, **concat_kwargs)[source]

Append another dance data object to the current data object.

Parameters:

data – New dance data object to be added.
mode (Optional[Literal['merge', 'rename', 'new_split']]) – How to combine the splits from the new data and the current data. (1) "merge": merge the splits from the data, e.g., the training indexes from both data are used as the training indexes in the new combined data. (2) "rename": rename the splits of the new data and add to the current split index dictionary, e.g., renaming ‘train’ to ‘ref’. Requires passing the rename_dict. Raise an error if the newly renamed key is already used in the current split index dictionary. (3) "new_split": assign the whole new data to a new split. Requires pssing the new_split_name that is not already used as a split name in the current data. (4) None: do not specify split index to the newly added data.
rename_dict (Optional[Dict[str, str]]) – Optional argument that is only used when mode="rename". A dictionary to map the split names in the new data to other names.
new_split_name (Optional[str]) – Optional argument that is only used when mode="new_split". Name of the split to assign to the new data.
label_batch (bool) – Add “batch” column to .obs when set to True.
**concat_kwargs – See anndata.concat().

property config: Dict[str, Any]

Return the dance data object configuration dict.

Notes

The configuration dictionary is saved in the data attribute, which is an AnnData object. Inparticular, the config will be saved in the .uns attribute with the key "dance_config".

filter_by_mask(mask, update_splits=True)[source]

Filter cells based on a boolean mask and optionally update splits.

Filters the cells in self.data using a provided boolean mask. If update_splits is True, this method also updates the internal split indices (train_idx, val_idx, etc.) to reflect the cells remaining after filtering.

Parameters:

mask (Union[Sequence[bool], pd.Series, np.ndarray]) – A boolean mask (list, Series, or array) with the same length as the current number of cells (self.data.shape[0]). Cells where the mask is True will be kept.
update_splits (bool, optional) – Whether to update the internal split indices to align with the filtered data. Defaults to True. If set to False, the split indices will become invalid if any cells are removed.

Returns:

Returns the instance to allow method chaining.

Return type:

self

Raises:

ValueError – If the mask is not boolean or has an incorrect length.
NotImplementedError – If the underlying self.data is not an anndata.AnnData object (as filtering MuData requires more careful handling).

filter_cells(**kwargs)[source]

Apply cell filtering using scanpy.pp.filter_cells and update splits.

Filters the cells in self.data based on the provided criteria, similar to scanpy.pp.filter_cells. Crucially, this method also updates the internal split indices (train_idx, val_idx, etc.) to reflect the cells remaining after filtering.

Parameters:: **kwargs – Arguments passed directly to scanpy.pp.filter_cells. Common arguments include min_counts, max_counts, min_genes, max_genes. Note: inplace is forced to False internally to get the filter mask, then applied effectively inplace.
Returns:: Returns the instance to allow method chaining.
Return type:: self
Raises:: NotImplementedError – If the underlying self.data is not an anndata.AnnData object. Filtering MuData requires more careful consideration of modalities.

get_feature(*, split_name=None, return_type='numpy', channel=None, channel_type='obsm', mod=None)[source]

Retrieve features from data.

Parameters:

split_name (Optional[str]) – Name of the split to retrieve. If not set, return all.
return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) – How should the features be returned. sparse: return as a sparse matrix; numpy: return as a numpy array; torch: return as a torch tensor; anndata: return as an anndata object.
channel (Optional[str]) – Return a particular channel as features. If channel_type is X or raw_X, then return .X or the .raw.X attribute from the AnnData directly. If channel_type is obs, return the column named by channel, similarly for var. Finally, if channel_type is obsm, obsp, varm, varp, layers, or uns, then return the value correspond to the channel in the dictionary.
channel_type (Optional[str]) – Channel type to use, default to obsm (will be changed to X in the near future).
mod (Optional[str]) – Modality to use, default to None. Options other than None are only available when the underlying data object is Mudata.

get_split_data(split_name)[source]

Obtain the underlying data of a particular split.

Parameters:: split_name (str) – Name of the split to retrieve.
Return type:: Union[AnnData, MuData]

get_split_idx(split_name, error_on_miss=False)[source]

Obtain cell indices for a particular split.

Parameters:

split_name (str) – Name of the split to retrieve.
error_on_miss (bool) – If set to True, raise KeyError if the queried split does not exit, otherwise return None.