pertpy.tools.DBSCANSpace#

class DBSCANSpace[source]#

Cluster the given data using DBSCAN.

Methods table#

add(adata, *, perturbations[, ...])

Add perturbations linearly.

compute(adata[, layer_key, embedding_key, ...])

Computes a clustering using Density-based spatial clustering of applications (DBSCAN).

compute_control_diff(adata, *[, target_col, ...])

Subtract mean of the control from the perturbation.

evaluate_clustering(adata, true_label_col, ...)

Evaluation of previously computed clustering against ground truth labels.

label_transfer(adata, *[, target_column, ...])

Impute missing values in the specified column using KNN imputation in the space defined by use_rep.

subtract(adata, *, perturbations[, ...])

Subtract perturbations linearly.

Methods#

DBSCANSpace.add(adata, *, perturbations, reference_key='control', ensure_consistency=False, target_col='perturbation')#

Add perturbations linearly. Assumes input of size n_perts x dimensionality.

Parameters:
  • adata (AnnData) – Anndata object of size n_perts x dim.

  • perturbations (Iterable[str]) – Perturbations to add.

  • reference_key (str, default: 'control') – perturbation source from which the perturbation summation starts.

  • ensure_consistency (bool, default: False) – Whether to run differential expression on all data matrices to ensure consistency of linear space.

  • target_col (str, default: 'perturbation') – .obs column name that stores the label of the perturbation applied to each cell.

Return type:

tuple[AnnData, AnnData] | AnnData

Returns:

Anndata object of size (n_perts+1) x dim, where the last row is the addition of the specified perturbations. If ensure_consistency is True, returns a tuple of (new_perturbation, adata) where adata is the AnnData object provided as input but updated using compute_control_diff.

Examples

Example usage with PseudobulkSpace:

>>> import pertpy as pt
>>> mdata = pt.dt.papalexi_2021()
>>> ps = pt.tl.PseudobulkSpace()
>>> ps_adata = ps.compute(mdata["rna"], target_col="gene_target", groups_col="gene_target")
>>> new_perturbation = ps.add(ps_adata, perturbations=["ATF2", "CD86"], reference_key="NT")
DBSCANSpace.compute(adata, layer_key=None, embedding_key=None, cluster_key='dbscan', copy=True, return_object=False, **kwargs)[source]#

Computes a clustering using Density-based spatial clustering of applications (DBSCAN).

Parameters:
  • adata (AnnData) – Anndata object of size cells x genes

  • layer_key (str, default: None) – If specified and exists in the adata, the clustering is done by using it. Otherwise, clustering is done with .X.

  • embedding_key (str, default: None) – if specified and exists in the adata, the clustering is done with that embedding. Otherwise, clustering is done with .X.

  • cluster_key (str, default: 'dbscan') – name of the .obs column to store the cluster labels.

  • copy (bool, default: True) – if True returns a new Anndata of same size with the new column; otherwise it updates the initial adata

  • return_object (bool, default: False) – if True returns the clustering object

  • **kwargs – Are passed to sklearn’s DBSCAN.

Return type:

tuple[AnnData, object] | AnnData

Returns:

If return_object is True, the adata and the clustering object is returned. Otherwise, only the adata is returned. The adata is updated with a new .obs column as specified in cluster_key, that stores the cluster labels.

Examples

>>> import pertpy as pt
>>> mdata = pt.dt.papalexi_2021()
>>> dbscan = pt.tl.DBSCANSpace()
>>> dbscan_adata = dbscan.compute(mdata["rna"])
DBSCANSpace.compute_control_diff(adata, *, target_col='perturbation', group_col=None, reference_key='control', layer_key=None, new_layer_key='control_diff', embedding_key=None, new_embedding_key='control_diff', all_data=False, copy=False)#

Subtract mean of the control from the perturbation.

Parameters:
  • adata (AnnData) – Anndata object of size cells x genes.

  • target_col (str, default: 'perturbation') – .obs column name that stores the label of the perturbation applied to each cell.

  • group_col (str, default: None) – .obs column name that stores the label of the group of each cell. If None, ignore groups.

  • reference_key (str, default: 'control') – The key of the control values.

  • layer_key (str, default: None) – Key of the AnnData layer to use for computation.

  • new_layer_key (str, default: 'control_diff') – the results are stored in the given layer.

  • embedding_key (str, default: None) – obsm key of the AnnData embedding to use for computation.

  • new_embedding_key (str, default: 'control_diff') – Results are stored in a new embedding in obsm with this key.

  • all_data (bool, default: False) – if True, do the computation in all data representations (X, all layers and all embeddings)

  • copy (bool, default: False) – If True returns a new Anndata of same size with the new column; otherwise it updates the initial AnnData object.

Return type:

AnnData

Returns:

Updated AnnData object.

Examples

Example usage with PseudobulkSpace:

>>> import pertpy as pt
>>> mdata = pt.dt.papalexi_2021()
>>> ps = pt.tl.PseudobulkSpace()
>>> diff_adata = ps.compute_control_diff(mdata["rna"], target_col="gene_target", reference_key="NT")
DBSCANSpace.evaluate_clustering(adata, true_label_col, cluster_col, metrics=None, **kwargs)#

Evaluation of previously computed clustering against ground truth labels.

Parameters:
  • adata (AnnData) – AnnData object that contains the clustered data and the cluster labels.

  • true_label_col (str) – ground truth labels.

  • cluster_col (str) – cluster computed labels.

  • metrics (Iterable[str], default: None) – Metrics to compute. If None it defaults to [“nmi”, “ari”, “asw”].

  • **kwargs – Additional arguments to pass to the metrics. For nmi, average_method can be passed. For asw, metric, distances, sample_size, and random_state can be passed.

Examples

Example usage with KMeansSpace:

>>> import pertpy as pt
>>> mdata = pt.dt.papalexi_2021()
>>> kmeans = pt.tl.KMeansSpace()
>>> kmeans_adata = kmeans.compute(mdata["rna"], n_clusters=26)
>>> results = kmeans.evaluate_clustering(
...     kmeans_adata, true_label_col="gene_target", cluster_col="k-means", metrics=["nmi"]
... )
DBSCANSpace.label_transfer(adata, *, target_column='perturbation', column_uncertainty_score_key='perturbation_transfer_uncertainty', target_val='unknown', neighbors_key='neighbors', **kwargs)#

Impute missing values in the specified column using KNN imputation in the space defined by use_rep.

Uncertainty is calculated as the entropy of the label distribution in the neighborhood of the target cell. In other words, a cell where all neighbors have the same set of labels will have an uncertainty of 0, whereas a cell where all neighbors have many different labels will have high uncertainty.

Parameters:
  • adata (AnnData) – The AnnData object containing single-cell data.

  • target_column (str, default: 'perturbation') – The column name in adata.obs to perform imputation on.

  • column_uncertainty_score_key (str, default: 'perturbation_transfer_uncertainty') – The column name in adata.obs to store the uncertainty score of the label transfer.

  • target_val (str, default: 'unknown') – The target value to impute.

  • neighbors_key (str, default: 'neighbors') – The key in adata.uns where the neighbors are stored.

Return type:

None

Examples

>>> import pertpy as pt
>>> import scanpy as sc
>>> import numpy as np
>>> adata = sc.datasets.pbmc68k_reduced()
>>> # randomly dropout 10% of the data annotations
>>> adata.obs["perturbation"] = adata.obs["louvain"].astype(str).copy()
>>> random_cells = np.random.choice(adata.obs.index, int(adata.obs.shape[0] * 0.1), replace=False)
>>> adata.obs.loc[random_cells, "perturbation"] = "unknown"
>>> sc.pp.neighbors(adata)
>>> sc.tl.umap(adata)
>>> ps = pt.tl.PseudobulkSpace()
>>> ps.label_transfer(adata)
DBSCANSpace.subtract(adata, *, perturbations, reference_key='control', ensure_consistency=False, target_col='perturbation')#

Subtract perturbations linearly. Assumes input of size n_perts x dimensionality.

Parameters:
  • adata (AnnData) – Anndata object of size n_perts x dim.

  • perturbations (Iterable[str]) – Perturbations to subtract.

  • reference_key (str, default: 'control') – Perturbation source from which the perturbation subtraction starts.

  • ensure_consistency (bool, default: False) – Whether to run differential expression on all data matrices to ensure consistency of linear space.

  • target_col (str, default: 'perturbation') – .obs column name that stores the label of the perturbation applied to each cell.

Return type:

tuple[AnnData, AnnData] | AnnData

Returns:

Anndata object of size (n_perts+1) x dim, where the last row is the subtraction of the specified perturbations. If ensure_consistency is True, returns a tuple of (new_perturbation, adata) where adata is the AnnData object provided as input but updated using compute_control_diff.

Examples

Example usage with PseudobulkSpace:

>>> import pertpy as pt
>>> mdata = pt.dt.papalexi_2021()
>>> ps = pt.tl.PseudobulkSpace()
>>> ps_adata = ps.compute(mdata["rna"], target_col="gene_target", groups_col="gene_target")
>>> new_perturbation = ps.subtract(ps_adata, reference_key="ATF2", perturbations=["BRD4", "CUL3"])