Tools#

Differential gene expression#

Differential gene expression involves the quantitative comparison of gene expression levels between two or more groups, such as different cell types, tissues, or conditions to discern genes that are significantly up- or downregulated in response to specific biological contexts or stimuli. Pertpy enables differential gene expression tests through a common interface that supports complex designs.

tools.PyDESeq2

Differential expression test using a PyDESeq2.

tools.EdgeR

Differential expression test using EdgeR.

tools.WilcoxonTest

Perform a unpaired or paired Wilcoxon test.

tools.TTest

Perform a unpaired or paired T-test.

tools.PermutationTest

Perform a permutation test.

tools.Statsmodels

Differential expression test using a statsmodels linear regression.

Pooled CRISPR screens#

Perturbation assignment - Mixscape#

CRISPR based screens can suffer from off-target effects but also limited efficacy of the guide RNAs. When analyzing CRISPR screen data, it is vital to know which perturbations were successful and which ones were not to accurately determine the effect of perturbations.

Mixscape is a pipeline that aims to determine and remove unsuccessfully perturbed cells [PMB+21]. First tries to remove confounding sources of variation such as cell cycle or replicate effect by calculating a perturbation signature Next, it determines which targeted cells were affected by the genetic perturbation (=KO) and which targeted cells were not (=NP) with the use of mixture models. Finally, it visualizes similarities and differences across different perturbations.

See Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens for more details on the pipeline.

tools.Mixscape()

identify perturbation effects in CRISPR screens by separating cells into perturbation groups.

Example implementation:

import pertpy as pt

mdata = pt.dt.papalexi_2021()
ms = pt.tl.Mixscape()
ms.perturbation_signature(mdata["rna"], "perturbation", "NT", "replicate")
ms.mixscape(adata=mdata["rna"], control="NT", labels="gene_target", layer="X_pert")
ms.lda(adata=mdata["rna"], labels="gene_target", layer="X_pert", control="NT")
ms.plot_lda(adata=mdata["rna"], control="NT")

See mixscape tutorial.

Compositional analysis#

Compositional data analysis focuses on identifying and quantifying variations in cell type composition across different conditions or samples to uncover biological differences driven by changes in cellular makeup.

Generally, there’s two ways of approaching this question:

  1. Without labeled groups using graph based approaches

  2. With labeled groups using pure statistical approaches

For a more in-depth explanation we refer to the corresponding sc-best-practices compositional chapter.

Without labeled groups - Milo#

Milo enables the exploration of differential abundance of cell types across different biological conditions or spatial locations [DHT+22]. It employs a neighborhood-testing approach to statistically assess variations in cell type compositions, providing insights into the microenvironmental and functional heterogeneity within and across samples.

See Differential abundance testing on single-cell data using k-nearest neighbor graphs for details on the statistical framework.

tools.Milo()

Python implementation of Milo.

Example implementation:

import pertpy as pt
import scanpy as sc

adata = pt.dt.stephenson_2021_subsampled()
adata.obs["COVID_severity"] = adata.obs["Status_on_day_collection_summary"].copy()
adata.obs[["patient_id", "COVID_severity"]].drop_duplicates()
adata = adata[adata.obs["Status"] != "LPS"].copy()

milo = pt.tl.Milo()
mdata = milo.load(adata)
sc.pp.neighbors(mdata["rna"], use_rep="X_scVI", n_neighbors=150, n_pcs=10)
milo.make_nhoods(mdata["rna"], prop=0.1)
mdata = milo.count_nhoods(mdata, sample_col="patient_id")
mdata["rna"].obs["Status"] = (
    mdata["rna"].obs["Status"].cat.reorder_categories(["Healthy", "Covid"])
)
milo.da_nhoods(mdata, design="~Status")

See milo tutorial.

With labeled groups - scCODA and tascCODA#

scCODA is designed to identify differences in cell type compositions from single-cell sequencing data across conditions for labeled groups [ButtnerOMuller+21]. It employs a Bayesian hierarchical model and Dirichlet-multinomial distribution, using Markov chain Monte Carlo (MCMC) for inference, to detect significant shifts in cell type composition across conditions.

tascCODA extends scCODA to analyze compositional count data from single-cell sequencing studies, incorporating hierarchical tree information and experimental covariates [OCM21]. By integrating spike-and-slab Lasso penalization with latent tree-based parameters, tascCODA identifies differential abundance across hierarchical levels, offering parsimonious and predictive insights into compositional changes in cell populations.

See scCODA is a Bayesian model for compositional single-cell data analysis and tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data for more details.

tools.Sccoda(*args, **kwargs)

Statistical model for single-cell differential composition analysis with specification of a reference cell type.

tools.Tasccoda(*args, **kwargs)

Statistical model for tree-aggregated differential composition analysis (tascCODA, Ostner et al., 2021).

Example implementation:

import pertpy as pt

haber_cells = pt.dt.haber_2017_regions()
sccoda = pt.tl.Sccoda()
sccoda_data = sccoda.load(
    haber_cells,
    type="cell_level",
    generate_sample_level=True,
    cell_type_identifier="cell_label",
    sample_identifier="batch",
    covariate_obs=["condition"],
)
sccoda_data.mod["coda_salm"] = sccoda_data["coda"][
    sccoda_data["coda"].obs["condition"].isin(["Control", "Salmonella"])
].copy()

sccoda_data = sccoda.prepare(
    sccoda_data,
    modality_key="coda_salm",
    formula="condition",
    reference_cell_type="Goblet",
)
sccoda.run_nuts(sccoda_data, modality_key="coda_salm")
sccoda.summary(sccoda_data, modality_key="coda_salm")
sccoda.plot_effects_barplot(
    sccoda_data, modality_key="coda_salm", parameter="Final Parameter"
)

See sccoda tutorial, extended sccoda tutorial and tasccoda tutorial.

Multicellular and gene programs#

Multicellular programs are organized interactions and coordinated activities among different cell types within a tissue, forming complex functional units that drive tissue-specific functions, responses to environmental changes, and pathological states. These programs enable a higher level of biological organization by integrating signaling pathways, gene expression, and cellular behaviors across the cellular community to maintain homeostasis and execute collective responses.

Multicellular programs - DIALOGUE#

DIALOGUE identifies latent multicellular programs by mapping the data into a feature space where the cell type specific representations are correlated across different samples and environments [JAR22]. Next, DIALOGUE employs multi-level hierarchical modeling to identify genes that comprise the latent features.

This is a work in progress (!) Python implementation of DIALOGUE for the discovery of multicellular programs.

See DIALOGUE maps multicellular programs in tissue from single-cell or spatial transcriptomics data for more details on the methodology.

tools.Dialogue(sample_id, celltype_key, ...)

Python implementation of DIALOGUE.

Example implementation:

import pertpy as pt
import scanpy as sc

adata = pt.dt.dialogue_example()
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)


dl = pt.tl.Dialogue(
    sample_id="clinical.status",
    celltype_key="cell.subtypes",
    n_counts_key="nCount_RNA",
    n_mpcs=3,
)
adata, mcps, ws, ct_subs = dl.calculate_multifactor_PMD(adata, normalize=True)
all_results, new_mcps = dl.multilevel_modeling(
    ct_subs=ct_subs,
    mcp_scores=mcps,
    ws_dict=ws,
    confounder="gender",
)

See DIALOGUE tutorial.

Enrichment#

Enrichment tests for single-cell data assess whether specific biological pathways or gene sets are overrepresented in the expression profiles of individual cells, aiding in the identification of functional characteristics and cellular states. While pathway enrichment is a well-studied and commonly applied approach in single-cell RNA-seq, other data sources such as genes targeted by drugs can also be enriched. Drug2cell performs such enrichment tests and is available in pertpy [KCM+23].

Example implementation:

import pertpy as pt
import scanpy as sc

adata = sc.datasets.pbmc3k_processed()

pt_enricher = pt.tl.Enrichment()
pt_enricher.score(adata)

See enrichment tutorial.

Distances and permutation tests#

In settings where many perturbations are applied, it is often times unclear which perturbations had a strong effect and should be investigated further. Differential gene expression poses one option to get candidate genes and p-values. Determining statistical distances between the perturbations and applying a permutation test is another option [PGS+24].

For more details, we refer to scPerturb: harmonized single-cell perturbation data.

tools.Distance([metric, agg_fct, layer_key, ...])

Distance class, used to compute distances between groups of cells.

tools.DistanceTest(metric[, n_perms, ...])

Run permutation tests using a distance of choice between groups of cells.

Example implementation:

import pertpy as pt

adata = pt.dt.distance_example()

# Pairwise distances
distance = pt.tl.Distance(metric="edistance", obsm_key="X_pca")
pairwise_edistance = distance.pairwise(adata, groupby="perturbation")

# E-test (Permutation test using E-distance)
etest = pt.tl.PermutationTest(
    metric="edistance", obsm_key="X_pca", correction="holm-sidak"
)
tab = etest(adata, groupby="perturbation", contrast="control")

See distance tutorial and distance tests tutorial.

Response prediction#

Response prediction describes computational models that predict how individual cells or cell populations will respond to specific treatments, conditions, or stimuli based on their gene expression profiles, enabling insights into cellular behaviors and potential therapeutic strategies. Such approaches can also order perturbations by their effect on groups of cells.

Rank perturbations - Augur#

Augur aims to rank or prioritize cell types according to their response to experimental perturbations [SSK+21]. Cells that respond strongly to perturbations are more easily distinguishable as treated or control in molecular space. Augur quantifies this by training a classifier to predict experimental labels within each cell type across cross-validation runs. Cell types are ranked by model accuracy—using AUC for categorical labels and concordance correlation for continuous ones—as a proxy for perturbation response.

For more details we refer to Cell type prioritization in single-cell data.

Example implementation:

import pertpy as pt

adata = pt.dt.sc_sim_augur()
ag = pt.tl.Augur(estimator="random_forest_classifier")
adata = ag.load(adata)
adata, results = ag.predict(adata)

# metrics for each cell type
results["summary_metrics"]

See augur tutorial.

tools.Augur

Python implementation of Augur.

Gene expression prediction with scGen#

scGen is a deep generative model that leverages autoencoders and adversarial training to integrate single-cell RNA sequencing data from different conditions or tissues, enabling the generation of synthetic single-cell data for cross-condition analysis and predicting cell-type-specific responses to perturbations [LWT19]. See scGen predicts single-cell perturbation responses for more details.

tools.Scgen(adata[, n_hidden, n_latent, ...])

Jax Implementation of scGen model for batch removal and perturbation prediction.

Example implementation:

import pertpy as pt

train = pt.dt.kang_2018()

train_new = train[
    ~((train.obs["cell_type"] == "CD4T") & (train.obs["condition"] == "stimulated"))
]
train_new = train_new.copy()

pt.tl.Scgen.setup_anndata(train_new, batch_key="condition", labels_key="cell_type")
scgen = pt.tl.Scgen(train_new)
scgen.train(max_epochs=100, batch_size=32)

pred, delta = scgen.predict(
    ctrl_key="control", stim_key="stimulated", celltype_to_predict="CD4T"
)
pred.obs["condition"] = "pred"

See scgen tutorial.

Causal perturbation analysis with CINEMA-OT#

CINEMA-OT is a causal framework for perturbation effect analysis to identify individual treatment effects and synergy at the single cell level [DWW+23]. CINEMA-OT separates confounding sources of variation from perturbation effects to obtain an optimal transport matching that reflects counterfactual cell pairs. These cell pairs represent causal perturbation responses permitting a number of novel analyses, such as individual treatment effect analysis, response clustering, attribution analysis, and synergy analysis.

See Causal identification of single-cell experimental perturbation effects with CINEMA-OT for more details.

tools.Cinemaot()

CINEMA-OT is a causal framework for perturbation effect analysis to identify individual treatment effects and synergy.

Example implementation:

import pertpy as pt

adata = pt.dt.cinemaot_example()

model = pt.tl.Cinemaot()
de = model.causaleffect(
    adata,
    pert_key="perturbation",
    control="No stimulation",
    return_matching=True,
    thres=0.5,
    smoothness=1e-5,
    eps=1e-3,
    solver="Sinkhorn",
    preweight_label="cell_type0528",
)

See CINEMA-OT tutorial.

Perturbation space#

Perturbation spaces depart from the individualistic perspective of cells and instead organizes cells into cohesive ensembles. This specialized space enables comprehending the collective impact of perturbations on cells. Pertpy offers various modules for calculating and evaluating perturbation spaces that are either based on summary statistics or clusters.

tools.MLPClassifierSpace()

Fits an ANN classifier to the data and takes the feature space (weights in the last layer) as embedding.

tools.LRClassifierSpace()

Fits a logistic regression model to the data and takes the feature space as embedding.

tools.CentroidSpace()

Computes the centroids per perturbation of a pre-computed embedding.

tools.DBSCANSpace()

Cluster the given data using DBSCAN.

tools.KMeansSpace()

Computes K-Means clustering of the expression values.

tools.PseudobulkSpace()

Calculates pseudobulks.

Example implementation:

import pertpy as pt

mdata = pt.dt.papalexi_2021()
ps = pt.tl.PseudobulkSpace()
ps_adata = ps.compute(
    mdata["rna"],
    target_col="gene_target",
    groups_col="gene_target",
    mode="mean",
)

See perturbation space tutorial.