Usage

Import the pertpy API as follows:

import pertpy as pt

You can then access the respective modules like:

pt.tl.cool_fancy_tool()

Datasets

pertpy provides access to several curated single-cell datasets spanning several types of perturbations. Many of the datasets originate from scperturb and were further curated to have harmonized names and be loadable as MuData objects.

data.adamson_2016_pilot()

6000 chronic myeloid leukemia (K562) cells carrying 8 distinct GBCs.

data.adamson_2016_upr_epistasis()

15000 K562 cells with UPR sensor genes knocked out and treated with thapsigargin.

data.adamson_2016_upr_perturb_seq()

Transcriptomics measurements of 65000 cells that were subject to 91 sgRNAs targeting 82 genes.

data.aissa_2021()

Transcriptomics of 848 P99 cells subject to consecutive erlotinib and 756 control cells.

data.bhattacherjee()

Processed single-cell data PFC adult mice under cocaine self-administration.

data.burczynski_crohn()

Bulk data with conditions ulcerative colitis (UC) and Crohn's disease (CD).

data.chang_2021()

Transcriptomics of 5 different cell lines that were induced with a unique TraCe-seq barcode.

data.combosciplex()

scRNA-seq subset of the combinatorial experiment of sciplex3.

data.cinemaot_example()

Subsampled CINEMA-OT example dataset.

data.datlinger_2017()

Transcriptomics measurements of 5905 Jurkat cells induced with anti-CD3 and anti-CD28 antibodies.

data.datlinger_2021()

Transcriptomics measurements of 151788 nuclei of four cell lines.

data.dialogue_example()

Example dataset used in DIALOGUE vignettes.

data.dixit_2016()

Perturb-seq: scRNA-seq with pooled CRISPR-KO perturbations.

data.dixit_2016_raw()

Perturb-seq: scRNA-seq with pooled CRISPR-KO perturbations.

data.frangieh_2021()

Processed perturb-CITE-seq data with multi-modal RNA and protein single-cell profiling.

data.frangieh_2021_protein()

CITE-seq data of 218000 cells under 750 perturbations (only the surface protein data).

data.frangieh_2021_raw()

Raw Perturb-CITE-seq data with multi-modal RNA and protein single-cell profiling readout.

data.frangieh_2021_rna()

CITE-seq data of 218000 cells under 750 perturbations (only the transcriptomics data).

data.gasperini_2019_atscale()

Transcriptomics of 254974 cells of chronic K562 cells with CRISPRi perturbations.

data.gasperini_2019_highmoi()

K562 perturbed cells with 1119 candidate enhancers (only the high MOI part).

data.gasperini_2019_lowmoi()

K562 perturbed cells with 1119 candidate enhancers (only the low MOI part).

data.gehring_2019()

96-plex perturbation experiment on live mouse neural stem cells.

data.haber_2017_regions()

Raw single-cell, pooled CRISPR screening.

data.kang_2018()

Processed multiplexing droplet-based single cell RNA-sequencing using genetic barcodes.

data.mcfarland_2020()

Response of various cell lines to a range of different drugs and CRISPRi perturbations.

data.norman_2019()

Processed single-cell, pooled CRISPR screening.

data.norman_2019_raw()

Raw single-cell, pooled CRISPR screening.

data.papalexi_2021()

ECCITE-seq dataset of 11 gRNAs generated from stimulated THP-1 cell line.

data.replogle_2022_k562_essential()

K562 cells transduced with CRISPRi (day 7 after transduction).

data.replogle_2022_k562_gwps()

K562 cells transduced with CRISPRi (day 8 after transduction).

data.replogle_2022_rpe1()

RPE1 cells transduced with CRISPRi (day 7 after transduction).

data.sc_sim_augur()

Simulated test dataset used in Augur example vignettes.

data.schiebinger_2019_16day()

Transcriptomes of 65781 iPSC cells collected over 10 time points in 2i or serum conditions (16-day time course).

data.schiebinger_2019_18day()

Transcriptomes of 259155 iPSC cells collected over 39 time points in 2i or serum conditions (18-day time course).

data.schraivogel_2020_tap_screen_chr8()

TAP-seq applied to K562 cells (only chromosome 8).

data.schraivogel_2020_tap_screen_chr11()

TAP-seq applied to K562 cells (only chromosome 11).

data.sciplex3_raw()

Raw sciplex3 perturbation dataset curated for perturbation modeling.

data.shifrut_2018()

CD8 T-cells from two donors for two conditions (SLICE and CROP-seq).

data.smillie_2019()

scRNA-seq data of the small intestine of mice under Ulcerative Colitis.

data.srivatsan_2020_sciplex2()

A549 cells exposed to four compounds.

data.srivatsan_2020_sciplex3()

Transcriptomes of 650000 A549, K562, and mCF7 cells exposed to 188 compounds.

data.srivatsan_2020_sciplex4()

A549 and MCF7 cells treated with pracinostat.

data.stephenson_2021_subsampled()

Processed 10X 5' scRNA-seq data from PBMC of COVID-19 patients and healthy donors

data.tian_2019_day7neuron()

Transcriptomes of 20000 day 7 neurons targeted by 58 gRNAs.

data.tian_2019_ipsc()

Transcriptomics of 20000 iPSCs targeted by 58 sgRNAs.

data.tian_2021_crispra()

CROP-seq of 50000 neurons treated with 374 gRNAs (CRISPRa only).

data.tian_2021_crispri()

CROP-seq of 98000 neurons treated with 374 gRNAs (CRISPRi only).

data.weinreb_2020()

Mouse embryonic stem cells under different cytokines across time.

data.xie_2017()

Single-cell transcriptomics of 51448 cells generated with Mosaic-seq.

data.zhao_2021()

Multiplexed drug perturbation from freshly resected tumors.

Preprocessing

Guide Assignment

Guide assignment is essential for quality control in single-cell Perturb-seq data, ensuring accurate mapping of guide RNAs to cells for reliable interpretation of gene perturbation effects. pertpy provides a simple function to assign guides based on thresholds. Each cell is assigned to the most expressed gRNA if it has at least the specified number of counts.

preprocessing.GuideAssignment

Offers simple guide assigment based on count thresholds.

Example implementation:

import pertpy as pt
import scanpy as sc

mdata = pt.dt.papalexi_2021()
gdo = mdata.mod["gdo"]
gdo.layers["counts"] = gdo.X.copy()
sc.pp.log1p(gdo)

ga = pt.pp.GuideAssignment()
ga.assign_by_threshold(gdo, 5, layer="counts", output_layer="assigned_guides")

ga.plot_heatmap(gdo, layer="assigned_guides")

See guide assignment tutorial for a more elaborate tutorial.

Tools

Differential gene expression

Differential gene expression involves the quantitative comparison of gene expression levels between two or more groups, such as different cell types, tissues, or conditions to discern genes that are significantly up- or downregulated in response to specific biological contexts or stimuli. Pertpy provides utilities to conduct differential gene expression tests through a common interface that supports complex designs and methods.

tools.DifferentialGeneExpression

Support for differential gene expression for scverse.

Pooled CRISPR screens

Perturbation assignment - Mixscape

CRISPR based screens can suffer from off-target effects but also limited efficacy of the guide RNAs. When analyzing CRISPR screen data, it is vital to know which perturbations were successful and which ones were not to accurately determine the effect of perturbations.

Mixscape first tries to remove confounding sources of variation such as cell cycle or replicate effect by calculating a perturbation signature Next, it determines which targeted cells were affected by the genetic perturbation (=KO) and which targeted cells were not (=NP) with the use of mixture models. Finally, it visualizes similarities and differences across different perturbations.

See Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens for more details on the pipeline.

tools.Mixscape()

Python implementation of Mixscape.

Example implementation:

import pertpy as pt

mdata = pt.dt.papalexi_2021()
ms = pt.tl.Mixscape()
ms.perturbation_signature(mdata["rna"], "perturbation", "NT", "replicate")
ms.mixscape(adata=mdata["rna"], control="NT", labels="gene_target", layer="X_pert")
ms.lda(adata=mdata["rna"], labels="gene_target", layer="X_pert", control="NT")
ms.plot_lda(adata=mdata["rna"], control="NT")

See mixscape tutorial for a more elaborate tutorial.

Compositional analysis

Compositional data analysis focuses on identifying and quantifying variations in cell type composition across different conditions or samples to uncover biological differences driven by changes in cellular makeup.

Generally, there’s two ways of approaching this question:

  1. Without labeled groups using graph based approaches

  2. With labeled groups using pure statistical approaches

For a more in-depth explanation we refer to the corresponding sc-best-practices compositional chapter.

Without labeled groups - Milo

Milo enables the exploration of differential abundance of cell types across different biological conditions or spatial locations. It employs a neighborhood-testing approach to statistically assess variations in cell type compositions, providing insights into the microenvironmental and functional heterogeneity within and across samples.

See Differential abundance testing on single-cell data using k-nearest neighbor graphs for details on the statistical framework.

tools.Milo()

Python implementation of Milo.

Example implementation:

import pertpy as pt
import scanpy as sc

adata = pt.dt.stephenson_2021_subsampled()
adata.obs["COVID_severity"] = adata.obs["Status_on_day_collection_summary"].copy()
adata.obs[["patient_id", "COVID_severity"]].drop_duplicates()
adata = adata[adata.obs["Status"] != "LPS"].copy()

milo = pt.tl.Milo()
mdata = milo.load(adata)
sc.pp.neighbors(mdata["rna"], use_rep="X_scVI", n_neighbors=150, n_pcs=10)
milo.make_nhoods(mdata["rna"], prop=0.1)
mdata = milo.count_nhoods(mdata, sample_col="patient_id")
mdata["rna"].obs["Status"] = (
    mdata["rna"].obs["Status"].cat.reorder_categories(["Healthy", "Covid"])
)
milo.da_nhoods(mdata, design="~Status")

See milo tutorial for a more elaborate tutorial.

With labeled groups - scCODA and tascCODA

scCODA is designed to identify differences in cell type compositions from single-cell sequencing data across conditions for labeled groups. It employs a Bayesian hierarchical model and Dirichlet-multinomial distribution, using Markov chain Monte Carlo (MCMC) for inference, to detect significant shifts in cell type composition across conditions.

tascCODA extends scCODA to analyze compositional count data from single-cell sequencing studies, incorporating hierarchical tree information and experimental covariates. By integrating spike-and-slab Lasso penalization with latent tree-based parameters, tascCODA identifies differential abundance across hierarchical levels, offering parsimonious and predictive insights into compositional changes in cell populations.

See scCODA is a Bayesian model for compositional single-cell data analysis and tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data for more details.

tools.Sccoda(*args, **kwargs)

Statistical model for single-cell differential composition analysis with specification of a reference cell type.

tools.Tasccoda(*args, **kwargs)

Statistical model for tree-aggregated differential composition analysis (tascCODA, Ostner et al., 2021).

Example implementation:

import pertpy as pt

haber_cells = pt.dt.haber_2017_regions()
sccoda = pt.tl.Sccoda()
sccoda_data = sccoda.load(
    haber_cells,
    type="cell_level",
    generate_sample_level=True,
    cell_type_identifier="cell_label",
    sample_identifier="batch",
    covariate_obs=["condition"],
)
sccoda_data.mod["coda_salm"] = sccoda_data["coda"][
    sccoda_data["coda"].obs["condition"].isin(["Control", "Salmonella"])
].copy()

sccoda_data = sccoda.prepare(
    sccoda_data,
    modality_key="coda_salm",
    formula="condition",
    reference_cell_type="Goblet",
)
sccoda.run_nuts(sccoda_data, modality_key="coda_salm")
sccoda.summary(sccoda_data, modality_key="coda_salm")
sccoda.plot_effects_barplot(
    sccoda_data, modality_key="coda_salm", parameter="Final Parameter"
)

See sccoda tutorial, extended sccoda tutorial and tasccoda tutorial for more elaborate tutorials.

Multicellular and gene programs

Multicellular programs are organized interactions and coordinated activities among different cell types within a tissue, forming complex functional units that drive tissue-specific functions, responses to environmental changes, and pathological states. These programs enable a higher level of biological organization by integrating signaling pathways, gene expression, and cellular behaviors across the cellular community to maintain homeostasis and execute collective responses.

Multicellular programs - DIALOGUE

DIALOGUE identifies latent multicellular programs by mapping the data into a feature space where the cell type specific representations are correlated across different samples and environments. Next, DIALOGUE employs multi-level hierarchical modeling to identify genes that comprise the latent features.

This is a work in progress (!) Python implementation of DIALOGUE for the discovery of multicellular programs.

See DIALOGUE maps multicellular programs in tissue from single-cell or spatial transcriptomics data for more details on the methodology.

tools.Dialogue(sample_id, celltype_key, ...)

Python implementation of DIALOGUE

Example implementation:

import pertpy as pt
import scanpy as sc

adata = pt.dt.dialogue_example()
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)


dl = pt.tl.Dialogue(
    sample_id="clinical.status",
    celltype_key="cell.subtypes",
    n_counts_key="nCount_RNA",
    n_mpcs=3,
)
adata, mcps, ws, ct_subs = dl.calculate_multifactor_PMD(adata, normalize=True)
all_results, new_mcps = dl.multilevel_modeling(
    ct_subs=ct_subs,
    mcp_scores=mcps,
    ws_dict=ws,
    confounder="gender",
)

See DIALOGUE tutorial for a more elaborate tutorial.

Enrichment

Enrichment tests for single-cell data assess whether specific biological pathways or gene sets are overrepresented in the expression profiles of individual cells, aiding in the identification of functional characteristics and cellular states. While pathway enrichment is a well-studied and commonly applied approach in single-cell RNA-seq, other data sources such as genes targeted by drugs can also be enriched.

This implementation of enrichment is designed to interoperate with MetaData and uses a simple hypergeometric test.

Example implementation:

import pertpy as pt
import scanpy as sc

adata = sc.datasets.pbmc3k_processed()

pt_enricher = pt.tl.Enrichment()
pt_enricher.score(adata)

See enrichment tutorial for a more elaborate tutorial.

Distances and Permutation Tests

In settings where many perturbations are applied, it is often times unclear which perturbations had a strong effect and should be investigated further. Differential gene expression poses one option to get candidate genes and p-values. Determining statistical distances between the perturbations and applying a permutation test is another option.

For more details on the concept and the e-distance in particular we refer to scPerturb: harmonized single-cell perturbation data.

tools.Distance([metric, layer_key, ...])

Distance class, used to compute distances between groups of cells.

tools.DistanceTest(metric[, n_perms, ...])

Run permutation tests using a distance of choice between groups of cells.

Example implementation:

import pertpy as pt

adata = pt.dt.distance_example()

# Pairwise distances
distance = pt.tl.Distance(metric="edistance", obsm_key="X_pca")
pairwise_edistance = distance.pairwise(adata, groupby="perturbation")

# E-test (Permutation test using E-distance)
etest = pt.tl.PermutationTest(
    metric="edistance", obsm_key="X_pca", correction="holm-sidak"
)
tab = etest(adata, groupby="perturbation", contrast="control")

See distance tutorial and distance tests tutorial for more elaborate tutorials.

Response prediction

Response prediction describes computational models that predict how individual cells or cell populations will respond to specific treatments, conditions, or stimuli based on their gene expression profiles, enabling insights into cellular behaviors and potential therapeutic strategies. Such approaches can also order perturbations by their effect on groups of cells.

Rank perturbations - Augur

Augur aims to rank or prioritize cell types according to their response to experimental perturbations given high dimensional single-cell sequencing data. The basic idea is that in the space of molecular measurements cells reacting heavily to induced perturbations are more easily separated into perturbed and unperturbed than cell types with little or no response. This separability is quantified by measuring how well experimental labels (eg. treatment and control) can be predicted within each cell type. Augur trains a machine learning model predicting experimental labels for each cell type in multiple cross validation runs and then prioritizes cell type response according to metric scores measuring the accuracy of the model. For categorical data the area under the curve is the default metric and for numerical data the concordance correlation coefficient is used as a proxy for how accurate the model is which in turn approximates perturbation response.

For more details we refer to Cell type prioritization in single-cell data.

Example implementation:

import pertpy as pt

adata = pt.dt.sc_sim_augur()
ag = pt.tl.Augur(estimator="random_forest_classifier")
adata = ag.load(adata)
adata, results = ag.predict(adata)

# metrics for each cell type
results["summary_metrics"]

See augur tutorial for a more elaborate tutorial.

tools.Augur

Python implementation of Augur.

Gene expression prediction with scGen

scGen is a deep generative model that leverages autoencoders and adversarial training to integrate single-cell RNA sequencing data from different conditions or tissues, enabling the generation of synthetic single-cell data for cross-condition analysis and predicting cell-type-specific responses to perturbations. See scGen predicts single-cell perturbation responses for more details.

tools.SCGEN(adata[, n_hidden, n_latent, ...])

Jax Implementation of scGen model for batch removal and perturbation prediction.

Example implementation:

import pertpy as pt

train = pt.dt.kang_2018()

train_new = train[
    ~((train.obs["cell_type"] == "CD4T") & (train.obs["condition"] == "stimulated"))
]
train_new = train_new.copy()

pt.tl.SCGEN.setup_anndata(train_new, batch_key="condition", labels_key="cell_type")
scgen = pt.tl.SCGEN(train_new)
scgen.train(max_epochs=100, batch_size=32)

pred, delta = scgen.predict(
    ctrl_key="control", stim_key="stimulated", celltype_to_predict="CD4T"
)
pred.obs["condition"] = "pred"

See scgen tutorial for a more elaborate tutorial.

Causal perturbation analysis with CINEMA-OT

CINEMA-OT is a causal framework for perturbation effect analysis to identify individual treatment effects and synergy at the single cell level. CINEMA-OT separates confounding sources of variation from perturbation effects to obtain an optimal transport matching that reflects counterfactual cell pairs. These cell pairs represent causal perturbation responses permitting a number of novel analyses, such as individual treatment effect analysis, response clustering, attribution analysis, and synergy analysis. See Causal identification of single-cell experimental perturbation effects with CINEMA-OT for more details.

tools.Cinemaot()

CINEMA-OT is a causal framework for perturbation effect analysis to identify individual treatment effects and synergy.

Example implementation:

import pertpy as pt

adata = pt.dt.cinemaot_example()

model = pt.tl.Cinemaot()
de = model.causaleffect(
    adata,
    pert_key="perturbation",
    control="No stimulation",
    return_matching=True,
    thres=0.5,
    smoothness=1e-5,
    eps=1e-3,
    solver="Sinkhorn",
    preweight_label="cell_type0528",
)

See CINEMA-OT tutorial for a more elaborate tutorial.

Perturbation space

Perturbation spaces depart from the individualistic perspective of cells and instead organizes cells into cohesive ensembles. This specialized space enables comprehending the collective impact of perturbations on cells. Pertpy offers various modules for calculating and evaluating perturbation spaces that are either based on summary statistics or clusters.

tools.MLPClassifierSpace()

Fits an ANN classifier to the data and takes the feature space (weights in the last layer) as embedding.

tools.LRClassifierSpace()

Fits a logistic regression model to the data and takes the feature space as embedding.

tools.CentroidSpace()

Computes the centroids per perturbation of a pre-computed embedding.

tools.DBSCANSpace()

Cluster the given data using DBSCAN

tools.KMeansSpace()

Computes K-Means clustering of the expression values.

tools.PseudobulkSpace()

Determines pseudobulks using decoupler.

Example implementation:

import pertpy as pt

mdata = pt.dt.papalexi_2021()
ps = pt.tl.PseudobulkSpace()
ps_adata = ps.compute(
    mdata["rna"],
    target_col="gene_target",
    groups_col="gene_target",
    mode="mean",
    min_cells=0,
    min_counts=0,
)

See perturbation space tutorial for a more elaborate tutorial.

MetaData

MetaData provides tooling to annotate perturbations by querying databases. Such metadata can aid with the development of biologically informed models and can be used for enrichment tests.

Cell line

This module allows for the retrieval of various types of information related to cell lines, including cell line annotation, bulk RNA and protein expression data.

Available databases for cell line metadata:

Compound

The Compound module enables the retrieval of various types of information related to compounds of interest, including the most common synonym, pubchemID and canonical SMILES.

Available databases for compound metadata:

Mechanism of Action

This module aims to retrieve metadata of mechanism of action studies related to perturbagens of interest, depending on the molecular targets.

Available databases for mechanism of action metadata:

Drug

This module allows for the retrieval of Drug target information.

Available databases for drug metadata:

metadata.CellLine()

Utilities to fetch cell line metadata.

metadata.Compound()

Utilities to fetch metadata for compounds.

metadata.Moa()

Utilities to fetch metadata for mechanism of action studies.

metadata.Drug()

Utilities to fetch metadata for drug studies.

Plots

Every tool has a set of plotting functions that start with plot_. However, we are considering to offer more general plots at a later point.