Differential gene expression#

Differential gene expression (DGE) analysis identifies genes that show statistically significant differences in expression levels across distinct cell populations or conditions. This analysis helps in identifying which cell types are most affected by a condition of interest such as a disease, and characterizing their functional signatures.

Pertpy provides a unified API to several families of DGE models so you can pick the one that fits your design:

  • Simple statistical tests (TTest, WilcoxonTest) compare two groups directly on the expression matrix. They are fast, assumption-light, and a reasonable choice when you only have a single binary condition and no covariates to adjust for.

  • Pseudobulk + count-based GLMs (EdgeR, PyDESeq2) aggregate cells into per-sample pseudobulks and fit a negative-binomial GLM. This is the recommended approach for multi-sample studies with covariates and the current best practice for scRNA-seq DE because it controls false positives caused by treating individual cells as independent replicates. EdgeR calls into the R package via rpy2; PyDESeq2 is a pure-Python reimplementation of DESeq2.

  • Generic linear models (Statsmodels) wrap statsmodels and expose OLS, robust linear models, and GLMs. Use this when your response variable does not look like a count (for example, log-normalised expression or a continuous score).

In the following tutorial we will demonstrate how the edgeR and PyDESeq2 interfaces can be used to model complex interactions using the triple-negative breast cancer (TNBC) Zhang dataset.

Environment setup#

# pip install 'pertpy[de]' decoupler
import warnings

import decoupler as dc
import pertpy as pt
import scanpy as sc

warnings.filterwarnings("ignore")

Dataset#

The Zhang dataset comprises scRNA-seq from 22 patients with advanced TNBC, treated with paclitaxel alone or in combination with the anti-PD-L1 therapy atezolizumab. 78 tumor biopsies and blood samples were collected before treatment (‘pre-treatment’), after treatment (‘post-treatment), and during disease progression (‘progression’). We focused on the transcriptomic data which encompasses approximately 489,490 high-quality immune cells with 27085 measured genes across 99 high resolution cell types.

adata = pt.dt.zhang_2021()

To keep runtimes reasonable and simplify the interpretation, we will only be working with the tissue samples.

adata = adata[adata.obs["Origin"] == "t", :].copy()
adata.obs.head(5)
Sample Patient Origin Tissue Efficacy Group Treatment Number of counts Number of genes Major celltype Cluster
Cell barcode
AAACCTGAGAGTCTGG.Pre_P007_t Pre_P007_t P007 t lymph_node PR Pre-treatment Anti-PD-L1+Chemo 2,346 748 T cell t_Tn-LEF1
AAACCTGAGCACACAG.Pre_P007_t Pre_P007_t P007 t lymph_node PR Pre-treatment Anti-PD-L1+Chemo 1,242 679 ILC cell t_ILC1-IL32
AAACCTGCAAGTCATC.Pre_P007_t Pre_P007_t P007 t lymph_node PR Pre-treatment Anti-PD-L1+Chemo 1,593 597 T cell t_Tn-LEF1
AAACCTGGTTGGGACA.Pre_P007_t Pre_P007_t P007 t lymph_node PR Pre-treatment Anti-PD-L1+Chemo 2,404 1,203 T cell t_CD8_Tem-GZMK
AAACCTGTCAGCTCGG.Pre_P007_t Pre_P007_t P007 t lymph_node PR Pre-treatment Anti-PD-L1+Chemo 2,034 682 T cell t_Tn-LEF1

When conducting differential gene expression analysis, it is important to understand the dataset well. We will therefore explore the various co-variates first.

adata.obs["Patient"].value_counts()
Patient
P019    37150
P013    22526
P012    14570
P025    14505
P002    10909
P022    10614
P023    10123
P005     8175
P018     7936
P020     7240
P007     7237
P003     4349
P016     4009
P017     4006
P004     3887
P010        8
Name: count, dtype: int64

We remove Patient 10 who has had very few sequenced cells.

adata = adata[~adata.obs["Patient"].isin(["P010"])]
adata.obs["Sample"].value_counts()
Sample
Pre_P019_t     27656
Prog_P013_t    10153
Post_P013_t     9975
Post_P019_t     9494
Post_P025_t     8072
Pre_P022_t      7689
Pre_P012_t      7682
Post_P012_t     6888
Pre_P020_t      6863
Post_P023_t     6813
Pre_P025_t      6433
Post_P018_t     5847
Pre_P002_t      5483
Post_P002_t     5426
Pre_P007_t      4999
Pre_P005_t      4432
Post_P003_t     4349
Pre_P004_t      3887
Pre_P016_t      3862
Post_P005_t     3743
Pre_P023_t      3310
Post_P022_t     2925
Pre_P017_t      2428
Pre_P013_t      2398
Prog_P007_t     2238
Pre_P018_t      2089
Post_P017_t     1578
Post_P020_t      377
Post_P016_t      147
Name: count, dtype: int64

Notably, two samples have less than 1000 sequenced cells. We will not remove them but keep it in mind.

Generally cells can behave quite differently from tissue to tissue. Therefore, it can be useful to conduct DGE tests per tissue and not all tissues jointly. To keep things simple, we will not conduct the tests per tissue separately here.

adata.obs["Tissue"].value_counts()
Tissue
lymph_node    60725
breast        50418
liver         30701
chest_wall    23154
brain          2238
Name: count, dtype: int64

Each patient was treated with a different type of therapy and classified based on efficacy of the treatment as responders (PR: partial response) or non-responders (SD: stable disease, PD: progressive disease).

adata.obs["Treatment"].value_counts()
Treatment
Anti-PD-L1+Chemo    89943
Chemo               77293
Name: count, dtype: int64
adata.obs["Efficacy"].value_counts()
Efficacy
PR    99337
SD    63550
PD     4349
Name: count, dtype: int64

Samples were collected both before and after treatment

adata.obs["Group"].value_counts()
Group
Pre-treatment     89211
Post-treatment    65634
Progression       12391
Name: count, dtype: int64

The dataset was annotated with two levels of cell type annotation:

adata.obs["Major celltype"].value_counts()
Major celltype
T cell          100155
B cell           35588
Myeloid cell     19878
ILC cell         11615
Name: count, dtype: int64
adata.obs["Cluster"].value_counts()
Cluster
t_Tn-LEF1           19046
t_CD8_Tem-GZMK      15371
t_Bmem-CD27         12639
t_CD4_Tcm-LMNA      10886
t_CD8-CXCL13        10689
t_Bn-TCL1A           9695
t_pB-IGHG1           8505
t_CD4_Tact-XIST      8497
t_CD4_Treg-FOXP3     7893
t_CD8_Trm-ZNF683     6456
t_CD4-CXCL13         6282
t_CD8_MAIT-KLRB1     4791
Mix                  4146
t_CD8_Teff-GNLY      3517
t_Bfoc-MKI67         2717
t_Tact-IFI6          2487
t_macro-IL1B         2148
t_macro-IL1RN        1973
t_Bfoc-NEIL1         1865
t_ILC3-AREG          1772
t_mono-FCN1          1684
t_ILC1-VCAM1         1613
t_macro-IGFBP7       1493
t_macro-MGP          1338
t_ILC1-FGFBP2        1309
t_ILC1-SELL          1297
t_ILC1-CD160         1097
t_macro-IFI27        1051
t_macro-TUBA1B       1050
t_ILC1-IL32           972
t_macro-CX3CR1        947
t_ILC2-SPON2          942
t_cDC2-CLEC10A        926
t_macro-MKI67         873
t_ILC1-IFNG           751
t_macro-SPP1          742
t_macro-CCL2          733
t_macro-MMP9          664
t_ILC1-CX3CR1         636
t_mono-S100A89        604
t_macro-CFD           596
t_ILC1-ZNF683         583
t_macro-SLC40A1       577
t_pDC-LILRA4          576
t_macro-FOLR2         491
t_ILC1-GZMK           424
t_cDC1-CLEC9A         273
t_cDC2-FCGR2B         261
t_mast-TPSAB1         250
t_cDC2-CD207          208
t_Bmem-MKI67          167
t_macro-CD24          148
t_mono-SMIM25         139
t_mDC-LAMP3           133
t_ILC3-IL7R           128
t_Tprf-MKI67           94
t_ILC1-CNOT2           91
Name: count, dtype: int64

We remove the Mix cell type analogously to the original publication.

adata = adata[~adata.obs["Cluster"].isin(["Mix"])]

For the remainder of the analysis we will work with the Cluster level because the Major celltype level is not fine-grained enough.

We store the raw counts in a counts layer because most DGE models expect raw counts.

adata.layers["counts"] = adata.X.copy()

Pseudobulks#

To perform DE analysis on our single-cell dataset, we first create pseudobulk samples by aggregating expression counts of cells within each group and sample of interest. This approach allows us to leverage statistical methods originally developed for bulk RNA-seq data, which is the recommended best practice to minimizing false positives in scRNA-seq DE analysis.

For each patient we create 1 pseudobulk sample per cell type by summing up the gene counts within each subpopulation.

pdata = dc.pp.pseudobulk(adata, sample_col="Patient", groups_col="Cluster", layer="counts", mode="sum")
dc.pp.filter_samples(pdata, inplace=True)
pdata
AnnData object with n_obs × n_vars = 457 × 27085
    obs: 'Patient', 'Cluster', 'Origin', 'Efficacy', 'Treatment', 'Major celltype', 'psbulk_cells', 'psbulk_counts'
    layers: 'psbulk_props'
dc.pl.filter_samples(pdata, groupby=["Patient", "Major celltype"], figsize=(12, 4))
../../_images/e240d62645afed81ef937474464b9379210f712badd10220086c3d69b0b7800b.png

Axes of variation#

The validity of differential gene expression results highly depends on the capture of the major axis of variations in the statistical model. Intermediate data exploration steps such as principal component analysis (PCA) or multidimensional scaling (MDS) on pseudobulk samples allow for the identification of the sources of variation and thus can guide the construction of corresponding design and contrast matrices that model the data.

Now that we have generated the pseudobulk profiles for each patient and each cell type, let’s explore the variability between them. For that, we will first do some simple preprocessing and calculate a PCA.

pdata.layers["counts"] = pdata.X.copy()

sc.pp.normalize_total(pdata, target_sum=1e4)
sc.pp.log1p(pdata)
sc.pp.scale(pdata, max_value=10)
sc.pp.pca(pdata)
# Return raw counts to X
dc.pp.swap_layer(pdata, "counts", inplace=True)
sc.pl.pca(pdata, color=["Patient", "Treatment", "Efficacy", "Cluster"], ncols=1, size=300)
sc.pl.pca_variance_ratio(pdata)
../../_images/8e4038212a2f3c132c21a83c70da4950e1cfe66ee78de5a5991392d7286984c0.png ../../_images/2eddd648194d818771c6f07db8c313af0998b0eb45fd3e09b71f9ae2504e1b6f.png

It seems like the four first components explain most of the variance and they easily separate most cell types from one another. In contrast, the principle components do not seem to be associated with Patient, Efficacy or Treatment.

To have a better overview of the association of PCs with sample metadata, let’s perform ANOVA on each PC and see whether they are significantly associated with any technical or biological annotations of our samples

pdata_subset = pdata.copy()
pdata_subset.obs = pdata.obs[["Patient", "Treatment", "Efficacy", "Cluster"]]

dc.tl.rankby_obsm(
    pdata_subset,
    key="X_pca",
    uns_key="pca_anova",
)
dc.pl.obsm(
    pdata_subset,
    key="pca_anova",
    names=["Patient", "Treatment", "Efficacy"],
    titles=["Principle component scores", "Adjusted p-values from ANOVA"],
    cmap_obs={},
)
../../_images/9bffd8808c75c0a9145a7e6aa4eeea43582aae1ae41f4cd99fa1350e7082b1c3.png

Differential expression testing with edgeR#

The edger interface further requires edger to be installed in R (BiocManager::install("edgeR")).

A linear model needs a design matrix that encodes, for each sample, the values of the covariates we want to control for or test against. Pertpy lets you describe the design with a Wilkinson formula — the same syntax used by R and patsy. A few examples:

  • ~ Treatment — intercept plus a coefficient per treatment level (the baseline level is absorbed into the intercept).

  • ~ Treatment + Efficacy — both covariates as additive, independent effects.

  • ~ Treatment + Efficacy + Treatment:Efficacy (equivalently ~ Treatment * Efficacy) — additive effects plus an interaction term that asks whether the effect of one covariate depends on the level of the other.

Here we start with the additive model: the type of treatment (Treatment) and response to the treatment (Efficacy) contribute independently to gene expression. By doing so, we can evaluate:

  1. How much of the variation in gene expression can be attributed to differences in drug efficacy, independent of the type of treatment.

  2. How much of the variation is due to the type of treatment, independent of the efficacy of the drug.

This setup helps in understanding not just whether a treatment works, but how its effectiveness might vary or be influenced by the inherent efficacy of the drug.

edgr = pt.tl.EdgeR(pdata, design="~Efficacy+Treatment")
edgr.fit()
• Calculating NormFactors
• Estimating Dispersions
• Fitting linear model

Fitting the model gave us one coefficient per term in the design. To turn those coefficients into a biological comparison we evaluate a contrast: a vector with one entry per coefficient that says “take this combination of fitted effects and test whether it differs from zero”. For a simple two-group comparison along a single column, contrast() builds the right vector automatically. We will assemble more complex contrasts by hand further down.

res_df = edgr.test_contrasts(edgr.contrast(column="Treatment", baseline="Chemo", group_to_compare="Anti-PD-L1+Chemo"))
res_df.head(10)
variable log_fc logCPM F p_value adj_p_value contrast
0 HBB 1.165896 9.259467 494.497680 4.767029e-76 1.291150e-71 None
1 RPS26 1.365747 10.339090 356.827260 1.354675e-59 1.834569e-55 None
2 FDCSP 4.726368 5.455406 279.700431 1.130304e-52 1.020476e-48 None
3 IGKV1D-16 -2.125786 5.300726 245.219864 1.787394e-46 1.210289e-42 None
4 AZGP1 -2.878053 3.409497 168.444076 1.330101e-34 7.205157e-31 None
5 AC135068.2 -3.971291 3.342327 159.553825 1.426533e-32 6.439607e-29 None
6 CCL19 3.816752 4.235494 139.161147 2.651780e-29 1.014408e-25 None
7 IGLV5-48 -4.362628 5.474016 140.204022 2.996222e-29 1.014408e-25 None
8 HTN1 6.556713 3.908961 162.661491 1.031716e-28 3.104891e-25 None
9 IGHV3-64 -4.136611 4.904814 135.687694 1.856920e-28 5.029468e-25 None

The result table has one row per gene. The key columns to look at:

  • variable — gene symbol.

  • log_fc — log2 fold change between the two contrasted groups. Positive values mean the gene is up in group_to_compare relative to baseline; negative values mean it is down.

  • p_value — raw p-value from the test that log_fc differs from zero. Useful for diagnostics, but do not threshold on this directly.

  • adj_p_value — p-value after Benjamini–Hochberg adjustment for multiple testing. This is the value to threshold on (commonly < 0.05).

  • contrast — identifier of the contrast when multiple are tested at once; None here because we only ran one contrast.

A standard summary is “significantly differentially expressed” = adj_p_value < 0.05 and |log_fc| above some effect-size threshold (the volcano plot below uses log2fc_thresh for the latter).

The set of differentially expressed genes can now be used for downstream tasks or as a first step, plotted in a volcano plot.

edgr.plot_volcano(res_df, log2fc_threshold=0)
../../_images/275f9c5be396e25808b8d41b1384c522d0a8f410437fde7727231f6d8bd8d50f.png

Once you have a ranked gene table, a typical next step is gene-set / pathway enrichment to turn per-gene statistics into per-pathway statistics. decoupler integrates well with the result tables produced here. For example, you can pass the log_fc column from res_df (indexed by variable) into decoupler.mt.ulm together with a prior-knowledge network from OmniPath (collectri(), progeny(), hallmark(), …) to score transcription-factor or pathway activities. We do not run an enrichment analysis here to keep this tutorial focused on the DE step; decoupler’s bulk functional analysis tutorial is a good starting point.

We identified genes that are differentially expressed between the two treatments. Next, we might be interested in looking at the expression levels of these genes across different subgroups. Those subgroups could be different patients or, as in this case, different efficacy groups. For this, we can use the plot_paired function, which enables comparing expression levels between paired samples. Note that the pairing must have been considered in the model design.

edgr.plot_paired(pdata, results_df=res_df, n_top_vars=8, groupby="Treatment", pairedby="Efficacy")
• Performing pseudobulk for paired samples
! 1 unpaired samples removed
../../_images/22bb636db932fe674bbdca4dd3108f5c621229598eec5218e65327febee91337.png

Furthermore, we can also compare the gene expression levels between the different efficacy groups. For this, we use the compare_groups function. Specifically, we will use the “Stable Disease” (SD) group as the baseline and compare it to the “Partial Response” (PR) and “Progressive Disease” (PD) groups.

res_df = edgr.compare_groups(pdata, column="Efficacy", baseline="SD", groups_to_compare=["PR", "PD"])
edgr.plot_multicomparison_fc(res_df, figsize=(12, 1.5))
• Calculating NormFactors
• Estimating Dispersions
• Fitting linear model
../../_images/7d70681667cd5e15158473bc7d973a4841a125ec5e7b8b4382311978ecddf60b.png

Differential expression testing with PyDESeq2#

The interface of PyDESeq2 in pertpy is very similar.

pds2 = pt.tl.PyDESeq2(adata=pdata, design="~Efficacy+Treatment")
pds2.fit()
Using None as control genes, passed at DeseqDataSet initialization
Fitting size factors...
... done in 0.15 seconds.
Fitting dispersions...
... done in 5.02 seconds.

Fitting dispersion trend curve...
... done in 0.31 seconds.

Fitting MAP dispersions...
... done in 4.52 seconds.

Fitting LFCs...
... done in 5.51 seconds.

Calculating cook's distance...
... done in 0.46 seconds.

Replacing 729 outlier genes.

Fitting dispersions...
... done in 0.17 seconds.

Fitting MAP dispersions...
... done in 0.17 seconds.

Fitting LFCs...
... done in 0.27 seconds.
res_df = pds2.test_contrasts(pds2.contrast(column="Treatment", baseline="Chemo", group_to_compare="Anti-PD-L1+Chemo"))
Log2 fold change & Wald test p-value, contrast vector: [ 0.  0.  0. -1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
AL627309.1   0.027626        0.627593  0.428452  1.464793  0.142977       NaN
AL669831.5   0.448998        0.069263  0.199917  0.346458  0.728999  0.853689
FAM87B       0.018980        0.526400  0.594716  0.885128  0.376087       NaN
LINC00115    1.599386       -0.190844  0.119332 -1.599273  0.109760  0.278756
NOC2L       12.462982       -0.086245  0.059518 -1.449061  0.147321  0.338361
...               ...             ...       ...       ...       ...       ...
AP001189.4   0.000000             NaN       NaN       NaN       NaN       NaN
AC011595.1   0.000000             NaN       NaN       NaN       NaN       NaN
TRDJ3        0.000000             NaN       NaN       NaN       NaN       NaN
GOLGA6L1     0.000000             NaN       NaN       NaN       NaN       NaN
AC008569.1   0.000000             NaN       NaN       NaN       NaN       NaN

[27085 rows x 6 columns]
Running Wald tests...
... done in 0.88 seconds.
res_df.head(10)
variable baseMean log_fc lfcSE stat p_value adj_p_value contrast
0 RPS26 485.923196 1.326838 0.061919 21.428711 7.215009e-102 1.013492e-97 None
1 FDCSP 15.211450 4.273903 0.335305 12.746295 3.269187e-37 2.296114e-33 None
2 AZGP1 3.718152 -2.709331 0.246142 -11.007204 3.527812e-28 1.651839e-24 None
3 IGLV1-51 83.891787 4.380368 0.405676 10.797710 3.528953e-27 1.239280e-23 None
4 POLR2J3 4.443091 -0.851760 0.081243 -10.484075 1.022415e-25 2.872371e-22 None
5 FAM118A 15.566632 1.327481 0.130264 10.190700 2.181878e-24 5.108140e-21 None
6 MT-ATP6 664.679405 -0.544051 0.056291 -9.665012 4.245693e-22 8.519892e-19 None
7 NDUFB1 44.043809 0.768386 0.080136 9.588463 8.940693e-22 1.391858e-18 None
8 RPS29 743.770550 0.770006 0.080310 9.587931 8.986940e-22 1.391858e-18 None
9 HLA-B 1814.364254 -0.573257 0.059852 -9.577850 9.908580e-22 1.391858e-18 None
pds2.plot_volcano(res_df, log2fc_threshold=0)
NaNs encountered, dropping rows with NaNs
../../_images/4122d78cb51858fff1c2f49b83cb8d2ad056490ebeb5a1eeca45b707097ee742.png

We can also plot the fold changes of the top differentially expressed genes:

pds2.plot_fold_change(res_df, n_top_vars=15)
../../_images/cbca9830e1601d9b075512324ee46a92fbb1f3218ee4898da54dcb8e3969c555.png

As already done for edgeR, we can also use PyDESeq2 to compare the gene expression levels between the different efficacy groups.

res_df = pds2.compare_groups(pdata, column="Efficacy", baseline="SD", groups_to_compare=["PR", "PD"])
edgr.plot_multicomparison_fc(res_df, figsize=(12, 1.5))
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value, contrast vector: [ 0.  1. -1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
AL627309.1   0.027626       -0.465182  0.648121 -0.717740  0.472918       NaN
AL669831.5   0.448998       -0.385964  0.207275 -1.862088  0.062591  0.290948
FAM87B       0.018980       -0.281550  0.850548 -0.331022  0.740628       NaN
LINC00115    1.599386       -0.253359  0.122537 -2.067623  0.038675  0.226483
NOC2L       12.462982       -0.067935  0.058906 -1.153274  0.248798  0.576505
...               ...             ...       ...       ...       ...       ...
AP001189.4   0.000000             NaN       NaN       NaN       NaN       NaN
AC011595.1   0.000000             NaN       NaN       NaN       NaN       NaN
TRDJ3        0.000000             NaN       NaN       NaN       NaN       NaN
GOLGA6L1     0.000000             NaN       NaN       NaN       NaN       NaN
AC008569.1   0.000000             NaN       NaN       NaN       NaN       NaN

[27085 rows x 6 columns]
Log2 fold change & Wald test p-value, contrast vector: [ 0.  0. -1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
AL627309.1   0.027626        0.336716  1.425541  0.236203  0.813276       NaN
AL669831.5   0.448998       -0.205091  0.553761 -0.370360  0.711114       NaN
FAM87B       0.018980        0.544483  1.847643  0.294691  0.768230       NaN
LINC00115    1.599386       -0.224591  0.342215 -0.656288  0.511639  0.673679
NOC2L       12.462982       -0.241951  0.151775 -1.594146  0.110903  0.254624
...               ...             ...       ...       ...       ...       ...
AP001189.4   0.000000             NaN       NaN       NaN       NaN       NaN
AC011595.1   0.000000             NaN       NaN       NaN       NaN       NaN
TRDJ3        0.000000             NaN       NaN       NaN       NaN       NaN
GOLGA6L1     0.000000             NaN       NaN       NaN       NaN       NaN
AC008569.1   0.000000             NaN       NaN       NaN       NaN       NaN

[27085 rows x 6 columns]
Fitting size factors...
... done in 0.17 seconds.
Fitting dispersions...
... done in 3.66 seconds.

Fitting dispersion trend curve...
... done in 0.34 seconds.

Fitting MAP dispersions...
... done in 4.54 seconds.

Fitting LFCs...
... done in 5.96 seconds.

Calculating cook's distance...
... done in 0.52 seconds.

Replacing 763 outlier genes.

Fitting dispersions...
... done in 0.22 seconds.

Fitting MAP dispersions...
... done in 0.22 seconds.

Fitting LFCs...
... done in 0.30 seconds.

Running Wald tests...
... done in 1.33 seconds.
Running Wald tests...
... done in 1.13 seconds.
../../_images/f78638b1359e75b85aef699e87201e9f2f75719858e282e4c3aa848bd61e9c98.png

Model interaction between covariates#

We might be interested in more complex comparisons between patient groups. For example in this study instead of assuming that the type of treatment and its efficacy affect gene expression independently, we can test whether the gene expression changes associated to efficacy might depend on the type of treatment. To do this with a linear model, we model the interaction between the treatment and efficacy, following the syntax from Wilkinson formulas.

# Exclude patient with progressive disease, observed only for one treatment condition
pdata2 = pdata[pdata.obs["Efficacy"] != "PD"].copy()

# Fit DE model with interaction term
pds2 = pt.tl.PyDESeq2(adata=pdata2, design="~ Efficacy + Treatment + Efficacy*Treatment")
pds2.fit()
Using None as control genes, passed at DeseqDataSet initialization
Fitting size factors...
... done in 0.18 seconds.
Fitting dispersions...
... done in 4.47 seconds.

Fitting dispersion trend curve...
... done in 0.37 seconds.

Fitting MAP dispersions...
... done in 5.02 seconds.

Fitting LFCs...
... done in 6.62 seconds.

Calculating cook's distance...
... done in 0.47 seconds.

Replacing 841 outlier genes.

Fitting dispersions...
... done in 0.20 seconds.

Fitting MAP dispersions...
... done in 0.21 seconds.

Fitting LFCs...
... done in 0.27 seconds.

We aim to identify genes that distinguish responders (Efficacy="PR") from non-responders (Efficacy="SD") specifically in patients treated with Anti-PD-L1, but not in those receiving standard chemotherapy. In other words, we want to find genes that are uniquely associated with treatment response in the Anti-PD-L1 group. To do so, we specify a contrast vector using the cond() helper function. This allows you to specify the conditions you want to compare without working out which combination of coefficients in the linear model correspond to your comparison of interest (see here to learn more about contrast vectors and how to build them).

# Specify groups of interest for comparison
interaction_contrast = (
    pds2.cond(Treatment="Anti-PD-L1+Chemo", Efficacy="PR") - pds2.cond(Treatment="Anti-PD-L1+Chemo", Efficacy="SD")
) - (pds2.cond(Treatment="Chemo", Efficacy="PR") - pds2.cond(Treatment="Chemo", Efficacy="SD"))
interaction_contrast
Intercept                            0.0
Efficacy[T.SD]                       0.0
Treatment[T.Chemo]                   0.0
Efficacy[T.SD]:Treatment[T.Chemo]    1.0
Name: 0, dtype: float64

Now we are ready to test for differential expression

interaction_res_df = pds2.test_contrasts(interaction_contrast)
Log2 fold change & Wald test p-value, contrast vector: [0. 0. 0. 1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
AL627309.1   0.030299        0.441925  1.259993  0.350736  0.725786       NaN
AL669831.5   0.481510        0.000147  0.421870  0.000349  0.999721  0.999721
FAM87B       0.019790        0.464098  1.709312  0.271511  0.785998       NaN
LINC00115    1.663735        0.444530  0.249670  1.780473  0.074999  0.254101
NOC2L       13.022724        0.441258  0.117724  3.748232  0.000178  0.004734
...               ...             ...       ...       ...       ...       ...
AP001189.4   0.000000             NaN       NaN       NaN       NaN       NaN
AC011595.1   0.000000             NaN       NaN       NaN       NaN       NaN
TRDJ3        0.000000             NaN       NaN       NaN       NaN       NaN
GOLGA6L1     0.000000             NaN       NaN       NaN       NaN       NaN
AC008569.1   0.000000             NaN       NaN       NaN       NaN       NaN

[27085 rows x 6 columns]
Running Wald tests...
... done in 1.01 seconds.
pds2.plot_volcano(interaction_res_df, log2fc_threshold=0)
NaNs encountered, dropping rows with NaNs
../../_images/5ce3b1e105e500599e319fc4f5ae3c8018dec9f5caab7cf96ee9a0662618edfd.png

We can also pass several contrasts to test_contrasts, for example to compare DE genes in responders specific to Anti-PD-L1 treatment to responses observed across treatments

general_response_contrast = pds2.cond(Efficacy="PR") - pds2.cond(Efficacy="SD")

interaction_res_df = pds2.test_contrasts(
    {"Anti-PD-L1_specific": interaction_contrast, "General": general_response_contrast}
)

edgr.plot_multicomparison_fc(interaction_res_df, figsize=(12, 1.5))
Log2 fold change & Wald test p-value, contrast vector: [0. 0. 0. 1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
AL627309.1   0.030299        0.441925  1.259993  0.350736  0.725786       NaN
AL669831.5   0.481510        0.000147  0.421870  0.000349  0.999721  0.999721
FAM87B       0.019790        0.464098  1.709312  0.271511  0.785998       NaN
LINC00115    1.663735        0.444530  0.249670  1.780473  0.074999  0.254101
NOC2L       13.022724        0.441258  0.117724  3.748232  0.000178  0.004734
...               ...             ...       ...       ...       ...       ...
AP001189.4   0.000000             NaN       NaN       NaN       NaN       NaN
AC011595.1   0.000000             NaN       NaN       NaN       NaN       NaN
TRDJ3        0.000000             NaN       NaN       NaN       NaN       NaN
GOLGA6L1     0.000000             NaN       NaN       NaN       NaN       NaN
AC008569.1   0.000000             NaN       NaN       NaN       NaN       NaN

[27085 rows x 6 columns]
Log2 fold change & Wald test p-value, contrast vector: [ 0. -1.  0.  0.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
AL627309.1   0.030299       -0.185247  0.869810 -0.212974  0.831347       NaN
AL669831.5   0.481510       -0.375254  0.301541 -1.244454  0.213332  0.458272
FAM87B       0.019790        0.006961  1.181996  0.005889  0.995301       NaN
LINC00115    1.663735       -0.036693  0.181667 -0.201980  0.839932  0.932967
NOC2L       13.022724        0.149331  0.084369  1.769982  0.076730  0.251246
...               ...             ...       ...       ...       ...       ...
AP001189.4   0.000000             NaN       NaN       NaN       NaN       NaN
AC011595.1   0.000000             NaN       NaN       NaN       NaN       NaN
TRDJ3        0.000000             NaN       NaN       NaN       NaN       NaN
GOLGA6L1     0.000000             NaN       NaN       NaN       NaN       NaN
AC008569.1   0.000000             NaN       NaN       NaN       NaN       NaN

[27085 rows x 6 columns]
Running Wald tests...
... done in 1.10 seconds.
Running Wald tests...
... done in 0.98 seconds.
../../_images/e084f3e632fe4ef75cee8c98b6421aa8ceeabf3baf4fe4a13a24a44036d1c03c.png