Differential gene expression#
Differential gene expression (DGE) analysis identifies genes that show statistically significant differences in expression levels across distinct cell populations or conditions. This analysis helps in identifying which cell types are most affected by a condition of interest such as a disease, and characterizing their functional signatures.
Pertpy provides an API to access several types of models for differential expression analysis. The first group of models comprises the T-test and Wilcoxon test as simple statistical tests for comparing expression values between two groups without covariates. The second group includes models of the linear model family that allow modeling complex designs and contrasts. Currently included are PyDESeq2, edgeR as well as a wrapper around statsmodels Statsmodels. which provides access to a wide range of regression models, including ordinary least squares regression, robust linear models and generalized linear models.
In the following tutorial we will demonstrate how the edgeR interface can be used to model complex interactions using the triple-negative breast cancer (TNBC) Zhang dataset.
Environment setup#
# pip install 'pertpy[de]' decoupler
import warnings
import decoupler as dc
import pertpy as pt
import scanpy as sc
warnings.filterwarnings("ignore")
Dataset#
The Zhang dataset comprises scRNA-seq from 22 patients with advanced TNBC, treated with paclitaxel alone or in combination with the anti-PD-L1 therapy atezolizumab. 78 tumor biopsies and blood samples were collected before treatment (‘pre-treatment’), after treatment (‘post-treatment), and during disease progression (‘progression’). We focused on the transcriptomic data which encompasses approximately 489,490 high-quality immune cells with 27085 measured genes across 99 high resolution cell types.
adata = pt.dt.zhang_2021()
To keep runtimes reasonable and simplify the interpretation, we will only be working with the tissue samples.
adata = adata[adata.obs["Origin"] == "t", :].copy()
adata.obs.head(5)
| Sample | Patient | Origin | Tissue | Efficacy | Group | Treatment | Number of counts | Number of genes | Major celltype | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Cell barcode | |||||||||||
| AAACCTGAGAGTCTGG.Pre_P007_t | Pre_P007_t | P007 | t | lymph_node | PR | Pre-treatment | Anti-PD-L1+Chemo | 2,346 | 748 | T cell | t_Tn-LEF1 |
| AAACCTGAGCACACAG.Pre_P007_t | Pre_P007_t | P007 | t | lymph_node | PR | Pre-treatment | Anti-PD-L1+Chemo | 1,242 | 679 | ILC cell | t_ILC1-IL32 |
| AAACCTGCAAGTCATC.Pre_P007_t | Pre_P007_t | P007 | t | lymph_node | PR | Pre-treatment | Anti-PD-L1+Chemo | 1,593 | 597 | T cell | t_Tn-LEF1 |
| AAACCTGGTTGGGACA.Pre_P007_t | Pre_P007_t | P007 | t | lymph_node | PR | Pre-treatment | Anti-PD-L1+Chemo | 2,404 | 1,203 | T cell | t_CD8_Tem-GZMK |
| AAACCTGTCAGCTCGG.Pre_P007_t | Pre_P007_t | P007 | t | lymph_node | PR | Pre-treatment | Anti-PD-L1+Chemo | 2,034 | 682 | T cell | t_Tn-LEF1 |
When conducting differential gene expression analysis, it is important to understand the dataset well. We will therefore explore the various co-variates first.
adata.obs["Patient"].value_counts()
Patient
P019 37150
P013 22526
P012 14570
P025 14505
P002 10909
P022 10614
P023 10123
P005 8175
P018 7936
P020 7240
P007 7237
P003 4349
P016 4009
P017 4006
P004 3887
P010 8
Name: count, dtype: int64
We remove Patient 10 who has had very few sequenced cells.
adata = adata[~adata.obs["Patient"].isin(["P010"])]
adata.obs["Sample"].value_counts()
Sample
Pre_P019_t 27656
Prog_P013_t 10153
Post_P013_t 9975
Post_P019_t 9494
Post_P025_t 8072
Pre_P022_t 7689
Pre_P012_t 7682
Post_P012_t 6888
Pre_P020_t 6863
Post_P023_t 6813
Pre_P025_t 6433
Post_P018_t 5847
Pre_P002_t 5483
Post_P002_t 5426
Pre_P007_t 4999
Pre_P005_t 4432
Post_P003_t 4349
Pre_P004_t 3887
Pre_P016_t 3862
Post_P005_t 3743
Pre_P023_t 3310
Post_P022_t 2925
Pre_P017_t 2428
Pre_P013_t 2398
Prog_P007_t 2238
Pre_P018_t 2089
Post_P017_t 1578
Post_P020_t 377
Post_P016_t 147
Name: count, dtype: int64
Notably, two samples have less than 1000 sequenced cells. We will not remove them but keep it in mind.
Generally cells can behave quite differently from tissue to tissue. Therefore, it can be useful to conduct DGE tests per tissue and not all tissues jointly. To keep things simple, we will not conduct the tests per tissue separately here.
adata.obs["Tissue"].value_counts()
Tissue
lymph_node 60725
breast 50418
liver 30701
chest_wall 23154
brain 2238
Name: count, dtype: int64
Each patient was treated with a different type of therapy and classified based on efficacy of the treatment as responders (PR: partial response) or non-responders (SD: stable disease, PD: progressive disease).
adata.obs["Treatment"].value_counts()
Treatment
Anti-PD-L1+Chemo 89943
Chemo 77293
Name: count, dtype: int64
adata.obs["Efficacy"].value_counts()
Efficacy
PR 99337
SD 63550
PD 4349
Name: count, dtype: int64
Samples were collected both before and after treatment
adata.obs["Group"].value_counts()
Group
Pre-treatment 89211
Post-treatment 65634
Progression 12391
Name: count, dtype: int64
The dataset was annotated with two levels of cell type annotation:
adata.obs["Major celltype"].value_counts()
Major celltype
T cell 100155
B cell 35588
Myeloid cell 19878
ILC cell 11615
Name: count, dtype: int64
adata.obs["Cluster"].value_counts()
Cluster
t_Tn-LEF1 19046
t_CD8_Tem-GZMK 15371
t_Bmem-CD27 12639
t_CD4_Tcm-LMNA 10886
t_CD8-CXCL13 10689
t_Bn-TCL1A 9695
t_pB-IGHG1 8505
t_CD4_Tact-XIST 8497
t_CD4_Treg-FOXP3 7893
t_CD8_Trm-ZNF683 6456
t_CD4-CXCL13 6282
t_CD8_MAIT-KLRB1 4791
Mix 4146
t_CD8_Teff-GNLY 3517
t_Bfoc-MKI67 2717
t_Tact-IFI6 2487
t_macro-IL1B 2148
t_macro-IL1RN 1973
t_Bfoc-NEIL1 1865
t_ILC3-AREG 1772
t_mono-FCN1 1684
t_ILC1-VCAM1 1613
t_macro-IGFBP7 1493
t_macro-MGP 1338
t_ILC1-FGFBP2 1309
t_ILC1-SELL 1297
t_ILC1-CD160 1097
t_macro-IFI27 1051
t_macro-TUBA1B 1050
t_ILC1-IL32 972
t_macro-CX3CR1 947
t_ILC2-SPON2 942
t_cDC2-CLEC10A 926
t_macro-MKI67 873
t_ILC1-IFNG 751
t_macro-SPP1 742
t_macro-CCL2 733
t_macro-MMP9 664
t_ILC1-CX3CR1 636
t_mono-S100A89 604
t_macro-CFD 596
t_ILC1-ZNF683 583
t_macro-SLC40A1 577
t_pDC-LILRA4 576
t_macro-FOLR2 491
t_ILC1-GZMK 424
t_cDC1-CLEC9A 273
t_cDC2-FCGR2B 261
t_mast-TPSAB1 250
t_cDC2-CD207 208
t_Bmem-MKI67 167
t_macro-CD24 148
t_mono-SMIM25 139
t_mDC-LAMP3 133
t_ILC3-IL7R 128
t_Tprf-MKI67 94
t_ILC1-CNOT2 91
Name: count, dtype: int64
We remove the Mix cell type analogously to the original publication.
adata = adata[~adata.obs["Cluster"].isin(["Mix"])]
For the remainder of the analysis we will work with the Cluster level because the Major celltype level is not fine-grained enough.
We store the raw counts in a counts layer because most DGE models expect raw counts.
adata.layers["counts"] = adata.X.copy()
Pseudobulks#
To perform DE analysis on our single-cell dataset, we first create pseudobulk samples by aggregating expression counts of cells within each group and sample of interest. This approach allows us to leverage statistical methods originally developed for bulk RNA-seq data, which is the recommended best practice to minimizing false positives in scRNA-seq DE analysis.
For each patient we create 1 pseudobulk sample per cell type by summing up the gene counts within each subpopulation.
pdata = dc.pp.pseudobulk(adata, sample_col="Patient", groups_col="Cluster", layer="counts", mode="sum")
dc.pp.filter_samples(pdata, inplace=True)
pdata
AnnData object with n_obs × n_vars = 457 × 27085
obs: 'Patient', 'Cluster', 'Origin', 'Efficacy', 'Treatment', 'Major celltype', 'psbulk_cells', 'psbulk_counts'
layers: 'psbulk_props'
dc.pl.filter_samples(pdata, groupby=["Patient", "Major celltype"], figsize=(12, 4))
Axes of variation#
The validity of differential gene expression results highly depends on the capture of the major axis of variations in the statistical model. Intermediate data exploration steps such as principal component analysis (PCA) or multidimensional scaling (MDS) on pseudobulk samples allow for the identification of the sources of variation and thus can guide the construction of corresponding design and contrast matrices that model the data.
Now that we have generated the pseudobulk profiles for each patient and each cell type, let’s explore the variability between them. For that, we will first do some simple preprocessing and calculate a PCA.
pdata.layers["counts"] = pdata.X.copy()
sc.pp.normalize_total(pdata, target_sum=1e4)
sc.pp.log1p(pdata)
sc.pp.scale(pdata, max_value=10)
sc.pp.pca(pdata)
# Return raw counts to X
dc.pp.swap_layer(pdata, "counts", inplace=True)
sc.pl.pca(pdata, color=["Patient", "Treatment", "Efficacy", "Cluster"], ncols=1, size=300)
sc.pl.pca_variance_ratio(pdata)
It seems like the four first components explain most of the variance and they easily separate most cell types from one another. In contrast, the principle components do not seem to be associated with Patient, Efficacy or Treatment.
To have a better overview of the association of PCs with sample metadata, let’s perform ANOVA on each PC and see whether they are significantly associated with any technical or biological annotations of our samples
pdata_subset = pdata.copy()
pdata_subset.obs = pdata.obs[["Patient", "Treatment", "Efficacy", "Cluster"]]
dc.tl.rankby_obsm(
pdata_subset,
key="X_pca",
uns_key="pca_anova",
)
dc.pl.obsm(
pdata_subset,
key="pca_anova",
names=["Patient", "Treatment", "Efficacy"],
titles=["Principle component scores", "Adjusted p-values from ANOVA"],
cmap_obs={},
)
Differential expression testing with edgeR#
The edger interface further requires edger to be installed in R (BiocManager::install("edgeR")).
Here, we are fit a model to capture that the type of treatment (Treatment) and response to the treatment (Efficacy) contribute independently to the gene expression levels.
By doing so, we can evaluate:
How much of the variation in gene expression can be attributed to differences in drug efficacy, independent of the type of treatment.
How much of the variation is due to the type of treatment, independent of the efficacy of the drug.
This setup helps in understanding not just whether a treatment works, but how its effectiveness might vary or be influenced by the inherent efficacy of the drug.
edgr = pt.tl.EdgeR(pdata, design="~Efficacy+Treatment")
edgr.fit()
• Calculating NormFactors
• Estimating Dispersions
• Fitting linear model
2025-07-05 11:09:20 | [INFO] cffi mode is CFFI_MODE.ANY
2025-07-05 11:09:20 | [INFO] R home found: /home/lukas/miniforge3/envs/pertpy/lib/R
2025-07-05 11:09:20 | [INFO] R library path:
2025-07-05 11:09:20 | [INFO] LD_LIBRARY_PATH:
2025-07-05 11:09:20 | [INFO] Default options to initialize R: rpy2, --quiet, --no-save
2025-07-05 11:09:20 | [INFO] R is already initialized. No need to initialize.
To now determine the differentially expressed genes between the treatments, we can specify a contrast as follows:
res_df = edgr.test_contrasts(edgr.contrast(column="Treatment", baseline="Chemo", group_to_compare="Anti-PD-L1+Chemo"))
res_df.head(10)
| variable | log_fc | logCPM | F | p_value | adj_p_value | contrast | |
|---|---|---|---|---|---|---|---|
| 0 | IGLV4-3 | -1.153857 | 2.672132 | 604.075825 | 8.581326e-98 | 2.324252e-93 | None |
| 1 | IGHV1-69 | -2.432450 | 5.603879 | 614.570202 | 4.456887e-94 | 6.035740e-90 | None |
| 2 | IGKV1D-16 | -2.125786 | 5.300726 | 522.146190 | 2.934691e-83 | 2.649537e-79 | None |
| 3 | HBB | 1.165896 | 9.259467 | 484.306407 | 6.054148e-75 | 4.099415e-71 | None |
| 4 | IGKV2D-26 | 0.243653 | 1.667114 | 387.465169 | 7.560028e-69 | 4.095267e-65 | None |
| 5 | RPS26 | 1.365747 | 10.339090 | 356.827260 | 1.354675e-59 | 6.115230e-56 | None |
| 6 | FDCSP | 4.726368 | 5.455406 | 279.700431 | 1.130304e-52 | 4.373469e-49 | None |
| 7 | AZGP1 | -2.878053 | 3.409497 | 168.444076 | 1.330101e-34 | 4.503223e-31 | None |
| 8 | AC135068.2 | -3.971291 | 3.342327 | 159.553825 | 1.426533e-32 | 4.293072e-29 | None |
| 9 | CCL19 | 3.816752 | 4.235494 | 139.161147 | 2.651780e-29 | 7.182347e-26 | None |
The set of differentially expressed genes can now be used for downstream tasks or as a first step, plotted in a volcano plot.
edgr.plot_volcano(res_df, log2fc_thresh=0)
We identified genes that are differentially expressed between the two treatments. Next, we might be interested in looking at the expression levels of these genes across different subgroups. Those subgroups could be different patients or, as in this case, different efficacy groups. For this, we can use the plot_paired function, which enables comparing expression levels between paired samples. Note that the pairing must have been considered in the model design.
edgr.plot_paired(pdata, results_df=res_df, n_top_vars=8, groupby="Treatment", pairedby="Efficacy")
• Performing pseudobulk for paired samples
! 1 unpaired samples removed
Furthermore, we can also compare the gene expression levels between the different efficacy groups. For this, we use the compare_groups function.
Specifically, we will use the “Stable Disease” (SD) group as the baseline and compare it to the “Partial Response” (PR) and “Progressive Disease” (PD) groups.
res_df = edgr.compare_groups(pdata, column="Efficacy", baseline="SD", groups_to_compare=["PR", "PD"])
edgr.plot_multicomparison_fc(res_df, figsize=(12, 1.5))
• Calculating NormFactors
• Estimating Dispersions
• Fitting linear model
Differential expression testing with PyDESeq2#
The interface of PyDESeq2 in pertpy is very similar.
pds2 = pt.tl.PyDESeq2(adata=pdata, design="~Efficacy+Treatment")
pds2.fit()
Using None as control genes, passed at DeseqDataSet initialization
Fitting size factors...
... done in 0.16 seconds.
Fitting dispersions...
... done in 2.44 seconds.
Fitting dispersion trend curve...
... done in 0.30 seconds.
Fitting MAP dispersions...
... done in 2.91 seconds.
Fitting LFCs...
... done in 3.44 seconds.
Calculating cook's distance...
... done in 0.33 seconds.
Replacing 729 outlier genes.
Fitting dispersions...
... done in 0.13 seconds.
Fitting MAP dispersions...
... done in 0.12 seconds.
Fitting LFCs...
... done in 0.17 seconds.
res_df = pds2.test_contrasts(pds2.contrast(column="Treatment", baseline="Chemo", group_to_compare="Anti-PD-L1+Chemo"))
Log2 fold change & Wald test p-value, contrast vector: [ 0. 0. 0. -1.]
baseMean log2FoldChange lfcSE stat pvalue padj
CTAGE15 0.005586 0.542828 1.100956 0.493051 0.621976 NaN
SSBP3-AS1 0.016446 0.588914 2.099829 0.280458 0.779126 NaN
RNF148 0.000000 NaN NaN NaN NaN NaN
TBX20 0.109343 0.144535 0.334674 0.431866 0.665838 NaN
UBE2A 46.219635 -0.003201 0.042763 -0.074867 0.940321 0.972752
... ... ... ... ... ... ...
TIMM10B 7.016125 -0.096656 0.061285 -1.577163 0.114758 0.287285
YTHDF3-AS1 1.036879 -0.301978 0.154461 -1.955049 0.050577 0.165338
ESCO1 11.054052 -0.089527 0.077811 -1.150575 0.249907 0.472764
CASC10 0.008167 0.547730 1.513299 0.361944 0.717394 NaN
AC005696.4 0.000359 0.576040 3.024556 0.190454 0.848953 NaN
[27085 rows x 6 columns]
Running Wald tests...
... done in 0.89 seconds.
res_df.head(10)
| variable | baseMean | log_fc | lfcSE | stat | p_value | adj_p_value | contrast | |
|---|---|---|---|---|---|---|---|---|
| 0 | RPS26 | 485.923196 | 1.326838 | 0.061918 | 21.428791 | 7.202693e-102 | 1.011762e-97 | None |
| 1 | FDCSP | 15.211450 | 4.273902 | 0.335308 | 12.746193 | 3.273428e-37 | 2.299092e-33 | None |
| 2 | AZGP1 | 3.718152 | -2.709333 | 0.246141 | -11.007261 | 3.525612e-28 | 1.650809e-24 | None |
| 3 | IGLV1-51 | 83.891787 | 4.380368 | 0.405676 | 10.797710 | 3.528952e-27 | 1.239280e-23 | None |
| 4 | POLR2J3 | 4.443091 | -0.851760 | 0.081241 | -10.484358 | 1.019352e-25 | 2.863768e-22 | None |
| 5 | FAM118A | 15.566632 | 1.327481 | 0.130264 | 10.190719 | 2.181439e-24 | 5.107111e-21 | None |
| 6 | MT-ATP6 | 664.679405 | -0.544051 | 0.056290 | -9.665054 | 4.243962e-22 | 8.516419e-19 | None |
| 7 | NDUFB1 | 44.043809 | 0.768386 | 0.080136 | 9.588499 | 8.937581e-22 | 1.391374e-18 | None |
| 8 | RPS29 | 743.770550 | 0.770006 | 0.080310 | 9.587948 | 8.985450e-22 | 1.391374e-18 | None |
| 9 | HLA-B | 1814.364254 | -0.573257 | 0.059852 | -9.577886 | 9.905134e-22 | 1.391374e-18 | None |
pds2.plot_volcano(res_df, log2fc_thresh=0)
NaNs encountered, dropping rows with NaNs
We can also plot the fold changes of the top differentially expressed genes:
pds2.plot_fold_change(res_df, n_top_vars=15)
As already done for edgeR, we can also use PyDESeq2 to compare the gene expression levels between the different efficacy groups.
res_df = pds2.compare_groups(pdata, column="Efficacy", baseline="SD", groups_to_compare=["PR", "PD"])
edgr.plot_multicomparison_fc(res_df, figsize=(12, 1.5))
Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value, contrast vector: [ 0. 1. -1.]
baseMean log2FoldChange lfcSE stat pvalue padj
CTAGE15 0.005586 -0.331009 1.458895 -0.226890 0.820509 NaN
SSBP3-AS1 0.016446 -0.360049 2.473883 -0.145540 0.884285 NaN
RNF148 0.000000 NaN NaN NaN NaN NaN
TBX20 0.109343 -0.191595 0.438625 -0.436809 0.662250 NaN
UBE2A 46.219635 -0.028705 0.042461 -0.676049 0.499010 0.772048
... ... ... ... ... ... ...
TIMM10B 7.016125 -0.020675 0.060917 -0.339396 0.734311 0.899746
YTHDF3-AS1 1.036879 -0.007892 0.155310 -0.050813 0.959475 0.985568
ESCO1 11.054052 0.154259 0.077208 1.997953 0.045722 0.248803
CASC10 0.008167 -0.336234 1.887693 -0.178119 0.858629 NaN
AC005696.4 0.000359 -0.332442 3.004621 -0.110644 0.911899 NaN
[27085 rows x 6 columns]
Log2 fold change & Wald test p-value, contrast vector: [ 0. 0. -1.]
baseMean log2FoldChange lfcSE stat pvalue padj
CTAGE15 0.005586 0.498110 3.170411 0.157112 0.875156 NaN
SSBP3-AS1 0.016446 0.548043 5.386868 0.101737 0.918966 NaN
RNF148 0.000000 NaN NaN NaN NaN NaN
TBX20 0.109343 0.297736 1.001656 0.297244 0.766280 NaN
UBE2A 46.219635 0.038339 0.099578 0.385013 0.700228 0.819585
... ... ... ... ... ... ...
TIMM10B 7.016125 -0.331021 0.171131 -1.934313 0.053075 0.159409
YTHDF3-AS1 1.036879 0.311081 0.392494 0.792575 0.428025 NaN
ESCO1 11.054052 -0.239092 0.192771 -1.240286 0.214870 0.387146
CASC10 0.008167 0.510294 4.106634 0.124261 0.901109 NaN
AC005696.4 0.000359 0.538072 6.548421 0.082168 0.934513 NaN
[27085 rows x 6 columns]
Fitting size factors...
... done in 0.18 seconds.
Fitting dispersions...
... done in 2.68 seconds.
Fitting dispersion trend curve...
... done in 0.34 seconds.
Fitting MAP dispersions...
... done in 2.97 seconds.
Fitting LFCs...
... done in 3.08 seconds.
Calculating cook's distance...
... done in 0.31 seconds.
Replacing 763 outlier genes.
Fitting dispersions...
... done in 0.13 seconds.
Fitting MAP dispersions...
... done in 0.13 seconds.
Fitting LFCs...
... done in 0.18 seconds.
Running Wald tests...
... done in 0.89 seconds.
Running Wald tests...
... done in 1.20 seconds.
Model interaction between covariates#
We might be interested in more complex comparisons between patient groups. For example in this study instead of assuming that the type of treatment and its efficacy affect gene expression independently, we can test whether the gene expression changes associated to efficacy might depend on the type of treatment. To do this with a linear model, we model the interaction between the treatment and efficacy, following the syntax from Wilkinson formulas.
# Exclude patient with progressive disease, observed only for one treatment condition
pdata2 = pdata[pdata.obs["Efficacy"] != "PD"].copy()
# Fit DE model with interaction term
pds2 = pt.tl.PyDESeq2(adata=pdata2, design="~ Efficacy + Treatment + Efficacy*Treatment")
pds2.fit()
Using None as control genes, passed at DeseqDataSet initialization
Fitting size factors...
... done in 0.16 seconds.
Fitting dispersions...
... done in 2.66 seconds.
Fitting dispersion trend curve...
... done in 0.32 seconds.
Fitting MAP dispersions...
... done in 3.11 seconds.
Fitting LFCs...
... done in 3.38 seconds.
Calculating cook's distance...
... done in 0.30 seconds.
Replacing 841 outlier genes.
Fitting dispersions...
... done in 0.13 seconds.
Fitting MAP dispersions...
... done in 0.14 seconds.
Fitting LFCs...
... done in 0.16 seconds.
We aim to identify genes that distinguish responders (Efficacy="PR") from non-responders (Efficacy="SD") specifically in patients treated with Anti-PD-L1, but not in those receiving standard chemotherapy. In other words, we want to find genes that are uniquely associated with treatment response in the Anti-PD-L1 group. To do so, we specify a contrast vector using the cond() helper function. This allows you to specify the conditions you want to compare without working out which combination of coefficients in the linear model correspond to your comparison of interest (see here to learn more about contrast vectors and how to build them).
# Specify groups of interest for comparison
interaction_contrast = (
pds2.cond(Treatment="Anti-PD-L1+Chemo", Efficacy="PR") - pds2.cond(Treatment="Anti-PD-L1+Chemo", Efficacy="SD")
) - (pds2.cond(Treatment="Chemo", Efficacy="PR") - pds2.cond(Treatment="Chemo", Efficacy="SD"))
interaction_contrast
Intercept 0.0
Efficacy[T.SD] 0.0
Treatment[T.Chemo] 0.0
Efficacy[T.SD]:Treatment[T.Chemo] 1.0
Name: 0, dtype: float64
Now we are ready to test for differential expression
interaction_res_df = pds2.test_contrasts(interaction_contrast)
Log2 fold change & Wald test p-value, contrast vector: [0. 0. 0. 1.]
baseMean log2FoldChange lfcSE stat pvalue padj
CTAGE15 0.006137 0.468466 2.902207 0.161417 0.871765 NaN
SSBP3-AS1 0.017327 0.438511 5.626887 0.077931 0.937883 NaN
RNF148 0.000000 NaN NaN NaN NaN NaN
TBX20 0.114327 0.664889 0.865857 0.767897 0.442548 NaN
UBE2A 47.864979 0.080577 0.086099 0.935862 0.349344 0.598941
... ... ... ... ... ... ...
TIMM10B 7.376670 0.297356 0.122610 2.425213 0.015299 0.097176
YTHDF3-AS1 1.073013 0.025700 0.322572 0.079671 0.936499 0.975045
ESCO1 11.570555 0.175444 0.156042 1.124335 0.260871 0.508901
CASC10 0.009065 0.535802 3.794047 0.141222 0.887695 NaN
AC005696.4 0.000398 0.578567 5.891193 0.098209 0.921767 NaN
[27085 rows x 6 columns]
Running Wald tests...
... done in 0.92 seconds.
pds2.plot_volcano(interaction_res_df, log2fc_thresh=0)
NaNs encountered, dropping rows with NaNs
We can also pass several contrasts to test_contrasts, for example to compare DE genes in responders specific to Anti-PD-L1 treatment to responses observed across treatments
general_response_contrast = pds2.cond(Efficacy="PR") - pds2.cond(Efficacy="SD")
interaction_res_df = pds2.test_contrasts(
{"Anti-PD-L1_specific": interaction_contrast, "General": general_response_contrast}
)
edgr.plot_multicomparison_fc(interaction_res_df, figsize=(12, 1.5))
Log2 fold change & Wald test p-value, contrast vector: [0. 0. 0. 1.]
baseMean log2FoldChange lfcSE stat pvalue padj
CTAGE15 0.006137 0.468466 2.902207 0.161417 0.871765 NaN
SSBP3-AS1 0.017327 0.438511 5.626887 0.077931 0.937883 NaN
RNF148 0.000000 NaN NaN NaN NaN NaN
TBX20 0.114327 0.664889 0.865857 0.767897 0.442548 NaN
UBE2A 47.864979 0.080577 0.086099 0.935862 0.349344 0.598941
... ... ... ... ... ... ...
TIMM10B 7.376670 0.297356 0.122610 2.425213 0.015299 0.097176
YTHDF3-AS1 1.073013 0.025700 0.322572 0.079671 0.936499 0.975045
ESCO1 11.570555 0.175444 0.156042 1.124335 0.260871 0.508901
CASC10 0.009065 0.535802 3.794047 0.141222 0.887695 NaN
AC005696.4 0.000398 0.578567 5.891193 0.098209 0.921767 NaN
[27085 rows x 6 columns]
Log2 fold change & Wald test p-value, contrast vector: [ 0. -1. 0. 0.]
baseMean log2FoldChange lfcSE stat pvalue padj
CTAGE15 0.006137 -0.043019 2.005797 -0.021447 0.982889 NaN
SSBP3-AS1 0.017327 -0.085559 3.891562 -0.021986 0.982459 NaN
RNF148 0.000000 NaN NaN NaN NaN NaN
TBX20 0.114327 0.156086 0.608956 0.256317 0.797706 NaN
UBE2A 47.864979 0.010494 0.060903 0.172300 0.863202 0.941821
... ... ... ... ... ... ...
TIMM10B 7.376670 0.123306 0.088435 1.394311 0.163224 0.393120
YTHDF3-AS1 1.073013 -0.034124 0.235592 -0.144844 0.884834 0.952060
ESCO1 11.570555 0.233000 0.111016 2.098792 0.035835 0.156166
CASC10 0.009065 -0.023606 2.622866 -0.009000 0.992819 NaN
AC005696.4 0.000398 0.002971 4.074636 0.000729 0.999418 NaN
[27085 rows x 6 columns]
Running Wald tests...
... done in 0.94 seconds.
Running Wald tests...
... done in 1.11 seconds.