Differential gene expression

Differential gene expression#

Differential gene expression (DGE) analysis identifies genes that show statistically significant differences in expression levels across distinct cell populations or conditions. This analysis helps in identifying which cell types are most affected by a condition of interest such as a disease, and characterizing their functional signatures.

Pertpy provides an API to access several types of models for differential expression analysis. The first group of models comprises the T-test and Wilcoxon test as simple statistical tests for comparing expression values between two groups without covariates. The second group includes models of the linear model family that allow modeling complex designs and contrasts. Currently included are PyDESeq2, edgeR as well as a wrapper around statsmodels Statsmodels. which provides access to a wide range of regression models, including ordinary least squares regression, robust linear models and generalized linear models.

In the following tutorial we will demonstrate how the edgeR interface can be used to model complex interactions using the triple-negative breast cancer (TNBC) Zhang dataset.

Environment setup#

# pip install 'pertpy[de]' decoupler

import warnings

import decoupler as dc
import pertpy as pt
import scanpy as sc

warnings.filterwarnings("ignore")

Dataset#

The Zhang dataset comprises scRNA-seq from 22 patients with advanced TNBC, treated with paclitaxel alone or in combination with the anti-PD-L1 therapy atezolizumab. 78 tumor biopsies and blood samples were collected before treatment (‘pre-treatment’), after treatment (‘post-treatment), and during disease progression (‘progression’). We focused on the transcriptomic data which encompasses approximately 489,490 high-quality immune cells with 27085 measured genes across 99 high resolution cell types.

adata = pt.dt.zhang_2021()

To keep runtimes reasonable and simplify the interpretation, we will only be working with the tissue samples.

adata = adata[adata.obs["Origin"] == "t", :].copy()

adata.obs.head(5)

	Sample	Patient	Origin	Tissue	Efficacy	Group	Treatment	Number of counts	Number of genes	Major celltype	Cluster
Cell barcode
AAACCTGAGAGTCTGG.Pre_P007_t	Pre_P007_t	P007	t	lymph_node	PR	Pre-treatment	Anti-PD-L1+Chemo	2,346	748	T cell	t_Tn-LEF1
AAACCTGAGCACACAG.Pre_P007_t	Pre_P007_t	P007	t	lymph_node	PR	Pre-treatment	Anti-PD-L1+Chemo	1,242	679	ILC cell	t_ILC1-IL32
AAACCTGCAAGTCATC.Pre_P007_t	Pre_P007_t	P007	t	lymph_node	PR	Pre-treatment	Anti-PD-L1+Chemo	1,593	597	T cell	t_Tn-LEF1
AAACCTGGTTGGGACA.Pre_P007_t	Pre_P007_t	P007	t	lymph_node	PR	Pre-treatment	Anti-PD-L1+Chemo	2,404	1,203	T cell	t_CD8_Tem-GZMK
AAACCTGTCAGCTCGG.Pre_P007_t	Pre_P007_t	P007	t	lymph_node	PR	Pre-treatment	Anti-PD-L1+Chemo	2,034	682	T cell	t_Tn-LEF1

When conducting differential gene expression analysis, it is important to understand the dataset well. We will therefore explore the various co-variates first.

adata.obs["Patient"].value_counts()

Patient
P019    37150
P013    22526
P012    14570
P025    14505
P002    10909
P022    10614
P023    10123
P005     8175
P018     7936
P020     7240
P007     7237
P003     4349
P016     4009
P017     4006
P004     3887
P010        8
Name: count, dtype: int64

We remove Patient 10 who has had very few sequenced cells.

adata = adata[~adata.obs["Patient"].isin(["P010"])]

adata.obs["Sample"].value_counts()

Sample
Pre_P019_t     27656
Prog_P013_t    10153
Post_P013_t     9975
Post_P019_t     9494
Post_P025_t     8072
Pre_P022_t      7689
Pre_P012_t      7682
Post_P012_t     6888
Pre_P020_t      6863
Post_P023_t     6813
Pre_P025_t      6433
Post_P018_t     5847
Pre_P002_t      5483
Post_P002_t     5426
Pre_P007_t      4999
Pre_P005_t      4432
Post_P003_t     4349
Pre_P004_t      3887
Pre_P016_t      3862
Post_P005_t     3743
Pre_P023_t      3310
Post_P022_t     2925
Pre_P017_t      2428
Pre_P013_t      2398
Prog_P007_t     2238
Pre_P018_t      2089
Post_P017_t     1578
Post_P020_t      377
Post_P016_t      147
Name: count, dtype: int64

Notably, two samples have less than 1000 sequenced cells. We will not remove them but keep it in mind.

Generally cells can behave quite differently from tissue to tissue. Therefore, it can be useful to conduct DGE tests per tissue and not all tissues jointly. To keep things simple, we will not conduct the tests per tissue separately here.

adata.obs["Tissue"].value_counts()

Tissue
lymph_node    60725
breast        50418
liver         30701
chest_wall    23154
brain          2238
Name: count, dtype: int64

Each patient was treated with a different type of therapy and classified based on efficacy of the treatment as responders (PR: partial response) or non-responders (SD: stable disease, PD: progressive disease).

adata.obs["Treatment"].value_counts()

Treatment
Anti-PD-L1+Chemo    89943
Chemo               77293
Name: count, dtype: int64

adata.obs["Efficacy"].value_counts()

Efficacy
PR    99337
SD    63550
PD     4349
Name: count, dtype: int64

Samples were collected both before and after treatment

adata.obs["Group"].value_counts()

Group
Pre-treatment     89211
Post-treatment    65634
Progression       12391
Name: count, dtype: int64

The dataset was annotated with two levels of cell type annotation:

adata.obs["Major celltype"].value_counts()

Major celltype
T cell          100155
B cell           35588
Myeloid cell     19878
ILC cell         11615
Name: count, dtype: int64

adata.obs["Cluster"].value_counts()

Cluster
t_Tn-LEF1           19046
t_CD8_Tem-GZMK      15371
t_Bmem-CD27         12639
t_CD4_Tcm-LMNA      10886
t_CD8-CXCL13        10689
t_Bn-TCL1A           9695
t_pB-IGHG1           8505
t_CD4_Tact-XIST      8497
t_CD4_Treg-FOXP3     7893
t_CD8_Trm-ZNF683     6456
t_CD4-CXCL13         6282
t_CD8_MAIT-KLRB1     4791
Mix                  4146
t_CD8_Teff-GNLY      3517
t_Bfoc-MKI67         2717
t_Tact-IFI6          2487
t_macro-IL1B         2148
t_macro-IL1RN        1973
t_Bfoc-NEIL1         1865
t_ILC3-AREG          1772
t_mono-FCN1          1684
t_ILC1-VCAM1         1613
t_macro-IGFBP7       1493
t_macro-MGP          1338
t_ILC1-FGFBP2        1309
t_ILC1-SELL          1297
t_ILC1-CD160         1097
t_macro-IFI27        1051
t_macro-TUBA1B       1050
t_ILC1-IL32           972
t_macro-CX3CR1        947
t_ILC2-SPON2          942
t_cDC2-CLEC10A        926
t_macro-MKI67         873
t_ILC1-IFNG           751
t_macro-SPP1          742
t_macro-CCL2          733
t_macro-MMP9          664
t_ILC1-CX3CR1         636
t_mono-S100A89        604
t_macro-CFD           596
t_ILC1-ZNF683         583
t_macro-SLC40A1       577
t_pDC-LILRA4          576
t_macro-FOLR2         491
t_ILC1-GZMK           424
t_cDC1-CLEC9A         273
t_cDC2-FCGR2B         261
t_mast-TPSAB1         250
t_cDC2-CD207          208
t_Bmem-MKI67          167
t_macro-CD24          148
t_mono-SMIM25         139
t_mDC-LAMP3           133
t_ILC3-IL7R           128
t_Tprf-MKI67           94
t_ILC1-CNOT2           91
Name: count, dtype: int64

We remove the Mix cell type analogously to the original publication.

adata = adata[~adata.obs["Cluster"].isin(["Mix"])]

For the remainder of the analysis we will work with the Cluster level because the Major celltype level is not fine-grained enough.

We store the raw counts in a counts layer because most DGE models expect raw counts.

adata.layers["counts"] = adata.X.copy()

Pseudobulks#

To perform DE analysis on our single-cell dataset, we first create pseudobulk samples by aggregating expression counts of cells within each group and sample of interest. This approach allows us to leverage statistical methods originally developed for bulk RNA-seq data, which is the recommended best practice to minimizing false positives in scRNA-seq DE analysis.

For each patient we create 1 pseudobulk sample per cell type by summing up the gene counts within each subpopulation.

pdata = dc.pp.pseudobulk(adata, sample_col="Patient", groups_col="Cluster", layer="counts", mode="sum")
dc.pp.filter_samples(pdata, inplace=True)
pdata

AnnData object with n_obs × n_vars = 457 × 27085
    obs: 'Patient', 'Cluster', 'Origin', 'Efficacy', 'Treatment', 'Major celltype', 'psbulk_cells', 'psbulk_counts'
    layers: 'psbulk_props'

dc.pl.filter_samples(pdata, groupby=["Patient", "Major celltype"], figsize=(12, 4))

../../_images/b2459cc7351810f555e9ea23d883fe6adea0f33c5e6cbb094510ae446ae468a8.png

Axes of variation#

The validity of differential gene expression results highly depends on the capture of the major axis of variations in the statistical model. Intermediate data exploration steps such as principal component analysis (PCA) or multidimensional scaling (MDS) on pseudobulk samples allow for the identification of the sources of variation and thus can guide the construction of corresponding design and contrast matrices that model the data.

Now that we have generated the pseudobulk profiles for each patient and each cell type, let’s explore the variability between them. For that, we will first do some simple preprocessing and calculate a PCA.

pdata.layers["counts"] = pdata.X.copy()

sc.pp.normalize_total(pdata, target_sum=1e4)
sc.pp.log1p(pdata)
sc.pp.scale(pdata, max_value=10)
sc.pp.pca(pdata)

# Return raw counts to X
dc.pp.swap_layer(pdata, "counts", inplace=True)

sc.pl.pca(pdata, color=["Patient", "Treatment", "Efficacy", "Cluster"], ncols=1, size=300)
sc.pl.pca_variance_ratio(pdata)

../../_images/1a786876715f0a5fc72466fde21933e8051a25cec3d1e93d83389dace038c9ae.png

../../_images/5ace8d26653424520391e433333ee44762e41d7e71e8009ada8e4515ec09feee.png

It seems like the four first components explain most of the variance and they easily separate most cell types from one another. In contrast, the principle components do not seem to be associated with Patient, Efficacy or Treatment.

To have a better overview of the association of PCs with sample metadata, let’s perform ANOVA on each PC and see whether they are significantly associated with any technical or biological annotations of our samples

pdata_subset = pdata.copy()
pdata_subset.obs = pdata.obs[["Patient", "Treatment", "Efficacy", "Cluster"]]

dc.tl.rankby_obsm(
    pdata_subset,
    key="X_pca",
    uns_key="pca_anova",
)

dc.pl.obsm(
    pdata_subset,
    key="pca_anova",
    names=["Patient", "Treatment", "Efficacy"],
    titles=["Principle component scores", "Adjusted p-values from ANOVA"],
    cmap_obs={},
)

../../_images/d3f65a4b238683e9c267b747c099538e3d1abe08583aebd59979751233c636fd.png

Differential expression testing with edgeR#

The edger interface further requires edger to be installed in R (BiocManager::install("edgeR")).

Here, we are fit a model to capture that the type of treatment (Treatment) and response to the treatment (Efficacy) contribute independently to the gene expression levels. By doing so, we can evaluate:

How much of the variation in gene expression can be attributed to differences in drug efficacy, independent of the type of treatment.
How much of the variation is due to the type of treatment, independent of the efficacy of the drug.

This setup helps in understanding not just whether a treatment works, but how its effectiveness might vary or be influenced by the inherent efficacy of the drug.

edgr = pt.tl.EdgeR(pdata, design="~Efficacy+Treatment")

edgr.fit()

• Calculating NormFactors
• Estimating Dispersions
• Fitting linear model

2025-07-05 11:09:20 | [INFO] cffi mode is CFFI_MODE.ANY
2025-07-05 11:09:20 | [INFO] R home found: /home/lukas/miniforge3/envs/pertpy/lib/R
2025-07-05 11:09:20 | [INFO] R library path: 
2025-07-05 11:09:20 | [INFO] LD_LIBRARY_PATH: 
2025-07-05 11:09:20 | [INFO] Default options to initialize R: rpy2, --quiet, --no-save
2025-07-05 11:09:20 | [INFO] R is already initialized. No need to initialize.

To now determine the differentially expressed genes between the treatments, we can specify a contrast as follows:

res_df = edgr.test_contrasts(edgr.contrast(column="Treatment", baseline="Chemo", group_to_compare="Anti-PD-L1+Chemo"))

res_df.head(10)

	variable	log_fc	logCPM	F	p_value	adj_p_value	contrast
0	IGLV4-3	-1.153857	2.672132	604.075825	8.581326e-98	2.324252e-93	None
1	IGHV1-69	-2.432450	5.603879	614.570202	4.456887e-94	6.035740e-90	None
2	IGKV1D-16	-2.125786	5.300726	522.146190	2.934691e-83	2.649537e-79	None
3	HBB	1.165896	9.259467	484.306407	6.054148e-75	4.099415e-71	None
4	IGKV2D-26	0.243653	1.667114	387.465169	7.560028e-69	4.095267e-65	None
5	RPS26	1.365747	10.339090	356.827260	1.354675e-59	6.115230e-56	None
6	FDCSP	4.726368	5.455406	279.700431	1.130304e-52	4.373469e-49	None
7	AZGP1	-2.878053	3.409497	168.444076	1.330101e-34	4.503223e-31	None
8	AC135068.2	-3.971291	3.342327	159.553825	1.426533e-32	4.293072e-29	None
9	CCL19	3.816752	4.235494	139.161147	2.651780e-29	7.182347e-26	None

The set of differentially expressed genes can now be used for downstream tasks or as a first step, plotted in a volcano plot.

edgr.plot_volcano(res_df, log2fc_thresh=0)

../../_images/1ea6f40210b1c64b1aa9c27218582af05b0e181ca685c4da1ac8338578874048.png

We identified genes that are differentially expressed between the two treatments. Next, we might be interested in looking at the expression levels of these genes across different subgroups. Those subgroups could be different patients or, as in this case, different efficacy groups. For this, we can use the plot_paired function, which enables comparing expression levels between paired samples. Note that the pairing must have been considered in the model design.

edgr.plot_paired(pdata, results_df=res_df, n_top_vars=8, groupby="Treatment", pairedby="Efficacy")

• Performing pseudobulk for paired samples
! 1 unpaired samples removed

../../_images/0fbfda3e8be15ab1e3ed0864f57408c4969b78c1da2ffc2cc5259845350af93f.png

Furthermore, we can also compare the gene expression levels between the different efficacy groups. For this, we use the compare_groups function. Specifically, we will use the “Stable Disease” (SD) group as the baseline and compare it to the “Partial Response” (PR) and “Progressive Disease” (PD) groups.

res_df = edgr.compare_groups(pdata, column="Efficacy", baseline="SD", groups_to_compare=["PR", "PD"])
edgr.plot_multicomparison_fc(res_df, figsize=(12, 1.5))

• Calculating NormFactors
• Estimating Dispersions
• Fitting linear model

../../_images/f4d141ec8fb39fcd8f85dcaa3e6cff2ad9d4c12f8b89f0f2e2648343c2ec3d81.png

Differential expression testing with PyDESeq2#

The interface of PyDESeq2 in pertpy is very similar.

pds2 = pt.tl.PyDESeq2(adata=pdata, design="~Efficacy+Treatment")

pds2.fit()

Using None as control genes, passed at DeseqDataSet initialization

Fitting size factors...
... done in 0.16 seconds.
Fitting dispersions...
... done in 2.44 seconds.

Fitting dispersion trend curve...
... done in 0.30 seconds.

Fitting MAP dispersions...
... done in 2.91 seconds.

Fitting LFCs...
... done in 3.44 seconds.

Calculating cook's distance...
... done in 0.33 seconds.

Replacing 729 outlier genes.

Fitting dispersions...
... done in 0.13 seconds.

Fitting MAP dispersions...
... done in 0.12 seconds.

Fitting LFCs...
... done in 0.17 seconds.

res_df = pds2.test_contrasts(pds2.contrast(column="Treatment", baseline="Chemo", group_to_compare="Anti-PD-L1+Chemo"))

Log2 fold change & Wald test p-value, contrast vector: [ 0.  0.  0. -1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
CTAGE15      0.005586        0.542828  1.100956  0.493051  0.621976       NaN
SSBP3-AS1    0.016446        0.588914  2.099829  0.280458  0.779126       NaN
RNF148       0.000000             NaN       NaN       NaN       NaN       NaN
TBX20        0.109343        0.144535  0.334674  0.431866  0.665838       NaN
UBE2A       46.219635       -0.003201  0.042763 -0.074867  0.940321  0.972752
...               ...             ...       ...       ...       ...       ...
TIMM10B      7.016125       -0.096656  0.061285 -1.577163  0.114758  0.287285
YTHDF3-AS1   1.036879       -0.301978  0.154461 -1.955049  0.050577  0.165338
ESCO1       11.054052       -0.089527  0.077811 -1.150575  0.249907  0.472764
CASC10       0.008167        0.547730  1.513299  0.361944  0.717394       NaN
AC005696.4   0.000359        0.576040  3.024556  0.190454  0.848953       NaN

[27085 rows x 6 columns]

Running Wald tests...
... done in 0.89 seconds.

res_df.head(10)

	variable	baseMean	log_fc	lfcSE	stat	p_value	adj_p_value	contrast
0	RPS26	485.923196	1.326838	0.061918	21.428791	7.202693e-102	1.011762e-97	None
1	FDCSP	15.211450	4.273902	0.335308	12.746193	3.273428e-37	2.299092e-33	None
2	AZGP1	3.718152	-2.709333	0.246141	-11.007261	3.525612e-28	1.650809e-24	None
3	IGLV1-51	83.891787	4.380368	0.405676	10.797710	3.528952e-27	1.239280e-23	None
4	POLR2J3	4.443091	-0.851760	0.081241	-10.484358	1.019352e-25	2.863768e-22	None
5	FAM118A	15.566632	1.327481	0.130264	10.190719	2.181439e-24	5.107111e-21	None
6	MT-ATP6	664.679405	-0.544051	0.056290	-9.665054	4.243962e-22	8.516419e-19	None
7	NDUFB1	44.043809	0.768386	0.080136	9.588499	8.937581e-22	1.391374e-18	None
8	RPS29	743.770550	0.770006	0.080310	9.587948	8.985450e-22	1.391374e-18	None
9	HLA-B	1814.364254	-0.573257	0.059852	-9.577886	9.905134e-22	1.391374e-18	None

pds2.plot_volcano(res_df, log2fc_thresh=0)

NaNs encountered, dropping rows with NaNs

../../_images/51e113ee535ddd592b190a5bd4444c670d25b999328fa5bbe3f1504108e0fe19.png

We can also plot the fold changes of the top differentially expressed genes:

pds2.plot_fold_change(res_df, n_top_vars=15)

../../_images/fffc4c21f76202512f851f0e7c320303eee9fce289bcdf4cc346d2f0e18d529b.png

As already done for edgeR, we can also use PyDESeq2 to compare the gene expression levels between the different efficacy groups.

res_df = pds2.compare_groups(pdata, column="Efficacy", baseline="SD", groups_to_compare=["PR", "PD"])
edgr.plot_multicomparison_fc(res_df, figsize=(12, 1.5))

Using None as control genes, passed at DeseqDataSet initialization
Log2 fold change & Wald test p-value, contrast vector: [ 0.  1. -1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
CTAGE15      0.005586       -0.331009  1.458895 -0.226890  0.820509       NaN
SSBP3-AS1    0.016446       -0.360049  2.473883 -0.145540  0.884285       NaN
RNF148       0.000000             NaN       NaN       NaN       NaN       NaN
TBX20        0.109343       -0.191595  0.438625 -0.436809  0.662250       NaN
UBE2A       46.219635       -0.028705  0.042461 -0.676049  0.499010  0.772048
...               ...             ...       ...       ...       ...       ...
TIMM10B      7.016125       -0.020675  0.060917 -0.339396  0.734311  0.899746
YTHDF3-AS1   1.036879       -0.007892  0.155310 -0.050813  0.959475  0.985568
ESCO1       11.054052        0.154259  0.077208  1.997953  0.045722  0.248803
CASC10       0.008167       -0.336234  1.887693 -0.178119  0.858629       NaN
AC005696.4   0.000359       -0.332442  3.004621 -0.110644  0.911899       NaN

[27085 rows x 6 columns]
Log2 fold change & Wald test p-value, contrast vector: [ 0.  0. -1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
CTAGE15      0.005586        0.498110  3.170411  0.157112  0.875156       NaN
SSBP3-AS1    0.016446        0.548043  5.386868  0.101737  0.918966       NaN
RNF148       0.000000             NaN       NaN       NaN       NaN       NaN
TBX20        0.109343        0.297736  1.001656  0.297244  0.766280       NaN
UBE2A       46.219635        0.038339  0.099578  0.385013  0.700228  0.819585
...               ...             ...       ...       ...       ...       ...
TIMM10B      7.016125       -0.331021  0.171131 -1.934313  0.053075  0.159409
YTHDF3-AS1   1.036879        0.311081  0.392494  0.792575  0.428025       NaN
ESCO1       11.054052       -0.239092  0.192771 -1.240286  0.214870  0.387146
CASC10       0.008167        0.510294  4.106634  0.124261  0.901109       NaN
AC005696.4   0.000359        0.538072  6.548421  0.082168  0.934513       NaN

[27085 rows x 6 columns]

Fitting size factors...
... done in 0.18 seconds.
Fitting dispersions...
... done in 2.68 seconds.

Fitting dispersion trend curve...
... done in 0.34 seconds.

Fitting MAP dispersions...
... done in 2.97 seconds.

Fitting LFCs...
... done in 3.08 seconds.

Calculating cook's distance...
... done in 0.31 seconds.

Replacing 763 outlier genes.

Fitting dispersions...
... done in 0.13 seconds.

Fitting MAP dispersions...
... done in 0.13 seconds.

Fitting LFCs...
... done in 0.18 seconds.

Running Wald tests...
... done in 0.89 seconds.

Running Wald tests...
... done in 1.20 seconds.

../../_images/8c3d3ce59c9bfe158227dcba9d51bfbbe510725f0b1545ef5de6b8c9b18ba544.png

Model interaction between covariates#

We might be interested in more complex comparisons between patient groups. For example in this study instead of assuming that the type of treatment and its efficacy affect gene expression independently, we can test whether the gene expression changes associated to efficacy might depend on the type of treatment. To do this with a linear model, we model the interaction between the treatment and efficacy, following the syntax from Wilkinson formulas.

# Exclude patient with progressive disease, observed only for one treatment condition
pdata2 = pdata[pdata.obs["Efficacy"] != "PD"].copy()

# Fit DE model with interaction term
pds2 = pt.tl.PyDESeq2(adata=pdata2, design="~ Efficacy + Treatment + Efficacy*Treatment")
pds2.fit()

Using None as control genes, passed at DeseqDataSet initialization

Fitting size factors...
... done in 0.16 seconds.
Fitting dispersions...
... done in 2.66 seconds.

Fitting dispersion trend curve...
... done in 0.32 seconds.

Fitting MAP dispersions...
... done in 3.11 seconds.

Fitting LFCs...
... done in 3.38 seconds.

Calculating cook's distance...
... done in 0.30 seconds.

Replacing 841 outlier genes.

Fitting dispersions...
... done in 0.13 seconds.

Fitting MAP dispersions...
... done in 0.14 seconds.

Fitting LFCs...
... done in 0.16 seconds.

We aim to identify genes that distinguish responders (Efficacy="PR") from non-responders (Efficacy="SD") specifically in patients treated with Anti-PD-L1, but not in those receiving standard chemotherapy. In other words, we want to find genes that are uniquely associated with treatment response in the Anti-PD-L1 group. To do so, we specify a contrast vector using the cond() helper function. This allows you to specify the conditions you want to compare without working out which combination of coefficients in the linear model correspond to your comparison of interest (see here to learn more about contrast vectors and how to build them).

# Specify groups of interest for comparison
interaction_contrast = (
    pds2.cond(Treatment="Anti-PD-L1+Chemo", Efficacy="PR") - pds2.cond(Treatment="Anti-PD-L1+Chemo", Efficacy="SD")
) - (pds2.cond(Treatment="Chemo", Efficacy="PR") - pds2.cond(Treatment="Chemo", Efficacy="SD"))

interaction_contrast

Intercept                            0.0
Efficacy[T.SD]                       0.0
Treatment[T.Chemo]                   0.0
Efficacy[T.SD]:Treatment[T.Chemo]    1.0
Name: 0, dtype: float64

Now we are ready to test for differential expression

interaction_res_df = pds2.test_contrasts(interaction_contrast)

Log2 fold change & Wald test p-value, contrast vector: [0. 0. 0. 1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
CTAGE15      0.006137        0.468466  2.902207  0.161417  0.871765       NaN
SSBP3-AS1    0.017327        0.438511  5.626887  0.077931  0.937883       NaN
RNF148       0.000000             NaN       NaN       NaN       NaN       NaN
TBX20        0.114327        0.664889  0.865857  0.767897  0.442548       NaN
UBE2A       47.864979        0.080577  0.086099  0.935862  0.349344  0.598941
...               ...             ...       ...       ...       ...       ...
TIMM10B      7.376670        0.297356  0.122610  2.425213  0.015299  0.097176
YTHDF3-AS1   1.073013        0.025700  0.322572  0.079671  0.936499  0.975045
ESCO1       11.570555        0.175444  0.156042  1.124335  0.260871  0.508901
CASC10       0.009065        0.535802  3.794047  0.141222  0.887695       NaN
AC005696.4   0.000398        0.578567  5.891193  0.098209  0.921767       NaN

[27085 rows x 6 columns]

Running Wald tests...
... done in 0.92 seconds.

pds2.plot_volcano(interaction_res_df, log2fc_thresh=0)

NaNs encountered, dropping rows with NaNs

../../_images/211ec87254258bc19fb1de57b674ab7237a82f87aa710f8a43e0d21bb95e7f23.png

We can also pass several contrasts to test_contrasts, for example to compare DE genes in responders specific to Anti-PD-L1 treatment to responses observed across treatments

general_response_contrast = pds2.cond(Efficacy="PR") - pds2.cond(Efficacy="SD")

interaction_res_df = pds2.test_contrasts(
    {"Anti-PD-L1_specific": interaction_contrast, "General": general_response_contrast}
)

edgr.plot_multicomparison_fc(interaction_res_df, figsize=(12, 1.5))

Log2 fold change & Wald test p-value, contrast vector: [0. 0. 0. 1.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
CTAGE15      0.006137        0.468466  2.902207  0.161417  0.871765       NaN
SSBP3-AS1    0.017327        0.438511  5.626887  0.077931  0.937883       NaN
RNF148       0.000000             NaN       NaN       NaN       NaN       NaN
TBX20        0.114327        0.664889  0.865857  0.767897  0.442548       NaN
UBE2A       47.864979        0.080577  0.086099  0.935862  0.349344  0.598941
...               ...             ...       ...       ...       ...       ...
TIMM10B      7.376670        0.297356  0.122610  2.425213  0.015299  0.097176
YTHDF3-AS1   1.073013        0.025700  0.322572  0.079671  0.936499  0.975045
ESCO1       11.570555        0.175444  0.156042  1.124335  0.260871  0.508901
CASC10       0.009065        0.535802  3.794047  0.141222  0.887695       NaN
AC005696.4   0.000398        0.578567  5.891193  0.098209  0.921767       NaN

[27085 rows x 6 columns]
Log2 fold change & Wald test p-value, contrast vector: [ 0. -1.  0.  0.]
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
CTAGE15      0.006137       -0.043019  2.005797 -0.021447  0.982889       NaN
SSBP3-AS1    0.017327       -0.085559  3.891562 -0.021986  0.982459       NaN
RNF148       0.000000             NaN       NaN       NaN       NaN       NaN
TBX20        0.114327        0.156086  0.608956  0.256317  0.797706       NaN
UBE2A       47.864979        0.010494  0.060903  0.172300  0.863202  0.941821
...               ...             ...       ...       ...       ...       ...
TIMM10B      7.376670        0.123306  0.088435  1.394311  0.163224  0.393120
YTHDF3-AS1   1.073013       -0.034124  0.235592 -0.144844  0.884834  0.952060
ESCO1       11.570555        0.233000  0.111016  2.098792  0.035835  0.156166
CASC10       0.009065       -0.023606  2.622866 -0.009000  0.992819       NaN
AC005696.4   0.000398        0.002971  4.074636  0.000729  0.999418       NaN

[27085 rows x 6 columns]

Running Wald tests...
... done in 0.94 seconds.

Running Wald tests...
... done in 1.11 seconds.

../../_images/4349fc0ef55165842dd79503f5b440db3d019f73c18971a38b0d9ed6502334b3.png