Use-case: Exploring perturbation effects#

This tutorial shows how to perform the analysis we present in Figure 4 of the pertpy preprint. We will use the data from Zhang et al. 2021 to explore the perturbation effects of different treatments on triple-negative breast cancer. We will use the mean squared error (MSE) distance to compare the expression profiles of cells between the different treatment-response groups. Additionally, we will use scCODA to identify changes in cell type composition between the different treatment groups.

import warnings

warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import pandas as pd
import pertpy as pt
import scanpy as sc
import seaborn as sns

Load and preprocess the data#

adata = pt.dt.zhang_2021()
adata
AnnData object with n_obs × n_vars = 489490 × 27085
    obs: 'Sample', 'Patient', 'Origin', 'Tissue', 'Efficacy', 'Group', 'Treatment', 'Number of counts', 'Number of genes', 'Major celltype', 'Cluster'

First, we will filter the data to only include tumor samples, no progression samples, and only samples with a partial response or stable disease. We will also filter out the samples with a mix of cell types, following the pre-processing steps in the original publication.

# Filter to tumor samples only
adata = adata[adata.obs["Origin"] == "t", :].copy()

# Filter out progression samples
adata = adata[adata.obs["Group"] != "Progression", :].copy()

# Subset to partial response and stable disease and rename PR and SD
adata = adata[adata.obs["Efficacy"].isin(["PR", "SD"]), :].copy()
adata.obs["Efficacy"] = adata.obs["Efficacy"].replace({"PR": "Partial response", "SD": "Stable disease"})

# Filter out Mix samples
adata = adata[adata.obs["Cluster"] != "Mix", :].copy()
adata
AnnData object with n_obs × n_vars = 146358 × 27085
    obs: 'Sample', 'Patient', 'Origin', 'Tissue', 'Efficacy', 'Group', 'Treatment', 'Number of counts', 'Number of genes', 'Major celltype', 'Cluster'

Now, we will refine the grouping to include both the timepoint and the treatment response.

adata.obs["Timepoint"] = adata.obs["Group"].copy()
adata.obs["Group"] = [
    f"{timepoint.split('-')[0]}-treat., {response}"
    for timepoint, response in zip(adata.obs["Timepoint"], adata.obs["Efficacy"])
]
adata.obs["Group"].value_counts()
Group
Pre-treat., Partial response     57295
Post-treat., Stable disease      31626
Post-treat., Partial response    29659
Pre-treat., Stable disease       27778
Name: count, dtype: int64

After filtering, we do some standard pre-processing steps:

sc.pp.filter_genes(adata, min_cells=10)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=4000, flavor="seurat_v3")
adata.raw = adata
adata = adata[:, adata.var.highly_variable]
adata
View of AnnData object with n_obs × n_vars = 146358 × 4000
    obs: 'Sample', 'Patient', 'Origin', 'Tissue', 'Efficacy', 'Group', 'Treatment', 'Number of counts', 'Number of genes', 'Major celltype', 'Cluster', 'Timepoint'
    var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'log1p', 'hvg'
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color="Treatment")
../../_images/5adef999b8620c1a571aaf0078984870181f1c0c127276da56b938935a7fca1e.png
sc.pl.umap(adata, color="Group")
../../_images/eff2c978a89686291b865575e63693f95c1875a19054ed7f7c6841a23e2c58ba.png

The dataset comes with a very detailed cell type annotation. We will use this annotation to define a coarser cell type annotation:

cell_type_converter = {
    "t_Bfoc": "B cell",
    "t_Bmem": "B cell",
    "t_Bn": "B cell",
    "t_CD4": "CD4 T cell",
    "t_CD8": "CD8 T cell",
    "t_ILC": "ILC cell",
    "t_Tact": "Activated T cell",
    "t_Tn": "Naive T cell",
    "t_Tprf": "Proliferating T cell",
    "t_cDC": "Dendritic cell",
    "t_mDC": "Dendritic cell",
    "t_macro": "Macrophage",
    "t_mast": "Mast cell",
    "t_mono": "Monocyte",
    "t_pB": "B cell",
    "t_pDC": "Dendritic cell",
}

cell_types = []
for ct in adata.obs["Cluster"]:
    for key, value in cell_type_converter.items():
        if ct.startswith(key):
            cell_types.append(value)
            break
    else:
        cell_types.append("n.a.")
adata.obs["Celltype"] = cell_types

sc.pl.umap(adata, color="Celltype")
../../_images/a8c6d58fa3751143fc4ed461f674ff9a8bc9bd5e40f382c9884b1df16de9678c.png

We will further define a function to filter the data to only include cell types that are present in all groups. This function is used to create adatas for the different treatment groups. We will use those for the distance metric analysis, as well as for the compositional analysis.

def subset_to_common_cell_types(adata_temp):
    isecs = pd.crosstab(adata_temp.obs["Cluster"], adata_temp.obs["Group"])
    celltypes = isecs[(isecs > 0).all(axis=1)].index.values.tolist()
    adata_temp = adata_temp[adata_temp.obs["Cluster"].isin(celltypes)]
    return adata_temp
adata_chemo = adata[adata.obs["Treatment"] == "Chemo"]
adata_chemo = subset_to_common_cell_types(adata_chemo)
adata_chemo.obs["Group"].value_counts()
Group
Post-treat., Stable disease      20659
Pre-treat., Partial response     16827
Post-treat., Partial response    12137
Pre-treat., Stable disease       11807
Name: count, dtype: int64
adata_chemo_pdl1 = adata[adata.obs["Treatment"] == "Anti-PD-L1+Chemo"]
adata_chemo_pdl1 = subset_to_common_cell_types(adata_chemo_pdl1)
adata_chemo_pdl1.obs["Group"].value_counts()
Group
Pre-treat., Partial response     40251
Post-treat., Partial response    16379
Pre-treat., Stable disease       13435
Post-treat., Stable disease      10717
Name: count, dtype: int64

Using a distance metric to rank perturbation effects#

Pertpy offers various distance metrics to compare the expression profile of cells between different groups, usually different perturbations. Here, we will use the mean squared error (MSE) distance to compare the expression profiles of cells between the different treatment-response groups. Importantly, we will do this for the full dataset, as well as for the chemo and chemo + anti-PD-L1 groups separately. This allows us to identify potential differences in the perturbation effects between the two treatment groups.

distance = pt.tl.Distance("mse", obsm_key="X_pca")
df_all = distance.pairwise(adata, groupby="Group", show_progressbar=False)

df_chemo = distance.pairwise(adata_chemo, groupby="Group", show_progressbar=False)

df_chemo_pdl1 = distance.pairwise(adata_chemo_pdl1, groupby="Group", show_progressbar=False)
# We need the global max and min to ensure that the color scale is the same for all heatmaps
global_max = max(df_all.max(axis=None), df_chemo.max(axis=None), df_chemo_pdl1.max(axis=None))
global_min = min(df_all.min(axis=None), df_chemo.min(axis=None), df_chemo_pdl1.min(axis=None))
order = df_all.index.values
_, ax = plt.subplots(figsize=(4, 3.5))
sns.heatmap(df_all, annot=True, fmt=".2f", vmin=global_min, vmax=global_max, cmap="Reds", ax=ax)
plt.show()
../../_images/a156258d8b08c6c1231eb5c4cc60b23c7573469efd1e7ceaf5b8ac62b2ad1e2f.png
df_chemo = df_chemo.loc[order, order]
_, ax = plt.subplots(figsize=(4, 3.5))
sns.heatmap(df_chemo, annot=True, fmt=".2f", vmin=global_min, vmax=global_max, cmap="Reds", ax=ax)
plt.show()
../../_images/f43dcd9a971d8b8b56c72d55bea3fc8854fb09e972889d716c5a971c538192d5.png
df_chemo_pdl1 = df_chemo_pdl1.loc[order, order]
_, ax = plt.subplots(figsize=(4, 3.5))
sns.heatmap(df_chemo_pdl1, annot=True, fmt=".2f", vmin=global_min, vmax=global_max, cmap="Reds", ax=ax)
plt.show()
../../_images/fdf9e5498acc2f062b8934cde80c25e6abded8279ebffcd63b73c8b8dc1d6ce0.png

The heatmaps above show that patients who responded to chemotherapy showed a larger difference between their pre- and post-treatment expression profiles compared to those who responded to the combination of anti-PDL-1 and chemotherapy. This indicates that the combination therapy may have led to a less intense response or was applied in cases with poorer prognoses.

Identifying changes in cell type composition using scCODA#

scCODA is a Bayesian model that allows to identify changes in cell type composition. Here, we will use scCODA to identify changes in cell type composition between the different treatment groups. We will run, analogous to the analysis set-up for the distance metric calculation, scCODA for each treatment (Chemo and Anti-PD-L1+Chemo) separately. However, we want to use the same reference cell type in order to be able to compare the results. Hence, we will first prepare the scCODA model on all data to identify a reference cell type:

# Get reference cell type
sccoda_model = pt.tl.Sccoda()

sccoda_data = sccoda_model.load(
    adata,
    type="cell_level",
    generate_sample_level=True,
    cell_type_identifier="Cluster",
    sample_identifier="Sample",
    covariate_obs=["Group"],
)

sccoda_data = sccoda_model.prepare(
    sccoda_data,
    modality_key="coda",
    formula="Group",
    reference_cell_type="automatic",
    automatic_reference_absence_threshold=0.1,
)
💡 Automatic reference selection! Reference cell type set to t_mono-FCN1
💡 Zero counts encountered in data! Added a pseudocount of 0.5.
fig = sccoda_model.plot_boxplots(
    sccoda_data,
    modality_key="coda",
    feature_name="Group",
    return_fig=True,
    show=False,
)
fig.set_size_inches(25, 5)
fig.set_dpi(200)
fig.show()
../../_images/cdd551e16ee8614860c277d6e769f026d8388bf54ee540cc64e595966c3744d9.png

The monocyte cell type is identified as the reference cell type. We will now run scCODA for the chemo and chemo + anti-PD-L1 groups separately, using the monocyte cell type as the reference cell type.

sccoda_model = pt.tl.Sccoda()


def run_sccoda(subset, reference):
    sccoda_data = sccoda_model.load(
        subset,
        type="cell_level",
        generate_sample_level=True,
        cell_type_identifier="Cluster",
        sample_identifier="Sample",
        covariate_obs=["Group"],
    )

    sccoda_data = sccoda_model.prepare(
        sccoda_data,
        modality_key="coda",
        formula=f"C(Group, Treatment('{reference}'))",
        reference_cell_type="t_mono-FCN1",  # "automatic",
        automatic_reference_absence_threshold=0.1,
    )

    sccoda_model.run_nuts(sccoda_data, modality_key="coda")
    sccoda_model.set_fdr(sccoda_data, modality_key="coda", est_fdr=0.1)

    comparison_groups = [g for g in subset.obs["Group"].unique() if g != reference]
    effect_df = pd.DataFrame(
        {"log2-fold change": [], "Cell Type": [], "Reference": [], "Comp. Group": [], "Final Parameter": []}
    )
    for comp_group in comparison_groups:
        group_effects = sccoda_data["coda"].varm[f"effect_df_C(Group, Treatment('{reference}'))[T.{comp_group}]"][
            ["log2-fold change", "Final Parameter"]
        ]
        group_effects = group_effects[group_effects["Final Parameter"] != 0]
        group_effects["Cell Type"] = group_effects.index
        group_effects["Reference"] = reference
        group_effects["Comp. Group"] = comp_group
        effect_df = pd.concat([effect_df, group_effects])

    if not effect_df.empty:
        fig = sccoda_model.plot_effects_barplot(sccoda_data, return_fig=True, show=False)
        fig.set_size_inches(12, 4)
        fig.show()
    else:
        print(f"No significant effects for reference {reference}")

    return effect_df
credible_effects_chemo = pd.DataFrame(
    {"log2-fold change": [], "Cell Type": [], "Reference": [], "Comp. Group": [], "Final Parameter": []}
)

for reference in adata_chemo.obs["Group"].unique():
    print(reference)
    effect_df_chemo = run_sccoda(adata_chemo, reference=reference)
    credible_effects_chemo = pd.concat([credible_effects_chemo, effect_df_chemo])
Pre-treat., Stable disease
💡 Zero counts encountered in data! Added a pseudocount of 0.5.
Pre-treat., Partial response
💡 Zero counts encountered in data! Added a pseudocount of 0.5.
No significant effects for reference Pre-treat., Partial response
Post-treat., Stable disease
💡 Zero counts encountered in data! Added a pseudocount of 0.5.
Post-treat., Partial response
💡 Zero counts encountered in data! Added a pseudocount of 0.5.
sample: 100%|██████████| 11000/11000 [02:14<00:00, 81.75it/s, 127 steps of size 2.63e-02. acc. prob=0.89]
sample: 100%|██████████| 11000/11000 [02:17<00:00, 80.06it/s, 127 steps of size 3.40e-02. acc. prob=0.85]
sample: 100%|██████████| 11000/11000 [02:19<00:00, 79.04it/s, 127 steps of size 2.78e-02. acc. prob=0.87]
sample: 100%|██████████| 11000/11000 [02:09<00:00, 84.70it/s, 127 steps of size 3.30e-02. acc. prob=0.79]
../../_images/2218bdb31d1a1fcc70cd6a66cc540ce75313855803edbd846620a9e2e9f33dc4.png ../../_images/cc5f20378ad50110083724bae4e8cd670a96c3b49a0942c55731330bbb618819.png ../../_images/0dd7e1de497449142c07687b0ae1e8b822d1c371f5bfc85e2480407e69e6862e.png
credible_effects_chemo
log2-fold change Cell Type Reference Comp. Group Final Parameter
t_CD4_Tcm-LMNA -1.051869 t_CD4_Tcm-LMNA Pre-treat., Stable disease Pre-treat., Partial response -0.963443
t_CD8_Tem-GZMK -1.173683 t_CD8_Tem-GZMK Pre-treat., Stable disease Pre-treat., Partial response -1.047878
t_CD8_Trm-ZNF683 -0.737895 t_CD8_Trm-ZNF683 Pre-treat., Stable disease Pre-treat., Partial response -0.745813
t_Tn-LEF1 -0.577472 t_Tn-LEF1 Pre-treat., Stable disease Pre-treat., Partial response -0.634616
t_CD4_Tcm-LMNA -1.099619 t_CD4_Tcm-LMNA Post-treat., Stable disease Pre-treat., Partial response -1.012419
t_CD8_Tem-GZMK -1.243750 t_CD8_Tem-GZMK Post-treat., Stable disease Pre-treat., Partial response -1.112323
t_CD8_Trm-ZNF683 -0.768398 t_CD8_Trm-ZNF683 Post-treat., Stable disease Pre-treat., Partial response -0.782834
t_Tn-LEF1 -0.599843 t_Tn-LEF1 Post-treat., Stable disease Pre-treat., Partial response -0.666001
t_CD4_Tcm-LMNA -1.084423 t_CD4_Tcm-LMNA Post-treat., Partial response Pre-treat., Partial response -0.994157
t_CD8_Tem-GZMK -1.210772 t_CD8_Tem-GZMK Post-treat., Partial response Pre-treat., Partial response -1.081735
t_CD8_Trm-ZNF683 -0.748751 t_CD8_Trm-ZNF683 Post-treat., Partial response Pre-treat., Partial response -0.761486
t_Tn-LEF1 -0.566308 t_Tn-LEF1 Post-treat., Partial response Pre-treat., Partial response -0.635026

Multiple credible effects were found for the chemo treatment group. We will now run scCODA for the Anti-PD-L1 + Chemo group.

credible_effects_chemo_pdl1 = pd.DataFrame(
    {"log2-fold change": [], "Cell Type": [], "Reference": [], "Comp. Group": [], "Final Parameter": []}
)

for reference in adata_chemo_pdl1.obs["Group"].unique():
    print(reference)
    effect_df_chemo_pdl1 = run_sccoda(adata_chemo_pdl1, reference=reference)
    credible_effects_chemo_pdl1 = pd.concat([credible_effects_chemo_pdl1, effect_df_chemo_pdl1])
Pre-treat., Partial response
💡 Zero counts encountered in data! Added a pseudocount of 0.5.
No significant effects for reference Pre-treat., Partial response
Pre-treat., Stable disease
💡 Zero counts encountered in data! Added a pseudocount of 0.5.
No significant effects for reference Pre-treat., Stable disease
Post-treat., Partial response
💡 Zero counts encountered in data! Added a pseudocount of 0.5.
No significant effects for reference Post-treat., Partial response
Post-treat., Stable disease
💡 Zero counts encountered in data! Added a pseudocount of 0.5.
No significant effects for reference Post-treat., Stable disease
sample: 100%|██████████| 11000/11000 [02:08<00:00, 85.74it/s, 127 steps of size 4.46e-02. acc. prob=0.71]
sample: 100%|██████████| 11000/11000 [02:12<00:00, 82.78it/s, 127 steps of size 3.73e-02. acc. prob=0.77]
sample: 100%|██████████| 11000/11000 [02:09<00:00, 84.65it/s, 127 steps of size 3.12e-02. acc. prob=0.92]
sample: 100%|██████████| 11000/11000 [02:07<00:00, 86.24it/s, 127 steps of size 3.74e-02. acc. prob=0.75]
credible_effects_chemo_pdl1
log2-fold change Cell Type Reference Comp. Group Final Parameter

For the Anti-PD-L1 + Chemo treatment group, no credible effects, i.e. changes in cell type composition, were found.

This fits to our earlier findings from the distance metric analysis, where we observed that the Anti-PD-L1 + Chemo group showed a smaller difference between their pre- and post-treatment expression profiles compared to the Chemo group. Overall, these findings indicate that the Anti-PD-L1 + Chemo combination therapy may have led to a less intense response or was applied in cases with poorer prognoses.