Use-case: Deconvoluting drug responses in cancer cell lines

Use-case: Deconvoluting drug responses in cancer cell lines#

This tutorial shows how to perform the analysis we present in Figure 3 of the pertpy preprint. We will use the McFarland et al. 2020 dataset, which contains single-cell RNA-seq data from 172 cancer cell lines treated with 13 drugs. We will first preprocess the data, annotate it with metadata, and compare it to bulk RNA-seq data. We will then deconvolute the drug response of each cell line into viability-dependent and -independent components. Overall, this tutorial demonstrates how to use pertpy to derive insights into perturbation responses in a complex dataset comprising various cell lines and drugs, with the goal of better understanding the molecular mechanisms underlying drug responses and how they vary across cell lines.

import warnings

warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pertpy as pt
import scanpy as sc
import seaborn as sns
import statsmodels.api as sm
from tqdm import tqdm
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Load and preprocess data#

The data can be obtained via pertpy’s dataloader. Let’s retrieve the data and perform some basic preprocessing.

adata = pt.dt.mcfarland_2020()
adata
AnnData object with n_obs × n_vars = 182875 × 32738
    obs: 'DepMap_ID', 'cancer', 'cell_det_rate', 'cell_line', 'cell_quality', 'channel', 'disease', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', 'doublet_z_margin', 'hash_assignment', 'hash_tag', 'num_SNPs', 'organism', 'percent.mito', 'perturbation', 'perturbation_type', 'sex', 'singlet_ID', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'time', 'tissue_type', 'tot_reads', 'nperts', 'ngenes', 'ncounts', 'percent_mito', 'percent_ribo', 'chembl-ID'
    var: 'ensembl_id', 'ncounts', 'ncells'
adata.obs.head(10)
DepMap_ID cancer cell_det_rate cell_line cell_quality channel disease dose_unit dose_value doublet_CL1 ... singlet_z_margin time tissue_type tot_reads nperts ngenes ncounts percent_mito percent_ribo chembl-ID
AAACCTGAGACAAGCC ACH-000174 True 0.114940 CAL62 normal nan thyroid cancer µM 5.0 CAL62_THYROID ... 14.820933 24 cell_line 1094 1 3758 17155.0 4.447683 28.079277 CHEMBL443684
AAACCTGAGAGATGAG ACH-000601 True 0.122660 MIAPACA2 normal nan pancreatic cancer µM 5.0 MIAPACA2_PANCREAS ... 14.746431 24 cell_line 1009 1 4009 19764.0 7.468124 25.460433 CHEMBL443684
AAACCTGAGCGTTGCC ACH-000601 True 0.176608 MIAPACA2 normal nan pancreatic cancer µM 5.0 MIAPACA2_PANCREAS ... 16.011316 24 cell_line 3022 1 5772 47352.0 2.755955 27.965028 CHEMBL443684
AAACCTGAGTAGGCCA ACH-000950 True 0.177802 LOVO normal nan colon/colorectal cancer µM 5.0 LOVO_LARGE_INTESTINE ... 11.195382 24 cell_line 3732 1 5813 74069.0 6.356235 49.395834 CHEMBL443684
AAACCTGAGTTCGATC ACH-000704 True 0.158993 OAW42 normal nan ovarian cancer µM 5.0 OAW42_OVARY ... 10.590814 24 cell_line 2396 1 5200 38785.0 8.521336 35.255898 CHEMBL443684
AAACCTGAGTTGAGAT ACH-000713 True 0.184940 CAOV3 normal nan ovarian cancer µM 5.0 CAOV3_OVARY ... 15.769723 24 cell_line 3177 1 6046 49613.0 2.866184 25.840002 CHEMBL443684
AAACCTGCAATGAATG ACH-000390 True 0.152774 LUDLU1 empty_droplet nan lung cancer µM 5.0 LUDLU1_LUNG ... 0.073927 24 cell_line 2026 1 4996 27298.0 9.293721 16.557990 CHEMBL443684
AAACCTGCACATGTGT ACH-000966 True 0.129431 IGROV1 empty_droplet nan ovarian cancer µM 5.0 IGROV1_OVARY ... 0.799263 24 cell_line 1490 1 4233 27735.0 4.575446 38.629890 CHEMBL443684
AAACCTGCACGGCGTT ACH-000368 True 0.163496 SNU1105 normal nan brain cancer µM 5.0 SNU1105_CENTRAL_NERVOUS_SYSTEM ... 8.935577 24 cell_line 2416 1 5345 40236.0 2.192067 29.033701 CHEMBL443684
AAACCTGCACTTAACG ACH-000749 True 0.119689 DMS273 normal nan lung cancer µM 5.0 DMS273_LUNG ... 18.328158 24 cell_line 1158 1 3914 21471.0 6.571655 37.380653 CHEMBL443684

10 rows × 36 columns

adata.obs["perturbation_type"].value_counts()
perturbation_type
drug      154710
CRISPR     28165
Name: count, dtype: int64
adata.obs["perturbation"].value_counts()
perturbation
Trametinib     41300
control        29143
BRD3379        21206
Dabrafenib     12814
Navitoclax      9623
sgLACZ          7567
sgOR2J2         7250
AZD5591         7098
sgGPX4-2        6831
sgGPX4-1        6517
Idasanutlin     5990
JQ1             5591
Bortezomib      5150
Prexasertib     3972
Everolimus      3825
Afatinib        3807
Taselisib       3098
Gemcitabine     2093
Name: count, dtype: int64

The dataset is annotated with various metrics for cell quality control, and information of cell line, including the cancer type, as well as details on the perturbation applied. The perturbations comprise CRISPR knockouts and drug treatments. We only want to look at the drug perturbations in this analysis, so we will filter out cells that were treated with other perturbations.

adata = adata[adata.obs["perturbation_type"] == "drug"]
adata
View of AnnData object with n_obs × n_vars = 154710 × 32738
    obs: 'DepMap_ID', 'cancer', 'cell_det_rate', 'cell_line', 'cell_quality', 'channel', 'disease', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', 'doublet_z_margin', 'hash_assignment', 'hash_tag', 'num_SNPs', 'organism', 'percent.mito', 'perturbation', 'perturbation_type', 'sex', 'singlet_ID', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'time', 'tissue_type', 'tot_reads', 'nperts', 'ngenes', 'ncounts', 'percent_mito', 'percent_ribo', 'chembl-ID'
    var: 'ensembl_id', 'ncounts', 'ncells'

We will now perform some standard preprocessing steps on the data, such as filtering out low-quality cells and normalizing the data, and identifying highly variable genes. As we will use EdgeR for differential expression analysis, which requires raw counts, we will also save the raw counts in the raw_counts layer.

sc.pp.filter_genes(adata, min_cells=30)

adata.layers["raw_counts"] = adata.X.copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.neighbors(adata)
sc.pp.pca(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color="perturbation")
WARNING: You’re trying to run this on 24424 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
../../_images/d593a6eef9c485eb6cb3354479b3d8ff7dec7f8f8fbd12ee65606b9dd892a661.png

Metadata annotation#

Datasets often come with metadata that can enable more detailed analyses, as those presented in this tutorial. However, the extent of metadata can vary greatly between datasets, and each dataset usually has its own metadata format. To address this, databases that allow for standardized metadata annotation exist. Pertpy offers multiple metadata classes to query these databases and annotate your data. Here, we will use the CellLine and Moa classes to annotate the cell lines and mechanisms of action (MOA) of the drugs, respectively.

cl_metadata = pt.md.CellLine()
cl_metadata.annotate(
    adata,
    query_id="DepMap_ID",
    reference_id="ModelID",
    fetch=["CellLineName", "Age", "OncotreePrimaryDisease", "SangerModelID", "OncotreeLineage"],
)
AnnData object with n_obs × n_vars = 154710 × 24424
    obs: 'DepMap_ID', 'cancer', 'cell_det_rate', 'cell_line', 'cell_quality', 'channel', 'disease', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', 'doublet_z_margin', 'hash_assignment', 'hash_tag', 'num_SNPs', 'organism', 'percent.mito', 'perturbation', 'perturbation_type', 'sex', 'singlet_ID', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'time', 'tissue_type', 'tot_reads', 'nperts', 'ngenes', 'ncounts', 'percent_mito', 'percent_ribo', 'chembl-ID', 'CellLineName', 'Age', 'OncotreePrimaryDisease', 'SangerModelID', 'OncotreeLineage'
    var: 'ensembl_id', 'ncounts', 'ncells', 'n_cells'
    uns: 'log1p', 'neighbors', 'pca', 'umap', 'perturbation_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'raw_counts'
    obsp: 'distances', 'connectivities'
sc.pl.umap(adata, color=["OncotreeLineage"])
../../_images/6704d73b806ce3a91c1c48e16659e7890a12597d0a28778636257039e79e4c7f.png
moa_metadata = pt.md.Moa()
moa_metadata.annotate(
    adata,
    query_id="perturbation",
)

# Add control annotations
adata.obs["moa"] = [
    "Control" if pert == "control" else moa for moa, pert in zip(adata.obs["moa"], adata.obs["perturbation"])
]
adata.obs["target"] = [
    "Control" if pert == "control" else target for target, pert in zip(adata.obs["target"], adata.obs["perturbation"])
]
💡 There are 14 identifiers in `adata.obs`.However, 5 identifiers can't be found in the moa annotation,leading to the presence of NA values for their respective metadata.
Please check again: *unmatched_identifiers[:verbosity]...
sc.pl.umap(adata, color=["moa"])
../../_images/758a8522a27ab4714590519e862edf371661cdb2247fb4e30c1f3588ceebf7ff.png

Next, we will annotate the dataset with GDSC IC50 values, which indicate the drug response of each cell line to each drug. We will use the GDSC class for this purpose. Note that we will use the previously queried SangerModelID as cell line name for this step. There are two versions of the dataset: GDSC1 and GDSC2. We will annotate the dataset with both versions.

cl_metadata.annotate_from_gdsc(
    adata,
    query_id="SangerModelID",
    reference_id="sanger_model_id",
    query_perturbation="perturbation",
    gdsc_dataset="gdsc_1",
)
adata.obs["ln_ic50_GDSC1"] = adata.obs["ln_ic50"].copy()

cl_metadata.annotate_from_gdsc(
    adata,
    query_id="SangerModelID",
    reference_id="sanger_model_id",
    query_perturbation="perturbation",
    gdsc_dataset="gdsc_2",
)
adata.obs["ln_ic50_GDSC2"] = adata.obs["ln_ic50"].copy()

del adata.obs["ln_ic50"]
adata
💡 There are 140 identifiers in `adata.obs`.However, 25 identifiers can't be found in the drug response annotation,leading to the presence of NA values for their respective metadata.
Please check again: *unmatched_identifiers[:verbosity]...
💡 There are 140 identifiers in `adata.obs`.However, 25 identifiers can't be found in the drug response annotation,leading to the presence of NA values for their respective metadata.
Please check again: *unmatched_identifiers[:verbosity]...
AnnData object with n_obs × n_vars = 154710 × 24424
    obs: 'SangerModelID', 'perturbation', 'DepMap_ID', 'cancer', 'cell_det_rate', 'cell_line', 'cell_quality', 'channel', 'disease', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', 'doublet_z_margin', 'hash_assignment', 'hash_tag', 'num_SNPs', 'organism', 'percent.mito', 'perturbation_type', 'sex', 'singlet_ID', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'time', 'tissue_type', 'tot_reads', 'nperts', 'ngenes', 'ncounts', 'percent_mito', 'percent_ribo', 'chembl-ID', 'CellLineName', 'Age', 'OncotreePrimaryDisease', 'OncotreeLineage', 'moa', 'target', 'ln_ic50_GDSC1', 'ln_ic50_GDSC2'
    var: 'ensembl_id', 'ncounts', 'ncells', 'n_cells'
    uns: 'log1p', 'neighbors', 'pca', 'umap', 'perturbation_colors', 'OncotreeLineage_colors', 'moa_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'raw_counts'
    obsp: 'distances', 'connectivities'
sc.pl.umap(adata, color=["ln_ic50_GDSC2"])
../../_images/11587d9fe099b74d60fbff70f40e3e0117bc735b580ca0e256516a0171e7354d.png

Comparison to bulk RNA-seq data#

One question we might have after sequencing an in vitro experiment is “How similar are our expression profiles compared to the public database?” To answer this, we generate “pseudobulks” by aggregating counts to the cell-type level and then compare them with bulk RNA-seq data.

ps = pt.tl.PseudobulkSpace()
pdata = ps.compute(adata, target_col="CellLineName", groups_col="perturbation")
base_line = pdata[pdata.obs.perturbation == "control"]
base_line.obs.index = base_line.obs.index.str.replace("_control", "")
cl_metadata.annotate_bulk_rna(base_line, cell_line_source="broad", query_id="DepMap_ID")
❗ To annotate bulk RNA data from Broad Institue, `DepMap_ID` is used as default reference and query identifier if no `reference_id` is given.
Ensure that `DepMap_ID` is available in 'adata.obs'.
Alternatively, use `annotate()` to annotate the cell line first 
💡 There are 170 identifiers in `adata.obs`.However, 1 identifiers can't be found in the bulk RNA annotation,leading to the presence of NA values for their respective metadata.
Please check again: *unmatched_identifiers[:verbosity]...
AnnData object with n_obs × n_vars = 170 × 24424
    obs: 'SangerModelID', 'perturbation', 'DepMap_ID', 'cancer', 'cell_line', 'disease', 'dose_unit', 'dose_value', 'organism', 'perturbation_type', 'sex', 'singlet_ID', 'tissue_type', 'nperts', 'chembl-ID', 'CellLineName', 'Age', 'OncotreePrimaryDisease', 'OncotreeLineage', 'moa', 'target', 'ln_ic50_GDSC1', 'ln_ic50_GDSC2', 'psbulk_n_cells', 'psbulk_counts'
    var: 'ensembl_id', 'ncounts', 'ncells', 'n_cells'
    obsm: 'bulk_rna_broad'
    layers: 'psbulk_props'
base_line.obsm["bulk_rna_broad"].head(5)
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938 ENSG00000000971 ENSG00000001036 ENSG00000001084 ENSG00000001167 ... ENSG00000288714 ENSG00000288717 ENSG00000288718 ENSG00000288719 ENSG00000288720 ENSG00000288721 ENSG00000288722 ENSG00000288723 ENSG00000288724 ENSG00000288725
22Rv1 2.179511 0.0 6.316146 3.407353 4.642702 0.014355 0.124328 5.816088 7.045814 5.057017 ... 0.000000 0.000000 0.000000 0.028569 0.250962 0.432959 3.875780 0.137504 0.0 0.000000
253J-BV 3.942045 0.0 5.967169 1.883621 3.581351 0.000000 0.084064 5.087463 4.444932 3.794936 ... 0.000000 0.201634 0.124328 0.000000 0.150560 0.526069 4.526069 0.214125 0.0 0.000000
42-MG-BA 3.880686 0.0 6.733083 1.922198 3.390943 0.028569 0.575312 5.816856 3.313246 3.903038 ... 0.028569 0.000000 0.042644 0.014355 0.070389 0.555816 2.601697 0.000000 0.0 0.084064
5637 5.128871 0.0 6.691534 2.010780 4.976364 0.163499 1.636915 6.193575 3.505891 3.709291 ... 0.000000 0.000000 0.150560 0.028569 0.014355 0.298658 2.978196 0.000000 0.0 0.000000
639-V 4.328406 0.0 7.058749 1.891419 3.529821 0.000000 3.878725 6.432792 4.698774 4.912650 ... 0.028569 0.201634 0.028569 0.056584 0.189034 0.505891 3.820690 0.000000 0.0 0.000000

5 rows × 53970 columns

# We can only correlate the expression of genes that are present in both datasets, hence we filter the data accordingly
overlapping_genes = set(base_line.var.ensembl_id) & set(base_line.obsm["bulk_rna_broad"].columns)
base_line = base_line[:, base_line.var["ensembl_id"].isin(overlapping_genes)]
base_line.obsm["bulk_rna_broad"] = base_line.obsm["bulk_rna_broad"][base_line.var.ensembl_id]

Next, we log-transform the bulk RNA-seq data and correlate the pseudobulks with the bulk RNA-seq data to see how similar the expression profiles are.

sc.pp.log1p(base_line)
corr, pvals, unmatched_cl_corr, unmatched_cl_pvals = cl_metadata.correlate(
    base_line, identifier="DepMap_ID", metadata_key="bulk_rna_broad"
)
❗ Column name of metadata is not the same as the index of adata.var. Ensure that the genes are in the same order.
# Visualize the correlation of cell lines by scatter plot
cl_metadata.plot_correlation(
    base_line,
    corr=corr,
    pval=pvals,
    identifier="DepMap_ID",
    metadata_key="bulk_rna_broad",
    subset_identifier=None,
)
../../_images/0bcfe4418269762f1dce674ef86fd209a02716aa808980199f891c5d3de17dc3.png
plot_df = pd.DataFrame(
    {
        "Bulk RNA Broad": base_line.obsm["bulk_rna_broad"].to_numpy().flatten(),
        "Baseline": base_line.X.flatten(),
    }
)

sns.regplot(
    x="Bulk RNA Broad",
    y="Baseline",
    data=plot_df,
    ci=99,
    scatter_kws={"s": 2, "alpha": 0.05},
    line_kws={"color": "red", "label": "Linear regression"},
)

plt.legend()
<matplotlib.legend.Legend at 0x5c9645510>
../../_images/f62bd3b87f5306ae5e3e7ef9eb2a20dc7634299d6afe6a9bc451f99d9244ee5b.png

The correlation between the pseudobulks and the bulk RNA-seq data is quite high, indicating that the expression profiles of the cell lines in the McFarland dataset are similar to those in the bulk RNA-seq data.

Deconvoluting drug responses#

Above, we have annotated our dataset with GDSC IC50 values, indicating the drug response of each cell line to each drug. We can now use this information to deconvolute the drug response of each cell line into viability-dependent and -independent components.

Let’s first check the correlation between the IC50 values of the two GDSC datasets:

adata.obs[["ln_ic50_GDSC1", "ln_ic50_GDSC2"]].corr()
ln_ic50_GDSC1 ln_ic50_GDSC2
ln_ic50_GDSC1 1.000000 0.839123
ln_ic50_GDSC2 0.839123 1.000000
# Check for how many cell lines we have GDSC data for the different drugs
adata.obs[["perturbation", "ln_ic50_GDSC2"]].drop_duplicates()["perturbation"].value_counts()
perturbation
Trametinib     115
Navitoclax      68
Dabrafenib      67
Afatinib        65
Gemcitabine     65
Taselisib       64
JQ1             63
Bortezomib      16
AZD5591          1
BRD3379          1
Everolimus       1
Idasanutlin      1
Prexasertib      1
control          1
Name: count, dtype: int64

We will look into the Dabrafenib drug response and deconvolute it into viability-dependent and -independent response, as shown in the McFarland et al. 2020 paper. We will use the EdgeR package for this analysis.

Precisely, we will perform the following steps:

  1. Obtain log-fold changes (logFC) for each gene in each cell line treated with Dabrafenib compared to the control.

  2. Perform linear regression between the logFC values and the GDSC IC50 values to obtain the slope and intercept for each gene.

  3. Visualize the results in a volcano plot.

Let’s start with the first step:

adata_dabrafenib = adata[adata.obs["perturbation"].isin(["control", "Dabrafenib"])]

logfc_df = pd.DataFrame(columns=adata_dabrafenib.var_names)

for cell_line in tqdm(adata_dabrafenib.obs["SangerModelID"].unique()):
    subset = adata_dabrafenib[adata_dabrafenib.obs["SangerModelID"] == cell_line]
    if subset.n_obs < 20:  # Threshold from the McFarland paper
        continue
    if "Dabrafenib" not in subset.obs["perturbation"].unique():
        continue

    edgr = pt.tl.EdgeR(subset, design="~perturbation", layer="raw_counts")
    edgr.fit()

    res_df = edgr.test_contrasts(edgr.contrast("perturbation", "control", "Dabrafenib"))
    res_df = res_df[["variable", "log_fc"]]
    res_df = res_df.set_index("variable")
    res_df = res_df.reindex(adata.var_names)

    logfc_df.loc[cell_line] = res_df["log_fc"]

logfc_df.to_csv("output/logfc_df_dabrafenib.csv")
logfc_df = pd.read_csv("output/logfc_df_dabrafenib.csv", index_col=0)
logfc_df = logfc_df.loc[:, (logfc_df != 0).any(axis=0)]

Now that we have the log-fold changes for each gene in each cell line treated with Dabrafenib, we can perform linear regression between the log-fold changes and the GDSC IC50 values to obtain the slope and intercept for each gene. The slope represents the viability-dependent response, while the intercept represents the viability-independent response.

We will use the GDSC2 dataset for this analysis, but you can also use the GDSC1 dataset by changing the gdsc_dataset parameter.

gdsc_dataset = 2  # We will use GDSC2 dataset for this analysis

lr_params = pd.DataFrame(columns=["gene", "slope", "intercept", "slope_pval", "intercept_pval"])

cell_lines = logfc_df.index
sens_cell_lines = (
    adata_dabrafenib.obs[[f"ln_ic50_GDSC{gdsc_dataset}", "SangerModelID"]]
    .drop_duplicates()
    .dropna()
    .set_index("SangerModelID")
)
cell_lines = [cell_line for cell_line in cell_lines if cell_line in sens_cell_lines.index]

X = sens_cell_lines.loc[cell_lines][f"ln_ic50_GDSC{gdsc_dataset}"].values
na_mask = np.isnan(X)
X = X[~na_mask]

for gene in tqdm(logfc_df.columns):
    y = logfc_df.loc[cell_lines][gene].values
    y = y[~na_mask]
    X = sm.add_constant(X)
    model = sm.OLS(y, X)
    results = model.fit()

    lr_params.loc[gene] = [gene, results.params[1], results.params[0], results.pvalues[1], results.pvalues[0]]

# Filter out with both zero slope AND intercept
lr_params = lr_params[(lr_params["slope"] != 0) | (lr_params["intercept"] != 0)]

# Multiple testing correction
lr_params["slope_pval_corrected"] = sm.stats.multipletests(lr_params["slope_pval"], method="fdr_bh")[1]
lr_params["intercept_pval_corrected"] = sm.stats.multipletests(lr_params["intercept_pval"], method="fdr_bh")[1]

lr_params["-log10(slope_pval_corrected)"] = -np.log10(lr_params["slope_pval_corrected"])
lr_params["-log10(intercept_pval_corrected)"] = -np.log10(lr_params["intercept_pval_corrected"])

lr_params.to_csv("output/linear_regression_results_dabrafenib.csv")
100%|██████████| 24424/24424 [01:15<00:00, 325.36it/s]
lr_params.head(5)
gene slope intercept slope_pval intercept_pval slope_pval_corrected intercept_pval_corrected -log10(slope_pval_corrected) -log10(intercept_pval_corrected)
RP11-34P13.7 RP11-34P13.7 0.000017 0.005576 0.997914 0.863109 0.999077 0.954066 0.000401 0.020422
AL627309.1 AL627309.1 0.002542 0.008615 0.779733 0.849677 0.940812 0.948825 0.026497 0.022814
AP006222.2 AP006222.2 0.027225 -0.045638 0.100015 0.577922 0.365464 0.822735 0.437156 0.084740
RP4-669L17.10 RP4-669L17.10 -0.015041 0.039887 0.088418 0.362479 0.339679 0.661822 0.468932 0.179258
RP4-669L17.2 RP4-669L17.2 -0.000496 0.005325 0.763366 0.518953 0.933945 0.785116 0.029679 0.105066

Let’s plot the genes that show a particularly high viability-independent response to Dabrafenib, using pertpy’s volcano plot implementation:

# We will not actually use the EdgeR method here, we just create the object so that we can subsequently use the plot_volcano method
edgr = pt.tl.EdgeR(adata, design="~perturbation", layer="raw_counts")
edgr.plot_volcano(
    lr_params,
    log2fc_col="intercept",
    pvalue_col="intercept_pval_corrected",
    symbol_col="gene",
    pval_thresh=0.05,
    log2fc_thresh=0.5,
)
../../_images/876f52bf23600d183001c0c372c7bfcb05ca4d670f1aef0606f75cc807f6331e.png

The volcano plot above shows the viability-independent response of each gene to Dabrafenib, represented by the intercept. We can also look at the linear regression for individual genes. Let’s pick the gene EGR1 with a highly negative intercept and visualize the linear regression between the log-fold changes and the GDSC IC50 values:

cell_lines = logfc_df.index
sens_cell_lines = (
    adata_dabrafenib.obs[[f"ln_ic50_GDSC{gdsc_dataset}", "SangerModelID"]]
    .drop_duplicates()
    .dropna()
    .set_index("SangerModelID")
)
cell_lines = [cell_line for cell_line in cell_lines if cell_line in sens_cell_lines.index]

X = sens_cell_lines.loc[cell_lines][f"ln_ic50_GDSC{gdsc_dataset}"].values
na_mask = np.isnan(X)
X = X[~na_mask]

y = logfc_df.loc[cell_lines]["UBE2V2"].values
y = y[~na_mask]
X = sm.add_constant(X)

fig, ax = plt.subplots(figsize=(4, 4))
sns.scatterplot(
    x=X[:, 1],
    y=y,
    ax=ax,
)

sns.regplot(
    x=X[:, 1],
    y=y,
    scatter=False,
    color="red",
    ax=ax,
)

ax.set_xlabel(f"ln_ic50_GDSC{gdsc_dataset}")
ax.set_ylabel("logfc_EGR1")
plt.show()
../../_images/c317bd7195677977d701d7cb36988739710087355a020f8ef57a97864f6c3d7c.png

Similarly, we can examine the viability-dependent response of each gene to Dabrafenib, represented by the slope:

edgr.plot_volcano(
    lr_params,
    log2fc_col="slope",
    pvalue_col="slope_pval_corrected",
    symbol_col="gene",
    pval_thresh=0.1,
    log2fc_thresh=0.1,
)
../../_images/cf0437c163b42f97f8873a35d493f3ab26927eb12d68b3fdb7359bcc6f2554d2.png

Conclusion#

Overall, we have deconvoluted the drug response of cell lines to Dabrafenib treatment into viability-dependent and -independent components. This analysis can provide insights into the molecular mechanisms underlying drug responses and pertpy can help you perform this analysis in a straightforward manner. For further details on the analysis and to see the results when performing this analysis with another drug, please refer to the pertpy preprint.