Metadata annotation#

Publically available metadata can help contextualize in-house experiments. Pertpy allows metadata to be easily fetched and perturbations to be annotated, augmenting existing datasets to increase sample sizes for training of machine learning models and add prior knowledge.

A couple of databases can be queried, which we categorize by the type of metadata:

  • Cell line metadata:

    • Cancer Dependency Map (DepMap) at Broad

    • Genomics of Drug Sensitivity in Cancer (GDSC)

  • Genomic datasets:

    • DepMap at Broad: Gene expression data

    • DepMap at Sanger: Gene expression and protein intensity values

  • Cell line x perturbation interactions:

    • Genomics of Drug Sensitivity in Cancer (GDSC): Drug sensitivity

This notebook demonstrates how you can use pertpy metadata with two examples:

  1. Sanity check of private data with publically available cell line expression profiles

  2. Analysis of cell-line specific IC50-related genes using the GDSC database

Setup#

import anndata as ad
import matplotlib.pyplot as plt
import pandas as pd
import pertpy as pt
import scanpy as sc
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

Dataset#

Let’s use the dataset from the original MIX-Seq paper (McFarland et al., 2020), a scRNA-seq dataset of 99 cell lines and 13 different drugs. We subset it to 50000 cells to speed up the run time.

adata = pt.dt.mcfarland_2020()
adata
AnnData object with n_obs × n_vars = 182875 × 32738
    obs: 'DepMap_ID', 'cancer', 'cell_det_rate', 'cell_line', 'cell_quality', 'channel', 'disease', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', 'doublet_z_margin', 'hash_assignment', 'hash_tag', 'num_SNPs', 'organism', 'percent.mito', 'perturbation', 'perturbation_type', 'sex', 'singlet_ID', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'time', 'tissue_type', 'tot_reads', 'nperts', 'ngenes', 'ncounts', 'percent_mito', 'percent_ribo', 'chembl-ID'
    var: 'ensembl_id', 'ncounts', 'ncells'
sc.pp.filter_genes(adata, min_cells=30)
sc.pp.normalize_total(adata)
sc.pp.subsample(adata, n_obs=50000, random_state=42)

Cell line metadata#

annotate uses a column in .obs called DepMap_ID as identifiers and adds metadata columns to .obs. We only annotate using 3 metadata columns here, but you can see the full set of available metadata using lookup.cell_line_meta.

Data is downloaded to the directory for cache files used by scanpy (Defaults to cache).

cl_metadata = pt.md.CellLine()
cl_metadata.annotate(
    adata,
    query_id="DepMap_ID",
    reference_id="ModelID",
    fetch=["CellLineName", "Age", "OncotreePrimaryDisease", "SangerModelID"],
)
AnnData object with n_obs × n_vars = 50000 × 25031
    obs: 'DepMap_ID', 'cancer', 'cell_det_rate', 'cell_line', 'cell_quality', 'channel', 'disease', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', 'doublet_z_margin', 'hash_assignment', 'hash_tag', 'num_SNPs', 'organism', 'percent.mito', 'perturbation', 'perturbation_type', 'sex', 'singlet_ID', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'time', 'tissue_type', 'tot_reads', 'nperts', 'ngenes', 'ncounts', 'percent_mito', 'percent_ribo', 'chembl-ID', 'CellLineName', 'Age', 'OncotreePrimaryDisease', 'SangerModelID'
    var: 'ensembl_id', 'ncounts', 'ncells', 'n_cells'

Annotated metadata is saved as additional columns in adata.obs.

adata.obs.iloc[:, 36:]
CellLineName Age OncotreePrimaryDisease SangerModelID
TTATGCTAGTGATCGG NCI-H1581 44.0 Non-Small Cell Lung Cancer SIDM00748
AGCGTCGTCTCGGACG SNU-1105 61.0 Diffuse Glioma NaN
TTGGGTACATTGCTTT SK-MEL-3 42.0 Melanoma SIDM01105
TACTTGTTCAGCGACC 786-O 58.0 Renal Cell Carcinoma SIDM00125
CACAGTATCGGCTACG COLO-680N 57.0 Esophageal Squamous Cell Carcinoma SIDM00956
... ... ... ... ...
TGCCCTACATGAGCGA KYSE-270 79.0 Esophageal Squamous Cell Carcinoma SIDM01029
CTGATCCGTATGTCAC BT-474 60.0 Invasive Breast Carcinoma SIDM00963
TATTACCAGCAGACTG NCI-H460 NaN Non-Small Cell Lung Cancer SIDM00144
TGTGGTATCTCCAGGG SK-MEL-3 42.0 Melanoma SIDM01105
AGGCCGTAGAGCTGGT RCC10RGB NaN Renal Cell Carcinoma SIDM00235

50000 rows × 4 columns

We can check the number of overlapping cell lines using lookup.

lookup = cl_metadata.lookup()
lookup.available_cell_lines(query_id_list=adata.obs.cell_line.unique(), reference_id="CellLineName")
173 cell lines are not found in the metadata.
36 cell lines are found! 

Hmmm, it looks like there isn’t much overlap with existing databases using the cell line names. However, there is complete overlap when mapping using DepMap_ID! This is due a slight inconsistency in using dashes in cell line names, so try to use a unique ID as much as possible.

lookup.available_cell_lines(query_id_list=adata.obs.DepMap_ID.unique(), reference_id="ModelID")
0 cell lines are not found in the metadata.
209 cell lines are found! 

One question we might have after sequencing an in vitro experiment is “How similar are our expression profiles compared to the public database?” To answer this, we generate “pseudobulks” by aggregating counts to the cell-type level and then compare them with bulk RNA-seq data.

ps = pt.tl.PseudobulkSpace()
# Calculate the pseudobulks
pdata = ps.compute(adata, target_col="CellLineName", groups_col="perturbation")
/opt/homebrew/Caskroom/miniforge/base/envs/pertpy-env/lib/python3.10/site-packages/decoupler/utils_anndata.py:181: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  cols = obs.groupby([sample_col, groups_col]).apply(lambda x: x.apply(lambda y: len(y.unique()) == 1)).all(0)
/opt/homebrew/Caskroom/miniforge/base/envs/pertpy-env/lib/python3.10/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(
# Extract the base line expression
base_line = pdata[pdata.obs.perturbation == "control"]
base_line.obs.index = base_line.obs.index.str.replace("_control", "")
lookup.available_bulk_rna(query_id_list=adata.obs.DepMap_ID.unique(), cell_line_source="broad")
1 cell lines are not found in the metadata.
208 cell lines are found! 

We annotate bulk RNA expression data from the Broad Institute because it contains more overlapping cell lines with the McFarland dataset.

cl_metadata.annotate_bulk_rna(base_line, cell_line_source="broad", query_id="DepMap_ID")
To annotate bulk RNA data from Broad Institue,  `DepMap_ID` is used as default reference and query identifier if no
`reference_id` is given.  Ensure that `DepMap_ID` is available in 'adata.obs'.  Alternatively, use `annotate()` to 
annotate the cell line first 
[bold blue]There are 155 identifiers in `adata.obs`.However, 1 identifiers can't be found in the bulk RNA annotation,leading to the presence of NA values for their respective metadata.

- Please check again: 
- ACH-000047
- ...
/opt/homebrew/Caskroom/miniforge/base/envs/pertpy-env/lib/python3.10/site-packages/pertpy/metadata/_cell_line.py:471: ImplicitModificationWarning: Setting element `.obsm['bulk_rna_broad']` of view, initializing view as actual.
  adata.obsm["bulk_rna_broad"] = ccle_expression
AnnData object with n_obs × n_vars = 155 × 25031
    obs: 'DepMap_ID', 'cancer', 'cell_line', 'disease', 'dose_unit', 'dose_value', 'organism', 'perturbation', 'perturbation_type', 'sex', 'singlet_ID', 'tissue_type', 'nperts', 'chembl-ID', 'CellLineName', 'Age', 'OncotreePrimaryDisease', 'SangerModelID', 'psbulk_n_cells', 'psbulk_counts'
    var: 'ensembl_id', 'ncounts', 'ncells', 'n_cells'
    obsm: 'bulk_rna_broad'
    layers: 'psbulk_props'

The bulk RNA expression data is stored in adata.obsm[bulk_rna_broad]

base_line.obsm["bulk_rna_broad"]
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938 ENSG00000000971 ENSG00000001036 ENSG00000001084 ENSG00000001167 ... ENSG00000288714 ENSG00000288717 ENSG00000288718 ENSG00000288719 ENSG00000288720 ENSG00000288721 ENSG00000288722 ENSG00000288723 ENSG00000288724 ENSG00000288725
22Rv1 2.179511 0.0 6.316146 3.407353 4.642702 0.014355 0.124328 5.816088 7.045814 5.057017 ... 0.000000 0.000000 0.000000 0.028569 0.250962 0.432959 3.875780 0.137504 0.0 0.000000
253J-BV 3.942045 0.0 5.967169 1.883621 3.581351 0.000000 0.084064 5.087463 4.444932 3.794936 ... 0.000000 0.201634 0.124328 0.000000 0.150560 0.526069 4.526069 0.214125 0.0 0.000000
42-MG-BA 3.880686 0.0 6.733083 1.922198 3.390943 0.028569 0.575312 5.816856 3.313246 3.903038 ... 0.028569 0.000000 0.042644 0.014355 0.070389 0.555816 2.601697 0.000000 0.0 0.084064
5637 5.128871 0.0 6.691534 2.010780 4.976364 0.163499 1.636915 6.193575 3.505891 3.709291 ... 0.000000 0.000000 0.150560 0.028569 0.014355 0.298658 2.978196 0.000000 0.0 0.000000
639-V 4.328406 0.0 7.058749 1.891419 3.529821 0.000000 3.878725 6.432792 4.698774 4.912650 ... 0.028569 0.201634 0.028569 0.056584 0.189034 0.505891 3.820690 0.000000 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TEN 4.418190 0.0 7.247453 2.134221 3.841973 0.000000 0.070389 6.435795 3.740928 4.249445 ... 0.000000 0.000000 0.084064 0.014355 0.226509 0.389567 4.275752 0.070389 0.0 0.000000
TUHR4TKB 4.309613 0.0 7.265849 2.042644 3.553361 0.097611 5.257765 6.048105 4.058316 3.539779 ... 0.000000 0.000000 0.000000 0.000000 0.014355 0.411426 4.647315 0.000000 0.0 0.000000
UACC-257 4.134221 0.0 6.318498 2.042644 4.228049 0.042644 0.176323 5.866908 4.586164 4.361066 ... 0.000000 0.000000 0.286881 0.028569 0.111031 0.632268 5.216843 0.000000 0.0 0.000000
UM-UC-1 5.683416 0.0 7.125155 1.963474 2.523562 0.028569 2.969012 5.472488 4.869378 3.543496 ... 0.000000 0.000000 0.000000 0.189034 0.000000 0.731183 2.587365 0.000000 0.0 0.000000
WM1799 3.357552 0.0 6.772282 2.731183 4.817623 0.028569 1.794936 5.285402 4.085765 4.013462 ... 0.000000 0.000000 0.263034 0.000000 0.111031 0.622930 2.003602 0.000000 0.0 0.000000

155 rows × 53970 columns

# Subset and sort overlapping genes
overlapping_genes = set(base_line.var.ensembl_id) & set(base_line.obsm["bulk_rna_broad"].columns)
base_line = base_line[:, base_line.var["ensembl_id"].isin(overlapping_genes)]
base_line.obsm["bulk_rna_broad"] = base_line.obsm["bulk_rna_broad"][base_line.var.ensembl_id]
/var/folders/zz/rylc5nzn5bzd41ydv9q1q7tm0000gn/T/ipykernel_82725/2526010814.py:6: ImplicitModificationWarning: Setting element `.obsm['bulk_rna_broad']` of view, initializing view as actual.
  base_line.obsm["bulk_rna_broad"] = base_line.obsm["bulk_rna_broad"][
# Log normalize the counts
sc.pp.log1p(base_line)
# Correlate the pseudobulks with the bulk RNA-seq data
# corr and pvals: correlation and p-value df for the overlapping cell lines
# unmatched_cl_orr and unmatched_cl_pvals: correlation and p-value df for cell lines that are only present in McFarland dataset
corr, pvals, unmatched_cl_corr, unmatched_cl_pvals = cl_metadata.correlate(
    base_line, identifier="DepMap_ID", metadata_key="bulk_rna_broad"
)
Column name of metadata is not the same as the index of adata.var. Ensure that the genes are in the same order.
# Visualize the correlation of cell lines by scatter plot
cl_metadata.plot_correlation(
    base_line,
    corr=corr,
    pval=pvals,
    identifier="DepMap_ID",
    metadata_key="bulk_rna_broad",
    subset_identifier=0,  # gain insights of individual cell lines instead of the global statistics
    # 0 means the first id is chosen
)
../../_images/2486512925ee093d90cf5930c34ce51a68e70d5aea59a9d7d955db22be1587ac.png

We observe a high correlation between the baseline gene expression in the McFarland dataset and the bulk RNA-seq expression data from the Broad Institute, suggesting good concordance between these two sources. We can also visualize the correlation matrix via heatmap.

# Generally a cell line is most correlated with itself
sns.set(font_scale=1.2)
sns.heatmap(
    corr,
    cmap="coolwarm",
    annot=False,
    linewidths=0.7,
    xticklabels=False,
    yticklabels=False,
)
plt.title("Pearson Correlation")
plt.xlabel("Broad")
plt.ylabel("Baseline")
plt.show()
../../_images/4c4114fc258d4f26373ae9a7cf3fb6db5eddc9484c661bc06bdfc2d71b0e5224.png
# Subset to first 20 cell lines
sns.heatmap(corr.iloc[:20, :20], cmap="coolwarm", annot=False, linewidths=0.7, yticklabels=True)
plt.title("Pearson Correlation")
plt.xlabel("Broad")
plt.ylabel("Baseline")
plt.show()
../../_images/7d278964b0baebf081e32a7635e984f568b814bfb4a96d7cf48e4c5eef22ca55.png

The cell line ACH-000047 is the only cell line here that is not listed in the Broad database. Among all evaluated cell lines, it demonstrates the highest baseline gene expression correlation with the cell line ACH-000842 from the database.

unmatched_cl_max_corr = pd.concat(
    # find the highest correlation and the corresponding cell line
    [unmatched_cl_corr.idxmax(axis=1), unmatched_cl_corr.max(axis=1)],
    # create a dataframe with one column for the cell line identifier, and one column for the correlation value
    axis=1,
    keys=["cell_line", "correlation"],
)
unmatched_cl_max_corr
cell_line correlation
DepMap_ID
ACH-000047 ACH-000900 0.876248
# Compare the tissue type of two cell lines.
base_line[base_line.obs["DepMap_ID"].isin(["ACH-000047", "ACH-000842"])].obs[["singlet_ID", "disease"]]
singlet_ID disease
GCIY GCIY_STOMACH gastric cancer
SW 480 SW480_LARGE_INTESTINE colon/colorectal cancer