Note

This page was generated from perturbation_space.ipynb. Some tutorial content may look better in light mode.

Perturbation Space

pertpy delineates between two fundamental domains to embed and analyze data: the “cell space” and the “perturbation space”. In this paradigm, the cell space represents configurations where discrete data points represent individual cells. Conversely, the perturbation space departs from the individualistic perspective of cells and instead organizes cells into cohesive ensembles. This specialized space enables comprehending the collective impact of perturbations on cells.

pertpy offers several distinct ways of determining the perturbation space that will be introduced in this tutorial. We differentiate between perturbation spaces (where we create one data point for all cells of one perturbation) and cluster spaces (where we cluster all cells and then test how well the clustering overlaps with the perturbations).

We will be working with the pre-processed Norman dataset, which encompasses a pooled CRISPR screening experiment comparing the transcriptional effects of overexpressing genes alone or in combination.

Setup

[1]:
import os

os.environ["KMP_WARNINGS"] = "off"
import warnings

warnings.filterwarnings("ignore")
[3]:
import pertpy as pt
import scanpy as sc
[6]:
adata = pt.dt.norman_2019()
adata
[6]:
AnnData object with n_obs × n_vars = 111255 × 19018
    obs: 'guide_identity', 'read_count', 'UMI_count', 'coverage', 'gemgroup', 'good_coverage', 'number_of_cells', 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'perturbation_name', 'perturbation_type', 'perturbation_value', 'perturbation_unit'
    var: 'index', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'doi', 'hvg', 'leiden', 'neighbors', 'pca', 'preprocessing_nb_link', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'
[7]:
adata.obs.head()
[7]:
guide_identity read_count UMI_count coverage gemgroup good_coverage number_of_cells guide_AHR guide_ARID1A guide_ARRDC3 ... n_genes n_genes_by_counts total_counts total_counts_mt pct_counts_mt leiden perturbation_name perturbation_type perturbation_value perturbation_unit
index
AAACCTGAGAAGAAGC-1 NegCtrl0_NegCtrl0__NegCtrl0_NegCtrl0 1252 67 18.686567 1 True 2 0 0 0 ... 4108 4108 19413.0 1327.0 6.835625 10 control genetic NaN NaN
AAACCTGAGGCATGTG-1 TSC22D1_NegCtrl0__TSC22D1_NegCtrl0 2151 104 20.682692 1 True 1 0 0 0 ... 3142 3142 13474.0 962.0 7.139676 3 TSC22D1 genetic NaN NaN
AAACCTGAGGCCCTTG-1 KLF1_MAP2K6__KLF1_MAP2K6 1037 59 17.576271 1 True 1 0 0 0 ... 4229 4229 23228.0 1548.0 6.664371 7 KLF1+MAP2K6 genetic NaN NaN
AAACCTGCACGAAGCA-1 NegCtrl10_NegCtrl0__NegCtrl10_NegCtrl0 958 39 24.564103 1 True 1 0 0 0 ... 2114 2114 6842.0 523.0 7.643963 2 control genetic NaN NaN
AAACCTGCAGACGTAG-1 CEBPE_RUNX1T1__CEBPE_RUNX1T1 244 14 17.428571 1 True 1 0 0 0 ... 2753 2753 9130.0 893.0 9.780942 10 CEBPE+RUNX1T1 genetic NaN NaN

5 rows × 123 columns

The Norman dataset has annotations for gene programmes for many perturbations, which we will use later to evaluate the perturbation spaces. These gene programmes can be obtained from the Norman paper:

[8]:
G1_CYCLE = [
    "CDKN1A",
    {"CDKN1B", "CDKN1A"},
    "CDKN1B",
    {"CDKN1C", "CDKN1A"},
    {"CDKN1C", "CDKN1B"},
    "CDKN1C",
]

ERYTHROID = [
    {"CBL", "CNN1"},
    {"CBL", "PTPN12"},
    {"CBL", "PTPN9"},
    {"CBL", "UBASH3B"},
    {"SAMD1", "PTPN12"},
    {"SAMD1", "UBASH3B"},
    {"UBASH3B", "CNN1"},
    {"UBASH3B", "PTPN12"},
    {"UBASH3B", "PTPN9"},
    {"UBASH3B", "UBASH3A"},
    {"UBASH3B", "ZBTB25"},
    {"BPGM", "SAMD1"},
    "PTPN1",
    {"PTPN12", "PTPN9"},
    {"PTPN12", "UBASH3A"},
    {"PTPN12", "ZBTB25"},
    {"UBASH3A", "CNN1"},
]

PIONEER_FACTORS = [
    {"FOXA1", "FOXF1"},
    {"FOXA1", "FOXL2"},
    {"FOXA1", "HOXB9"},
    {"FOXA3", "FOXA1"},
    {"FOXA3", "FOXF1"},
    {"FOXA3", "FOXL2"},
    {"FOXA3", "HOXB9"},
    "FOXA3",
    {"FOXF1", "FOXL2"},
    {"FOXF1", "HOXB9"},
    {"FOXL2", "MEIS1"},
    "HOXA13",
    "HOXC13",
    {"POU3F2", "FOXL2"},
    "TP73",
    "MIDN",
    {"LYL1", "IER5L"},
    "HOXC13",
    {"DUSP9", "SNAI1"},
    {"ZBTB10", "SNAI1"},
]

GRANULOCYTE_APOPTOSIS = [
    "SPI1",
    "CEBPA",
    {"CEBPB", "CEBPA"},
    "CEBPB",
    {"CEBPE", "CEBPA"},
    {"CEBPE", "CEBPB"},
    {"CEBPE", "RUNX1T1"},
    {"CEBPE", "SPI1"},
    "CEBPE",
    {"ETS2", "CEBPE"},
    {"KLF1", "CEBPA"},
    {"FOSB", "CEBPB"},
    {"FOSB", "CEBPE"},
    {"ZC3HAV1", "CEBPA"},
    {"JUN", "CEBPA"},
]

PRO_GROWTH = [
    {"CEBPE", "KLF1"},
    "KLF1",
    {"KLF1", "BAK1"},
    {"KLF1", "MAP2K6"},
    {"KLF1", "TGFBR2"},
    "ELMSAN1",
    {"MAP2K3", "SLC38A2"},
    {"MAP2K3", "ELMSAN1"},
    "MAP2K3",
    {"MAP2K3", "MAP2K6"},
    {"MAP2K6", "ELMSAN1"},
    "MAP2K6",
    {"MAP2K6", "KLF1"},
]

MEGAKARYOCYTE = [
    {"MAPK1", "TGFBR2"},
    "MAPK1",
    {"ETS2", "MAPK1"},
    "ETS2",
    {"CEBPB", "MAPK1"},
]

programmes = {
    "G1 cell cycle": G1_CYCLE,
    "Erythroid": ERYTHROID,
    "Pioneer factors": PIONEER_FACTORS,
    "Granulocyte apoptosis": GRANULOCYTE_APOPTOSIS,
    "Pro-growth": PRO_GROWTH,
    "Megakaryocyte": MEGAKARYOCYTE,
}

Now we will save the gene programmes in our AnnData object as an additional ‘.obs’ columns:

[9]:
gene_programme = []

for target_pert in adata.obs["perturbation_name"]:
    if target_pert == "control":
        gene_programme.append("Control")
        continue

    found_programme = False
    for programme, pert_list in programmes.items():
        for pert in pert_list:
            if (isinstance(pert, set) and pert == set(target_pert.split("+"))) or (target_pert == pert):
                gene_programme.append(programme)
                found_programme = True
                break

    if not found_programme:
        gene_programme.append("Unknown")

adata.obs["gene_programme"] = gene_programme

We will work only with the perturbations that have a defined gene programme, since we want to evaluate the perturbation spaces afterward.

[10]:
adata = adata[adata.obs["gene_programme"] != "Unknown"]

Embed data in Perturbation Spaces

When embedding data in the perturbation space, we aim to generate on datapoint from all cells of one perturbation. In this tutorial, we introduce three different ways to do so:

  • Pseudobulk Space

  • Discriminator Classifier

  • Centroid Space

Pseudobulk Space

The Pseudobulk space returns an Anndata object in which each observation corresponds to the pseudobulk expression of all cells of the respective perturbation.

[8]:
ps = pt.tl.PseudobulkSpace()
psadata = ps.compute(
    adata,
    target_col="perturbation_name",
    mode="mean",
    min_cells=0,
    min_counts=0,
)
[9]:
psadata
[9]:
AnnData object with n_obs × n_vars = 75 × 19017
    obs: 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'perturbation_name', 'perturbation_type', 'perturbation_value', 'perturbation_unit', 'gene_programme', 'psbulk_n_cells', 'psbulk_counts'
    var: 'index', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    layers: 'psbulk_props'
[10]:
psadata.obs.head()
[10]:
guide_AHR guide_ARID1A guide_ARRDC3 guide_ATL1 guide_BAK1 guide_BCL2L11 guide_BCORL1 guide_BPGM guide_C19orf26 guide_C3orf72 ... guide_ZC3HAV1 guide_ZNF318 guide_ids perturbation_name perturbation_type perturbation_value perturbation_unit gene_programme psbulk_n_cells psbulk_counts
BAK1+KLF1 0 0 0 0 1 0 0 0 0 0 ... 0 0 BAK1,KLF1 BAK1+KLF1 genetic NaN NaN Pro-growth 392.0 1.186428e+06
BPGM+SAMD1 0 0 0 0 0 0 0 1 0 0 ... 0 0 BPGM,SAMD1 BPGM+SAMD1 genetic NaN NaN Erythroid 300.0 8.491724e+05
CBL+CNN1 0 0 0 0 0 0 0 0 0 0 ... 0 0 CBL,CNN1 CBL+CNN1 genetic NaN NaN Erythroid 348.0 9.622422e+05
CBL+PTPN12 0 0 0 0 0 0 0 0 0 0 ... 0 0 CBL,PTPN12 CBL+PTPN12 genetic NaN NaN Erythroid 333.0 9.529553e+05
CBL+PTPN9 0 0 0 0 0 0 0 0 0 0 ... 0 0 CBL,PTPN9 CBL+PTPN9 genetic NaN NaN Erythroid 305.0 8.501961e+05

5 rows × 113 columns

In the generated AnnData object, each observation represents a perturbation, and its expression is the mode of the PseudobulkSpace function.

Now, the perturbation space can be visualized, and various operations can be applied to analyze it.

[11]:
sc.pp.neighbors(psadata)
sc.tl.umap(psadata)
sc.pl.umap(psadata, color="perturbation_name")
WARNING: You’re trying to run this on 19017 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
../../_images/tutorials_notebooks_perturbation_space_19_1.png

Based on the UMAP visualization of the perturbation space above, we see that the individual perturbations are grouped into clusters. Next, we want to check if the clusters correspond to the gene programmes that we have defined for the individual perturbations above.

[12]:
sc.pl.umap(psadata, color="gene_programme")
../../_images/tutorials_notebooks_perturbation_space_21_0.png

When coloring the individual perturbations according to their gene programs, we observe that the clusters align with the gene programs. This indicates that the perturbation space nicely captures the effects of the perturbations on the cells. Furthermore, when looking at the control perturbation (in the UMAP plot above depicted in blue), we see that it is situated on the periphery of the ‘pro-growth’ cluster, suggesting that the perturbations in this cluster might have only a small effect on the cells.

Another option is to calculate the difference between the perturbations and the control, using the compute_control_diff method. Afterwards, we can again calculate and visualize the perturbation space:

[15]:
sc.pp.neighbors(adata)
sc.tl.pca(adata)
[16]:
diff_adata = ps.compute_control_diff(
    adata,
    target_col="perturbation_name",
    reference_key="control",
    embedding_key="X_pca",
)
[17]:
diff_psadata = ps.compute(
    diff_adata,
    target_col="perturbation_name",
    mode="mean",
    min_cells=0,
    min_counts=0,
    embedding_key="X_pca",
    skip_checks=True,
)
[18]:
sc.pp.neighbors(diff_psadata)
sc.tl.umap(diff_psadata)
sc.pl.umap(diff_psadata, color="gene_programme")
../../_images/tutorials_notebooks_perturbation_space_27_0.png

As the individual perturbations already clustered nicely before the control difference was calculated, we do not see a big difference in the UMAP embedding in terms of clustering. We can observe that the perturbations with the same gene programme cluster a bit more densely than before. We also note one “Granulocyte apoptosis” perturbation that is situated in the “Pioneer factors” cluster, which could be worth investigating further.

MLP Classifier Space

The multilayer perceptron (MLP) classifier embedding method trains a neural network based classifier to predict which perturbation has been applied to each cell. Once the training has finished, we obtain the representations of the last layer as embedding of the perturbations.

[9]:
ps = pt.tl.MLPClassifierSpace()

Next, we will create and train the model using the compute method. The method accepts different hyperparameters related with the architecture of the model such as hidden_dim, dropout, batch_norm, etc. Training hyperparameters such as batch_size, test_split_size, validation_split_size can also be changed. The compute method trains the model, using GPU if available, and returns the embeddings of the data. The embeddings are the representation of each cell in the last layer of the neural network, which has a size of 256, as we define in the parameter hidden_dim. Hence, the embeddings represent a lower dimensional representation of the data.

[11]:
cell_embeddings = ps.compute(
    adata,
    target_col="perturbation_name",
    hidden_dim=[512, 256],
    dropout=0.05,
    batch_size=64,
    batch_norm=True,
    max_epochs=20,
)
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name | Type | Params
------------------------------
0 | net  | MLP  | 9.9 M
------------------------------
9.9 M     Trainable params
0         Non-trainable params
9.9 M     Total params
39.553    Total estimated model params size (MB)
`Trainer.fit` stopped: `max_epochs=20` reached.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric               DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│         test_loss              2.326035976409912     │
└───────────────────────────┴───────────────────────────┘
[14]:
cell_embeddings.shape
[14]:
(40763, 256)

The cell_embeddings object is of size (n_cells, 256). We can further analyze and visualize this space, by applying pseudobulk or any other transformation to obtain one datapoint per perturbation (an AnnData object with shape n_perturbations x 256).

[15]:
ps = pt.tl.PseudobulkSpace()
psadata = ps.compute(
    cell_embeddings,
    target_col="perturbations",
    mode="mean",
    min_cells=0,
    min_counts=0,
)
[16]:
psadata
[16]:
AnnData object with n_obs × n_vars = 75 × 256
    obs: 'perturbations', 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'perturbation_name', 'perturbation_type', 'perturbation_value', 'perturbation_unit', 'gene_programme', 'psbulk_n_cells', 'psbulk_counts'
    layers: 'psbulk_props'
[18]:
sc.pp.neighbors(psadata, use_rep="X")
sc.tl.umap(psadata)
sc.pl.umap(psadata, color="perturbations")
../../_images/tutorials_notebooks_perturbation_space_38_0.png
[19]:
sc.pl.umap(psadata, color="gene_programme")
../../_images/tutorials_notebooks_perturbation_space_39_0.png

As we see based on the UMAP plot above, the perturbations are grouped into clusters that correspond to the gene programmes. Importantly, the classifier was not trained to separate the gene programmes, but the individual perturbations. Hence, the fact that the gene programmes are nicely separated in the embedding space indicates that the MLP Classifier works well, as the embeddings capture the effects of the perturbations on the cells.

Logistic Regression Classifier Space

As an alternative to the MLP classifier, we can use a logistic regression classifier to embed the data in the perturbation space. The logistic regression classifier fits one classifier per perturbation and uses the coefficients of the classifier as the embedding for the respective perturbation. Consequently, we obtain one embedding per perturbation, so that we don’t need to calculate a pseudobulk space afterwards, as we did for the MLP classifier. We fit the classifier on PCA data, which was already calculated above.

[11]:
ps = pt.tl.LRClassifierSpace()
psadata = ps.compute(adata, embedding_key="X_pca", target_col="perturbation_name")
psadata
[11]:
AnnData object with n_obs × n_vars = 75 × 50
    obs: 'perturbations', 'classifier_score', 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'perturbation_type', 'gene_programme'
[13]:
sc.pp.neighbors(psadata, use_rep='X')
sc.tl.umap(psadata)
sc.pl.umap(psadata, color=['gene_programme'])
../../_images/tutorials_notebooks_perturbation_space_44_0.png

Similar to the MLP classifier, the logistic regression classifier also captures the effects of the perturbations on the cells, as the perturbations cluster according to the gene programmes.

Centroid Space

The Centroid Space computes the centroids per perturbation of a pre-computed embedding. Hence, we first need to compute an embedding for the original dataset; here we use UMAP. Note that in this particular dataset, the UMAP is already computed. Nevertheless, we will compute it again to show how to use the Centroid Space in a general setting.

[15]:
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
[16]:
sc.pl.umap(adata, color="perturbation_name")
../../_images/tutorials_notebooks_perturbation_space_49_0.png
[17]:
ps = pt.tl.CentroidSpace()
psadata = ps.compute(adata, target_col="perturbation_name", embedding_key="X_umap")
[18]:
psadata
[18]:
AnnData object with n_obs × n_vars = 75 × 2
    obs: 'perturbation_name', 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'perturbation_type', 'gene_programme'
    obsm: 'X_umap'
[19]:
sc.pl.umap(psadata, color=["perturbation_name", "gene_programme"], legend_loc="on data")
../../_images/tutorials_notebooks_perturbation_space_52_0.png

Based on the UMAP plots above, we see that also the Centroid Space is capable of capturing the effects of the perturbations on the cells. The clusters align with the gene programmes, and the control perturbation is situated on the periphery of the ‘Pro-growth’ cluster.

Cluster Spaces

DBScan Space

DBSCAN clusters the given data using a density-based algorithm.

You can cluster the data based on the full expression profile stored in .X or on a data representation with reduced dimensionality as specified by the embedding_key parameter. Be aware that computing the embedding on the .X matrix can be very time-consuming for large datasets. In this tutorial, we will use the UMAP embedding, which was already calculated above but will be calculated again in the next cell for the purpose of completeness.

[25]:
sc.tl.pca(adata, n_comps=15)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
[26]:
ps = pt.tl.DBSCANSpace()
[27]:
dbscan_psadata = ps.compute(adata, min_samples=50, eps=0.25, copy=True, embedding_key="X_umap", n_jobs=2)
[28]:
sc.pl.umap(dbscan_psadata, color="dbscan")
../../_images/tutorials_notebooks_perturbation_space_59_0.png

The UMAP plot above shows that several clusters have been identified based on density-based clustering. Of note, one could achieve a finer clustering using the whole data (adata.X) or a different embedding, which we omit here for the sake of simplicity.

Next, we want to evaluate if the identified clusters correspond to the gene programmes that we have defined for the individual perturbations above. Note that the true_label_col specifying the ground truth annotation in the evaluate_clustering method could also be something different from gene programs, or even the perturbation itself. As evaluation metrics, we will calculate the NMI and ARI between the clusters and the gene programmes. Additionally, ASW can be calculated as well but, depending on the size of the dataset, it can take longer to compute.

[29]:
dbscan_results = ps.evaluate_clustering(
    dbscan_psadata,
    true_label_col="gene_programme",
    cluster_col="dbscan",
    metric="l1",
    metrics=["nmi", "ari"],
)
[30]:
dbscan_results
[30]:
{'nmi': 0.17569832707103, 'ari': 0.0918081441614236}

The NMI is a value between 0 and 1, where 1 would indicate that the computed clusters perfectly align with the ground truth clusters. Here, we have a value of approx. 0.18, which is comparably low and indicates that the DBSCAN clusters computed based on UMAP do not correspond well to the gene programmes. The ARI takes a value between -0.5 and 1, where 0.0 indicates that the clusters are randomly assigned. Here, we see an ARI of 0.09, again highlighting that the density-based clustering based on UMAP does not align well with the gene programmes. To improve the clustering, one could consider running DBSCAN on higher-dimensional data, e.g. by using a different embedding or no embedding at all.

K-Means Space

Analogous to the steps used for DBSCAN clustering, we will now use K-means algorithm to cluster the data. This time, we will use the PCA embedding, which we calculated above.

[33]:
ps = pt.tl.KMeansSpace()
kmeans_psadata = ps.compute(adata, n_clusters=7, copy=True, embedding_key="X_pca")
[34]:
sc.pl.umap(kmeans_psadata, color=["k-means", "gene_programme"])
../../_images/tutorials_notebooks_perturbation_space_67_0.png

Again, we will evaluate the calculated clusters using the NMI and ARI metrics:

[14]:
kmeans_results = ps.evaluate_clustering(
    kmeans_psadata,
    true_label_col="gene_programme",
    cluster_col="k-means",
    metric="l2",
    metrics=["nmi", "ari"],
)
[36]:
kmeans_results
[36]:
{'nmi': 0.22118522241390073, 'ari': 0.15420778229072798}

Compared to the NMI and ARI computed for the DBSCAN clusters, both metrics are higher when using K-means on a PCA embedding. This could be due to the fact that the PCs capture more information than the 2D UMAP embedding, the ability to specify the number of target clusters when running K-means, or simply because K-Means is better suited for this data. Again, one could test using another embedding or no embedding at all, as well as additional parameters to improve the clustering.

Conclusion

In this tutorial, we have introduced several methods to embed and work with data in the perturbation space. Whether first clustering the data or immediately embedding it in a perturbation space, the latter can be used to analyze the effects of perturbations on cells. One important application is the identification of perturbations with similar effects, similar to the gene programme annotation in this tutorial. Furthermore, we showed how to arithmetically combine perturbations to create new perturbations, which can be used to validate experimental data or predict the effects of novel perturbations.

References

  1. Norman, Thomas M et al. “Exploring genetic interaction manifolds constructed from rich single-cell phenotypes.” Science (New York, N.Y.) vol. 365,6455 (2019): 786-793. doi:10.1126/science.aax4438