Note

This page was generated from perturbation_space.ipynb. Some tutorial content may look better in light mode.

Perturbation Space¶

pertpy delineates between two fundamental domains to embed and analyze data: the “cell space” and the “perturbation space”. In this paradigm, the cell space represents configurations where discrete data points represent individual cells. Conversely, the perturbation space departs from the individualistic perspective of cells and instead organizes cells into cohesive ensembles. This specialized space enables comprehending the collective impact of perturbations on cells.

pertpy offers several distinct ways of determining the perturbation space that will be introduced in this tutorial. We differentiate between perturbation spaces (where we create one data point for all cells of one perturbation) and cluster spaces (where we cluster all cells and then test how well the clustering overlaps with the perturbations).

We will be working with the pre-processed Norman dataset, which encompasses a pooled CRISPR screening experiment comparing the transcriptional effects of overexpressing genes alone or in combination.

Setup¶

[1]:

import os

os.environ["KMP_WARNINGS"] = "off"
import warnings

warnings.filterwarnings("ignore")

[3]:

import pertpy as pt
import scanpy as sc

[6]:

adata = pt.dt.norman_2019()
adata

[6]:

AnnData object with n_obs × n_vars = 111255 × 19018
    obs: 'guide_identity', 'read_count', 'UMI_count', 'coverage', 'gemgroup', 'good_coverage', 'number_of_cells', 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'perturbation_name', 'perturbation_type', 'perturbation_value', 'perturbation_unit'
    var: 'index', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'doi', 'hvg', 'leiden', 'neighbors', 'pca', 'preprocessing_nb_link', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

[7]:

adata.obs.head()

[7]:

	guide_identity	read_count	UMI_count	coverage	gemgroup	good_coverage	number_of_cells	guide_AHR	guide_ARID1A	guide_ARRDC3	...	n_genes	n_genes_by_counts	total_counts	total_counts_mt	pct_counts_mt	leiden	perturbation_name	perturbation_type	perturbation_value	perturbation_unit
index
AAACCTGAGAAGAAGC-1	NegCtrl0_NegCtrl0__NegCtrl0_NegCtrl0	1252	67	18.686567	1	True	2	0	0	0	...	4108	4108	19413.0	1327.0	6.835625	10	control	genetic	NaN	NaN
AAACCTGAGGCATGTG-1	TSC22D1_NegCtrl0__TSC22D1_NegCtrl0	2151	104	20.682692	1	True	1	0	0	0	...	3142	3142	13474.0	962.0	7.139676	3	TSC22D1	genetic	NaN	NaN
AAACCTGAGGCCCTTG-1	KLF1_MAP2K6__KLF1_MAP2K6	1037	59	17.576271	1	True	1	0	0	0	...	4229	4229	23228.0	1548.0	6.664371	7	KLF1+MAP2K6	genetic	NaN	NaN
AAACCTGCACGAAGCA-1	NegCtrl10_NegCtrl0__NegCtrl10_NegCtrl0	958	39	24.564103	1	True	1	0	0	0	...	2114	2114	6842.0	523.0	7.643963	2	control	genetic	NaN	NaN
AAACCTGCAGACGTAG-1	CEBPE_RUNX1T1__CEBPE_RUNX1T1	244	14	17.428571	1	True	1	0	0	0	...	2753	2753	9130.0	893.0	9.780942	10	CEBPE+RUNX1T1	genetic	NaN	NaN

5 rows × 123 columns

The Norman dataset has annotations for gene programmes for many perturbations, which we will use later to evaluate the perturbation spaces. These gene programmes can be obtained from the Norman paper:

[8]:

G1_CYCLE = [
    "CDKN1A",
    {"CDKN1B", "CDKN1A"},
    "CDKN1B",
    {"CDKN1C", "CDKN1A"},
    {"CDKN1C", "CDKN1B"},
    "CDKN1C",
]

ERYTHROID = [
    {"CBL", "CNN1"},
    {"CBL", "PTPN12"},
    {"CBL", "PTPN9"},
    {"CBL", "UBASH3B"},
    {"SAMD1", "PTPN12"},
    {"SAMD1", "UBASH3B"},
    {"UBASH3B", "CNN1"},
    {"UBASH3B", "PTPN12"},
    {"UBASH3B", "PTPN9"},
    {"UBASH3B", "UBASH3A"},
    {"UBASH3B", "ZBTB25"},
    {"BPGM", "SAMD1"},
    "PTPN1",
    {"PTPN12", "PTPN9"},
    {"PTPN12", "UBASH3A"},
    {"PTPN12", "ZBTB25"},
    {"UBASH3A", "CNN1"},
]

PIONEER_FACTORS = [
    {"FOXA1", "FOXF1"},
    {"FOXA1", "FOXL2"},
    {"FOXA1", "HOXB9"},
    {"FOXA3", "FOXA1"},
    {"FOXA3", "FOXF1"},
    {"FOXA3", "FOXL2"},
    {"FOXA3", "HOXB9"},
    "FOXA3",
    {"FOXF1", "FOXL2"},
    {"FOXF1", "HOXB9"},
    {"FOXL2", "MEIS1"},
    "HOXA13",
    "HOXC13",
    {"POU3F2", "FOXL2"},
    "TP73",
    "MIDN",
    {"LYL1", "IER5L"},
    "HOXC13",
    {"DUSP9", "SNAI1"},
    {"ZBTB10", "SNAI1"},
]

GRANULOCYTE_APOPTOSIS = [
    "SPI1",
    "CEBPA",
    {"CEBPB", "CEBPA"},
    "CEBPB",
    {"CEBPE", "CEBPA"},
    {"CEBPE", "CEBPB"},
    {"CEBPE", "RUNX1T1"},
    {"CEBPE", "SPI1"},
    "CEBPE",
    {"ETS2", "CEBPE"},
    {"KLF1", "CEBPA"},
    {"FOSB", "CEBPB"},
    {"FOSB", "CEBPE"},
    {"ZC3HAV1", "CEBPA"},
    {"JUN", "CEBPA"},
]

PRO_GROWTH = [
    {"CEBPE", "KLF1"},
    "KLF1",
    {"KLF1", "BAK1"},
    {"KLF1", "MAP2K6"},
    {"KLF1", "TGFBR2"},
    "ELMSAN1",
    {"MAP2K3", "SLC38A2"},
    {"MAP2K3", "ELMSAN1"},
    "MAP2K3",
    {"MAP2K3", "MAP2K6"},
    {"MAP2K6", "ELMSAN1"},
    "MAP2K6",
    {"MAP2K6", "KLF1"},
]

MEGAKARYOCYTE = [
    {"MAPK1", "TGFBR2"},
    "MAPK1",
    {"ETS2", "MAPK1"},
    "ETS2",
    {"CEBPB", "MAPK1"},
]

programmes = {
    "G1 cell cycle": G1_CYCLE,
    "Erythroid": ERYTHROID,
    "Pioneer factors": PIONEER_FACTORS,
    "Granulocyte apoptosis": GRANULOCYTE_APOPTOSIS,
    "Pro-growth": PRO_GROWTH,
    "Megakaryocyte": MEGAKARYOCYTE,
}

Now we will save the gene programmes in our AnnData object as an additional ‘.obs’ columns:

[9]:

gene_programme = []

for target_pert in adata.obs["perturbation_name"]:
    if target_pert == "control":
        gene_programme.append("Control")
        continue

    found_programme = False
    for programme, pert_list in programmes.items():
        for pert in pert_list:
            if (isinstance(pert, set) and pert == set(target_pert.split("+"))) or (target_pert == pert):
                gene_programme.append(programme)
                found_programme = True
                break

    if not found_programme:
        gene_programme.append("Unknown")

adata.obs["gene_programme"] = gene_programme

We will work only with the perturbations that have a defined gene programme, since we want to evaluate the perturbation spaces afterward.

[10]:

adata = adata[adata.obs["gene_programme"] != "Unknown"]

Embed data in Perturbation Spaces¶

When embedding data in the perturbation space, we aim to generate on datapoint from all cells of one perturbation. In this tutorial, we introduce three different ways to do so:

Pseudobulk Space
Discriminator Classifier
Centroid Space

Pseudobulk Space¶

The Pseudobulk space returns an Anndata object in which each observation corresponds to the pseudobulk expression of all cells of the respective perturbation.

[8]:

ps = pt.tl.PseudobulkSpace()
psadata = ps.compute(
    adata,
    target_col="perturbation_name",
    mode="mean",
    min_cells=0,
    min_counts=0,
)

[9]:

psadata

[9]:

AnnData object with n_obs × n_vars = 75 × 19017
    obs: 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'perturbation_name', 'perturbation_type', 'perturbation_value', 'perturbation_unit', 'gene_programme', 'psbulk_n_cells', 'psbulk_counts'
    var: 'index', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    layers: 'psbulk_props'

[10]:

psadata.obs.head()

[10]:

	guide_BAK1	guide_BPGM	...	guide_ids	perturbation_name	perturbation_type	perturbation_value	perturbation_unit	gene_programme	psbulk_n_cells	psbulk_counts
BAK1+KLF1	1	0	...	BAK1,KLF1	BAK1+KLF1	genetic	NaN	NaN	Pro-growth	392.0	1.186428e+06
BPGM+SAMD1	0	1	...	BPGM,SAMD1	BPGM+SAMD1	genetic	NaN	NaN	Erythroid	300.0	8.491724e+05
CBL+CNN1	0	0	...	CBL,CNN1	CBL+CNN1	genetic	NaN	NaN	Erythroid	348.0	9.622422e+05
CBL+PTPN12	0	0	...	CBL,PTPN12	CBL+PTPN12	genetic	NaN	NaN	Erythroid	333.0	9.529553e+05
CBL+PTPN9	0	0	...	CBL,PTPN9	CBL+PTPN9	genetic	NaN	NaN	Erythroid	305.0	8.501961e+05

5 rows × 113 columns

In the generated AnnData object, each observation represents a perturbation, and its expression is the mode of the PseudobulkSpace function.

Now, the perturbation space can be visualized, and various operations can be applied to analyze it.

[11]:

sc.pp.neighbors(psadata)
sc.tl.umap(psadata)
sc.pl.umap(psadata, color="perturbation_name")

WARNING: You’re trying to run this on 19017 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.

../../_images/tutorials_notebooks_perturbation_space_19_1.png

Based on the UMAP visualization of the perturbation space above, we see that the individual perturbations are grouped into clusters. Next, we want to check if the clusters correspond to the gene programmes that we have defined for the individual perturbations above.

[12]:

sc.pl.umap(psadata, color="gene_programme")

../../_images/tutorials_notebooks_perturbation_space_21_0.png

When coloring the individual perturbations according to their gene programs, we observe that the clusters align with the gene programs. This indicates that the perturbation space nicely captures the effects of the perturbations on the cells. Furthermore, when looking at the control perturbation (in the UMAP plot above depicted in blue), we see that it is situated on the periphery of the ‘pro-growth’ cluster, suggesting that the perturbations in this cluster might have only a small effect on the cells.

Another option is to calculate the difference between the perturbations and the control, using the compute_control_diff method. Afterwards, we can again calculate and visualize the perturbation space:

[15]:

sc.pp.neighbors(adata)
sc.tl.pca(adata)

[16]:

diff_adata = ps.compute_control_diff(
    adata,
    target_col="perturbation_name",
    reference_key="control",
    embedding_key="X_pca",
)

[17]:

diff_psadata = ps.compute(
    diff_adata,
    target_col="perturbation_name",
    mode="mean",
    min_cells=0,
    min_counts=0,
    embedding_key="X_pca",
    skip_checks=True,
)

[18]:

sc.pp.neighbors(diff_psadata)
sc.tl.umap(diff_psadata)
sc.pl.umap(diff_psadata, color="gene_programme")

../../_images/tutorials_notebooks_perturbation_space_27_0.png

As the individual perturbations already clustered nicely before the control difference was calculated, we do not see a big difference in the UMAP embedding in terms of clustering. We can observe that the perturbations with the same gene programme cluster a bit more densely than before. We also note one “Granulocyte apoptosis” perturbation that is situated in the “Pioneer factors” cluster, which could be worth investigating further.

MLP Classifier Space¶

The multilayer perceptron (MLP) classifier embedding method trains a neural network based classifier to predict which perturbation has been applied to each cell. Once the training has finished, we obtain the representations of the last layer as embedding of the perturbations.

[9]:

ps = pt.tl.MLPClassifierSpace()

Next, we will create and train the model using the compute method. The method accepts different hyperparameters related with the architecture of the model such as hidden_dim, dropout, batch_norm, etc. Training hyperparameters such as batch_size, test_split_size, validation_split_size can also be changed. The compute method trains the model, using GPU if available, and returns the embeddings of the data. The embeddings are the representation of each cell in the last layer of the neural network, which has a size of 256, as we define in the parameter hidden_dim. Hence, the embeddings represent a lower dimensional representation of the data.

[11]:

cell_embeddings = ps.compute(
    adata,
    target_col="perturbation_name",
    hidden_dim=[512, 256],
    dropout=0.05,
    batch_size=64,
    batch_norm=True,
    max_epochs=20,
)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name | Type | Params
------------------------------
0 | net  | MLP  | 9.9 M
------------------------------
9.9 M     Trainable params
0         Non-trainable params
9.9 M     Total params
39.553    Total estimated model params size (MB)

`Trainer.fit` stopped: `max_epochs=20` reached.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│         test_loss         │     2.326035976409912     │
└───────────────────────────┴───────────────────────────┘

[14]:

cell_embeddings.shape

[14]:

(40763, 256)

The cell_embeddings object is of size (n_cells, 256). We can further analyze and visualize this space, by applying pseudobulk or any other transformation to obtain one datapoint per perturbation (an AnnData object with shape n_perturbations x 256).

[15]:

ps = pt.tl.PseudobulkSpace()
psadata = ps.compute(
    cell_embeddings,
    target_col="perturbations",
    mode="mean",
    min_cells=0,
    min_counts=0,
)

[16]:

psadata

[16]:

AnnData object with n_obs × n_vars = 75 × 256
    obs: 'perturbations', 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'perturbation_name', 'perturbation_type', 'perturbation_value', 'perturbation_unit', 'gene_programme', 'psbulk_n_cells', 'psbulk_counts'
    layers: 'psbulk_props'

[18]:

sc.pp.neighbors(psadata, use_rep="X")
sc.tl.umap(psadata)
sc.pl.umap(psadata, color="perturbations")

../../_images/tutorials_notebooks_perturbation_space_38_0.png

[19]:

sc.pl.umap(psadata, color="gene_programme")

../../_images/tutorials_notebooks_perturbation_space_39_0.png

As we see based on the UMAP plot above, the perturbations are grouped into clusters that correspond to the gene programmes. Importantly, the classifier was not trained to separate the gene programmes, but the individual perturbations. Hence, the fact that the gene programmes are nicely separated in the embedding space indicates that the MLP Classifier works well, as the embeddings capture the effects of the perturbations on the cells.

Logistic Regression Classifier Space¶

As an alternative to the MLP classifier, we can use a logistic regression classifier to embed the data in the perturbation space. The logistic regression classifier fits one classifier per perturbation and uses the coefficients of the classifier as the embedding for the respective perturbation. Consequently, we obtain one embedding per perturbation, so that we don’t need to calculate a pseudobulk space afterwards, as we did for the MLP classifier. We fit the classifier on PCA data, which was already calculated above.

[11]:

ps = pt.tl.LRClassifierSpace()
psadata = ps.compute(adata, embedding_key="X_pca", target_col="perturbation_name")
psadata

[11]:

AnnData object with n_obs × n_vars = 75 × 50
    obs: 'perturbations', 'classifier_score', 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'perturbation_type', 'gene_programme'

[13]:

sc.pp.neighbors(psadata, use_rep='X')
sc.tl.umap(psadata)
sc.pl.umap(psadata, color=['gene_programme'])

../../_images/tutorials_notebooks_perturbation_space_44_0.png

Similar to the MLP classifier, the logistic regression classifier also captures the effects of the perturbations on the cells, as the perturbations cluster according to the gene programmes.

Centroid Space¶

The Centroid Space computes the centroids per perturbation of a pre-computed embedding. Hence, we first need to compute an embedding for the original dataset; here we use UMAP. Note that in this particular dataset, the UMAP is already computed. Nevertheless, we will compute it again to show how to use the Centroid Space in a general setting.

[15]:

sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

[16]:

sc.pl.umap(adata, color="perturbation_name")

../../_images/tutorials_notebooks_perturbation_space_49_0.png

[17]:

ps = pt.tl.CentroidSpace()
psadata = ps.compute(adata, target_col="perturbation_name", embedding_key="X_umap")

[18]:

psadata

[18]:

AnnData object with n_obs × n_vars = 75 × 2
    obs: 'perturbation_name', 'guide_AHR', 'guide_ARID1A', 'guide_ARRDC3', 'guide_ATL1', 'guide_BAK1', 'guide_BCL2L11', 'guide_BCORL1', 'guide_BPGM', 'guide_C19orf26', 'guide_C3orf72', 'guide_CBFA2T3', 'guide_CBL', 'guide_CDKN1A', 'guide_CDKN1B', 'guide_CDKN1C', 'guide_CEBPA', 'guide_CEBPB', 'guide_CEBPE', 'guide_CELF2', 'guide_CITED1', 'guide_CKS1B', 'guide_CLDN6', 'guide_CNN1', 'guide_CNNM4', 'guide_COL1A1', 'guide_COL2A1', 'guide_CSRNP1', 'guide_DLX2', 'guide_DUSP9', 'guide_EGR1', 'guide_ELMSAN1', 'guide_ETS2', 'guide_FEV', 'guide_FOSB', 'guide_FOXA1', 'guide_FOXA3', 'guide_FOXF1', 'guide_FOXL2', 'guide_FOXO4', 'guide_GLB1L2', 'guide_HES7', 'guide_HK2', 'guide_HNF4A', 'guide_HOXA13', 'guide_HOXB9', 'guide_HOXC13', 'guide_IER5L', 'guide_IGDCC3', 'guide_IKZF3', 'guide_IRF1', 'guide_ISL2', 'guide_JUN', 'guide_KIAA1804', 'guide_KIF18B', 'guide_KIF2C', 'guide_KLF1', 'guide_KMT2A', 'guide_LHX1', 'guide_LYL1', 'guide_MAML2', 'guide_MAP2K3', 'guide_MAP2K6', 'guide_MAP4K3', 'guide_MAP4K5', 'guide_MAP7D1', 'guide_MAPK1', 'guide_MEIS1', 'guide_MIDN', 'guide_NCL', 'guide_NIT1', 'guide_OSR2', 'guide_PLK4', 'guide_POU3F2', 'guide_PRDM1', 'guide_PRTG', 'guide_PTPN1', 'guide_PTPN12', 'guide_PTPN13', 'guide_PTPN9', 'guide_RHOXF2', 'guide_RREB1', 'guide_RUNX1T1', 'guide_S1PR2', 'guide_SAMD1', 'guide_SET', 'guide_SGK1', 'guide_SLC38A2', 'guide_SLC4A1', 'guide_SLC6A9', 'guide_SNAI1', 'guide_SPI1', 'guide_STIL', 'guide_TBX2', 'guide_TBX3', 'guide_TGFBR2', 'guide_TMSB4X', 'guide_TP73', 'guide_TSC22D1', 'guide_UBASH3A', 'guide_UBASH3B', 'guide_ZBTB1', 'guide_ZBTB10', 'guide_ZBTB25', 'guide_ZC3HAV1', 'guide_ZNF318', 'guide_ids', 'perturbation_type', 'gene_programme'
    obsm: 'X_umap'

[19]:

sc.pl.umap(psadata, color=["perturbation_name", "gene_programme"], legend_loc="on data")

../../_images/tutorials_notebooks_perturbation_space_52_0.png

Based on the UMAP plots above, we see that also the Centroid Space is capable of capturing the effects of the perturbations on the cells. The clusters align with the gene programmes, and the control perturbation is situated on the periphery of the ‘Pro-growth’ cluster.

Cluster Spaces¶

DBScan Space¶

DBSCAN clusters the given data using a density-based algorithm.

You can cluster the data based on the full expression profile stored in .X or on a data representation with reduced dimensionality as specified by the embedding_key parameter. Be aware that computing the embedding on the .X matrix can be very time-consuming for large datasets. In this tutorial, we will use the UMAP embedding, which was already calculated above but will be calculated again in the next cell for the purpose of completeness.

[25]:

sc.tl.pca(adata, n_comps=15)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

[26]:

ps = pt.tl.DBSCANSpace()

[27]:

dbscan_psadata = ps.compute(adata, min_samples=50, eps=0.25, copy=True, embedding_key="X_umap", n_jobs=2)

[28]:

sc.pl.umap(dbscan_psadata, color="dbscan")

../../_images/tutorials_notebooks_perturbation_space_59_0.png

The UMAP plot above shows that several clusters have been identified based on density-based clustering. Of note, one could achieve a finer clustering using the whole data (adata.X) or a different embedding, which we omit here for the sake of simplicity.

Next, we want to evaluate if the identified clusters correspond to the gene programmes that we have defined for the individual perturbations above. Note that the true_label_col specifying the ground truth annotation in the evaluate_clustering method could also be something different from gene programs, or even the perturbation itself. As evaluation metrics, we will calculate the NMI and ARI between the clusters and the gene programmes. Additionally, ASW can be calculated as well but, depending on the size of the dataset, it can take longer to compute.

[29]:

dbscan_results = ps.evaluate_clustering(
    dbscan_psadata,
    true_label_col="gene_programme",
    cluster_col="dbscan",
    metric="l1",
    metrics=["nmi", "ari"],
)

[30]:

dbscan_results

[30]:

{'nmi': 0.17569832707103, 'ari': 0.0918081441614236}

The NMI is a value between 0 and 1, where 1 would indicate that the computed clusters perfectly align with the ground truth clusters. Here, we have a value of approx. 0.18, which is comparably low and indicates that the DBSCAN clusters computed based on UMAP do not correspond well to the gene programmes. The ARI takes a value between -0.5 and 1, where 0.0 indicates that the clusters are randomly assigned. Here, we see an ARI of 0.09, again highlighting that the density-based clustering based on UMAP does not align well with the gene programmes. To improve the clustering, one could consider running DBSCAN on higher-dimensional data, e.g. by using a different embedding or no embedding at all.

K-Means Space¶

Analogous to the steps used for DBSCAN clustering, we will now use K-means algorithm to cluster the data. This time, we will use the PCA embedding, which we calculated above.

[33]:

ps = pt.tl.KMeansSpace()
kmeans_psadata = ps.compute(adata, n_clusters=7, copy=True, embedding_key="X_pca")

[34]:

sc.pl.umap(kmeans_psadata, color=["k-means", "gene_programme"])

../../_images/tutorials_notebooks_perturbation_space_67_0.png

Again, we will evaluate the calculated clusters using the NMI and ARI metrics:

[14]:

kmeans_results = ps.evaluate_clustering(
    kmeans_psadata,
    true_label_col="gene_programme",
    cluster_col="k-means",
    metric="l2",
    metrics=["nmi", "ari"],
)

[36]:

kmeans_results

[36]:

{'nmi': 0.22118522241390073, 'ari': 0.15420778229072798}

Compared to the NMI and ARI computed for the DBSCAN clusters, both metrics are higher when using K-means on a PCA embedding. This could be due to the fact that the PCs capture more information than the 2D UMAP embedding, the ability to specify the number of target clusters when running K-means, or simply because K-Means is better suited for this data. Again, one could test using another embedding or no embedding at all, as well as additional parameters to improve the clustering.

Conclusion¶

In this tutorial, we have introduced several methods to embed and work with data in the perturbation space. Whether first clustering the data or immediately embedding it in a perturbation space, the latter can be used to analyze the effects of perturbations on cells. One important application is the identification of perturbations with similar effects, similar to the gene programme annotation in this tutorial. Furthermore, we showed how to arithmetically combine perturbations to create new perturbations, which can be used to validate experimental data or predict the effects of novel perturbations.

References¶

Norman, Thomas M et al. “Exploring genetic interaction manifolds constructed from rich single-cell phenotypes.” Science (New York, N.Y.) vol. 365,6455 (2019): 786-793. doi:10.1126/science.aax4438