Augur - cell type prioritization prediction

Augur - cell type prioritization prediction#

This is a short tutorial demonstrating Augur, a Python implementation of the Augur R package using a simulated dataset [SSK+21].

Augur aims to rank or prioritize cell types according to their response to experimental perturbations. The fundamental idea is that in the space of molecular measurements cells reacting heavily to induced perturbations are more easily separated into perturbed and unperturbed than cell types with little or no response. This separability is quantified by measuring how well experimental labels (e.g. treatment and control) can be predicted within each cell type. Augur trains a machine learning model predicting experimental labels for each cell type in multiple cross validation runs and then prioritizes cell type response according to metric scores measuring the accuracy of the model. For categorical data Augur uses the area under the curve and for numerical data it uses concordance correlation coefficient.

Setup#

import warnings

warnings.filterwarnings("ignore")

import pertpy as pt
import scanpy as sc

Dataset#

Create an Augur object specifying an estimator of random_forest_classifier or logistic_regression_classifier for categorical data and random_forest_regressor for numerical data.

The dataset consists of 600 cells, distributed evenly between three populations (cell types A, B, and C). Each of these cell types has approximately half of its cells in one of two conditions, treatment and control. The cell types also have different numbers of differentially expressed genes (DEG) in response to treatment. Cell type A has approximately 5% of DEG in response to the treatment, while cell type B has 25% DEGs and cell type C has 50%.

adata = pt.dt.sc_sim_augur()

ag_rfc = pt.tl.Augur("random_forest_classifier")
loaded_data = ag_rfc.load(adata)

loaded_data

AnnData object with n_obs × n_vars = 600 × 15697
    obs: 'label', 'cell_type', 'y_'
    var: 'name'

Augur prediction#

Now we run Augur with the function predict and look at the results:

The default way to select features is based on select_variance, which is an implementation that is very close to the original R Augur implementation. In addition, it is possible to select features with scanpy.pp.highly_variable_genes, by setting select_variance_features=False

v_adata, v_results = ag_rfc.predict(loaded_data, subsample_size=20, select_variance_features=True, n_threads=4)

# to visualize the results
lollipop = ag_rfc.plot_lollipop(v_results)

Set smaller span value in the case of a `segmentation fault` error.

Set larger span in case of svddc or other near singularities error.

../../_images/1ac671a9ed634ab76c714a2d42c9b00884620fcc256442b9077617f435d47295.png

h_adata, h_results = ag_rfc.predict(loaded_data, subsample_size=20, select_variance_features=False, n_threads=4)

print(h_results["summary_metrics"])

                  CellTypeA  CellTypeB  CellTypeC
mean_augur_score   0.598299   0.867948   0.918435
mean_auc           0.598299   0.867948   0.918435
mean_accuracy      0.542674   0.735971   0.783040
mean_precision     0.539211   0.782898   0.790340
mean_f1            0.419785   0.697024   0.795674
mean_recall        0.418889   0.704921   0.855873

Here we visualize the cell ranking and the corresponding augur scores using select variance feature selection in a lollipop plot. In this case the ranking it the same but the values themselves differ between the two methods.

To compare the two feature selection methods they can be plotted together in a scatterplot. Here you can see that CellTypeB has slightly higher values when using the method select_highly_variable compared to select_variance, where as CellTypeC has the same augur score.

scatter = ag_rfc.plot_scatterplot(v_results, h_results)

../../_images/cc33cb09d48318c523e7c230c011beffaab9f6cb10a840a9436ad0aa6e40c6a1.png

The corresponding mean_augur_score is also saved in result_adata.obs.

sc.pp.neighbors(v_adata, use_rep="X")
sc.tl.umap(v_adata)
sc.pl.umap(v_adata, color="augur_score")

../../_images/012c4570dd6e3db795b6f0862b785fb15e67bddad008c2873887a9fdf649c6bf.png

Feature Importances#

The results object also returns feature importances. In the case of a random forest estimator the feature importances built into sci-kit learn were used to calculate feature importances. For the logistic regression the agresti method is used where the mean gets subtracted from the coefficient values and then divided by the standard deviation.

important_features = ag_rfc.plot_important_features(v_results)

../../_images/aaf81070667a165e685f5b3dbfed48201ffee459b73461e827abc11e2f3de422.png

Differential Prioritization#

Augur is also able to perform differential prioritization and executes a permutation test to identify cell types with statistically significant differences in AUC between two different rounds of cell type prioritization (e.g. response to drugs A and B, compared to untreated control).

In the following part, single-cell prefrontal cortex data from adult mice under cocaine self-administration from Bhattacherjee 2019 will be used. Adult mice were subject to cocaine self-administration, samples were collected after three time points: Maintenance, 48h after cocaine withdrawal and 15 days after cocaine withdrawal.

We compare withdraw_15d_Cocaine and withdraw_48h_Cocaine with respect to the difference to Maintenance_Cocaine. Each variation is run once in default mode and once in permute mode.

bhattacherjee_adata = pt.dt.bhattacherjee()
ag_rfc = pt.tl.Augur("random_forest_classifier")

We first run Augur on Maintenance_Cocaine and withdraw_15d_Cocaine in augur_mode=default and augur_mode=permute mode.

# default
bhattacherjee_15 = ag_rfc.load(
    bhattacherjee_adata,
    condition_label="Maintenance_Cocaine",
    treatment_label="withdraw_15d_Cocaine",
)

bhattacherjee_adata_15, bhattacherjee_results_15 = ag_rfc.predict(bhattacherjee_15, random_state=None, n_threads=4)

print(bhattacherjee_results_15["summary_metrics"].loc["mean_augur_score"].sort_values(ascending=False))

Filtering samples with Maintenance_Cocaine and withdraw_15d_Cocaine labels.

Set smaller span value in the case of a `segmentation fault` error.

Set larger span in case of svddc or other near singularities error.

Oligo         0.801247
Astro         0.780522
Microglia     0.757075
OPC           0.729909
Inhibitory    0.649989
NF Oligo      0.629626
Excitatory    0.602041
Endo          0.576973
Name: mean_augur_score, dtype: float64

# permute
bhattacherjee_adata_15_permute, bhattacherjee_results_15_permute = ag_rfc.predict(
    bhattacherjee_15,
    augur_mode="permute",
    n_subsamples=100,
    random_state=None,
    n_threads=4,
)

Set smaller span value in the case of a `segmentation fault` error.

Set larger span in case of svddc or other near singularities error.

Now lets do the same looking at Maintenance_Cocaine and withdraw_48h_Cocaine.

# default
bhattacherjee_48 = ag_rfc.load(
    bhattacherjee_adata,
    condition_label="Maintenance_Cocaine",
    treatment_label="withdraw_48h_Cocaine",
)

bhattacherjee_adata_48, bhattacherjee_results_48 = ag_rfc.predict(bhattacherjee_48, random_state=None, n_threads=4)

print(bhattacherjee_results_48["summary_metrics"].loc["mean_augur_score"].sort_values(ascending=False))

Filtering samples with Maintenance_Cocaine and withdraw_48h_Cocaine labels.

Set smaller span value in the case of a `segmentation fault` error.

Set larger span in case of svddc or other near singularities error.

Astro         0.626780
OPC           0.624014
Oligo         0.613776
NF Oligo      0.611406
Microglia     0.598095
Inhibitory    0.571735
Excitatory    0.539138
Endo          0.529558
Name: mean_augur_score, dtype: float64

# permute
bhattacherjee_adata_48_permute, bhattacherjee_results_48_permute = ag_rfc.predict(
    bhattacherjee_48,
    augur_mode="permute",
    n_subsamples=100,
    random_state=None,
    n_threads=4,
)

Set smaller span value in the case of a `segmentation fault` error.

Set larger span in case of svddc or other near singularities error.

Skipping NF Oligo cell type - 79 samples is less than min_cells 100.

Let’s also take a look at the Augur scores of the two versions in a scatter plot. The diagonal line is the identity function. If the values were the same they would be on the line.

scatter = ag_rfc.plot_scatterplot(bhattacherjee_results_15, bhattacherjee_results_48)

../../_images/cefe036e06260ce08943c976ef1edacd067ca9b4072f26deb1e9cf7165670bb1.png

To figure out which cell type was most affected in comparing withdraw_48h_Cocaine and withdraw_15d_Cocaine we can run differential prioritization:

pvals = ag_rfc.predict_differential_prioritization(
    augur_results1=bhattacherjee_results_15,
    augur_results2=bhattacherjee_results_48,
    permuted_results1=bhattacherjee_results_15_permute,
    permuted_results2=bhattacherjee_results_48_permute,
)

pvals

	cell_type	mean_augur_score1	mean_augur_score2	delta_augur	b	m	z	pval	padj
0	Oligo	0.801247	0.613776	-0.187472	1000	1000	-13.490667	0.001998	0.002797
1	Inhibitory	0.649989	0.571735	-0.078254	989	1000	-4.071143	0.023976	0.023976
2	OPC	0.729909	0.624014	-0.105896	998	1000	-5.484539	0.005994	0.006993
3	Microglia	0.757075	0.598095	-0.158980	1000	1000	-9.528952	0.001998	0.002797
4	Endo	0.576973	0.529558	-0.047415	1000	1000	-3.286200	0.001998	0.002797
5	Excitatory	0.602041	0.539138	-0.062902	1000	1000	-4.038718	0.001998	0.002797
6	Astro	0.780522	0.626780	-0.153741	1000	1000	-7.940456	0.001998	0.002797

The p-value, following the R Augur implementation is calculated using b, the number of times permuted values are larger than original values and m, the number of permutations run. Since b is the same for all cells but OPC and inhibitory, the p-value is the same for these as well.

diff = ag_rfc.plot_dp_scatter(pvals)

../../_images/01c3c2c2cf30a59f774d50e8eed0b23ede530faada20b95b1bcad578b465f767.png

In this case the cell type Endo has the lowest z-score. When comparing the impact of withdraw_48h_Cocaine and withdraw_15d_Cocaine, this cell type was most perturbed.