Augur - cell type prioritization prediction#
This is a short tutorial demonstrating Augur, a Python implementation of the Augur R package using a simulated dataset [SSK+21].
Augur aims to rank or prioritize cell types according to their response to experimental perturbations. The fundamental idea is that in the space of molecular measurements cells reacting heavily to induced perturbations are more easily separated into perturbed and unperturbed than cell types with little or no response. This separability is quantified by measuring how well experimental labels (e.g. treatment and control) can be predicted within each cell type. Augur trains a machine learning model predicting experimental labels for each cell type in multiple cross validation runs and then prioritizes cell type response according to metric scores measuring the accuracy of the model. For categorical data Augur uses the area under the curve and for numerical data it uses concordance correlation coefficient.
Setup#
import warnings
warnings.filterwarnings("ignore")
import pertpy as pt
import scanpy as sc
Dataset#
Create an Augur object specifying an estimator of random_forest_classifier or logistic_regression_classifier for categorical data and random_forest_regressor for numerical data.
The dataset consists of 600 cells, distributed evenly between three populations (cell types A, B, and C). Each of these cell types has approximately half of its cells in one of two conditions, treatment and control. The cell types also have different numbers of differentially expressed genes (DEG) in response to treatment. Cell type A has approximately 5% of DEG in response to the treatment, while cell type B has 25% DEGs and cell type C has 50%.
adata = pt.dt.sc_sim_augur()
ag_rfc = pt.tl.Augur("random_forest_classifier")
loaded_data = ag_rfc.load(adata)
loaded_data
AnnData object with n_obs × n_vars = 600 × 15697
obs: 'label', 'cell_type', 'y_'
var: 'name'
Augur prediction#
Now we run Augur with the function predict and look at the results:
The default way to select features is based on select_variance, which is an implementation that is very close to the original R Augur implementation. In addition, it is possible to select features with scanpy.pp.highly_variable_genes, by setting select_variance_features=False
v_adata, v_results = ag_rfc.predict(loaded_data, subsample_size=20, select_variance_features=True, n_threads=4)
# to visualize the results
lollipop = ag_rfc.plot_lollipop(v_results)
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.
h_adata, h_results = ag_rfc.predict(loaded_data, subsample_size=20, select_variance_features=False, n_threads=4)
print(h_results["summary_metrics"])
CellTypeA CellTypeB CellTypeC
mean_augur_score 0.598299 0.867948 0.918435
mean_auc 0.598299 0.867948 0.918435
mean_accuracy 0.542674 0.735971 0.783040
mean_precision 0.539211 0.782898 0.790340
mean_f1 0.419785 0.697024 0.795674
mean_recall 0.418889 0.704921 0.855873
Here we visualize the cell ranking and the corresponding augur scores using select variance feature selection in a lollipop plot.
In this case the ranking it the same but the values themselves differ between the two methods.
To compare the two feature selection methods they can be plotted together in a scatterplot. Here you can see that CellTypeB has slightly higher values when using the method select_highly_variable compared to select_variance, where as CellTypeC has the same augur score.
scatter = ag_rfc.plot_scatterplot(v_results, h_results)
The corresponding mean_augur_score is also saved in result_adata.obs.
sc.pp.neighbors(v_adata, use_rep="X")
sc.tl.umap(v_adata)
sc.pl.umap(v_adata, color="augur_score")
Feature Importances#
The results object also returns feature importances. In the case of a random forest estimator the feature importances built into sci-kit learn were used to calculate feature importances. For the logistic regression the agresti method is used where the mean gets subtracted from the coefficient values and then divided by the standard deviation.
important_features = ag_rfc.plot_important_features(v_results)
Differential Prioritization#
Augur is also able to perform differential prioritization and executes a permutation test to identify cell types with statistically significant differences in AUC between two different rounds of cell type prioritization (e.g. response to drugs A and B, compared to untreated control).
In the following part, single-cell prefrontal cortex data from adult mice under cocaine self-administration from Bhattacherjee 2019 will be used. Adult mice were subject to cocaine self-administration, samples were collected after three time points: Maintenance, 48h after cocaine withdrawal and 15 days after cocaine withdrawal.
We compare withdraw_15d_Cocaine and withdraw_48h_Cocaine with respect to the difference to Maintenance_Cocaine.
Each variation is run once in default mode and once in permute mode.
bhattacherjee_adata = pt.dt.bhattacherjee()
ag_rfc = pt.tl.Augur("random_forest_classifier")
We first run Augur on Maintenance_Cocaine and withdraw_15d_Cocaine in augur_mode=default and augur_mode=permute mode.
# default
bhattacherjee_15 = ag_rfc.load(
bhattacherjee_adata,
condition_label="Maintenance_Cocaine",
treatment_label="withdraw_15d_Cocaine",
)
bhattacherjee_adata_15, bhattacherjee_results_15 = ag_rfc.predict(bhattacherjee_15, random_state=None, n_threads=4)
print(bhattacherjee_results_15["summary_metrics"].loc["mean_augur_score"].sort_values(ascending=False))
Filtering samples with Maintenance_Cocaine and withdraw_15d_Cocaine labels.
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.
Oligo 0.801247
Astro 0.780522
Microglia 0.757075
OPC 0.729909
Inhibitory 0.649989
NF Oligo 0.629626
Excitatory 0.602041
Endo 0.576973
Name: mean_augur_score, dtype: float64
# permute
bhattacherjee_adata_15_permute, bhattacherjee_results_15_permute = ag_rfc.predict(
bhattacherjee_15,
augur_mode="permute",
n_subsamples=100,
random_state=None,
n_threads=4,
)
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.
Now lets do the same looking at Maintenance_Cocaine and withdraw_48h_Cocaine.
# default
bhattacherjee_48 = ag_rfc.load(
bhattacherjee_adata,
condition_label="Maintenance_Cocaine",
treatment_label="withdraw_48h_Cocaine",
)
bhattacherjee_adata_48, bhattacherjee_results_48 = ag_rfc.predict(bhattacherjee_48, random_state=None, n_threads=4)
print(bhattacherjee_results_48["summary_metrics"].loc["mean_augur_score"].sort_values(ascending=False))
Filtering samples with Maintenance_Cocaine and withdraw_48h_Cocaine labels.
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.
Astro 0.626780
OPC 0.624014
Oligo 0.613776
NF Oligo 0.611406
Microglia 0.598095
Inhibitory 0.571735
Excitatory 0.539138
Endo 0.529558
Name: mean_augur_score, dtype: float64
# permute
bhattacherjee_adata_48_permute, bhattacherjee_results_48_permute = ag_rfc.predict(
bhattacherjee_48,
augur_mode="permute",
n_subsamples=100,
random_state=None,
n_threads=4,
)
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.
Skipping NF Oligo cell type - 79 samples is less than min_cells 100.
Let’s also take a look at the Augur scores of the two versions in a scatter plot. The diagonal line is the identity function. If the values were the same they would be on the line.
scatter = ag_rfc.plot_scatterplot(bhattacherjee_results_15, bhattacherjee_results_48)
To figure out which cell type was most affected in comparing withdraw_48h_Cocaine and withdraw_15d_Cocaine we can run differential prioritization:
pvals = ag_rfc.predict_differential_prioritization(
augur_results1=bhattacherjee_results_15,
augur_results2=bhattacherjee_results_48,
permuted_results1=bhattacherjee_results_15_permute,
permuted_results2=bhattacherjee_results_48_permute,
)
pvals
| cell_type | mean_augur_score1 | mean_augur_score2 | delta_augur | b | m | z | pval | padj | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Oligo | 0.801247 | 0.613776 | -0.187472 | 1000 | 1000 | -13.490667 | 0.001998 | 0.002797 |
| 1 | Inhibitory | 0.649989 | 0.571735 | -0.078254 | 989 | 1000 | -4.071143 | 0.023976 | 0.023976 |
| 2 | OPC | 0.729909 | 0.624014 | -0.105896 | 998 | 1000 | -5.484539 | 0.005994 | 0.006993 |
| 3 | Microglia | 0.757075 | 0.598095 | -0.158980 | 1000 | 1000 | -9.528952 | 0.001998 | 0.002797 |
| 4 | Endo | 0.576973 | 0.529558 | -0.047415 | 1000 | 1000 | -3.286200 | 0.001998 | 0.002797 |
| 5 | Excitatory | 0.602041 | 0.539138 | -0.062902 | 1000 | 1000 | -4.038718 | 0.001998 | 0.002797 |
| 6 | Astro | 0.780522 | 0.626780 | -0.153741 | 1000 | 1000 | -7.940456 | 0.001998 | 0.002797 |
The p-value, following the R Augur implementation is calculated using b, the number of times permuted values are larger than original values and m, the number of permutations run. Since b is the same for all cells but OPC and inhibitory, the p-value is the same for these as well.
diff = ag_rfc.plot_dp_scatter(pvals)
In this case the cell type Endo has the lowest z-score. When comparing the impact of withdraw_48h_Cocaine and withdraw_15d_Cocaine, this cell type was most perturbed.