Note

This page was generated from augur.ipynb. Some tutorial content may look better in light mode.

Augur - cell type prioritization prediction

This is a short tutorial demonstrating Augur, a Python implementation of the Augur R package.

Augur aims to rank or prioritize cell types according to their response to experimental perturbations. The fundamental idea is that in the space of molecular measurements cells reacting heavily to induced perturbations are more easily separated into perturbed and unperturbed than cell types with little or no response. This separability is quantified by measuring how well experimental labels (e.g. treatment and control) can be predicted within each cell type. Augur trains a machine learning model predicting experimental labels for each cell type in multiple cross validation runs and then prioritizes cell type response according to metric scores measuring the accuracy of the model. For categorical data Augur uses the area under the curve and for numerical data it uses concordance correlation coefficient.

The following tutorial runs through a simple analysis with Augur using a simulated dataset.

Setup

[1]:
import warnings

warnings.filterwarnings("ignore")
[2]:
import pertpy as pt
import scanpy as sc

As a first step, we load the dataset. This can either be an AnnData object, a DataFrame containing cell type labels as well as conditions for each cell or data contained in a DataFrame that contains the corresponding metadata.

Afterwards, we create an Augur object specifying our estimator of interest to measure how predictable the perturbation labels for each cell type in the dataset are. Choose random_forest_classifier or logistic_regression_classifier for categorical data and random_forest_regressor for numerical data. Additional parameters can be added through the Params class.

Next, load this data into an AnnData object adding dummy variables of labels and renaming cell type and label observation columns to cell_type and label.

The dataset that we’ll use consists of 600 cells, distributed evenly between three populations (cell types A, B, and C). Each of these cell types has approximately half of its cells in one of two conditions, treatment and control. The cell types also have different numbers of differentially expressed genes (DEG) in response to treatment. Cell type A has approximately 5% of DEG in response to the treatment, while cell type B has 25% DEGs and cell type C has 50%.

[3]:
adata = pt.dt.sc_sim_augur()

ag_rfc = pt.tl.Augur("random_forest_classifier")
loaded_data = ag_rfc.load(adata)
[4]:
loaded_data
[4]:
AnnData object with n_obs × n_vars = 600 × 15697
    obs: 'label', 'cell_type', 'y_'
    var: 'name'

Augur prediction

Now we run Augur with the function predict and look at the results:

The default way to select features is based on select_variance, which is an implementation that is very close to the original R Augur implementation. In addition, it is possible to select features with scanpy.pp.highly_variable_genes, by setting select_variance_features=False

[5]:
v_adata, v_results = ag_rfc.predict(loaded_data, subsample_size=20, select_variance_features=True, n_threads=4)

# to visualize the results
lollipop = ag_rfc.plot_lollipop(v_results)
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.
../../_images/tutorials_notebooks_augur_10_5.png
[6]:
h_adata, h_results = ag_rfc.predict(loaded_data, subsample_size=20, select_variance_features=False, n_threads=4)

print(h_results["summary_metrics"])
                  CellTypeA  CellTypeB  CellTypeC
mean_augur_score   0.598299   0.867948   0.918435
mean_auc           0.598299   0.867948   0.918435
mean_accuracy      0.542674   0.735971   0.783040
mean_precision     0.539211   0.782898   0.790340
mean_f1            0.419785   0.697024   0.795674
mean_recall        0.418889   0.704921   0.855873

Here we visualize the cell ranking and the corresponding augur scores using select variance feature selection in a lollipop plot. In this case the ranking it the same but the values themselves differ between the two methods.

To compare the two feature selection methods they can be plotted together in a scatterplot. Here you can see that CellTypeB has slightly higher values when using the method select_highly_variable compared to select_variance, where as CellTypeC has the same augur score.

[7]:
scatter = ag_rfc.plot_scatterplot(v_results, h_results)
../../_images/tutorials_notebooks_augur_14_0.png

The corresponding mean_augur_score is also saved in result_adata.obs.

[8]:
sc.pp.neighbors(v_adata, use_rep="X")
sc.tl.umap(v_adata)
sc.pl.umap(v_adata, color="augur_score")
../../_images/tutorials_notebooks_augur_16_0.png

Feature Importances

The results object also returns feature importances. In the case of a random forest estimator the feature importances built into sci-kit learn were used to calculate feature importances. For the logistic regression the agresti method is used where the mean gets subtracted from the coefficient values and then divided by the standard deviation.

[9]:
important_features = ag_rfc.plot_important_features(v_results)
../../_images/tutorials_notebooks_augur_19_0.png

Differential Prioritization

Augur is also able to perform differential prioritization and executes a permutation test to identify cell types with statistically significant differences in AUC between two different rounds of cell type prioritization (e.g. response to drugs A and B, compared to untreated control).

In the following part, single-cell prefrontal cortex data from adult mice under cocaine self-administration from Bhattacherjee 2019 will be used. Adult mice were subject to cocaine self-administration, samples were collected after three time points: Maintenance, 48h after cocaine withdrawal and 15 days after cocaine withdrawal.

We compare withdraw_15d_Cocaine and withdraw_48h_Cocaine with respect to the difference to Maintenance_Cocaine. Each variation is run once in default mode and once in permute mode.

[10]:
bhattacherjee_adata = pt.dt.bhattacherjee()
ag_rfc = pt.tl.Augur("random_forest_classifier")

We first run Augur on Maintenance_Cocaine and withdraw_15d_Cocaine in augur_mode=default and augur_mode=permute mode.

[11]:
# default
bhattacherjee_15 = ag_rfc.load(
    bhattacherjee_adata,
    condition_label="Maintenance_Cocaine",
    treatment_label="withdraw_15d_Cocaine",
)

bhattacherjee_adata_15, bhattacherjee_results_15 = ag_rfc.predict(bhattacherjee_15, random_state=None, n_threads=4)

print(bhattacherjee_results_15["summary_metrics"].loc["mean_augur_score"].sort_values(ascending=False))
Filtering samples with Maintenance_Cocaine and withdraw_15d_Cocaine labels.
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.
Oligo         0.801247
Astro         0.780522
Microglia     0.757075
OPC           0.729909
Inhibitory    0.649989
NF Oligo      0.629626
Excitatory    0.602041
Endo          0.576973
Name: mean_augur_score, dtype: float64
[12]:
# permute
bhattacherjee_adata_15_permute, bhattacherjee_results_15_permute = ag_rfc.predict(
    bhattacherjee_15,
    augur_mode="permute",
    n_subsamples=100,
    random_state=None,
    n_threads=4,
)
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.

Now lets do the same looking at Maintenance_Cocaine and withdraw_48h_Cocaine.

[13]:
# default
bhattacherjee_48 = ag_rfc.load(
    bhattacherjee_adata,
    condition_label="Maintenance_Cocaine",
    treatment_label="withdraw_48h_Cocaine",
)

bhattacherjee_adata_48, bhattacherjee_results_48 = ag_rfc.predict(bhattacherjee_48, random_state=None, n_threads=4)

print(bhattacherjee_results_48["summary_metrics"].loc["mean_augur_score"].sort_values(ascending=False))
Filtering samples with Maintenance_Cocaine and withdraw_48h_Cocaine labels.
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.
Astro         0.626780
OPC           0.624014
Oligo         0.613776
NF Oligo      0.611406
Microglia     0.598095
Inhibitory    0.571735
Excitatory    0.539138
Endo          0.529558
Name: mean_augur_score, dtype: float64
[14]:
# permute
bhattacherjee_adata_48_permute, bhattacherjee_results_48_permute = ag_rfc.predict(
    bhattacherjee_48,
    augur_mode="permute",
    n_subsamples=100,
    random_state=None,
    n_threads=4,
)
Set smaller span value in the case of a `segmentation fault` error.
Set larger span in case of svddc or other near singularities error.
Skipping NF Oligo cell type - 79 samples is less than min_cells 100.

Let’s also take a look at the Augur scores of the two versions in a scatter plot. The diagonal line is the identity function. If the values were the same they would be on the line.

[15]:
scatter = ag_rfc.plot_scatterplot(bhattacherjee_results_15, bhattacherjee_results_48)
../../_images/tutorials_notebooks_augur_29_0.png

To figure out which cell type was most affected in comparing withdraw_48h_Cocaine and withdraw_15d_Cocaine we can run differential prioritization:

[16]:
pvals = ag_rfc.predict_differential_prioritization(
    augur_results1=bhattacherjee_results_15,
    augur_results2=bhattacherjee_results_48,
    permuted_results1=bhattacherjee_results_15_permute,
    permuted_results2=bhattacherjee_results_48_permute,
)
[17]:
pvals
[17]:
cell_type mean_augur_score1 mean_augur_score2 delta_augur b m z pval padj
0 Oligo 0.801247 0.613776 -0.187472 1000 1000 -13.490667 0.001998 0.002797
1 Inhibitory 0.649989 0.571735 -0.078254 989 1000 -4.071143 0.023976 0.023976
2 OPC 0.729909 0.624014 -0.105896 998 1000 -5.484539 0.005994 0.006993
3 Microglia 0.757075 0.598095 -0.158980 1000 1000 -9.528952 0.001998 0.002797
4 Endo 0.576973 0.529558 -0.047415 1000 1000 -3.286200 0.001998 0.002797
5 Excitatory 0.602041 0.539138 -0.062902 1000 1000 -4.038718 0.001998 0.002797
6 Astro 0.780522 0.626780 -0.153741 1000 1000 -7.940456 0.001998 0.002797

The p-value, following the R Augur implementation is calculated using b, the number of times permuted values are larger than original values and m, the number of permutations run. Since b is the same for all cells but OPC and inhibitory, the p-value is the same for these as well.

[18]:
diff = ag_rfc.plot_dp_scatter(pvals)
../../_images/tutorials_notebooks_augur_34_0.png

In this case the cell type Endo has the lowest z-score. When comparing the impact of withdraw_48h_Cocaine and withdraw_15d_Cocaine, this cell type was most perturbed.

Conclusion

Augur is a simple way of ranking perturbation effects based on the idea that stronger perturbations should be easier separable from control than weaker perturbations.

References

  1. Skinnider, M.A., Squair, J.W., Kathe, C. et al. Cell type prioritization in single-cell data. Nat Biotechnol 39, 30–34 (2021). https://doi.org/10.1038/s41587-020-0605-1

  2. Squair, J.W., Skinnider, M.A., Gautier, M. et al. Prioritization of cell types responsive to biological perturbations in single-cell data with Augur. Nat Protoc 16, 3836–3873 (2021). https://doi.org/10.1038/s41596-021-00561-x

  3. Bhattacherjee, A., Djekidel, M.N., Chen, R. et al. Cell type-specific transcriptional programs in mouse prefrontal cortex during adolescence and addiction. Nat Commun 10, 4169 (2019). https://doi.org/10.1038/s41467-019-12054-3