pertpy.tools.Augur

class pertpy.tools.Augur(estimator, params=None)[source]

Python implementation of Augur.

Methods table

average_metrics(cell_cv_results)

Calculate average metric of cross validation runs done of one cell type.

ccc_score(y_true, y_pred)

Implementation of Lin's Concordance correlation coefficient, based on https://gitlab.com/-/snippets/1730605.

cox_compare(loess1, loess2)

Cox compare test on two models.

create_estimator(classifier[, params])

Creates a model object of the provided type and populates it with desired parameters.

cross_validate_subsample(adata, augur_mode, ...)

Cross validate subsample anndata object.

draw_subsample(adata, augur_mode, ...)

Subsample and select random features of anndata object.

load(input[, meta, label_col, ...])

Loads the input data.

predict(adata[, n_subsamples, ...])

Calculates the Area under the Curve using the given classifier.

predict_differential_prioritization(...[, ...])

Predicts the differential prioritization by performing permutation tests on samples.

run_cross_validation(subsample, ...)

Perform cross validation on given subsample.

sample(adata, categorical, subsample_size, ...)

Sample AnnData observations.

select_highly_variable(adata)

Feature selection by variance using scanpy highly variable genes function.

select_variance(adata, var_quantile, ...[, span])

Feature selection based on Augur implementation.

set_scorer(multiclass, zero_division)

Set scoring fuctions for cross-validation based on estimator.

plot_dp_scatter(results[, top_n, ...])

Plot scatterplot of differential prioritization.

plot_important_features(data[, key, top_n, ...])

Plot a lollipop plot of the n features with largest feature importances.

plot_lollipop(data[, key, return_fig, ax, ...])

Plot a lollipop plot of the mean augur values.

plot_scatterplot(results1, results2[, ...])

Create scatterplot with two augur results.

Methods

average_metrics

Augur.average_metrics(cell_cv_results)[source]

Calculate average metric of cross validation runs done of one cell type.

Parameters:

cell_cv_results (list[Any]) – list of all cross validation runs of one cell type

Return type:

dict[Any, Any]

Returns:

Dict containing the average result for each metric of one cell type

ccc_score

Augur.ccc_score(y_true, y_pred)[source]

Implementation of Lin’s Concordance correlation coefficient, based on https://gitlab.com/-/snippets/1730605.

Parameters:
  • y_true – array-like of shape (n_samples), ground truth (correct) target values

  • y_pred – array-like of shape (n_samples), estimated target values

Return type:

float

Returns:

Concordance correlation coefficient.

cox_compare

Augur.cox_compare(loess1, loess2)[source]

Cox compare test on two models.

Based on: https://www.statsmodels.org/dev/generated/statsmodels.stats.diagnostic.compare_cox.html Info: Tests of non-nested hypothesis might not provide unambiguous answers. The test should be performed in both directions and it is possible that both or neither test rejects.

Parameters:
  • loess1 – fitted loess regression object

  • loess2 – fitted loess regression object

Returns:

t-statistic for the test that including the fitted values of the first model in the second model has no effect and two-sided pvalue for the t-statistic

create_estimator

Augur.create_estimator(classifier, params=None)[source]

Creates a model object of the provided type and populates it with desired parameters.

Parameters:
  • classifier (Union[Literal['random_forest_classifier'], Literal['random_forest_regressor'], Literal['logistic_regression_classifier']]) – classifier to use in calculating the area under the curve. Either random forest classifier or logistic regression for categorical data or random forest regressor for continous data

  • params (Params | None) – parameters used to populate the model object. Default values are n_estimators = 100, max_depth = None, max_features = 2, penalty = l2, random_state = None.

Return type:

RandomForestClassifier | RandomForestRegressor | LogisticRegression

Returns:

Estimator object.

Examples

>>> import pertpy as pt
>>> augur = pt.tl.Augur("random_forest_classifier")
>>> estimator = augur.create_estimator("logistic_regression_classifier")

cross_validate_subsample

Augur.cross_validate_subsample(adata, augur_mode, subsample_size, folds, feature_perc, subsample_idx, random_state, zero_division)[source]

Cross validate subsample anndata object.

Parameters:
  • adata (AnnData) – Anndata with obs label and cell_type for label and cell type and dummie variable y_ columns used as target

  • augur_mode (str) – one of default, velocity or permute. Setting augur_mode = “velocity” disables feature selection, assuming feature selection has been performed by the RNA velocity procedure to produce the input matrix, while setting augur_mode = “permute” will generate a null distribution of AUCs for each cell type by permuting the labels

  • subsample_size (int) – number of cells to subsample randomly per type from each experimental condition

  • folds (int) – number of folds to run cross validation on

  • feature_perc (float) – proportion of genes that are randomly selected as features for input to the classifier in each subsample using the random gene filter

  • subsample_idx (int) – index of the subsample

  • random_state (int | None) – set numpy random seed, sampling seed and fold seed

  • zero_division (int | str) – 0 or 1 or warn; Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised. Precision metric parameter.

Return type:

dict

Returns:

Results for each cross validation fold.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)
>>> results = ag_rfc.cross_validate_subsample(loaded_data, augur_mode="default", subsample_size=20,                 folds=3, feature_perc=0.5, subsample_idx=0, random_state=42, zero_division=0)

draw_subsample

Augur.draw_subsample(adata, augur_mode, subsample_size, feature_perc, categorical, random_state)[source]

Subsample and select random features of anndata object.

Parameters:
  • adata (AnnData) – Anndata with obs label and cell_type for label and cell type and dummie variable y_ columns used as target

  • augur_mode (str) – one of default, velocity or permute. Setting augur_mode = “velocity” disables feature selection, assuming feature selection has been performed by the RNA velocity procedure to produce the input matrix, while setting augur_mode = “permute” will generate a null distribution of AUCs for each cell type by permuting the labels

  • subsample_size (int) – number of cells to subsample randomly per type from each experimental condition

  • feature_perc (float) – proportion of genes that are randomly selected as features for input to the classifier in each subsample using the random gene filter

  • categorical (bool) – True if target values are categorical

  • random_state (int) – set numpy random seed and sampling seed

Return type:

AnnData

Returns:

Subsample of anndata object of size subsample_size

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)
>>> subsample = ag_rfc.draw_subsample(adata, augur_mode="default", subsample_size=20, feature_perc=0.5,                 categorical=True, random_state=42)

load

Augur.load(input, meta=None, label_col='label_col', cell_type_col='cell_type_col', condition_label=None, treatment_label=None)[source]

Loads the input data.

Parameters:
  • input (AnnData | DataFrame) – Anndata or matrix containing gene expression values (genes in rows, cells in columns) and optionally meta data about each cell.

  • meta (DataFrame | None) – Optional Pandas DataFrame containing meta data about each cell.

  • label_col (str) – column of the meta DataFrame or the Anndata or matrix containing the condition labels for each cell in the cell-by-gene expression matrix

  • cell_type_col (str) – column of the meta DataFrame or the Anndata or matrix containing the cell type labels for each cell in the cell-by-gene expression matrix

  • condition_label (str | None) – in the case of more than two labels, this label is used in the analysis

  • treatment_label (str | None) – in the case of more than two labels, this label is used in the analysis

Return type:

AnnData

Returns:

Anndata object containing gene expression values (cells in rows, genes in columns) and cell type, label and y dummy variables as obs

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)

plot_dp_scatter

Augur.plot_dp_scatter(results, top_n=None, return_fig=None, ax=None, show=None, save=None)[source]

Plot scatterplot of differential prioritization.

Parameters:
  • results (DataFrame) – Results after running differential prioritization.

  • top_n (int) – optionally, the number of top prioritized cell types to label in the plot

  • ax (Axes) – optionally, axes used to draw plot

Return type:

Axes | Figure | None

Returns:

Axes of the plot.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.bhattacherjee()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> data_15 = ag_rfc.load(adata, condition_label="Maintenance_Cocaine", treatment_label="withdraw_15d_Cocaine")
>>> adata_15, results_15 = ag_rfc.predict(data_15, random_state=None, n_threads=4)
>>> adata_15_permute, results_15_permute = ag_rfc.predict(data_15, augur_mode="permute", n_subsamples=100, random_state=None, n_threads=4)
>>> data_48 = ag_rfc.load(adata, condition_label="Maintenance_Cocaine", treatment_label="withdraw_48h_Cocaine")
>>> adata_48, results_48 = ag_rfc.predict(data_48, random_state=None, n_threads=4)
>>> adata_48_permute, results_48_permute = ag_rfc.predict(data_48, augur_mode="permute", n_subsamples=100, random_state=None, n_threads=4)
>>> pvals = ag_rfc.predict_differential_prioritization(augur_results1=results_15, augur_results2=results_48,                 permuted_results1=results_15_permute, permuted_results2=results_48_permute)
>>> ag_rfc.plot_dp_scatter(pvals)
Preview:
../../_images/augur_dp_scatter.png

plot_important_features

Augur.plot_important_features(data, key='augurpy_results', top_n=10, return_fig=None, ax=None, show=None, save=None)[source]

Plot a lollipop plot of the n features with largest feature importances.

Parameters:
  • results – results after running predict() as dictionary or the AnnData object.

  • key (str) – Key in the AnnData object of the results

  • top_n (int) – n number feature importance values to plot. Default is 10.

  • ax (Axes) – optionally, axes used to draw plot

  • return_figure – if True returns figure of the plot, default is False

Return type:

Axes | None

Returns:

Axes of the plot.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> v_adata, v_results = ag_rfc.predict(
...     loaded_data, subsample_size=20, select_variance_features=True, n_threads=4
... )
>>> ag_rfc.plot_important_features(v_results)
Preview:
../../_images/augur_important_features.png

plot_lollipop

Augur.plot_lollipop(data, key='augurpy_results', return_fig=None, ax=None, show=None, save=None)[source]

Plot a lollipop plot of the mean augur values.

Parameters:
  • results – results after running predict() as dictionary or the AnnData object.

  • key (str) – Key in the AnnData object of the results

  • ax (Axes) – optionally, axes used to draw plot

  • return_figure – if True returns figure of the plot

Return type:

Axes | Figure | None

Returns:

Axes of the plot.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> v_adata, v_results = ag_rfc.predict(
...     loaded_data, subsample_size=20, select_variance_features=True, n_threads=4
... )
>>> ag_rfc.plot_lollipop(v_results)
Preview:
../../_images/augur_lollipop.png

plot_scatterplot

Augur.plot_scatterplot(results1, results2, top_n=None, return_fig=None, show=None, save=None)[source]

Create scatterplot with two augur results.

Parameters:
  • results1 (dict[str, Any]) – results after running predict()

  • results2 (dict[str, Any]) – results after running predict()

  • top_n (int) – optionally, the number of top prioritized cell types to label in the plot

  • return_figure – if True returns figure of the plot

Return type:

Axes | Figure | None

Returns:

Axes of the plot.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> h_adata, h_results = ag_rfc.predict(loaded_data, subsample_size=20, n_threads=4)
>>> v_adata, v_results = ag_rfc.predict(
...     loaded_data, subsample_size=20, select_variance_features=True, n_threads=4
... )
>>> ag_rfc.plot_scatterplot(v_results, h_results)
Preview:
../../_images/augur_scatterplot.png

predict

Augur.predict(adata, n_subsamples=50, subsample_size=20, folds=3, min_cells=None, feature_perc=0.5, var_quantile=0.5, span=0.75, filter_negative_residuals=False, n_threads=4, augur_mode='default', select_variance_features=True, key_added='augurpy_results', random_state=None, zero_division=0)[source]

Calculates the Area under the Curve using the given classifier.

Parameters:
  • adata (AnnData) – Anndata with obs label and cell_type for label and cell type and dummie variable y_ columns used as target

  • n_subsamples (int) – number of random subsamples to draw from complete dataset for each cell type

  • subsample_size (int) – number of cells to subsample randomly per type from each experimental condition

  • folds (int) – number of folds to run cross validation on. Be careful changing this parameter without also changing subsample_size.

  • min_cells (int) – minimum number of cells for a particular cell type in each condition in order to retain that type for analysis (depricated..)

  • feature_perc (float) – proportion of genes that are randomly selected as features for input to the classifier in each subsample using the random gene filter

  • var_quantile (float) – The quantile below which features will be filtered, based on their residuals in a loess model. Defaults to 0.5.

  • span (float) – Smoothing factor, as a fraction of the number of points to take into account. Should be in the range (0, 1]. Defaults to 0.75.

  • filter_negative_residuals (bool) – if True, filter residuals at a fixed threshold of zero, instead of var_quantile

  • n_threads (int) – number of threads to use for parallelization

  • select_variance_features (bool) – Whether to select genes based on the original Augur implementation (True) or using scanpy’s highly_variable_genes (False). Defaults to True.

  • key_added (str) – Key to add results to in .uns

  • augur_mode (Union[Literal['permute'], Literal['default'], Literal['velocity']]) – One of ‘default’, ‘velocity’ or ‘permute’. Setting augur_mode = “velocity” disables feature selection, assuming feature selection has been performed by the RNA velocity procedure to produce the input matrix, while setting augur_mode = “permute” will generate a null distribution of AUCs for each cell type by permuting the labels. Note that when setting augur_mode = “permute” n_subsample values less than 100 will be set to 500.

  • random_state (int | None) – set numpy random seed, sampling seed and fold seed

  • zero_division (int | str) – 0 or 1 or warn; Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised. Precision metric parameter.

Returns:

  • summary_metrics: Pandas Dataframe containing mean metrics for each cell type

  • feature_importances: Pandas Dataframe containing feature importances of genes across all cross validation runs

  • full_results: Dict containing merged results of individual cross validation runs for each cell type

  • [cell_types]: Cross validation runs of the cell type called

Return type:

A tuple with a dictionary containing the following keys with an updated AnnData object with mean_augur_score metrics in obs

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> h_adata, h_results = ag_rfc.predict(loaded_data, subsample_size=20, n_threads=4)

predict_differential_prioritization

Augur.predict_differential_prioritization(augur_results1, augur_results2, permuted_results1, permuted_results2, n_subsamples=50, n_permutations=1000)[source]

Predicts the differential prioritization by performing permutation tests on samples.

Performs permutation tests that identifies cell types with statistically significant differences in augur_score between two conditions respectively compared to the control.

Parameters:
  • augur1 – Augurpy results from condition 1, obtained from predict()[1]

  • augur2 – Augurpy results from condition 2, obtained from predict()[1]

  • permuted1 – permuted Augurpy results from condition 1, obtained from predict() with argument augur_mode=permute

  • permuted2 – permuted Augurpy results from condition 2, obtained from predict() with argument augur_mode=permute

  • n_subsamples (int) – number of subsamples to pool when calculating the mean augur score for each permutation; Defaults to 50.

  • n_permutations (int) – the total number of mean augur scores to calculate from a background distribution

Return type:

DataFrame

Returns:

Results object containing mean augur scores.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.bhattacherjee()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> data_15 = ag_rfc.load(adata, condition_label="Maintenance_Cocaine", treatment_label="withdraw_15d_Cocaine")
>>> adata_15, results_15 = ag_rfc.predict(data_15, random_state=None, n_threads=4)
>>> adata_15_permute, results_15_permute = ag_rfc.predict(data_15, augur_mode="permute", n_subsamples=100, random_state=None, n_threads=4)
>>> data_48 = ag_rfc.load(adata, condition_label="Maintenance_Cocaine", treatment_label="withdraw_48h_Cocaine")
>>> adata_48, results_48 = ag_rfc.predict(data_48, random_state=None, n_threads=4)
>>> adata_48_permute, results_48_permute = ag_rfc.predict(data_48, augur_mode="permute", n_subsamples=100, random_state=None, n_threads=4)
>>> pvals = ag_rfc.predict_differential_prioritization(augur_results1=results_15, augur_results2=results_48,                 permuted_results1=results_15_permute, permuted_results2=results_48_permute)

run_cross_validation

Augur.run_cross_validation(subsample, subsample_idx, folds, random_state, zero_division)[source]

Perform cross validation on given subsample.

Parameters:
  • subsample (AnnData) – subsample of gene expression matrix of size subsample_size

  • estimator – classifier object to use in calculating the area under the curve

  • subsample_idx (int) – index of subsample

  • folds (int) – number of folds

  • random_state (int | None) – set random fold seed

  • zero_division (int | str) – 0 or 1 or warn; Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised. Precision metric parameter.

Return type:

dict

Returns:

Dictionary containing prediction metrics and estimator for each fold.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)
>>> subsample = ag_rfc.draw_subsample(adata, augur_mode="default", subsample_size=20, feature_perc=0.5,                 categorical=True, random_state=42)
>>> results = ag_rfc.run_cross_validation(subsample=subsample, folds=3, subsample_idx=0, random_state=42, zero_division=0)

sample

Augur.sample(adata, categorical, subsample_size, random_state, features)[source]

Sample AnnData observations.

Parameters:
  • adata (AnnData) – Anndata with obs label and cell_type for label and cell type and dummie variable y_ columns used as target

  • categorical (bool) – True if target values are categorical

  • subsample_size (int) – number of cells to subsample randomly per type from each experimental condition

  • random_state (int) – set numpy random seed and sampling seed

  • features (list) – features returned Anndata object

Returns:

Subsample of AnnData object of size subsample_size with given features

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)
>>> features = loaded_data.var_names
>>> subsample = ag_rfc.sample(
...     loaded_data, categorical=True, subsample_size=20, random_state=42, features=loaded_data.var_names
... )

select_highly_variable

Augur.select_highly_variable(adata)[source]

Feature selection by variance using scanpy highly variable genes function.

Parameters:

adata (AnnData) – Anndata object containing gene expression values (cells in rows, genes in columns)

Return type:

AnnData

Results:

Anndata object with highly variable genes added as layer

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)

select_variance

Augur.select_variance(adata, var_quantile, filter_negative_residuals, span=0.75)[source]

Feature selection based on Augur implementation.

Parameters:
  • adata (AnnData) – Anndata object

  • var_quantile (float) – The quantile below which features will be filtered, based on their residuals in a loess model.

  • filter_negative_residuals (bool) – if True, filter residuals at a fixed threshold of zero, instead of var_quantile

  • span (float) – Smoothing factor, as a fraction of the number of points to take into account. Should be in the range (0, 1]. Defaults to 0.75

Returns:

AnnData object with additional select_variance column in var.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_variance(loaded_data, var_quantile=0.5, filter_negative_residuals=False, span=0.75)

set_scorer

Augur.set_scorer(multiclass, zero_division)[source]

Set scoring fuctions for cross-validation based on estimator.

Parameters:
  • multiclass (bool) – True if there are more than two target classes

  • zero_division (int | str) – 0 or 1 or warn; Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised. Precision metric parameter.

Return type:

dict[str, Any]

Returns:

Dict linking name to scorer object and string name

Examples

>>> import pertpy as pt
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> scorer = ag_rfc.set_scorer(True, 0)