pertpy.tools.Augur

pertpy.tools.Augur#

class Augur(estimator, n_estimators=100, max_depth=None, max_features=2, penalty='l2', random_state=None)[source]#: Python implementation of Augur.

Methods table#

`average_metrics`(cell_cv_results)	Calculate average metric of cross validation runs done of one cell type.
`ccc_score`(y_true, y_pred)	Implementation of Lin's Concordance correlation coefficient, based on -/snippets/1730605.
`cox_compare`(loess1, loess2)	Cox compare test on two models.
`create_estimator`(classifier, *[, ...])	Creates a model object of the provided type and populates it with desired parameters.
`cross_validate_subsample`(adata, *, ...)	Cross validate subsample anndata object.
`draw_subsample`(adata, *, augur_mode, ...)	Subsample and select random features of anndata object.
`load`(input, *[, layer, meta, label_col, ...])	Loads the input data.
`plot_dp_scatter`(results, *[, top_n, ax, ...])	Plot scatterplot of differential prioritization.
`plot_important_features`(data, *[, key, ...])	Plot a lollipop plot of the n features with largest feature importances.
`plot_lollipop`(data, *[, key, ax, return_fig])	Plot a lollipop plot of the mean augur values.
`plot_scatterplot`(results1, results2, *[, ...])	Create scatterplot with two augur results.
`predict`(adata, *[, n_subsamples, ...])	Calculates the Area under the Curve using the given classifier.
`predict_differential_prioritization`(...[, ...])	Predicts the differential prioritization by performing permutation tests on samples.
`run_cross_validation`(subsample, *, ...)	Perform cross validation on given subsample.
`sample`(adata, categorical, subsample_size, ...)	Sample AnnData observations.
`select_highly_variable`(adata)	Feature selection by variance using scanpy highly variable genes function.
`select_variance`(adata, *, var_quantile, ...)	Feature selection based on Augur implementation.
`set_scorer`(multiclass, zero_division)	Set scoring fuctions for cross-validation based on estimator.

Methods#

Augur.average_metrics(cell_cv_results)[source]#

Calculate average metric of cross validation runs done of one cell type.

Parameters:: cell_cv_results (list[Any]) – list of all cross validation runs of one cell type
Return type:: dict[Any, Any]
Returns:: Dict containing the average result for each metric of one cell type

Augur.ccc_score(y_true, y_pred)[source]#

Implementation of Lin’s Concordance correlation coefficient, based on -/snippets/1730605.

Parameters:

y_true (ndarray) – array-like of shape (n_samples), ground truth (correct) target values
y_pred (ndarray) – array-like of shape (n_samples), estimated target values

Return type:

float

Returns:

Concordance correlation coefficient.

Augur.cox_compare(loess1, loess2)[source]#

Cox compare test on two models.

Based on: https://www.statsmodels.org/dev/generated/statsmodels.stats.diagnostic.compare_cox.html Info: Tests of non-nested hypothesis might not provide unambiguous answers. The test should be performed in both directions and it is possible that both or neither test rejects.

Parameters:

loess1 – fitted loess regression object
loess2 – fitted loess regression object

Returns:

t-statistic for the test that including the fitted values of the first model in the second model has no effect and two-sided pvalue for the t-statistic

Augur.create_estimator(classifier, *, n_estimators=100, max_depth=None, max_features=2, penalty='l2', random_state=None)[source]#

Creates a model object of the provided type and populates it with desired parameters.

Parameters:

classifier (Literal['random_forest_classifier', 'random_forest_regressor', 'logistic_regression_classifier']) – classifier to use in calculating the area under the curve. Either random forest classifier or logistic regression for categorical data or random forest regressor for continous data
n_estimators (int, default: 100) – Number of trees in the forest.
max_depth (int | None, default: None) – Maximal depth of each tree.
max_features (Union[Literal['auto', 'log2', 'sqrt'], int, float], default: 2) –
Maximal number of features considered when looking at best split.
- if int then consider max_features for each split
- if float consider round(max_features*n_features)
- if auto then max_features=n_features (default)
- if log2 then max_features=log2(n_features)
- if sqrt then max_featuers=sqrt(n_features)
penalty (Literal['l1', 'l2', 'elasticnet', 'none'], default: 'l2') –
Norm of the penalty used in logistic regression
- if l1 then L1 penalty is added
- if l2 then L2 penalty is added (default)
- if elasticnet both L1 and L2 penalties are added
- if none no penalty is added
random_state (int | None, default: None) – Random model seed.

Return type:

RandomForestClassifier | RandomForestRegressor | LogisticRegression

Examples

>>> import pertpy as pt
>>> augur = pt.tl.Augur("random_forest_classifier")
>>> estimator = augur.create_estimator("logistic_regression_classifier")

Augur.cross_validate_subsample(adata, *, augur_mode, subsample_size, folds, feature_perc, subsample_idx, random_state, zero_division)[source]#

Cross validate subsample anndata object.

Parameters:

adata (AnnData) – Anndata with obs label and cell_type for label and cell type and dummie variable y_ columns used as target
augur_mode (str) – one of default, velocity or permute. Setting augur_mode = “velocity” disables feature selection, assuming feature selection has been performed by the RNA velocity procedure to produce the input matrix, while setting augur_mode = “permute” will generate a null distribution of AUCs for each cell type by permuting the labels
subsample_size (int) – number of cells to subsample randomly per type from each experimental condition
folds (int) – number of folds to run cross validation on
feature_perc (float) – proportion of genes that are randomly selected as features for input to the classifier in each subsample using the random gene filter
subsample_idx (int) – index of the subsample
random_state (int | None) – set numpy random seed, sampling seed and fold seed
zero_division (int | str) – 0 or 1 or warn; Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised. Precision metric parameter.

Return type:

dict

Returns:

Results for each cross validation fold.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)
>>> results = ag_rfc.cross_validate_subsample(loaded_data, augur_mode="default", subsample_size=20,                 folds=3, feature_perc=0.5, subsample_idx=0, random_state=42, zero_division=0)

Augur.draw_subsample(adata, *, augur_mode, subsample_size, feature_perc, categorical, random_state)[source]#

Subsample and select random features of anndata object.

Parameters:

adata (AnnData) – Anndata with obs label and cell_type for label and cell type and dummie variable y_ columns used as target
augur_mode (str) – one of default, velocity or permute. Setting augur_mode = “velocity” disables feature selection, assuming feature selection has been performed by the RNA velocity procedure to produce the input matrix, while setting augur_mode = “permute” will generate a null distribution of AUCs for each cell type by permuting the labels
subsample_size (int) – number of cells to subsample randomly per type from each experimental condition
feature_perc (float) – proportion of genes that are randomly selected as features for input to the classifier in each subsample using the random gene filter
categorical (bool) – True if target values are categorical
random_state (int) – set numpy random seed and sampling seed

Return type:

AnnData

Returns:

Subsample of anndata object of size subsample_size

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)
>>> subsample = ag_rfc.draw_subsample(adata, augur_mode="default", subsample_size=20, feature_perc=0.5,                 categorical=True, random_state=42)

Augur.load(input, *, layer=None, meta=None, label_col='label_col', cell_type_col='cell_type_col', condition_label=None, treatment_label=None)[source]#

Loads the input data.

Parameters:

input (AnnData | DataFrame) – Anndata or matrix containing gene expression values (genes in rows, cells in columns) and optionally meta data about each cell.
layer (str | None, default: None) – Layer in AnnData to use for expression data. If None, uses .X
meta (DataFrame | None, default: None) – Optional Pandas DataFrame containing meta data about each cell.
label_col (str, default: 'label_col') – column of the meta DataFrame or the Anndata or matrix containing the condition labels for each cell in the cell-by-gene expression matrix
cell_type_col (str, default: 'cell_type_col') – column of the meta DataFrame or the Anndata or matrix containing the cell type labels for each cell in the cell-by-gene expression matrix
condition_label (str | None, default: None) – in the case of more than two labels, this label is used in the analysis
treatment_label (str | None, default: None) – in the case of more than two labels, this label is used in the analysis

Return type:

AnnData

Returns:

Anndata object containing gene expression values (cells in rows, genes in columns) and cell type, label and y dummy variables as obs

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> augur_adata = ag_rfc.load(adata)

Augur.plot_dp_scatter(results, *, top_n=None, ax=None, return_fig=False)[source]#

Plot scatterplot of differential prioritization.

Parameters:

results (DataFrame) – Results after running differential prioritization.
top_n (int, default: None) – optionally, the number of top prioritized cell types to label in the plot
ax (Axes, default: None) – optionally, axes used to draw plot
return_fig (bool, default: False) – if True, returns figure of the plot, that can be used for saving.

Return type:

Figure | None

Returns:

If return_fig is True, returns the figure, otherwise None.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.bhattacherjee()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")

>>> data_15 = ag_rfc.load(adata, condition_label="Maintenance_Cocaine", treatment_label="withdraw_15d_Cocaine")
>>> adata_15, results_15 = ag_rfc.predict(data_15, random_state=None, n_threads=4)
>>> adata_15_permute, results_15_permute = ag_rfc.predict(data_15, augur_mode="permute", n_subsamples=100, random_state=None, n_threads=4)

>>> data_48 = ag_rfc.load(adata, condition_label="Maintenance_Cocaine", treatment_label="withdraw_48h_Cocaine")
>>> adata_48, results_48 = ag_rfc.predict(data_48, random_state=None, n_threads=4)
>>> adata_48_permute, results_48_permute = ag_rfc.predict(data_48, augur_mode="permute", n_subsamples=100, random_state=None, n_threads=4)

>>> pvals = ag_rfc.predict_differential_prioritization(augur_results1=results_15, augur_results2=results_48,                 permuted_results1=results_15_permute, permuted_results2=results_48_permute)
>>> ag_rfc.plot_dp_scatter(pvals)

Preview:

Augur.plot_important_features(data, *, key='augurpy_results', top_n=10, ax=None, return_fig=False)[source]#

Plot a lollipop plot of the n features with largest feature importances.

Parameters:

data (dict[str, Any]) – results after running predict() as dictionary or the AnnData object.
key (str, default: 'augurpy_results') – Key in the AnnData object of the results
top_n (int, default: 10) – n number feature importance values to plot. Default is 10.
ax (Axes, default: None) – optionally, axes used to draw plot
return_fig (bool, default: False) – if True, returns figure of the plot, that can be used for saving.

Return type:

Figure | None

Returns:

If return_fig is True, returns the figure, otherwise None.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> v_adata, v_results = ag_rfc.predict(
...     loaded_data, subsample_size=20, select_variance_features=True, n_threads=4
... )
>>> ag_rfc.plot_important_features(v_results)

Preview:

Augur.plot_lollipop(data, *, key='augurpy_results', ax=None, return_fig=False)[source]#

Plot a lollipop plot of the mean augur values.

Parameters:

data (dict[str, Any] | AnnData) – results after running predict() as dictionary or the AnnData object.
key (str, default: 'augurpy_results') – .uns key in the results AnnData object.
ax (Axes, default: None) – optionally, axes used to draw plot.
return_fig (bool, default: False) – if True, returns figure of the plot, that can be used for saving.

Return type:

Figure | None

Returns:

If return_fig is True, returns the figure, otherwise None.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> v_adata, v_results = ag_rfc.predict(
...     loaded_data, subsample_size=20, select_variance_features=True, n_threads=4
... )
>>> ag_rfc.plot_lollipop(v_results)

Preview:

Augur.plot_scatterplot(results1, results2, *, top_n=None, return_fig=False)[source]#

Create scatterplot with two augur results.

Parameters:

results1 (dict[str, Any]) – results after running predict()
results2 (dict[str, Any]) – results after running predict()
top_n (int, default: None) – optionally, the number of top prioritized cell types to label in the plot
return_fig (bool, default: False) – if True, returns figure of the plot, that can be used for saving.

Return type:

Figure | None

Returns:

Axes of the plot.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> h_adata, h_results = ag_rfc.predict(loaded_data, subsample_size=20, n_threads=4)
>>> v_adata, v_results = ag_rfc.predict(
...     loaded_data, subsample_size=20, select_variance_features=True, n_threads=4
... )
>>> ag_rfc.plot_scatterplot(v_results, h_results)

Preview:

Augur.predict(adata, *, n_subsamples=50, subsample_size=20, folds=3, min_cells=None, feature_perc=0.5, var_quantile=0.5, span=0.75, filter_negative_residuals=False, n_threads=4, augur_mode='default', select_variance_features=True, key_added='augurpy_results', random_state=None, zero_division=0)[source]#

Calculates the Area under the Curve using the given classifier.

Parameters:

adata (AnnData) – Anndata with obs label and cell_type for label and cell type and dummie variable y_ columns used as target
n_subsamples (int, default: 50) – number of random subsamples to draw from complete dataset for each cell type
subsample_size (int, default: 20) – number of cells to subsample randomly per type from each experimental condition
folds (int, default: 3) – number of folds to run cross validation on. Be careful changing this parameter without also changing subsample_size.
min_cells (int, default: None) – minimum number of cells for a particular cell type in each condition in order to retain that type for analysis (depricated..)
feature_perc (float, default: 0.5) – proportion of genes that are randomly selected as features for input to the classifier in each subsample using the random gene filter
var_quantile (float, default: 0.5) – The quantile below which features will be filtered, based on their residuals in a loess model.
span (float, default: 0.75) – Smoothing factor, as a fraction of the number of points to take into account. Should be in the range (0, 1].
filter_negative_residuals (bool, default: False) – if True, filter residuals at a fixed threshold of zero, instead of var_quantile
n_threads (int, default: 4) – number of threads to use for parallelization
select_variance_features (bool, default: True) – Whether to select genes based on the original Augur implementation (True) or using scanpy’s highly_variable_genes (False).
key_added (str, default: 'augurpy_results') – Key to add results to in .uns
augur_mode (Literal['default', 'permute', 'velocity'], default: 'default') – One of ‘default’, ‘velocity’ or ‘permute’. Setting augur_mode = “velocity” disables feature selection, assuming feature selection has been performed by the RNA velocity procedure to produce the input matrix, while setting augur_mode = “permute” will generate a null distribution of AUCs for each cell type by permuting the labels. Note that when setting augur_mode = “permute” n_subsample values less than 100 will be set to 500.
random_state (int | None, default: None) – set numpy random seed, sampling seed and fold seed
zero_division (int | str, default: 0) – 0 or 1 or warn; Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised. Precision metric parameter.

Return type:

tuple[AnnData, dict[str, Any]]

Returns:

A tuple with a dictionary containing the following keys with an updated AnnData object with mean_augur_score metrics in obs.

summary_metrics: Pandas Dataframe containing mean metrics for each cell type

feature_importances: Pandas Dataframe containing feature importances of genes across all cross validation runs

full_results: Dict containing merged results of individual cross validation runs for each cell type

[cell_types]: Cross validation runs of the cell type called

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> h_adata, h_results = ag_rfc.predict(loaded_data, subsample_size=20, n_threads=4)

Augur.predict_differential_prioritization(augur_results1, augur_results2, permuted_results1, permuted_results2, n_subsamples=50, n_permutations=1000)[source]#

Predicts the differential prioritization by performing permutation tests on samples.

Performs permutation tests that identifies cell types with statistically significant differences in augur_score between two conditions respectively compared to the control.

Parameters:

augur_results1 (dict[str, Any]) – Augurpy results from condition 1, obtained from predict()[1]
augur_results2 (dict[str, Any]) – Augurpy results from condition 2, obtained from predict()[1]
permuted_results1 (dict[str, Any]) – permuted Augurpy results from condition 1, obtained from predict() with argument augur_mode=permute
permuted_results2 (dict[str, Any]) – permuted Augurpy results from condition 2, obtained from predict() with argument augur_mode=permute
n_subsamples (int, default: 50) – number of subsamples to pool when calculating the mean augur score for each permutation.
n_permutations (int, default: 1000) – the total number of mean augur scores to calculate from a background distribution

Return type:

DataFrame

Returns:

Results object containing mean augur scores.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.bhattacherjee()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")

>>> data_15 = ag_rfc.load(adata, condition_label="Maintenance_Cocaine", treatment_label="withdraw_15d_Cocaine")
>>> adata_15, results_15 = ag_rfc.predict(data_15, random_state=None, n_threads=4)
>>> adata_15_permute, results_15_permute = ag_rfc.predict(data_15, augur_mode="permute", n_subsamples=100, random_state=None, n_threads=4)

>>> data_48 = ag_rfc.load(adata, condition_label="Maintenance_Cocaine", treatment_label="withdraw_48h_Cocaine")
>>> adata_48, results_48 = ag_rfc.predict(data_48, random_state=None, n_threads=4)
>>> adata_48_permute, results_48_permute = ag_rfc.predict(data_48, augur_mode="permute", n_subsamples=100, random_state=None, n_threads=4)

>>> pvals = ag_rfc.predict_differential_prioritization(augur_results1=results_15, augur_results2=results_48,                 permuted_results1=results_15_permute, permuted_results2=results_48_permute)

Augur.run_cross_validation(subsample, *, subsample_idx, folds, random_state, zero_division)[source]#

Perform cross validation on given subsample.

Parameters:

subsample (AnnData) – subsample of gene expression matrix of size subsample_size
estimator – classifier object to use in calculating the area under the curve
subsample_idx (int) – index of subsample
folds (int) – number of folds
random_state (int | None) – set random fold seed
zero_division (int | str) – 0 or 1 or warn; Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised. Precision metric parameter.

Return type:

dict

Returns:

Dictionary containing prediction metrics and estimator for each fold.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)
>>> subsample = ag_rfc.draw_subsample(adata, augur_mode="default", subsample_size=20, feature_perc=0.5,                 categorical=True, random_state=42)
>>> results = ag_rfc.run_cross_validation(subsample=subsample, folds=3, subsample_idx=0, random_state=42, zero_division=0)

Augur.sample(adata, categorical, subsample_size, random_state, features)[source]#

Sample AnnData observations.

Parameters:

adata (AnnData) – Anndata with obs label and cell_type for label and cell type and dummie variable y_ columns used as target
categorical (bool) – True if target values are categorical
subsample_size (int) – number of cells to subsample randomly per type from each experimental condition
random_state (int) – set numpy random seed and sampling seed
features (list) – features returned Anndata object

Returns:

Subsample of AnnData object of size subsample_size with given features

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)
>>> features = loaded_data.var_names
>>> subsample = ag_rfc.sample(
...     loaded_data, categorical=True, subsample_size=20, random_state=42, features=loaded_data.var_names
... )

Augur.select_highly_variable(adata)[source]#

Feature selection by variance using scanpy highly variable genes function.

Parameters:: adata (AnnData) – Anndata object containing gene expression values (cells in rows, genes in columns)
Return type:: AnnData

Results:: Anndata object with highly variable genes added as layer

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_highly_variable(loaded_data)

Augur.select_variance(adata, *, var_quantile, filter_negative_residuals, span=0.75)[source]#

Feature selection based on Augur implementation.

Parameters:

adata (AnnData) – Anndata object
var_quantile (float) – The quantile below which features will be filtered, based on their residuals in a loess model.
filter_negative_residuals (bool) – if True, filter residuals at a fixed threshold of zero, instead of var_quantile
span (float, default: 0.75) – Smoothing factor, as a fraction of the number of points to take into account. Should be in the range (0, 1].

Returns:

AnnData object with additional select_variance column in var.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.sc_sim_augur()
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> loaded_data = ag_rfc.load(adata)
>>> ag_rfc.select_variance(loaded_data, var_quantile=0.5, filter_negative_residuals=False, span=0.75)

Augur.set_scorer(multiclass, zero_division)[source]#

Set scoring fuctions for cross-validation based on estimator.

Parameters:

multiclass (bool) – Whether there are more than two target classes
zero_division (int | str) – 0 or 1 or warn; Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised. Precision metric parameter.

Return type:

dict[str, Any]

Returns:

Dict linking name to scorer object and string name

Examples

>>> import pertpy as pt
>>> ag_rfc = pt.tl.Augur("random_forest_classifier")
>>> scorer = ag_rfc.set_scorer(True, 0)

pertpy.tools.Augur

Contents

pertpy.tools.Augur#

Methods table#

Methods#