pertpy.tools.DifferentialGeneExpression

class pertpy.tools.DifferentialGeneExpression[source]

Support for differential gene expression for scverse.

Methods table

calculate_cohens_d(de_res_1, de_res_2)

Calculate Cohen's D for the logfoldchanges.

calculate_correlation(de_res_1, de_res_2[, ...])

Calculate the Spearman correlation coefficient for 'pvals_adj' and 'logfoldchanges' columns.

calculate_jaccard_index(de_res_1, de_res_2)

Calculate the Jaccard index for sets of significantly expressed genes/features based on a p-value threshold.

de_analysis(adata, groupby, method, ...[, ...])

Perform differential expression analysis.

de_res_to_anndata(adata, de_res, *, groupby)

Add tabular differential expression result to AnnData as if it was produced by scanpy.tl.rank_genes_groups.

filter_by_expr(adata[, obs, group, ...])

Filter AnnData by which genes have sufficiently large counts to be retained in a statistical analysis.

filter_by_prop(adata[, min_prop, min_samples])

Determine which genes are expressed in a sufficient proportion of cells across samples.

get_pseudobulk(adata, sample_col, groups_col)

Summarizes expression profiles across cells per sample and group.

Methods

calculate_cohens_d

DifferentialGeneExpression.calculate_cohens_d(de_res_1, de_res_2)[source]

Calculate Cohen’s D for the logfoldchanges.

Parameters:
  • de_res_1 (DataFrame) – A DataFrame with DE result columns, including ‘logfoldchanges’.

  • de_res_2 (DataFrame) – Another DataFrame with the same DE result columns.

Return type:

Series

Returns:

A pandas Series containing Cohen’s D for each gene/feature.

calculate_correlation

DifferentialGeneExpression.calculate_correlation(de_res_1, de_res_2, method='spearman')[source]

Calculate the Spearman correlation coefficient for ‘pvals_adj’ and ‘logfoldchanges’ columns.

Parameters:
  • de_res_1 (DataFrame) – A DataFrame with DE result columns.

  • de_res_2 (DataFrame) – Another DataFrame with the same DE result columns.

  • method (Literal['spearman', 'pearson', 'kendall-tau']) – The correlation method to apply. One of spearman, pearson, kendall-tau. Defaults to spearman.

Return type:

DataFrame

Returns:

A DataFrame with the Spearman correlation coefficients for ‘pvals_adj’ and ‘logfoldchanges’.

calculate_jaccard_index

DifferentialGeneExpression.calculate_jaccard_index(de_res_1, de_res_2, threshold=0.05)[source]

Calculate the Jaccard index for sets of significantly expressed genes/features based on a p-value threshold.

Parameters:
  • de_res_1 (DataFrame) – A DataFrame with DE result columns, including ‘pvals’.

  • de_res_2 (DataFrame) – Another DataFrame with the same DE result columns.

  • threshold (float) – A threshold for determining significant expression (default is 0.05).

Return type:

float

Returns:

The Jaccard index.

de_analysis

DifferentialGeneExpression.de_analysis(adata, groupby, method, *formula, contrast, inplace=True, key_added)[source]

Perform differential expression analysis.

Parameters:
  • adata (AnnData) – single-cell or pseudobulk AnnData object

  • groupby (str) – Column in adata.obs that contains the factor to test, e.g. treatment. For simple statistical tests (t-test, wilcoxon), it is sufficient to specify groupby. Linear models require to specify a formula. In that case, the groupby column is used to compute the contrast.

  • method (Literal['t-test', 'wilcoxon', 'pydeseq2', 'deseq2', 'edger']) – Which method to use to perform the DE test.

  • formula (str | None) – model specification for linear models. E.g. ~ treatment + sex + age. MUST contain the factor specified in groupby.

  • contrast (str | None) – See e.g. https://www.statsmodels.org/devel/contrasts.html for more information.

  • inplace (bool) – if True, save the result in adata.varm[key_added]

  • key_added (str | None) – Key under which the result is saved in adata.varm if inplace is True. If set to None this defaults to de_{method}_{groupby}.

Returns:

  • gene_id

  • log2 fold change

  • mean expression

  • unadjusted p-value

  • adjusted p-value

Return type:

Depending on the method a Pandas DataFrame containing at least

de_res_to_anndata

DifferentialGeneExpression.de_res_to_anndata(adata, de_res, *, groupby, gene_id_col='gene_symbols', score_col='scores', pval_col='pvals', pval_adj_col='pvals_adj', lfc_col='logfoldchanges', key_added='rank_genes_groups')[source]

Add tabular differential expression result to AnnData as if it was produced by scanpy.tl.rank_genes_groups.

Parameters:
  • adata (AnnData) – Annotated data matrix

  • de_res (DataFrame) – Tablular de result

  • groupby (str) – Column in de_res that indicates the group. This column must also exist in adata.obs.

  • gene_id_col (str) – Column in de_res that holds the gene identifiers

  • score_col (str) – Column in de_res that holds the score (results will be ordered by score).

  • pval_col (str) – Column in de_res that holds the unadjusted pvalue

  • pval_adj_col (str | None) – Column in de_res that holds the adjusted pvalue. If not specified, the unadjusted pvalues will be FDR-adjusted.

  • lfc_col (str) – Column in de_res that holds the log fold change

  • key_added (str) – Key under which the results will be stored in adata.uns

Return type:

None

filter_by_expr

DifferentialGeneExpression.filter_by_expr(adata, obs=None, group=None, lib_size=None, min_count=10, min_total_count=15, large_n=10, min_prop=0.7)[source]

Filter AnnData by which genes have sufficiently large counts to be retained in a statistical analysis.

Wraps decoupler’s filter_by_expr function. See https://decoupler-py.readthedocs.io/en/latest/generated/decoupler.filter_by_expr.html#decoupler.filter_by_expr for more details.

Parameters:
  • adata (AnnData) – AnnData obtained after running get_pseudobulk.

  • obs (DataFrame) – Metadata dataframe, only needed if adata is not an AnnData.

  • group (str | None) – Name of the .obs column to group by. If None, assumes all samples belong to one group.

  • lib_size (int | float | None) – Library size. Defaults to the sum of reads per sample if None.

  • min_count (int) – Minimum count required per gene for at least some samples.

  • min_total_count (int) – Minimum total count required per gene across all samples.

  • large_n (int) – Number of samples per group considered to be “large”.

  • min_prop (float) – Minimum proportion of samples in the smallest group that express the gene.

Return type:

AnnData

Returns:

AnnData with only the genes that are to be kept.

filter_by_prop

DifferentialGeneExpression.filter_by_prop(adata, min_prop=0.2, min_samples=2)[source]

Determine which genes are expressed in a sufficient proportion of cells across samples.

This function selects genes that are sufficiently expressed across cells in each sample and that this condition is met across a minimum number of samples.

Parameters:
  • adata (AnnData) – AnnData obtained after running get_pseudobulk. It requieres .layer[‘psbulk_props’].

  • min_prop (float) – Minimum proportion of cells that express a gene in a sample.

  • min_samples (int) – Minimum number of samples with bigger or equal proportion of cells with expression than min_prop.

Return type:

AnnData

Returns:

AnnData with only the genes that are to be kept.

get_pseudobulk

DifferentialGeneExpression.get_pseudobulk(adata, sample_col, groups_col, obs=None, layer=None, use_raw=False, mode='sum', min_cells=10, min_counts=1000, dtype=<class 'numpy.float32'>, skip_checks=False)[source]

Summarizes expression profiles across cells per sample and group.

Generates summarized expression profiles across cells per sample (e.g. sample id) and group (e.g. cell type) based on the metadata found in .obs. To ensure a minimum quality control, this function removes genes that are not expressed enough across cells (min_prop) or samples (min_smpls), and samples with not enough cells (min_cells) or gene counts (min_counts).

By default this function expects raw integer counts as input and sums them per sample and group (mode=’sum’), but other modes are available.

This function produces some quality control metrics to assess if is necessary to filter some samples. The number of cells that belong to each sample is stored in .obs[‘psbulk_n_cells’], the total sum of counts per sample in .obs[‘psbulk_counts’], and the proportion of cells that express a given gene in .layers[‘psbulk_props’].

Wraps decoupler’s get_pseudobulk function. See: https://decoupler-py.readthedocs.io/en/latest/generated/decoupler.get_pseudobulk.html#decoupler.get_pseudobulk for more details.

Parameters:
  • adata (AnnData) – Input AnnData object.

  • sample_col (str) – Column of obs where to extract the samples names.

  • groups_col (str) – Column of obs where to extract the groups names.

  • obs (DataFrame) – If provided, metadata DataFrame.

  • layer (str) – If provided, which layer to use.

  • use_raw (bool) – Use raw attribute of the AnnData object if present.

  • mode (str) – How to perform the pseudobulk. Available options are ‘sum’, ‘mean’ or ‘median’. Also accepts callback functions to perform custom aggregations. Additionally, it is also possible to provide a dictionary of different callback functions, each one stored in a different resulting .layer. In this case, the result of the first callback function of the dictionary is stored in .X by default.

  • min_cells – Filter to remove samples by a minimum number of cells in a sample-group pair.

  • min_counts (int) – Filter to remove samples by a minimum number of summed counts in a sample-group pair.

  • dtype (Union[dtype[Any], None, type[Any], _SupportsDType[dtype[Any]], str, tuple[Any, int], tuple[Any, Union[SupportsIndex, Sequence[SupportsIndex]]], list[Any], _DTypeDict, tuple[Any, Any]]) – Type of float used.

  • skip_checks (bool) – Whether to skip input checks. Set to True when working with positive and negative data, or when counts are not integers.

Return type:

AnnData

Returns:

Returns new AnnData object with unormalized pseudobulk profiles per sample and group.