pertpy.tools.MLPClassifierSpace

class pertpy.tools.MLPClassifierSpace[source]

Fits an ANN classifier to the data and takes the feature space (weights in the last layer) as embedding.

We train the ANN to classify the different perturbations. After training, the penultimate layer is used as the feature space, resulting in one embedding per cell. Consider employing the PseudoBulk or another PerturbationSpace to obtain one embedding per perturbation.

See here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7289078/ (Dose-response analysis) and Sup 17-19.

Methods table

add(adata, perturbations[, reference_key, ...])

Add perturbations linearly.

compute(adata[, target_col, layer_key, ...])

Creates cell embeddings by training a MLP classifier model to distinguish between perturbations.

compute_control_diff(adata[, target_col, ...])

Subtract mean of the control from the perturbation.

get_embeddings(**kwargs)

This method is deprecated and will be removed in the future.

label_transfer(adata[, column, target_val, ...])

Impute missing values in the specified column using KNN imputation in the space defined by use_rep.

load(adata, **kwargs)

This method is deprecated and will be removed in the future.

subtract(adata, perturbations[, ...])

Subtract perturbations linearly.

train(**kwargs)

This method is deprecated and will be removed in the future.

Methods

add

MLPClassifierSpace.add(adata, perturbations, reference_key='control', ensure_consistency=False, target_col='perturbation')

Add perturbations linearly. Assumes input of size n_perts x dimensionality

Parameters:
  • adata (AnnData) – Anndata object of size n_perts x dim.

  • perturbations (Iterable[str]) – Perturbations to add.

  • reference_key (str) – perturbation source from which the perturbation summation starts. Defaults to ‘control’.

  • ensure_consistency (bool) – If True, runs differential expression on all data matrices to ensure consistency of linear space.

  • target_col (str) – .obs column name that stores the label of the perturbation applied to each cell. Defaults to ‘perturbation’.

Return type:

tuple[AnnData, AnnData] | AnnData

Returns:

Anndata object of size (n_perts+1) x dim, where the last row is the addition of the specified perturbations. If ensure_consistency is True, returns a tuple of (new_perturbation, adata) where adata is the AnnData object provided as input but updated using compute_control_diff.

Examples

Example usage with PseudobulkSpace:

>>> import pertpy as pt
>>> mdata = pt.dt.papalexi_2021()
>>> ps = pt.tl.PseudobulkSpace()
>>> ps_adata = ps.compute(mdata["rna"], target_col="gene_target", groups_col="gene_target")
>>> new_perturbation = ps.add(ps_adata, perturbations=["ATF2", "CD86"], reference_key="NT")

compute

MLPClassifierSpace.compute(adata, target_col='perturbations', layer_key=None, hidden_dim=None, dropout=0.0, batch_norm=True, batch_size=256, test_split_size=0.2, validation_split_size=0.25, max_epochs=20, val_epochs_check=2, patience=2)[source]

Creates cell embeddings by training a MLP classifier model to distinguish between perturbations.

A model is created using the specified parameters (hidden_dim, dropout, batch_norm). Further parameters such as the number of classes to predict (number of perturbations) are obtained from the provided AnnData object directly. Dataloaders that take into account class imbalances are created. Next, the model is trained and tested, using the GPU if available. The embeddings are obtained by passing the data through the model and extracting the values in the last layer of the MLP. You will get one embedding per cell, so be aware that you might need to apply another perturbation space to aggregate the embeddings per perturbation.

Parameters:
  • adata (AnnData) – AnnData object of size cells x genes

  • target_col (str) – .obs column that stores the perturbations. Defaults to “perturbations”.

  • layer_key (str) – Layer in adata to use. Defaults to None.

  • hidden_dim (list[int]) – List of number of neurons in each hidden layers of the neural network. For instance, [512, 256] will create a neural network with two hidden layers, the first with 512 neurons and the second with 256 neurons. Defaults to [512].

  • dropout (float) – Amount of dropout applied, constant for all layers. Defaults to 0.

  • batch_norm (bool) – Whether to apply batch normalization. Defaults to True.

  • batch_size (int) – The batch size, i.e. the number of datapoints to use in one forward/backward pass. Defaults to 256.

  • test_split_size (float) – Fraction of data to put in the test set. Default to 0.2.

  • validation_split_size (float) – Fraction of data to put in the validation set of the resultant train set. E.g. a test_split_size of 0.2 and a validation_split_size of 0.25 means that 25% of 80% of the data will be used for validation. Defaults to 0.25.

  • max_epochs (int) – Maximum number of epochs for training. Defaults to 20.

  • val_epochs_check (int) – Test performance on validation dataset after every val_epochs_check training epochs. Note that this affects early stopping, as the model will be stopped if the validation performance does not improve for patience epochs. Defaults to 2.

  • patience (int) – Number of validation performance checks without improvement, after which the early stopping flag is activated and training is therefore stopped. Defaults to 2.

Return type:

AnnData

Returns:

AnnData whose X attribute is the perturbation embedding and whose .obs[‘perturbations’] are the names of the perturbations. The AnnData will have shape (n_cells, n_features) where n_features is the number of features in the last layer of the MLP.

Examples

>>> import pertpy as pt
>>> adata = pt.dt.norman_2019()
>>> dcs = pt.tl.MLPClassifierSpace()
>>> cell_embeddings = dcs.compute(adata, target_col="perturbation_name")

compute_control_diff

MLPClassifierSpace.compute_control_diff(adata, target_col='perturbation', group_col=None, reference_key='control', layer_key=None, new_layer_key='control_diff', embedding_key=None, new_embedding_key='control_diff', all_data=False, copy=False)

Subtract mean of the control from the perturbation.

Parameters:
  • adata (AnnData) – Anndata object of size cells x genes.

  • target_col (str) – .obs column name that stores the label of the perturbation applied to each cell. Defaults to ‘perturbations’.

  • group_col (str) – .obs column name that stores the label of the group of eah cell. If None, ignore groups. Defaults to ‘perturbations’.

  • reference_key (str) – The key of the control values. Defaults to ‘control’.

  • layer_key (str) – Key of the AnnData layer to use for computation. Defaults to the X matrix otherwise.

  • new_layer_key (str) – the results are stored in the given layer. Defaults to ‘control_diff’.

  • embedding_key (str) – obsm key of the AnnData embedding to use for computation. Defaults to the ‘X’ matrix otherwise.

  • new_embedding_key (str) – Results are stored in a new embedding in obsm with this key. Defaults to ‘control_diff’.

  • all_data (bool) – if True, do the computation in all data representations (X, all layers and all embeddings)

  • copy (bool) – If True returns a new Anndata of same size with the new column; otherwise it updates the initial AnnData object.

Return type:

AnnData

Returns:

Updated AnnData object.

Examples

Example usage with PseudobulkSpace:

>>> import pertpy as pt
>>> mdata = pt.dt.papalexi_2021()
>>> ps = pt.tl.PseudobulkSpace()
>>> diff_adata = ps.compute_control_diff(mdata["rna"], target_col="gene_target", reference_key="NT")

get_embeddings

MLPClassifierSpace.get_embeddings(**kwargs)[source]

This method is deprecated and will be removed in the future. Please use the compute method instead.

label_transfer

MLPClassifierSpace.label_transfer(adata, column='perturbation', target_val='unknown', n_neighbors=5, use_rep='X_umap')

Impute missing values in the specified column using KNN imputation in the space defined by use_rep.

Parameters:
  • adata (AnnData) – The AnnData object containing single-cell data.

  • column (str) – The column name in AnnData object to perform imputation on. Defaults to “perturbation”.

  • target_val (str) – The target value to impute. Defaults to “unknown”.

  • n_neighbors (int) – Number of neighbors to use for imputation. Defaults to 5.

  • use_rep (str) – The key in adata.obsm where the embedding (UMAP, PCA, etc.) is stored. Defaults to ‘X_umap’.

Return type:

None

Examples

>>> import pertpy as pt
>>> import scanpy as sc
>>> import numpy as np
>>> adata = sc.datasets.pbmc68k_reduced()
>>> rng = np.random.default_rng()
>>> adata.obs["perturbation"] = rng.choice(
...     ["A", "B", "C", "unknown"], size=adata.n_obs, p=[0.33, 0.33, 0.33, 0.01]
... )
>>> sc.pp.neighbors(adata)
>>> sc.tl.umap(adata)
>>> ps = pt.tl.PseudobulkSpace()
>>> ps.label_transfer(adata, n_neighbors=5, use_rep="X_umap")

load

MLPClassifierSpace.load(adata, **kwargs)[source]

This method is deprecated and will be removed in the future. Please use the compute method instead.

subtract

MLPClassifierSpace.subtract(adata, perturbations, reference_key='control', ensure_consistency=False, target_col='perturbation')

Subtract perturbations linearly. Assumes input of size n_perts x dimensionality

Parameters:
  • adata (AnnData) – Anndata object of size n_perts x dim.

  • perturbations (Iterable[str]) – Perturbations to subtract.

  • reference_key (str) – Perturbation source from which the perturbation subtraction starts. Defaults to ‘control’.

  • ensure_consistency (bool) – If True, runs differential expression on all data matrices to ensure consistency of linear space.

  • target_col (str) – .obs column name that stores the label of the perturbation applied to each cell. Defaults to ‘perturbations’.

Return type:

tuple[AnnData, AnnData] | AnnData

Returns:

Anndata object of size (n_perts+1) x dim, where the last row is the subtraction of the specified perturbations. If ensure_consistency is True, returns a tuple of (new_perturbation, adata) where adata is the AnnData object provided as input but updated using compute_control_diff.

Examples

Example usage with PseudobulkSpace:

>>> import pertpy as pt
>>> mdata = pt.dt.papalexi_2021()
>>> ps = pt.tl.PseudobulkSpace()
>>> ps_adata = ps.compute(mdata["rna"], target_col="gene_target", groups_col="gene_target")
>>> new_perturbation = ps.subtract(ps_adata, reference_key="ATF2", perturbations=["BRD4", "CUL3"])

train

MLPClassifierSpace.train(**kwargs)[source]

This method is deprecated and will be removed in the future. Please use the compute method instead.