pertpy.tools.MLPClassifierSpace#
- class MLPClassifierSpace[source]#
Fits an ANN classifier to the data and takes the feature space (weights in the last layer) as embedding.
We train the ANN to classify the different perturbations. After training, the penultimate layer is used as the feature space, resulting in one embedding per cell. Consider employing the PseudoBulk or another PerturbationSpace to obtain one embedding per perturbation.
See here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7289078/ (Dose-response analysis) and Sup 17-19.
Methods table#
|
Add perturbations linearly. |
|
Creates cell embeddings by training a MLP classifier model to distinguish between perturbations. |
|
Subtract mean of the control from the perturbation. |
|
Impute missing values in the specified column using KNN imputation in the space defined by use_rep. |
|
Subtract perturbations linearly. |
Methods#
- MLPClassifierSpace.add(adata, *, perturbations, reference_key='control', ensure_consistency=False, target_col='perturbation')#
Add perturbations linearly. Assumes input of size n_perts x dimensionality.
- Parameters:
adata (
AnnData) – Anndata object of size n_perts x dim.reference_key (
str, default:'control') – perturbation source from which the perturbation summation starts.ensure_consistency (
bool, default:False) – Whether to run differential expression on all data matrices to ensure consistency of linear space.target_col (
str, default:'perturbation') – .obs column name that stores the label of the perturbation applied to each cell.
- Return type:
- Returns:
Anndata object of size (n_perts+1) x dim, where the last row is the addition of the specified perturbations. If ensure_consistency is True, returns a tuple of (new_perturbation, adata) where adata is the AnnData object provided as input but updated using compute_control_diff.
Examples
Example usage with PseudobulkSpace:
>>> import pertpy as pt >>> mdata = pt.dt.papalexi_2021() >>> ps = pt.tl.PseudobulkSpace() >>> ps_adata = ps.compute(mdata["rna"], target_col="gene_target", groups_col="gene_target") >>> new_perturbation = ps.add(ps_adata, perturbations=["ATF2", "CD86"], reference_key="NT")
- MLPClassifierSpace.compute(adata, target_col='perturbations', layer_key=None, hidden_dim=None, dropout=0.0, batch_norm=True, batch_size=128, test_split_size=0.2, validation_split_size=0.25, max_epochs=20, val_epochs_check=2, patience=2, lr=0.0001, seed=42)[source]#
Creates cell embeddings by training a MLP classifier model to distinguish between perturbations.
A model is created using the specified parameters (hidden_dim, dropout, batch_norm). Further parameters such as the number of classes to predict (number of perturbations) are obtained from the provided AnnData object directly. Dataloaders that take into account class imbalances are created. Next, the model is trained and tested, using the GPU if available. The embeddings are obtained by passing the data through the model and extracting the values in the last layer of the MLP. You will get one embedding per cell, so be aware that you might need to apply another perturbation space to aggregate the embeddings per perturbation.
- Parameters:
adata (
AnnData) – AnnData object of size cells x genestarget_col (
str, default:'perturbations') – .obs column that stores the perturbations.layer_key (
str, default:None) – Layer in adata to use.hidden_dim (
list[int], default:None) – List of number of neurons in each hidden layers of the neural network. For instance, [512, 256] will create a neural network with two hidden layers, the first with 512 neurons and the second with 256 neurons.dropout (
float, default:0.0) – Amount of dropout applied, constant for all layers.batch_norm (
bool, default:True) – Whether to apply batch normalization.batch_size (
int, default:128) – The batch size, i.e. the number of datapoints to use in one forward/backward pass.test_split_size (
float, default:0.2) – Fraction of data to put in the test set. Default to 0.2.validation_split_size (
float, default:0.25) – Fraction of data to put in the validation set of the resultant train set. E.g. a test_split_size of 0.2 and a validation_split_size of 0.25 means that 25% of 80% of the data will be used for validation.max_epochs (
int, default:20) – Maximum number of epochs for training.val_epochs_check (
int, default:2) – Test performance on validation dataset after every val_epochs_check training epochs. Note that this affects early stopping, as the model will be stopped if the validation performance does not improve for patience epochs.patience (
int, default:2) – Number of validation performance checks without improvement, after which the early stopping flag is activated and training is therefore stopped.lr (
float, default:0.0001) – Learning rate for training.seed (
int, default:42) – Random seed for reproducibility.
- Return type:
- Returns:
AnnData whose X attribute is the perturbation embedding and whose .obs[‘perturbations’] are the names of the perturbations. The AnnData will have shape (n_cells, n_features) where n_features is the number of features in the last layer of the MLP.
Examples
>>> import pertpy as pt >>> adata = pt.dt.norman_2019() >>> dcs = pt.tl.MLPClassifierSpace() >>> cell_embeddings = dcs.compute(adata, target_col="perturbation_name")
- MLPClassifierSpace.compute_control_diff(adata, *, target_col='perturbation', group_col=None, reference_key='control', layer_key=None, new_layer_key='control_diff', embedding_key=None, new_embedding_key='control_diff', all_data=False, copy=False)#
Subtract mean of the control from the perturbation.
- Parameters:
adata (
AnnData) – Anndata object of size cells x genes.target_col (
str, default:'perturbation') – .obs column name that stores the label of the perturbation applied to each cell.group_col (
str, default:None) – .obs column name that stores the label of the group of each cell. If None, ignore groups.reference_key (
str, default:'control') – The key of the control values.layer_key (
str, default:None) – Key of the AnnData layer to use for computation.new_layer_key (
str, default:'control_diff') – the results are stored in the given layer.embedding_key (
str, default:None) – obsm key of the AnnData embedding to use for computation.new_embedding_key (
str, default:'control_diff') – Results are stored in a new embedding in obsm with this key.all_data (
bool, default:False) – if True, do the computation in all data representations (X, all layers and all embeddings)copy (
bool, default:False) – If True returns a new Anndata of same size with the new column; otherwise it updates the initial AnnData object.
- Return type:
- Returns:
Updated AnnData object.
Examples
Example usage with PseudobulkSpace:
>>> import pertpy as pt >>> mdata = pt.dt.papalexi_2021() >>> ps = pt.tl.PseudobulkSpace() >>> diff_adata = ps.compute_control_diff(mdata["rna"], target_col="gene_target", reference_key="NT")
- MLPClassifierSpace.label_transfer(adata, *, target_column='perturbation', column_uncertainty_score_key='perturbation_transfer_uncertainty', target_val='unknown', neighbors_key='neighbors', **kwargs)#
Impute missing values in the specified column using KNN imputation in the space defined by use_rep.
Uncertainty is calculated as the entropy of the label distribution in the neighborhood of the target cell. In other words, a cell where all neighbors have the same set of labels will have an uncertainty of 0, whereas a cell where all neighbors have many different labels will have high uncertainty.
- Parameters:
adata (
AnnData) – The AnnData object containing single-cell data.target_column (
str, default:'perturbation') – The column name in adata.obs to perform imputation on.column_uncertainty_score_key (
str, default:'perturbation_transfer_uncertainty') – The column name in adata.obs to store the uncertainty score of the label transfer.target_val (
str, default:'unknown') – The target value to impute.neighbors_key (
str, default:'neighbors') – The key in adata.uns where the neighbors are stored.
- Return type:
Examples
>>> import pertpy as pt >>> import scanpy as sc >>> import numpy as np >>> adata = sc.datasets.pbmc68k_reduced() >>> # randomly dropout 10% of the data annotations >>> adata.obs["perturbation"] = adata.obs["louvain"].astype(str).copy() >>> random_cells = np.random.choice(adata.obs.index, int(adata.obs.shape[0] * 0.1), replace=False) >>> adata.obs.loc[random_cells, "perturbation"] = "unknown" >>> sc.pp.neighbors(adata) >>> sc.tl.umap(adata) >>> ps = pt.tl.PseudobulkSpace() >>> ps.label_transfer(adata)
- MLPClassifierSpace.subtract(adata, *, perturbations, reference_key='control', ensure_consistency=False, target_col='perturbation')#
Subtract perturbations linearly. Assumes input of size n_perts x dimensionality.
- Parameters:
adata (
AnnData) – Anndata object of size n_perts x dim.reference_key (
str, default:'control') – Perturbation source from which the perturbation subtraction starts.ensure_consistency (
bool, default:False) – Whether to run differential expression on all data matrices to ensure consistency of linear space.target_col (
str, default:'perturbation') – .obs column name that stores the label of the perturbation applied to each cell.
- Return type:
- Returns:
Anndata object of size (n_perts+1) x dim, where the last row is the subtraction of the specified perturbations. If ensure_consistency is True, returns a tuple of (new_perturbation, adata) where adata is the AnnData object provided as input but updated using compute_control_diff.
Examples
Example usage with PseudobulkSpace:
>>> import pertpy as pt >>> mdata = pt.dt.papalexi_2021() >>> ps = pt.tl.PseudobulkSpace() >>> ps_adata = ps.compute(mdata["rna"], target_col="gene_target", groups_col="gene_target") >>> new_perturbation = ps.subtract(ps_adata, reference_key="ATF2", perturbations=["BRD4", "CUL3"])