Note

This page was generated from milo.ipynb. Some tutorial content may look better in light mode.

Milo - KNN based differential abundance analysis¶

This notebook demonstrates how to run differential abundance analysis on single-cell datasets using the Milo framework.

Many biological conditions (disease, development, genetic KOs) can induce shifts in cell composition, where cells of a given state become enriched or depleted in response to a perturbation. With differential abundance analysis, we quantify consistent changes in cell composition across replicate samples. While differential abundance analysis can be performed on cell type clusters, it’s not always possible or practical to use precisely labeled clusters, especially when we are interested in studying transitional states, such as during developmental processes, or when we expect only a subpopulation of a cell type to be affected by the condition of interest. Milo is a method to detect compositional changes occurring in smaller subpopulations of cells, defined as neighbourhoods over the k-nearest neighbor (KNN) graph of cell-cell similarities.

[ ]:

%load_ext autoreload
%autoreload 2

Setup¶

[ ]:

import matplotlib
import matplotlib.pyplot as plt
import pertpy as pt
import scanpy as sc

plt.rcParams["figure.figsize"] = (7, 7)
import numpy as np

sc.settings.verbosity = 3

Dataset¶

For this vignette we will use a dataset of blood cells from COVID-19 patients - the Stephenson et al. 2021 dataset. The original dataset is available via cellxgene. Here cells were subsampled to have 500 cells per donor, and predicted doublets and low quality cells were filtered out.

[ ]:

adata = pt.data.stephenson_2021_subsampled()

[ ]:

adata

AnnData object with n_obs × n_vars = 62509 × 16299
    obs: 'Collection_Day', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'patient_id', 'author_cell_type', 'organism', 'sex', 'tissue', 'ethnicity', 'disease', 'assay', 'cell_type', 'dataset_group'
    var: 'gene_id', 'gene_name'
    uns: 'log1p', 'umap'
    obsm: 'X_scVI', 'X_umap'

This object collects data from PBMCs of 119 individuals (adata.obs['patient_id']), with varying severity of COVID-19 |(adata.obs['Status_on_day_collection_summary']). Samples were collected from three different hospitals (adata.obs['Site']). The dataset also includes 6 samples from PBMCs of healthy patients stimulated with LPS to simulate anti-viral response, which we will exclude in subsequent analysis.

[ ]:

adata.obs["COVID_severity"] = adata.obs["Status_on_day_collection_summary"].copy()  # short name
adata.obs[["patient_id", "COVID_severity"]].drop_duplicates().value_counts("COVID_severity")

COVID_severity
Moderate        30
Healthy         23
Mild            23
Critical        15
Severe          13
Asymptomatic     9
LPS_10hours      3
LPS_90mins       3
dtype: int64

Here the data from different patients and sites has been integrated in a common latent space, minimizing the differences between donors, by mapping with scArches using an scVI model trained on healthy PBMCs (see here for references on integration). We can visualize the batch-corrected space using UMAP.

[ ]:

sc.pl.umap(adata, color=["author_cell_type"], legend_loc="on data")
sc.pl.umap(adata, color=["Site", "Status"], wspace=0.3, size=8)

../../_images/tutorials_notebooks_milo_10_0.png

../../_images/tutorials_notebooks_milo_10_1.png

In this dataset we are interested in identifying cell populations that increase in abundance in response to COVID-19, and with varying degrees of severity COVID-19 severity.

### Prepare for Milo analysis

[ ]:

## Exclude LPS samples
adata = adata[adata.obs["Status"] != "LPS"].copy()

## Initialize object for Milo analysis
milo = pt.tl.Milo()
mdata = milo.load(adata)

When initializing the Milo object, we create a MuData object which will store both the gene expression matrices (rna view) and cell count matrices used for differential abundance analysis (milo view).

[ ]:

mdata

MuData object with n_obs × n_vars = 59873 × 16299
  2 modalities
    rna:    59873 x 16299
      obs:  'Collection_Day', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'patient_id', 'author_cell_type', 'organism', 'sex', 'tissue', 'ethnicity', 'disease', 'assay', 'cell_type', 'dataset_group', 'COVID_severity'
      var:  'gene_id', 'gene_name'
      uns:  'log1p', 'umap', 'author_cell_type_colors', 'Site_colors', 'Status_colors', 'neighbors'
      obsm: 'X_scVI', 'X_umap'
      obsp: 'distances', 'connectivities'
    milo_compositional:     0 x 0

Build KNN graph¶

We can use scanpy functions to build a KNN graph. We set the dimensionality and value for k to use in subsequent steps.

Here the value of k indicates the smallest possible size of neighbourhood in which we will quantify differential abundance (i.e. with k=50 the smallest neighbourhood will have 50 cells). Depending on the number of samples, you might want to use a high value of k for neighbourhood analysis, to have sufficient power to estimate abundance fold-changes. Since here we have data from > 100 patients, we set k=150 to have on average more than one cell per donor in each neighbourhood.

[ ]:

sc.pp.neighbors(mdata["rna"], use_rep="X_scVI", n_neighbors=150)

computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:02:14)

Construct neighbourhoods¶

This step assigns cells to a set of representative neighbourhoods on the KNN graph.

[ ]:

milo.make_nhoods(mdata["rna"], prop=0.1)

The assignment of cells to neighbourhoods is stored as a sparse binary matrix in mdata['rna'].obsm. Here we see that cells have been assigned to 4307 neighbourhoods.

[ ]:

mdata["rna"].obsm["nhoods"]

<59873x4307 sparse matrix of type '<class 'numpy.float32'>'
        with 1291378 stored elements in Compressed Sparse Row format>

The information on which cells are sampled as index cells of representative neighbourhoods is stored in mdata['rna'].obs, along with the distance of the index to the kth nearest neighbor, which is used later for the SpatialFDR correction.

[ ]:

mdata["rna"][mdata["rna"].obs["nhood_ixs_refined"] != 0].obs[["nhood_ixs_refined", "nhood_kth_distance"]]

	nhood_ixs_refined	nhood_kth_distance
10_1038_s41591_021_01329_2-S12_CTGTGCTAGCTCCTCT-1	1	0.971554
10_1038_s41591_021_01329_2-S11_TAAGCGTAGATCGGGT-1	1	1.470572
10_1038_s41591_021_01329_2-S12_CTTCTCTAGGCAGGTT-1	1	0.824604
10_1038_s41591_021_01329_2-S11_ACGGAGACACCGAAAG-1	1	1.441965
10_1038_s41591_021_01329_2-S12_AACCATGGTCCATCCT-1	1	0.751696
...	...	...
10_1038_s41591_021_01329_2-GGGCATCGTGTCAATC-newcastle74	1	0.876956
10_1038_s41591_021_01329_2-TGCACCTGTCAAAGCG-newcastle74	1	1.203800
10_1038_s41591_021_01329_2-ATTTCTGCATGAAGTA-newcastle74	1	0.904092
10_1038_s41591_021_01329_2-GTCGTAATCTCGAGTA-newcastle74	1	1.094548
10_1038_s41591_021_01329_2-TTCGGTCAGGGCTTCC-newcastle74	1	0.839677

4307 rows × 2 columns

We can visualize the distribution of neighbourhood sizes, to check that the minimal value of k makes sense, and that the distribution of sizes is not too wide.

[ ]:

nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
plt.hist(nhood_size, bins=100)
plt.xlabel("# cells in nhood")
plt.ylabel("# nhoods");

../../_images/tutorials_notebooks_milo_26_0.png

Count cells in neighbourhoods¶

Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample (in this case the patient) are in each neighbourhood. We need to use the cell metadata saved in adata.obs and specify which column contains the sample information.

[ ]:

mdata = milo.count_nhoods(mdata, sample_col="patient_id")

This function populates the modality milo to mdata.

mdata[‘milo’] is an anndata object where obs correspond to samples and vars correspond to neighbourhoods, and where .X stores the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing.

[ ]:

mdata

MuData object with n_obs × n_vars = 59873 × 16299
  2 modalities
    rna:    59873 x 16299
      obs:  'Collection_Day', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'patient_id', 'author_cell_type', 'organism', 'sex', 'tissue', 'ethnicity', 'disease', 'assay', 'cell_type', 'dataset_group', 'COVID_severity', 'nhood_ixs_random', 'nhood_ixs_refined', 'nhood_kth_distance'
      var:  'gene_id', 'gene_name'
      uns:  'log1p', 'umap', 'author_cell_type_colors', 'Site_colors', 'Status_colors', 'neighbors', 'nhood_neighbors_key'
      obsm: 'X_scVI', 'X_umap', 'nhoods'
      obsp: 'distances', 'connectivities'
    milo_compositional:     113 x 4307
      var:  'index_cell', 'kth_distance'
      uns:  'sample_col'

[ ]:

mdata["milo"]

AnnData object with n_obs × n_vars = 113 × 4307
    var: 'index_cell', 'kth_distance'
    uns: 'sample_col'

Differential abundance testing with GLM¶

We are now ready to test for differential abundance in time. The experimental design needs to be specified with R-style formulas.

Here we run a simple comparison, testing for changes in cell abundance associated with COVID-19.

[ ]:

# Reorder categories
# (by default, the last category is taken as the condition of interest)
mdata["rna"].obs["Status"] = mdata["rna"].obs["Status"].cat.reorder_categories(["Healthy", "Covid"])
milo.da_nhoods(mdata, design="~Status")

We can explicitly specify the comparison we want to make using the parameter model_contrasts. This is especially important when we have more than two levels for a category and we want to compare against a specific control level.

[ ]:

milo.da_nhoods(mdata, design="~Status", model_contrasts="StatusCovid-StatusHealthy")

In addition, we might want to take into account potential confounders that could affect cell abundances, for example batch effects. We can include confounders in the model using the syntax ~ confounders + condition, where the covariate specified in the last term is always the covariate of interest. In this case we can estimate the effect of COVID-19 on cell abundance regressing out changes in cell abundance driven by the site of collection.

See below for more examples on how to specify different types of comparisons.

[ ]:

milo.da_nhoods(mdata, design="~Site+Status", model_contrasts="StatusCovid")

Information about the sample design is stored in mdata['milo'].obs:

[ ]:

mdata["milo"].obs

	Status	Site	patient_id
AP1	Covid	Sanger	AP1
AP2	Covid	Sanger	AP2
AP3	Covid	Sanger	AP3
AP4	Covid	Sanger	AP4
AP5	Covid	Sanger	AP5
...	...	...	...
newcastle21v2	Covid	Ncl	newcastle21v2
newcastle49	Covid	Ncl	newcastle49
newcastle59	Covid	Ncl	newcastle59
newcastle65	Healthy	Ncl	newcastle65
newcastle74	Healthy	Ncl	newcastle74

113 rows × 3 columns

The differential abundance test results are stored in milo_mdata['milo'].var. In particular:

logFC: stores the log-Fold Change in abundance (i.e. the slope of the linear model)
PValue stores the p-value for the test
SpatialFDR stores the p-values adjusted for multiple testing (accounting for overlap between neighbourhoods)

[ ]:

mdata["milo"].var

	index_cell	kth_distance	SpatialFDR	Nhood_size	nhood_annotation	nhood_annotation_frac	logFC	logCPM	F	PValue	FDR
0	10_1038_s41591_021_01329_2-S12_CTGTGCTAGCTCCTCT-1	0.971554	0.754070	261.0	Mixed	0.367816	-0.232552	8.555860	0.410906	0.522339	0.737128
1	10_1038_s41591_021_01329_2-S11_TAAGCGTAGATCGGGT-1	1.470572	0.463495	255.0	Mixed	0.200000	-0.548279	8.563358	1.619824	0.204785	0.439248
2	10_1038_s41591_021_01329_2-S12_CTTCTCTAGGCAGGTT-1	0.824604	0.971190	287.0	CD4.CM	0.634146	-0.037593	8.600876	0.010044	0.920285	0.967389
3	10_1038_s41591_021_01329_2-S11_ACGGAGACACCGAAAG-1	1.441965	0.424277	225.0	Mixed	0.351111	0.572683	8.451508	1.829268	0.177939	0.400619
4	10_1038_s41591_021_01329_2-S12_AACCATGGTCCATCCT-1	0.751696	0.996721	216.0	Mixed	0.472222	-0.009325	8.396741	0.000068	0.993406	0.996414
...	...	...	...	...	...	...	...	...	...	...	...
4302	10_1038_s41591_021_01329_2-GGGCATCGTGTCAATC-ne...	0.876956	0.045042	338.0	CD83_CD14_mono	0.961538	-1.158965	8.726121	7.728468	0.006024	0.038664
4303	10_1038_s41591_021_01329_2-TGCACCTGTCAAAGCG-ne...	1.203800	0.086444	237.0	CD8.EM	0.687764	-0.944625	8.514239	5.963366	0.015589	0.075946
4304	10_1038_s41591_021_01329_2-ATTTCTGCATGAAGTA-ne...	0.904092	0.433902	250.0	CD8.TE	0.780000	-0.582428	8.531476	1.777272	0.184198	0.410207
4305	10_1038_s41591_021_01329_2-GTCGTAATCTCGAGTA-ne...	1.094548	0.392866	276.0	Mixed	0.532609	0.586302	8.618117	2.023270	0.156665	0.370135
4306	10_1038_s41591_021_01329_2-TTCGGTCAGGGCTTCC-ne...	0.839677	0.233195	273.0	CD8.TE	0.860806	-0.792698	8.581984	3.409528	0.066491	0.214353

4307 rows × 11 columns

We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots.

[ ]:

old_figsize = plt.rcParams["figure.figsize"]
plt.rcParams["figure.figsize"] = [10, 5]
plt.subplot(1, 2, 1)
plt.hist(mdata["milo"].var.PValue, bins=50)
plt.xlabel("P-Vals")
plt.subplot(1, 2, 2)
plt.plot(mdata["milo"].var.logFC, -np.log10(mdata["milo"].var.SpatialFDR), ".")
plt.xlabel("log-Fold Change")
plt.ylabel("- log10(Spatial FDR)")
plt.tight_layout()
plt.rcParams["figure.figsize"] = old_figsize

../../_images/tutorials_notebooks_milo_43_0.png

Visualize results on embedding¶

To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding. Here each node represents a neighbourhood, and the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying singificant DA are colored by their log-Fold Change.

[ ]:

milo.build_nhood_graph(mdata)

[ ]:

plt.rcParams["figure.figsize"] = [7, 7]
milo.plot_nhood_graph(
    mdata,
    alpha=0.1,  ## SpatialFDR level (1%)
    min_size=1,  ## Size of smallest dot
)

../../_images/tutorials_notebooks_milo_47_0.png

Visualize result by celltype¶

We might want to visualize whether DA is particularly evident in certain cell types. To do this, we assign a cell type label to each neighbourhood by finding the most abundant cell type within cells in each neighbourhood (after all, neighbourhoods are in most cases small subpopulations within the same cell type). We can label neighbourhoods in the results data.frame using the function milo.annotate_nhoods. This also saves the fraction of cells harbouring the label.

[ ]:

milo.annotate_nhoods(mdata, anno_col="author_cell_type")

We can see that for the majority of neighbourhoods, almost all cells have the same cell type label. We can rename neighbourhoods where less than 60% of the cells have the top label as “Mixed”

[ ]:

plt.hist(mdata["milo"].var["nhood_annotation_frac"], bins=30)
plt.xlabel("celltype fraction")

Text(0.5, 0, 'celltype fraction')

../../_images/tutorials_notebooks_milo_51_1.png

[ ]:

mdata["milo"].var.loc[mdata["milo"].var["nhood_annotation_frac"] < 0.6, "nhood_annotation"] = "Mixed"

Now we can visualize the fold changes by cell type annotation.

[ ]:

milo.plot_da_beeswarm(mdata, alpha=0.1)

../../_images/tutorials_notebooks_milo_54_0.png

This shows that neighbourhoods of Plasmablast cells, malignant B cells and monocytes are especially enriched in cells from COVID-19 samples.

We can check the effect size by visualizing the cell counts directly

[ ]:

## Get IDs of plasmablast neighbourhood
pl_nhoods = mdata["milo"].var_names[
    (mdata["milo"].var["SpatialFDR"] < 0.1) & (mdata["milo"].var["nhood_annotation"] == "Plasmablast")
]

## Visualize cell counts by condition (x-axis) and individuals on all neighbourhoods
milo.plot_nhood_counts_by_cond(mdata, test_var="Status", subset_nhoods=pl_nhoods, log_counts=False)

../../_images/tutorials_notebooks_milo_56_0.png

[ ]:

## or one neighbourhood at a time
with matplotlib.rc_context({"figure.figsize": [12, 6]}):
    for i, nh in enumerate(pl_nhoods):
        plt.subplot(2, 5, i + 1)
        milo.plot_nhood_counts_by_cond(mdata, test_var="Status", subset_nhoods=nh, log_counts=False)
        plt.title(f"Nhood {nh}")
    plt.tight_layout()

../../_images/tutorials_notebooks_milo_57_0.png

## Specifying DA comparisons - examples using model contrasts

We can compute fold-changes and p-values for multiple comparisons on the same set of neighbourhoods, by specifying the test of interest in the design and model_contrasts parameter.

For example, we can find neighbourhoods enriched in cells from Asymptomatic patients, compared to healthy individuals:

[ ]:

severity_order = ["Healthy", "Asymptomatic", "Mild", "Moderate", "Severe", "Critical"]
mdata["rna"].obs["COVID_severity"] = mdata["rna"].obs["COVID_severity"].cat.reorder_categories(severity_order)

[ ]:

milo.da_nhoods(mdata, design="~Site+COVID_severity", model_contrasts="COVID_severityAsymptomatic")
# no need to set "-COVID_severityHealthy", since this is the reference level

[ ]:

## Get IDs of enriched neighbourhoods
pl_nhoods = mdata["milo"].var_names[(mdata["milo"].var["SpatialFDR"] < 0.01) & (mdata["milo"].var["logFC"] > 2)]

## Visualize cell counts by condition (x-axis) and individuals on all signif neighbourhoods
milo.plot_nhood_counts_by_cond(mdata, test_var="COVID_severity", subset_nhoods=pl_nhoods, log_counts=False)

../../_images/tutorials_notebooks_milo_61_0.png

or comparing Asymptomatic patients againsts all other COVID-19 patients:

[ ]:

milo.da_nhoods(
    mdata,
    design="~Site+COVID_severity",
    model_contrasts="COVID_severityAsymptomatic-(COVID_severityMild + COVID_severityModerate + COVID_severitySevere + COVID_severityCritical)",
)

[ ]:

## Get IDs of enriched neighbourhoods
pl_nhoods = mdata["milo"].var_names[(mdata["milo"].var["SpatialFDR"] < 0.1) & (mdata["milo"].var["logFC"] > 2)]

## Visualize cell counts by condition (x-axis) and individuals on all signif neighbourhoods
milo.plot_nhood_counts_by_cond(mdata, test_var="COVID_severity", subset_nhoods=pl_nhoods, log_counts=False)

../../_images/tutorials_notebooks_milo_64_0.png

In this case, we could also be interested in finding where cell abundance increases or decreases linearly with COVID-19 severity. To do this, we need to encode COVID severity as a continuous variable

[ ]:

severity_order = ["Healthy", "Asymptomatic", "Mild", "Moderate", "Severe", "Critical"]
mdata["rna"].obs["COVID_severity_continuous"] = mdata["rna"].obs["COVID_severity"].cat.codes

[ ]:

milo.da_nhoods(mdata, design="~Site+COVID_severity_continuous")

[ ]:

## Get IDs of enriched neighbourhoods
pl_nhoods = mdata["milo"].var_names[
    (mdata["milo"].var["SpatialFDR"] < 0.1) & (mdata["milo"].var["logFC"] > 0)
]  ## notice how here logFCs are much smaller

## Add COVID severity labels to mdata['milo'].obs
milo.add_covariate_to_nhoods_var(mdata, ["COVID_severity"])
mdata["milo"].obs["COVID_severity"] = (
    mdata["milo"].obs["COVID_severity"].astype("category").cat.reorder_categories(severity_order)
)

## Visualize cell counts by condition (x-axis) and individuals on all signif neighbourhoods
milo.plot_nhood_counts_by_cond(mdata, test_var="COVID_severity", subset_nhoods=pl_nhoods, log_counts=False)

../../_images/tutorials_notebooks_milo_68_0.png

Conclusion¶

Milo is a computational framework designed for differential abundance analysis in single-cell datasets, enabling the detection of subtle compositional changes in cell populations by analyzing neighborhoods in a k-nearest neighbor graph of cell-cell similarities, particularly useful in scenarios where cell type clusters are imprecise or transitional states are of interest.

References¶

Dann, E., Henderson, N.C., Teichmann, S.A. et al. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat Biotechnol 40, 245–253 (2022). https://doi.org/10.1038/s41587-021-01033-z
Stephenson, E., Reynolds, G., Botting, R.A. et al. Single-cell multi-omics analysis of the immune response in COVID-19. Nat Med 27, 904–916 (2021). https://doi.org/10.1038/s41591-021-01329-2