Note

This page was generated from sccoda.ipynb. Some tutorial content may look better in light mode.

scCODA - Compositional analysis of labeled single-cell data¶

This notebook serves as a tutorial for using the scCODA model (Büttner, Ostner et al., 2021) to analyze changes in cell composition data. Inspired by methods for compositional analysis of microbiome data, scCODA proposes a Bayesian approach to address the low replicate issue as commonly encountered in single-cell analysis. It models cell-type counts using a hierarchical Dirichlet-Multinomial model, which accounts for uncertainty in cell-type proportions and the negative correlative bias via joint modeling of all measured cell-type proportions. To ensure a uniquely identifiable solution and easy interpretability, the reference in scCODA is chosen to be a specific cell type. Hence, any detected compositional changes by scCODA always have to be viewed in relation to the selected reference.

Setup¶

[1]:

import warnings

warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import mudata as mu
import pertpy as pt

Installed version 0.4.0 of pertpy is newer than the latest release 0.3.0! You are running a nightly version and 
features may break!

Data preparation¶

The data we use in the following example comes from Haber et al., 2017. It contains samples from the small intestinal epithelium of mice with different conditions. We first load the raw cell-level data. The dataset contains gene expressions of 9842 cells. They are annotated with their sample identifier (batch), the condition of the subjects and the type of each cell (cell_label).

[2]:

haber_cells = pt.dt.haber_2017_regions()
haber_cells.obs

[2]:

	batch	barcode	condition	cell_label
index
B1_AAACATACCACAAC_Control_Enterocyte.Progenitor	B1	AAACATACCACAAC	Control	Enterocyte.Progenitor
B1_AAACGCACGAGGAC_Control_Stem	B1	AAACGCACGAGGAC	Control	Stem
B1_AAACGCACTAGCCA_Control_Stem	B1	AAACGCACTAGCCA	Control	Stem
B1_AAACGCACTGTCCC_Control_Stem	B1	AAACGCACTGTCCC	Control	Stem
B1_AAACTTGACCACCT_Control_Enterocyte.Progenitor	B1	AAACTTGACCACCT	Control	Enterocyte.Progenitor
...	...	...	...	...
B10_TTTCACGACAAGCT_Salmonella_TA	B10	TTTCACGACAAGCT	Salmonella	TA
B10_TTTCAGTGAGGCGA_Salmonella_Enterocyte	B10	TTTCAGTGAGGCGA	Salmonella	Enterocyte
B10_TTTCAGTGCGACAT_Salmonella_Stem	B10	TTTCAGTGCGACAT	Salmonella	Stem
B10_TTTCAGTGTGACCA_Salmonella_Endocrine	B10	TTTCAGTGTGACCA	Salmonella	Endocrine
B10_TTTCAGTGTTCTCA_Salmonella_Enterocyte.Progenitor	B10	TTTCAGTGTTCTCA	Salmonella	Enterocyte.Progenitor

9842 rows × 4 columns

We now create an instance of a scCODA model and use it to aggregate the (cells x genes) data to a (samples x cell types) format by counting the cells for each category. The resulting MuData object sccoda_data includes both data aggregation levels as modalities rna and coda.

Looking at the data, we see that we have 4 control samples, and 3 conditions with 2 samples each. The resulting object separates our data components: Cell counts are stored in data.X, covariates in data.obs.

Further, it must contain one column or a set of columns (e.g. subject id, treatment, disease status) that uniquely identify each (statistical) sample. Further covariates (e.g. subject age) can either be specified via addidional column names in adata.obs, a key in adata.uns, or as a separate DataFrame.

[3]:

sccoda_model = pt.tl.Sccoda()
sccoda_data = sccoda_model.load(
    haber_cells,
    type="cell_level",
    generate_sample_level=True,
    cell_type_identifier="cell_label",
    sample_identifier="batch",
    covariate_obs=["condition"],
)
print(sccoda_data)
print(sccoda_data["coda"].X)
print(sccoda_data["coda"].obs)

Installed version 0.4.0 of pertpy is newer than the latest release 0.3.0! You are running a nightly version and 
features may break!

MuData object with n_obs × n_vars = 9852 × 15223
  2 modalities
    rna:        9842 x 15215
      obs:      'batch', 'barcode', 'condition', 'cell_label', 'scCODA_sample_id'
    coda:       10 x 8
      obs:      'condition', 'batch'
      var:      'n_cells'
[[ 36.  59. 136.  36. 239. 125. 191.  18.]
 [ 32. 373. 116.  67. 117.  65. 168.  12.]
 [  5.  46.  23.  20.  50.  11.  40.   5.]
 [ 45.  98. 188. 124. 250. 155. 365.  33.]
 [ 26. 221. 198.  36. 131. 130. 196.   4.]
 [ 52.  75. 347.  66. 323. 263. 313.  51.]
 [ 65. 126. 115.  33.  65.  39. 129.  59.]
 [ 42.  71. 203. 147. 271. 109. 180. 146.]
 [ 40.  57. 383. 170. 321. 244. 256.  71.]
 [ 37. 332. 113.  59.  90.  47. 132.  10.]]
                    condition batch
scCODA_sample_id
B1                    Control    B1
B10                Salmonella   B10
B2                    Control    B2
B3                    Control    B3
B4                    Control    B4
B5                 Hpoly.Day3    B5
B6                 Hpoly.Day3    B6
B7                Hpoly.Day10    B7
B8                Hpoly.Day10    B8
B9                 Salmonella    B9

Note:

The term “samples” in scCODA refers to its statistical meaning. Each sample should contain all cells that were processed under the same conditions (e.g. same subject, treatment, disease status, …), which will be combined into one row of the aggregated data set. When using load, the key(s) to uniquely identify the samples must be specified in sample_identifier. Additional covariates that do not further distinguish the samples but are needed to create the model formula can be passed via column names in adata.obs (covariate_obs), a key in adata.uns (covariate_uns), or as a separate DataFrame (covariate_df). When creating the aggregated dataset, scCODA will check the uniqueness of each additional covariate. Only if all cells in each sample have the same covariate value, it is possible to assign a value to the whole sample. If this is not the case, scCODA will skip the covariate with the message Covariate <xyz> has non-unique values! Skipping....

In our example, the column batch uniquely identifies each of the 10 samples, and we can pass condition as an additional covariate. To showcase the need for a list of sample_identifiers, consider the example where we obtained cells from three different subjects, which were split and then treated in three different ways:

cell	subject	treatment	age
cell1	S1	T1	35
cell2	S1	T3	35
cell3	S3	T2	42
cell4	S2	T1	50
cell5	S1	T1	35

In the aggregated data, we want nine samples (all combinations of subject and treatment). To achieve this, we need to pass the list ["subject", "treatment"] as the sample_identifier. The age column is unique for each subject (and therefore for each subject-treatment combination) and will thus be an additional covariate. The resulting aggregated obs DataFrame will be:

subject_treatment	subject	treatment	age
S1_T1	S1	T1	35
S1_T2	S1	T2	35
S1_T3	S1	T3	35
S2_T1	S2	T1	42
S2_T2	S2	T2	42
S2_T3	S2	T3	42
S3_T1	S3	T1	50
S3_T2	S3	T2	50
S3_T3	S3	T3	50

For our first example, we want to look at how the Salmonella infection influences the cell composition. Therefore, we create a subset of our compositional data that only contains the healthy and Salmonella-infected samples as a new data modality.

[4]:

# Select control and salmonella data
sccoda_data.mod["coda_salm"] = sccoda_data["coda"][
    sccoda_data["coda"].obs["condition"].isin(["Control", "Salmonella"])
].copy()
print(sccoda_data["coda_salm"])

AnnData object with n_obs × n_vars = 6 × 8
    obs: 'condition', 'batch'
    var: 'n_cells'

Plotting the data, we can see that there is a large increase of Enterocytes in the infected sampes, while most other cell types slightly decrease. Since scRNA-seq experiments are limited in the number of cells per sample, the count data is compositional, which leads to negative correlations between the cell types. Thus, the slight decreases in many cell types might be fully caused by the increase in Enterocytes.

[5]:

sccoda_model.plot_boxplots(sccoda_data, modality_key="coda_salm", feature_name="condition", add_dots=True)
plt.show()

../../_images/tutorials_notebooks_sccoda_12_0.png

Model setup and inference¶

We can now run parameter inference for the scCODA model to determine which cell types are credibly changing under Salmonella infection. To do this, we first need to specify all the necessary information through sccoda_model.prepare. We need to set:

The formula parameter. It specifies how the covariates are used in the model. It can process R-style formulas via the patsy package, e.g. formula="Cov1 + Cov2 + Cov3". Here, we simply use the “Condition” covariate of our dataset
The reference_cell_type parameter is used to specify a cell type that is believed to be unchanged by the covariates in formula. This is necessary, because compositional analysis must always be performed relative to a reference (See Büttner, Ostner et al., 2021 for a more thorough explanation). If no knowledge about such a cell type exists prior to the analysis, taking a cell type that has a nearly constant relative abundance over all samples is often a good choice. It is also possible to let scCODA find a suited reference cell type by using reference_cell_type="automatic". Here, we take Goblet cells as the reference.

[6]:

sccoda_data = sccoda_model.prepare(
    sccoda_data,
    modality_key="coda_salm",
    formula="condition",
    reference_cell_type="Goblet",
)
sccoda_data["coda_salm"]

[6]:

AnnData object with n_obs × n_vars = 6 × 8
    obs: 'condition', 'batch'
    var: 'n_cells'
    uns: 'scCODA_params'
    obsm: 'covariate_matrix', 'sample_counts'

No-U-turn HMC sampling is then initiated by calling sccoda_model.run_nuts().

[7]:

# Run MCMC
sccoda_model.run_nuts(sccoda_data, modality_key="coda_salm")
sccoda_data["coda_salm"]

sample: 100%|██████████| 11000/11000 [00:36<00:00, 303.04it/s, 127 steps of size 2.53e-02. acc. prob=0.83]

[7]:

AnnData object with n_obs × n_vars = 6 × 8
    obs: 'condition', 'batch'
    var: 'n_cells'
    uns: 'scCODA_params'
    obsm: 'covariate_matrix', 'sample_counts'
    varm: 'intercept_df', 'effect_df_condition[T.Salmonella]'

Result interpretation¶

Calling sccoda_model.summary(), we can see the most relevant information for further analysis:

[8]:

sccoda_model.summary(sccoda_data, modality_key="coda_salm")

                                          Compositional Analysis summary                                           
┌──────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────┐
│ Name                                         │ Value                                                            │
├──────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ Data                                         │ Data: 6 samples, 8 cell types                                    │
│ Reference cell type                          │ Goblet                                                           │
│ Formula                                      │ condition                                                        │
└──────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Intercepts                                                                                                      │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                        Final Parameter  Expected Sample                                                         │
│ Cell Type                                                                                                       │
│ Endocrine                  1.110            34.745                                                              │
│ Enterocyte                 2.319           116.401                                                              │
│ Enterocyte.Progenitor      2.509           140.758                                                              │
│ Goblet                     1.741            65.303                                                              │
│ Stem                       2.693           169.193                                                              │
│ TA                         2.100            93.507                                                              │
│ TA.Early                   2.848           197.560                                                              │
│ Tuft                       0.426            17.532                                                              │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Effects                                                                                                         │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                              Final Parameter  Expected Sample  log2-fold change                 │
│ Covariate             Cell Type                                                                                 │
│ conditionT.Salmonella Endocrine                  0.000            24.831            -0.485                      │
│                       Enterocyte                 1.352           321.438             1.465                      │
│                       Enterocyte.Progenitor      0.000           100.596            -0.485                      │
│                       Goblet                     0.000            46.670            -0.485                      │
│                       Stem                       0.000           120.918            -0.485                      │
│                       TA                         0.000            66.827            -0.485                      │
│                       TA.Early                   0.000           141.190            -0.485                      │
│                       Tuft                       0.000            12.530            -0.485                      │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Model properties

First, the summary shows an overview over the model properties:

Number of samples/cell types
The reference cell type.
The formula used

The model has two types of parameters that are relevant for analysis - intercepts and effects. These can be interpreted like in a standard regression model: Intercepts show how the cell types are distributed without any active covariates, effects show how the covariates influence the cell types.

Intercepts

The first column of the intercept summary shows the parameters determined by the MCMC inference.

The “Expected sample” column gives some context to the numerical values. If we had a new sample (with no active covariates) with a total number of cells equal to the mean sampling depth of the dataset, then this distribution over the cell types would be most likely.

Effects

For the effect summary, the first column again shows the inferred parameters for all combinations of covariates and cell types. Most important is the distinctions between zero and non-zero entries A value of zero means that no statistically credible effect was detected. For a value other than zero, a credible change was detected. A positive sign indicates an increase, a negative sign a decrease in abundance.

Since the numerical values of the “Final Parameter” column are not straightforward to interpret, the “Expected sample” and “log2-fold change” columns give us an idea on the magnitude of the change. The expected sample is calculated for each covariate separately (covariate value = 1, all other covariates = 0), with the same method as for the intercepts. The log-fold change is then calculated between this expected sample and the expected sample with no active covariates from the intercept section. Since the data is compositional, cell types for which no credible change was detected, can still change in abundance as well, as soon as a credible effect is detected on another cell type due to the sum-to-one constraint. If there are no credible effects for a covariate, its expected sample will be identical to the intercept sample, therefore the log2-fold change is 0.

Interpretation

In the salmonella case, we see only a credible increase of Enterocytes, while all other cell types are unaffected by the disease. The log-fold change of Enterocytes between control and infected samples with the same total cell count lies at about 1.46.

We can also easily filter out all credible effects:

[9]:

sccoda_model.credible_effects(sccoda_data, modality_key="coda_salm")

[9]:

Covariate                Cell Type
condition[T.Salmonella]  Endocrine                False
                         Enterocyte                True
                         Enterocyte.Progenitor    False
                         Goblet                   False
                         Stem                     False
                         TA                       False
                         TA.Early                 False
                         Tuft                     False
Name: Final Parameter, dtype: bool

We can also visualize the results as a barplot:

[10]:

sccoda_model.plot_effects_barplot(sccoda_data, modality_key="coda_salm", parameter="Final Parameter")

[10]:

<seaborn.axisgrid.FacetGrid at 0x7f96d999b580>

../../_images/tutorials_notebooks_sccoda_22_1.png

Adjusting the False discovery rate¶

scCODA selects credible effects based on their inclusion probability. The cutoff between credible and non-credible effects depends on the desired false discovery rate (FDR). A smaller FDR value will produce more conservative results, but might miss some effects, while a larger FDR value selects more effects at the cost of a larger number of false discoveries.

The desired FDR level can be easily set after inference via sim_results.set_fdr(). Per default, the value is 0.05, but we recommend to increase it up to 0.2 if no effects are found at a more conservative level.

In our example, setting a desired FDR of 0.4 reveals small effects on Endocrine and Tuft cells. Keep in mind that we chose this value only for instructive purposes, since there are no credible effects beside Enterocytes at lower FDR levels. In practice, expecting 40% of all credible effects to be false-positives is usually not recommended.

[11]:

sccoda_model.set_fdr(sccoda_data, modality_key="coda_salm", est_fdr=0.4)
sccoda_model.summary(sccoda_data, modality_key="coda_salm")

                                          Compositional Analysis summary                                           
┌──────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────┐
│ Name                                         │ Value                                                            │
├──────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ Data                                         │ Data: 6 samples, 8 cell types                                    │
│ Reference cell type                          │ Goblet                                                           │
│ Formula                                      │ condition                                                        │
└──────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Intercepts                                                                                                      │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                        Final Parameter  Expected Sample                                                         │
│ Cell Type                                                                                                       │
│ Endocrine                  1.110            34.745                                                              │
│ Enterocyte                 2.319           116.401                                                              │
│ Enterocyte.Progenitor      2.509           140.758                                                              │
│ Goblet                     1.741            65.303                                                              │
│ Stem                       2.693           169.193                                                              │
│ TA                         2.100            93.507                                                              │
│ TA.Early                   2.848           197.560                                                              │
│ Tuft                       0.426            17.532                                                              │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Effects                                                                                                         │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                              Final Parameter  Expected Sample  log2-fold change                 │
│ Covariate             Cell Type                                                                                 │
│ conditionT.Salmonella Endocrine                   0.282           32.626            -0.091                      │
│                       Enterocyte                  1.352          318.408             1.452                      │
│                       Enterocyte.Progenitor       0.000           99.648            -0.498                      │
│                       Goblet                      0.000           46.231            -0.498                      │
│                       Stem                        0.000          119.778            -0.498                      │
│                       TA                          0.000           66.197            -0.498                      │
│                       TA.Early                    0.000          139.860            -0.498                      │
│                       Tuft                       -0.013           12.252            -0.517                      │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Saving results¶

The MuData object containing all relevant information can be saved e.g. as a h5mu file. To access the saved data object, simply load it from disk and call the respective functions in an instance of a scCODA model

[12]:

# saving
path = "test"
sccoda_data.write_h5mu(path)

# loading
sccoda_data_2 = mu.read_h5mu(path)

sccoda_model.summary(sccoda_data_2, modality_key="coda_salm")

                                          Compositional Analysis summary                                           
┌──────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────┐
│ Name                                         │ Value                                                            │
├──────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ Data                                         │ Data: 6 samples, 8 cell types                                    │
│ Reference cell type                          │ Goblet                                                           │
│ Formula                                      │ condition                                                        │
└──────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Intercepts                                                                                                      │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                        Final Parameter  Expected Sample                                                         │
│ Cell Type                                                                                                       │
│ Endocrine                  1.110            34.745                                                              │
│ Enterocyte                 2.319           116.401                                                              │
│ Enterocyte.Progenitor      2.509           140.758                                                              │
│ Goblet                     1.741            65.303                                                              │
│ Stem                       2.693           169.193                                                              │
│ TA                         2.100            93.507                                                              │
│ TA.Early                   2.848           197.560                                                              │
│ Tuft                       0.426            17.532                                                              │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Effects                                                                                                         │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                              Final Parameter  Expected Sample  log2-fold change                 │
│ Covariate             Cell Type                                                                                 │
│ conditionT.Salmonella Endocrine                   0.282           32.626            -0.091                      │
│                       Enterocyte                  1.352          318.408             1.452                      │
│                       Enterocyte.Progenitor       0.000           99.648            -0.498                      │
│                       Goblet                      0.000           46.231            -0.498                      │
│                       Stem                        0.000          119.778            -0.498                      │
│                       TA                          0.000           66.197            -0.498                      │
│                       TA.Early                    0.000          139.860            -0.498                      │
│                       Tuft                       -0.013           12.252            -0.517                      │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

[13]:

test_model = pt.tl.Sccoda()

[14]:

test_model.get_intercept_df(sccoda_data_2, modality_key="coda_salm")

[14]:

	Final Parameter	HDI 3%	HDI 97%	SD	Expected Sample
Cell Type
Endocrine	1.110	0.448	1.808	0.366	34.745194
Enterocyte	2.319	1.734	2.913	0.315	116.401017
Enterocyte.Progenitor	2.509	1.898	3.093	0.316	140.757882
Goblet	1.741	1.126	2.330	0.320	65.303215
Stem	2.693	2.130	3.282	0.309	169.193202
TA	2.100	1.454	2.700	0.337	93.507465
TA.Early	2.848	2.269	3.427	0.308	197.559789
Tuft	0.426	-0.274	1.162	0.391	17.532236

[15]:

test_model.get_effect_df(sccoda_data, modality_key="coda_salm")

[15]:

		Final Parameter	HDI 3%	HDI 97%	SD	Inclusion probability	Expected Sample	log2-fold change
Covariate	Cell Type
conditionT.Salmonella	Endocrine	0.282468	-0.476	1.097	0.319	0.4426	32.625962	-0.090793
	Enterocyte	1.351694	0.842	1.859	0.266	0.9996	318.408292	1.451774
	Enterocyte.Progenitor	0.000000	-0.358	0.630	0.150	0.3100	99.647678	-0.498308
	Goblet	0.000000	0.000	0.000	0.000	0.0000	46.230546	-0.498308
	Stem	0.000000	-0.805	0.193	0.217	0.3947	119.778085	-0.498308
	TA	0.000000	-0.843	0.382	0.238	0.3782	66.197370	-0.498308
	TA.Early	0.000000	-0.465	0.451	0.126	0.2806	139.859835	-0.498308
	Tuft	-0.012933	-0.892	0.987	0.312	0.3994	12.252233	-0.516965

[16]:

sccoda_data["coda_salm"].varm["intercept_df"]

[16]:

	Final Parameter	HDI 3%	HDI 97%	SD	Expected Sample
Cell Type
Endocrine	1.110	0.448	1.808	0.366	34.745194
Enterocyte	2.319	1.734	2.913	0.315	116.401017
Enterocyte.Progenitor	2.509	1.898	3.093	0.316	140.757882
Goblet	1.741	1.126	2.330	0.320	65.303215
Stem	2.693	2.130	3.282	0.309	169.193202
TA	2.100	1.454	2.700	0.337	93.507465
TA.Early	2.848	2.269	3.427	0.308	197.559789
Tuft	0.426	-0.274	1.162	0.391	17.532236

References¶

Büttner, M., Ostner, J., Müller, C.L. et al. scCODA is a Bayesian model for compositional single-cell data analysis. Nat Commun 12, 6876 (2021). https://doi.org/10.1038/s41467-021-27150-6
Haber, A., Biton, M., Rogel, N. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017). https://doi.org/10.1038/nature24489