Note
This page was generated from tasccoda.ipynb. Some tutorial content may look better in light mode.
tascCODA - Tree-aggregated compositional analysis¶
This notebook is a tutorial on how to use tascCODA Ostner et al., 2021 for tree-aggregated compositional analysis of high-throughput sequencing (HTS) data.
For this example, we use single-cell RNA sequencing data. However, there are no limitations to use tascCODA with other HTS data, such as 16S rRNA sequencing.
The particular dataset for this analysis was generated by Smillie et al., 2019. It contains samples from two different regions in the small intestine of mice - Epithelium and Lamina Propria - and three different inflammation conditions - healthy, non-inflamed and inflamed. In total, we have 365.492 cells from 51 cell types in 133 samples.
[1]:
import matplotlib.pyplot as plt
import pertpy as pt
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
ryp2 is not installed. Install with pip install rpy2 to run tools with R support.
Data setup¶
First, we read in the per-cell data. In this dataset, samples are specified as “Subject”, and cell types are denoted as “Cluster”. Interesting covariates are the samples’ location and health status. The columns “Major_l1” - “Major_l4” and “Cluster” describe a lineage tree over the cell types.
[3]:
smillie_counts = pt.dt.smillie_2019()
smillie_counts.obs
[3]:
Origin | Subject | Sample | Location | Replicate | Health | Cluster | nGene | nUMI | Major_l1 | Major_l2 | Major_l3 | Major_l4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
N7.EpiA.AAGGCTACCCTTTA | Imm | N7 | EpiA | Epi | A | Non-inflamed | Plasma | 624.0 | 7433.0 | Immune | Lymphoid | B cells | Plasma4 |
N7.EpiA.AAGGTGCTACGGAG | Imm | N7 | EpiA | Epi | A | Non-inflamed | CD8+ IELs | 558.0 | 1904.0 | Immune | Lymphoid | T cells | CD8+ T |
N7.EpiA.AAGTAACTTGCTTT | Imm | N7 | EpiA | Epi | A | Non-inflamed | CD8+ IELs | 437.0 | 1366.0 | Immune | Lymphoid | T cells | CD8+ T |
N7.EpiA.ACAATAACCCTCAC | Imm | N7 | EpiA | Epi | A | Non-inflamed | Plasma | 484.0 | 5161.0 | Immune | Lymphoid | B cells | Plasma4 |
N7.EpiA.ACAGTTCTTCTACT | Imm | N7 | EpiA | Epi | A | Non-inflamed | CD8+ IELs | 470.0 | 1408.0 | Immune | Lymphoid | T cells | CD8+ T |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
N110.LPB.TTTGGTTGTGTGGCTC | Epi | N110 | LPB | LP | B | Inflamed | Immature Enterocytes 2 | 2553.0 | 11705.0 | Epithelial | Epithelial | Absorptive | Immature cells |
N110.LPB.TTTGGTTTCCTTAATC | Epi | N110 | LPB | LP | B | Inflamed | TA 2 | 3234.0 | 16164.0 | Epithelial | Epithelial | Absorptive | TA cells |
N110.LPB.TTTGGTTTCTTACCTA | Epi | N110 | LPB | LP | B | Inflamed | Enterocyte Progenitors | 258.0 | 384.0 | Epithelial | Epithelial | Absorptive | Immature cells |
N110.LPB.TTTGTCAAGGATGGAA | Epi | N110 | LPB | LP | B | Inflamed | TA 1 | 487.0 | 772.0 | Epithelial | Epithelial | Absorptive | TA cells |
N110.LPB.TTTGTCAGTTGTTTGG | Epi | N110 | LPB | LP | B | Inflamed | TA 1 | 363.0 | 747.0 | Epithelial | Epithelial | Absorptive | TA cells |
365492 rows × 13 columns
[4]:
smillie_counts
[4]:
AnnData object with n_obs × n_vars = 365492 × 18172
obs: 'Origin', 'Subject', 'Sample', 'Location', 'Replicate', 'Health', 'Cluster', 'nGene', 'nUMI', 'Major_l1', 'Major_l2', 'Major_l3', 'Major_l4'
Next, we convert the data to a (samples x cell types) object. To identify the statistical samples, we need to combine the “Subject” and “Sample” columns. We also need to extract the tree-structured cell lineage information (Lineages Major_l1
- Major_l4
) from the data, generate a Newick string through tasccoda.tree_utils.df2newick()
and convert it into a ete3 tree object. This is all done within the load
function by specifying.
[5]:
tasccoda_model = pt.tl.Tasccoda()
smillie_data = tasccoda_model.load(
smillie_counts,
type="cell_level",
cell_type_identifier="Cluster",
sample_identifier=["Subject", "Sample"],
covariate_obs=["Location", "Health"],
levels_orig=["Major_l1", "Major_l2", "Major_l3", "Major_l4", "Cluster"],
add_level_name=True,
)
smillie_data
[5]:
MuData object with n_obs × n_vars = 365625 × 18223 2 modalities rna: 365492 x 18172 obs: 'Origin', 'Subject', 'Sample', 'Location', 'Replicate', 'Health', 'Cluster', 'nGene', 'nUMI', 'Major_l1', 'Major_l2', 'Major_l3', 'Major_l4', 'scCODA_sample_id' coda: 133 x 51 obs: 'Location', 'Health', 'Subject', 'Sample' var: 'n_cells' uns: 'tree'
We can also visualize the tree:
[6]:
tasccoda_model.plot_draw_tree(smillie_data["coda"])
[6]:
For this tutorial, we focus on finding changes between healthy and non-inflamed tissue in the Lamina Propria. Therefore, we subset the data accordingly, and end up with 48 samples:
[7]:
smillie_data.mod["coda_LP"] = smillie_data["coda"][
(smillie_data["coda"].obs["Health"].isin(["Healthy", "Non-inflamed"]))
& (smillie_data["coda"].obs["Location"] == "LP")
]
smillie_data
[7]:
MuData object with n_obs × n_vars = 365625 × 18223 3 modalities rna: 365492 x 18172 obs: 'Origin', 'Subject', 'Sample', 'Location', 'Replicate', 'Health', 'Cluster', 'nGene', 'nUMI', 'Major_l1', 'Major_l2', 'Major_l3', 'Major_l4', 'scCODA_sample_id' coda: 133 x 51 obs: 'Location', 'Health', 'Subject', 'Sample' var: 'n_cells' uns: 'tree' coda_LP: 48 x 51 obs: 'Location', 'Health', 'Subject', 'Sample' var: 'n_cells' uns: 'tree'
A quick plot of the data with the sccoda.util.data_visualization
module shows us that there are some changes in relative abundance in the data:
[8]:
tasccoda_model.plot_boxplots(smillie_data, modality_key="coda_LP", feature_name="Health", figsize=(20, 8))
plt.show()
Run tascCODA¶
Inferring credible effects with tascCODA on our data is now simply a matter of running the sampling process. The model creation and inference works analogous to scCODA (see the scCODA quickstart tutorial)
As hyperparameters, we have to specify in the prepare
function: - The reference cell type (which is assumed to be unchanged between the conditions): We use the automatic
setting here, which chooses a reference that induces minimal compositional effects (here we use the automatic suggestion - NK cells) - The model formula (R-style formula string, just as in scCODA) - The aggregation bias \(\phi\), defined in the tascCODA
paper. We go with an unbiased aggregation (pen_args={"phi": 0}
) here. - The key in .uns
of our MuData modality where the tree is saved.
Then, we run No-U-turn sampling via run_nuts
with the default settings of 11.000 samples, of which 1.000 are discarded as burn-in
[9]:
smillie_data = tasccoda_model.prepare(
smillie_data,
modality_key="coda_LP",
tree_key="tree",
reference_cell_type="automatic",
formula="Health",
pen_args={"phi": 0},
)
smillie_data
Automatic reference selection! Reference cell type set to NKs
Zero counts encountered in data! Added a pseudocount of 0.5.
[9]:
MuData object with n_obs × n_vars = 365625 × 18223 3 modalities rna: 365492 x 18172 obs: 'Origin', 'Subject', 'Sample', 'Location', 'Replicate', 'Health', 'Cluster', 'nGene', 'nUMI', 'Major_l1', 'Major_l2', 'Major_l3', 'Major_l4', 'scCODA_sample_id' coda: 133 x 51 obs: 'Location', 'Health', 'Subject', 'Sample' var: 'n_cells' uns: 'tree' coda_LP: 48 x 51 obs: 'Location', 'Health', 'Subject', 'Sample' var: 'n_cells' uns: 'tree', 'scCODA_params' obsm: 'covariate_matrix', 'sample_counts'
[10]:
tasccoda_model.run_nuts(smillie_data, modality_key="coda_LP")
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701341923.141712 1 tfrt_cpu_pjrt_client.cc:349] TfrtCpuClient created.
sample: 100%|██████████| 11000/11000 [03:12<00:00, 57.14it/s, 63 steps of size 7.98e-02. acc. prob=0.91]
Result analysis¶
Calling summary
, we can see the most relevant information for further analysis:
[11]:
tasccoda_model.summary(smillie_data, modality_key="coda_LP")
Compositional Analysis summary ┌────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────┐ │ Name │ Value │ ├────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────┤ │ Data │ Data: 48 samples, 51 cell types │ │ Reference cell type │ NKs │ │ Formula │ Health │ └────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Intercepts │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Final Parameter Expected Sample │ │ Cell Type │ │ Best4+ Enterocytes -1.182 10.352 │ │ CD4+ Activated Fos-hi 1.561 160.805 │ │ CD4+ Activated Fos-lo 1.609 168.712 │ │ CD4+ Memory 1.319 126.241 │ │ CD4+ PD1+ -0.933 13.279 │ │ CD8+ IELs -0.112 30.180 │ │ CD8+ IL17+ -0.803 15.123 │ │ CD8+ LP 1.602 167.535 │ │ CD69+ Mast 0.592 61.019 │ │ CD69- Mast -1.354 8.716 │ │ Cycling B -0.477 20.951 │ │ Cycling Monocytes -0.568 19.129 │ │ Cycling T -0.700 16.763 │ │ Cycling TA 0.090 36.936 │ │ DC1 -0.604 18.452 │ │ DC2 0.352 48.000 │ │ Endothelial 0.118 37.985 │ │ Enterocyte Progenitors -0.904 13.670 │ │ Enterocytes -1.218 9.986 │ │ Enteroendocrine -1.266 9.518 │ │ Follicular 0.828 77.261 │ │ GC -0.913 13.547 │ │ Glia -0.104 30.423 │ │ Goblet -1.220 9.966 │ │ ILCs -0.845 14.501 │ │ Immature Enterocytes 1 -1.067 11.614 │ │ Immature Enterocytes 2 -0.826 14.779 │ │ Immature Goblet -0.327 24.342 │ │ Inflammatory Fibroblasts -0.970 12.797 │ │ Inflammatory Monocytes -0.537 19.731 │ │ M cells -1.398 8.341 │ │ MT-hi -0.716 16.497 │ │ Macrophages 1.572 162.583 │ │ Microvascular -0.298 25.058 │ │ Myofibroblasts -0.014 33.288 │ │ NKs -0.391 22.833 │ │ Pericytes -0.682 17.068 │ │ Plasma 3.580 1210.987 │ │ Post-capillary Venules 0.009 34.062 │ │ RSPO3+ -0.617 18.214 │ │ Secretory TA -0.912 13.561 │ │ Stem -0.798 15.198 │ │ TA 1 -0.018 33.155 │ │ TA 2 -0.174 28.366 │ │ Tregs 0.682 66.766 │ │ Tuft -1.442 7.982 │ │ WNT2B+ Fos-hi 0.738 70.611 │ │ WNT2B+ Fos-lo 1 1.117 103.151 │ │ WNT2B+ Fos-lo 2 0.555 58.803 │ │ WNT5B+ 1 0.384 49.560 │ │ WNT5B+ 2 0.785 74.009 │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Effects │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Effect Expected Sample log2-fold change │ │ Covariate Cell Type │ │ HealthT.Non-inflamed Best4+ Enterocytes 0.000 19.452 0.910 │ │ CD4+ Activated Fos-hi -0.676 153.688 -0.065 │ │ CD4+ Activated Fos-lo -1.031 113.061 -0.577 │ │ CD4+ Memory -0.676 120.654 -0.065 │ │ CD4+ PD1+ -0.676 12.691 -0.065 │ │ CD8+ IELs -0.676 28.845 -0.065 │ │ CD8+ IL17+ -0.676 14.453 -0.065 │ │ CD8+ LP -0.676 160.120 -0.065 │ │ CD69+ Mast -0.262 88.228 0.532 │ │ CD69- Mast -0.262 12.603 0.532 │ │ Cycling B -0.710 19.355 -0.114 │ │ Cycling Monocytes -0.603 19.667 0.040 │ │ Cycling T -0.676 16.021 -0.065 │ │ Cycling TA 0.000 69.403 0.910 │ │ DC1 -0.603 18.971 0.040 │ │ DC2 -0.603 49.349 0.040 │ │ Endothelial -0.369 49.349 0.378 │ │ Enterocyte Progenitors 0.000 25.685 0.910 │ │ Enterocytes 0.000 18.764 0.910 │ │ Enteroendocrine 0.000 17.884 0.910 │ │ Follicular -0.710 71.373 -0.114 │ │ GC -0.710 12.515 -0.114 │ │ Glia -0.369 39.525 0.378 │ │ Goblet 0.000 18.726 0.910 │ │ ILCs 0.000 27.247 0.910 │ │ Immature Enterocytes 1 0.000 21.822 0.910 │ │ Immature Enterocytes 2 0.000 27.769 0.910 │ │ Immature Goblet 0.000 45.738 0.910 │ │ Inflammatory Fibroblasts -0.369 16.625 0.378 │ │ Inflammatory Monocytes -0.603 20.286 0.040 │ │ M cells 0.000 15.673 0.910 │ │ MT-hi -0.676 15.767 -0.065 │ │ Macrophages -0.603 167.156 0.040 │ │ Microvascular -0.369 32.555 0.378 │ │ Myofibroblasts -0.369 43.247 0.378 │ │ NKs 0.000 42.902 0.910 │ │ Pericytes -0.369 22.174 0.378 │ │ Plasma -1.048 797.854 -0.602 │ │ Post-capillary Venules -0.369 44.253 0.378 │ │ RSPO3+ -0.369 23.663 0.378 │ │ Secretory TA 0.000 25.481 0.910 │ │ Stem 0.000 28.558 0.910 │ │ TA 1 0.336 87.176 1.395 │ │ TA 2 0.336 74.584 1.395 │ │ Tregs -0.676 63.811 -0.065 │ │ Tuft 0.000 14.998 0.910 │ │ WNT2B+ Fos-hi -0.369 91.737 0.378 │ │ WNT2B+ Fos-lo 1 -0.369 134.012 0.378 │ │ WNT2B+ Fos-lo 2 -0.369 76.396 0.378 │ │ WNT5B+ 1 -0.369 64.388 0.378 │ │ WNT5B+ 2 -0.369 96.152 0.378 │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Nodes │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Covariate=Health[T.Non-inflamed]_node │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Final Parameter Is credible │ │ Node │ │ Major_l1_Immune 0.00 False │ │ Major_l2_Stromal -0.37 True │ │ Major_l2_Epithelial 0.00 False │ │ Major_l2_Lymphoid 0.00 False │ │ Major_l2_Myeloid -0.26 True │ │ Major_l3_Fibroblasts 0.00 False │ │ Major_l4_EndothelialCells 0.00 False │ │ Glia 0.00 False │ │ Major_l3_Absorptive 0.00 False │ │ Major_l3_Secretory 0.00 False │ │ Cycling TA 0.00 False │ │ M cells 0.00 False │ │ Stem 0.00 False │ │ Major_l3_B cells -0.71 True │ │ Major_l3_T cells -0.68 True │ │ ILCs 0.00 False │ │ NKs 0.00 False │ │ Major_l3_Monocytes -0.34 True │ │ Major_l3_Mast 0.00 False │ │ Major_l4_WNT2B+ 0.00 False │ │ Major_l4_WNT5B+ 0.00 False │ │ Myofibroblasts 0.00 False │ │ Inflammatory Fibroblasts 0.00 False │ │ Pericytes 0.00 False │ │ Microvascular 0.00 False │ │ Endothelial 0.00 False │ │ Post-capillary Venules 0.00 False │ │ Major_l4_TA cells 0.34 True │ │ Major_l4_Immature cells 0.00 False │ │ Major_l4_Absorptive Mature cells 0.00 False │ │ Major_l4_Progenitor cells 0.00 False │ │ Major_l4_Secretory Mature cells 0.00 False │ │ Plasma -0.34 True │ │ Cycling B 0.00 False │ │ Follicular 0.00 False │ │ GC 0.00 False │ │ Major_l4_CD8+ T 0.00 False │ │ Major_l4_CD4+ T 0.00 False │ │ Major_l4_DCs 0.00 False │ │ Macrophages 0.00 False │ │ Cycling Monocytes 0.00 False │ │ Inflammatory Monocytes 0.00 False │ │ CD69+ Mast 0.00 False │ │ CD69- Mast 0.00 False │ │ WNT2B+ Fos-lo 1 0.00 False │ │ WNT2B+ Fos-hi 0.00 False │ │ WNT2B+ Fos-lo 2 0.00 False │ │ RSPO3+ 0.00 False │ │ WNT5B+ 2 0.00 False │ │ WNT5B+ 1 0.00 False │ │ TA 1 0.00 False │ │ TA 2 0.00 False │ │ Enterocyte Progenitors 0.00 False │ │ Immature Enterocytes 2 0.00 False │ │ Immature Enterocytes 1 0.00 False │ │ Best4+ Enterocytes 0.00 False │ │ Enterocytes 0.00 False │ │ Immature Goblet 0.00 False │ │ Secretory TA 0.00 False │ │ Goblet 0.00 False │ │ Enteroendocrine 0.00 False │ │ Tuft 0.00 False │ │ CD8+ IELs 0.00 False │ │ Cycling T 0.00 False │ │ CD8+ IL17+ 0.00 False │ │ CD8+ LP 0.00 False │ │ Tregs 0.00 False │ │ CD4+ Activated Fos-hi 0.00 False │ │ CD4+ Activated Fos-lo -0.35 True │ │ CD4+ PD1+ 0.00 False │ │ CD4+ Memory 0.00 False │ │ MT-hi 0.00 False │ │ DC2 0.00 False │ │ DC1 0.00 False │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Model properties
First, the summary shows an overview over the model properties: * Number of samples/cell types * The reference cell type. * The formula used
The model has three types of parameters that are relevant for analysis - intercepts, feature-level effects and node-wise effects. These can be interpreted like in a standard regression model: Intercepts show how the cell types are distributed without any active covariates, effects show how the covariates influence the cell types.
Intercepts
The first column of the intercept summary shows the parameters determined by the MCMC inference.
The “Expected sample” column gives some context to the numerical values. If we had a new sample (with no active covariates) with a total number of cells equal to the mean sampling depth of the dataset, then this distribution over the cell types would be most likely.
Feature-level Effects
For the feature-level effect summary, the first column again shows the inferred parameters for all combinations of covariates and cell types, as sums over node-level effects on all parent nodes. Most important is the distinction between zero and non-zero entries A value of zero means that no statistically credible effect was detected. For a value other than zero, a credible change was detected. A positive sign indicates an increase, a negative sign a decrease in abundance.
Since the numerical values of the “Effect” column are not straightforward to interpret, the “Expected sample” and “log2-fold change” columns give us an idea on the magnitude of the change. The expected sample is calculated for each covariate separately (covariate value = 1, all other covariates = 0), with the same method as for the intercepts. The log-fold change is then calculated between this expected sample and the expected sample with no active covariates from the intercept section. Since the data is compositional, cell types for which no credible change was detected, will still change in abundance as well, as soon as a credible effect is detected on another cell type due to the sum-to-one constraint. If there are no credible effects for a covariate, its expected sample will be identical to the intercept sample, therefore the log2-fold change is 0.
Node-level effects
These parameters are the most important ones. They describe, at which points in the tree a credible change in abundance was detected. The data frame just has two columns: The “Final Parameter” column shows the effect values, the “Is credible” column simply depicts whether the inferred effect is different from 0, i.e. credible. In a normal tascCODA analysis, we are interested in which subtrees (i.e. nodes) have a nonzero effect, which we can easily extract:
[12]:
tasccoda_model.credible_effects(smillie_data, modality_key="coda_LP")
[12]:
Covariate Node
Health[T.Non-inflamed]_node Major_l1_Immune False
Major_l2_Stromal True
Major_l2_Epithelial False
Major_l2_Lymphoid False
Major_l2_Myeloid True
...
CD4+ PD1+ False
CD4+ Memory False
MT-hi False
DC2 False
DC1 False
Name: Final Parameter, Length: 74, dtype: bool
Interpretation
We see that most credible effects are on intermediate nodes. The only slight increase in abundance is on TA cells, while there are decreases on Stromal and some Immune cell types. The most important decreases (i.e. with the largest effect sizes) can be found on B- and T-cells. Plasma cells have an additional decrease, meaning that they change even stronger in abundance than the rest of the B-cell lineage.
We can also easily plot the credible effects as nodes on the tree for better visualization.
[13]:
tasccoda_model.plot_draw_effects(
smillie_data,
modality_key="coda_LP",
tree=smillie_data["coda_LP"].uns["tree"],
covariate="Health[T.Non-inflamed]",
)
[13]:
To better see the size of aggregated effects on the individual cell types, we can plot them at the side of the tree plot:
[10]:
tasccoda_model.plot_draw_effects(
smillie_data,
modality_key="coda_LP",
tree=smillie_data["coda_LP"].uns["tree"],
covariate="Health[T.Non-inflamed]",
show_legend=False,
show_leaf_effects=True,
)
[10]:
References¶
Ostner, J., Carcy, S., Müller, C.L. tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data. Front. Genet. 12, (2021). https://doi.org/10.3389/fgene.2021.766405
Smillie, C.S., Biton, M., Ordovas-Montanes, J. et al. Intra- and Inter-cellular Rewiring of the Human Colon during Ulcerative Colitis. Cell 178, 714–730.e22 (2019). https://doi.org/10.1016/j.cell.2019.06.029