Note
This page was generated from ontology_mapping.ipynb. Some tutorial content may look better in light mode.
Ontology mapping¶
Ontologies are structured and standardized representations of knowledge in a specific domain, defining the concepts, relationships, and properties within that domain. They are essential for perturbation analysis as they provide a common vocabulary and framework for organizing and integration perturbation data.
pertpy is compatible with Bionty which provides access to public ontologies and functionality to map values against them.
Setup¶
If you don’t yet have Bionty installed, install it with pip install bionty
.
[1]:
import anndata as ad
import numpy as np
import pandas as pd
Create an AnnData object with gene names in Ensemble notation and cell line annotations in the obs
slot.
[2]:
adata = ad.AnnData(
X=np.random.random((3, 3)),
var=pd.DataFrame(
index=[
"ENSG00000148584",
"ENSG00000121410",
"ENSGcorrupted",
]
),
obs=pd.DataFrame(
columns=["cell lines"],
data=[
"HEK293",
"JURKAT",
"THP-1 cell",
],
),
)
adata
/home/zeth/miniconda3/envs/pertpy/lib/python3.11/site-packages/anndata/_core/anndata.py:183: ImplicitModificationWarning: Transforming to str index.
warnings.warn("Transforming to str index.", ImplicitModificationWarning)
[2]:
AnnData object with n_obs × n_vars = 3 × 3
obs: 'cell lines'
[3]:
adata.obs
[3]:
cell lines | |
---|---|
0 | HEK293 |
1 | JURKAT |
2 | THP-1 cell |
Introduction to Bionty¶
First we import Bionty.
[4]:
import bionty as bt
❗ You are running 3.11.4
Only python versions 3.8~3.10 are currently tested, use at your own risk.
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!
if you see this message repeatedly, run: bt.reset_sources()
Let’s look at all available ontologies.
[5]:
bt.display_available_sources()
[5]:
source | organism | version | url | md5 | source_name | source_website | |
---|---|---|---|---|---|---|---|
entity | |||||||
Organism | ensembl | vertebrates | release-110 | https://ftp.ensembl.org/pub/release-110/specie... | f3faf95648d3a2b50fd3625456739706 | Ensembl | https://www.ensembl.org |
Organism | ensembl | vertebrates | release-109 | https://ftp.ensembl.org/pub/release-109/specie... | 7595bb989f5fec07eaca5e2138f67bd4 | Ensembl | https://www.ensembl.org |
Organism | ensembl | vertebrates | release-108 | https://ftp.ensembl.org/pub/release-108/specie... | d97c1ee302e4072f5f5c7850eff0b642 | Ensembl | https://www.ensembl.org |
Organism | ensembl | bacteria | release-57 | https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacte... | ee28510ed5586ea7ab4495717c96efc8 | Ensembl | https://www.ensembl.org |
Organism | ensembl | fungi | release-57 | http://ftp.ensemblgenomes.org/pub/fungi/releas... | dbcde58f4396ab8b2480f7fe9f83df8a | Ensembl | https://www.ensembl.org |
Organism | ensembl | metazoa | release-57 | http://ftp.ensemblgenomes.org/pub/metazoa/rele... | 424636a574fec078a61cbdddb05f9132 | Ensembl | https://www.ensembl.org |
Organism | ensembl | plants | release-57 | https://ftp.ensemblgenomes.ebi.ac.uk/pub/plant... | eadaa1f3e527e4c3940c90c7fa5c8bf4 | Ensembl | https://www.ensembl.org |
Organism | ncbitaxon | all | 2023-06-20 | s3://bionty-assets/df_all__ncbitaxon__2023-06-... | 00d97ba65627f1cd65636d2df22ea76c | NCBItaxon Ontology | https://github.com/obophenotype/ncbitaxon |
Gene | ensembl | human | release-110 | s3://bionty-assets/df_human__ensembl__release-... | 832f3947e83664588d419608a469b528 | Ensembl | https://www.ensembl.org |
Gene | ensembl | human | release-109 | s3://bionty-assets/human_ensembl_release-109_G... | 72da9968c74e96d136a489a6102a4546 | Ensembl | https://www.ensembl.org |
Gene | ensembl | mouse | release-110 | s3://bionty-assets/df_mouse__ensembl__release-... | fa4ce130f2929aefd7ac3bc8eaf0c4de | Ensembl | https://www.ensembl.org |
Gene | ensembl | mouse | release-109 | s3://bionty-assets/mouse_ensembl_release-109_G... | 08a1165061151b270b985317322bd2ed | Ensembl | https://www.ensembl.org |
Gene | ensembl | saccharomyces cerevisiae | release-110 | s3://bionty-assets/df_saccharomyces cerevisiae... | 2e59495a3e87ea6575e408697dd73459 | Ensembl | https://www.ensembl.org |
Protein | uniprot | human | 2023-03 | s3://bionty-assets/df_human__uniprot__2023-03_... | 1c46e85c6faf5eff3de5b4e1e4edc4d3 | Uniprot | https://www.uniprot.org |
Protein | uniprot | human | 2023-02 | s3://bionty-assets/human_uniprot_2023-02_Prote... | 0cb7264eb43f91bd04dac792dd879241 | Uniprot | https://www.uniprot.org |
Protein | uniprot | mouse | 2023-03 | s3://bionty-assets/df_mouse__uniprot__2023-03_... | 9d5e9a8225011d3218e10f9bbb96a46c | Uniprot | https://www.uniprot.org |
Protein | uniprot | mouse | 2023-02 | s3://bionty-assets/mouse_uniprot_2023-02_Prote... | dcae4f62f5df145a5c15163fce7e9135 | Uniprot | https://www.uniprot.org |
CellMarker | cellmarker | human | 2.0 | s3://bionty-assets/human_cellmarker_2.0_CellMa... | d565d4a542a5c7e7a06255975358e4f4 | CellMarker | http://bio-bigdata.hrbmu.edu.cn/CellMarker |
CellMarker | cellmarker | mouse | 2.0 | s3://bionty-assets/mouse_cellmarker_2.0_CellMa... | 189586732c63be949e40dfa6a3636105 | CellMarker | http://bio-bigdata.hrbmu.edu.cn/CellMarker |
CellLine | clo | all | 2022-03-21 | https://data.bioontology.org/ontologies/CLO/su... | ea58a1010b7e745702a8397a526b3a33 | Cell Line Ontology | https://bioportal.bioontology.org/ontologies/CLO |
CellType | cl | all | 2023-04-20 | http://purl.obolibrary.org/obo/cl/releases/202... | 58cdc1545f0d35e6fce76a65331b00fb | Cell Ontology | https://obophenotype.github.io/cell-ontology |
CellType | cl | all | 2023-02-15 | http://purl.obolibrary.org/obo/cl/releases/202... | 9331a6a029cb1863bd0584ab41508df7 | Cell Ontology | https://obophenotype.github.io/cell-ontology |
CellType | cl | all | 2022-08-16 | http://purl.obolibrary.org/obo/cl/releases/202... | d0655766574e63f3fe5ed56d3c030880 | Cell Ontology | https://obophenotype.github.io/cell-ontology |
CellType | cl | all | 2023-08-24 | http://purl.obolibrary.org/obo/cl/releases/202... | 46e7dd89421f1255cf0191eca1548f73 | Cell Ontology | https://obophenotype.github.io/cell-ontology |
Tissue | uberon | all | 2023-04-19 | http://purl.obolibrary.org/obo/uberon/releases... | 5611dd1375d5a95ac7d7de8e25e6016f | Uberon multi-species anatomy ontology | http://obophenotype.github.io/uberon |
Tissue | uberon | all | 2023-02-14 | http://purl.obolibrary.org/obo/uberon/releases... | 3f94e22fae4cdde88a555c5cd59c47da | Uberon multi-species anatomy ontology | http://obophenotype.github.io/uberon |
Tissue | uberon | all | 2022-08-19 | http://purl.obolibrary.org/obo/uberon/releases... | c7c958a1ee48fdce146f2c1763eed27e | Uberon multi-species anatomy ontology | http://obophenotype.github.io/uberon |
Tissue | uberon | all | 2023-09-05 | http://purl.obolibrary.org/obo/uberon/releases... | abcee3ede566d1311d758b853ccdf5aa | Uberon multi-species anatomy ontology | http://obophenotype.github.io/uberon |
Disease | mondo | all | 2023-04-04 | http://purl.obolibrary.org/obo/mondo/releases/... | 700c43dd9ba51aecc7a8edfc3bc2dab1 | Mondo Disease Ontology | https://mondo.monarchinitiative.org |
Disease | mondo | all | 2023-02-06 | http://purl.obolibrary.org/obo/mondo/releases/... | 2b7d479d4bd02a94eab47d1c9e64c5db | Mondo Disease Ontology | https://mondo.monarchinitiative.org |
Disease | mondo | all | 2022-10-11 | http://purl.obolibrary.org/obo/mondo/releases/... | 04b808d05c2c2e81430b20a0e87552bb | Mondo Disease Ontology | https://mondo.monarchinitiative.org |
Disease | mondo | all | 2023-08-02 | http://purl.obolibrary.org/obo/mondo/releases/... | 7f33767422042eec29f08b501fc851db | Mondo Disease Ontology | https://mondo.monarchinitiative.org |
Disease | doid | human | 2023-03-31 | http://purl.obolibrary.org/obo/doid/releases/2... | 64f083a1e47867c307c8eae308afc3bb | Human Disease Ontology | https://disease-ontology.org |
Disease | doid | human | 2023-01-30 | http://purl.obolibrary.org/obo/doid/releases/2... | 9f0c92ad2896dda82195e9226a06dc36 | Human Disease Ontology | https://disease-ontology.org |
ExperimentalFactor | efo | all | 3.48.0 | http://www.ebi.ac.uk/efo/releases/v3.48.0/efo.owl | 3367e9a9ae3dee9113024e5108c49091 | The Experimental Factor Ontology | https://bioportal.bioontology.org/ontologies/EFO |
ExperimentalFactor | efo | all | 3.57.0 | http://www.ebi.ac.uk/efo/releases/v3.57.0/efo.owl | 2ecafc69b3aba7bdb31ad99438505c05 | The Experimental Factor Ontology | https://bioportal.bioontology.org/ontologies/EFO |
Phenotype | hp | human | 2023-06-17 | https://github.com/obophenotype/human-phenotyp... | 65e8d96bc81deb893163927063b10c06 | Human Phenotype Ontology | https://hpo.jax.org |
Phenotype | hp | human | 2023-04-05 | https://github.com/obophenotype/human-phenotyp... | bdf866e11d37cf6fd2aef25c325b2c8a | Human Phenotype Ontology | https://hpo.jax.org |
Phenotype | hp | human | 2023-01-27 | https://github.com/obophenotype/human-phenotyp... | ceeb3ada771908deef620d74cd8e6b0f | Human Phenotype Ontology | https://hpo.jax.org |
Phenotype | mp | mammalian | 2023-05-31 | https://github.com/mgijax/mammalian-phenotype-... | be89052cf6d9c0b6197038fe347ef293 | Mammalian Phenotype Ontology | https://github.com/mgijax/mammalian-phenotype-... |
Phenotype | zp | zebrafish | 2022-12-17 | https://github.com/obophenotype/zebrafish-phen... | 03430b567bf153216c0fa4c3440b3b24 | Zebrafish Phenotype Ontology | https://github.com/obophenotype/zebrafish-phen... |
Phenotype | phe | human | 1.2 | s3://bionty-assets/df_human__phe__1.2__Phenoty... | 741033ee1b13df7c41b4849e8bd02f13 | Phecodes ICD10 map | https://phewascatalog.org/phecodes_icd10 |
Phenotype | pato | all | 2023-05-18 | http://purl.obolibrary.org/obo/pato/releases/2... | bd472f4971492109493d4ad8a779a8dd | Phenotype And Trait Ontology | https://github.com/pato-ontology/pato |
Pathway | go | all | 2023-05-10 | https://data.bioontology.org/ontologies/GO/sub... | e9845499eadaef2418f464cd7e9ac92e | Gene Ontology | http://geneontology.org |
Pathway | pw | all | 7.79 | https://data.bioontology.org/ontologies/PW/sub... | 02e2337bb1ab7cc4332ef6acc4cbdfa6 | Pathway Ontology | https://www.ebi.ac.uk/ols/ontologies/pw |
BFXPipeline | lamin | all | 1.0.0 | s3://bionty-assets/bfxpipelines.json | a7eff57a256994692fba46e0199ffc94 | Bioinformatics Pipeline | https://lamin.ai |
Drug | dron | all | 2023-03-10 | https://data.bioontology.org/ontologies/DRON/s... | 75e86011158fae76bb46d96662a33ba3 | Drug Ontology | https://bioportal.bioontology.org/ontologies/DRON |
DevelopmentalStage | hsapdv | human | 2020-03-10 | http://purl.obolibrary.org/obo/hsapdv.owl | 0423f338c50161880df4d5d1523d24ed | Human Developmental Stages | https://github.com/obophenotype/developmental-... |
DevelopmentalStage | mmusdv | mouse | 2020-03-10 | http://purl.obolibrary.org/obo/mmusdv.owl | 6342b59cf3082b10c54f90a8c3336b72 | Mouse Developmental Stages | https://github.com/obophenotype/developmental-... |
Ethnicity | hancestro | human | 2023-07-313.0 | http://purl.obolibrary.org/obo/hancestro.owl | af731447e95b4ca341a91b018edd4885 | Human Ancestry Ontology | https://github.com/EBISPOT/hancestro |
Ethnicity | hancestro | human | 3.0 | https://github.com/EBISPOT/hancestro/raw/3.0/h... | 76dd9efda9c2abd4bc32fc57c0b755dd | Human Ancestry Ontology | https://github.com/EBISPOT/hancestro |
BioSample | ncbi | all | 2023-09 | s3://bionty-assets/df_all__ncbi__2023-09__BioS... | 918db9bd1734b97c596c67d9654a4126 | NCBI BioSample attributes | https://www.ncbi.nlm.nih.gov/biosample/docs/at... |
Bionty provides three key functionalities:
inspect
: Check whether any of our values (here diseases) are mappable against a specified ontology.map_synonyms
: Map values against synonyms. This is not relevant for our diseases.curate
: Curate ontology values against the ontology to ensure compliance.
Mapping against the Cell Line Ontology with Bionty
We will now showcase how to access the cell line ontology with Bionty. The Cell Line Ontology (CLO) aims to harmonize cell line definitions across the world.
Bionty is centered around Bionty entity objects that provide the above introduced functionality. We create a Bionty CellLine object with the cell line ontology as our source and a specific version for reproducibility.
Cell lines¶
[6]:
cell_line_bt = bt.CellLine(source="clo", version="2022-03-21")
cell_line_bt
[6]:
PublicOntology
Entity: CellLine
Organism: all
Source: clo, 2022-03-21
#terms: 39037
📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object
We can access the DataFrame that contains all ontology terms:
[7]:
cell_line_bt.df()
[7]:
name | definition | synonyms | parents | |
---|---|---|---|---|
ontology_id | ||||
CLO:0000000 | cell line cell culturing | a maintaining cell culture process that keeps ... | None | [] |
CLO:0000001 | cell line cell | A cultured cell that is part of a cell line - ... | None | [] |
CLO:0000002 | suspension cell line culturing | suspension cell line culturing is a cell line ... | None | [CLO:0000000] |
CLO:0000003 | adherent cell line culturing | adherent cell line culturing is a cell line cu... | None | [CLO:0000000] |
CLO:0000004 | cell line cell modification | a material processing that modifies an existin... | None | [] |
... | ... | ... | ... | ... |
CLO:0051617 | RCB0187 cell | A immortal medaka cell line cell that has the ... | RCB0187|OLHE-131 | [CLO:0009822] |
CLO:0051618 | RCB2945 cell | A immortal medaka cell line cell that has the ... | RCB2945|DIT29 | [CLO:0009822] |
CLO:0051619 | RCB0184 cell | A immortal medaka cell line cell that has the ... | OLF-136|RCB0184 | [CLO:0009822] |
CLO:0051620 | RCB0188 cell | A immortal medaka cell line cell that has the ... | RCB0188|OLME-104 | [CLO:0009822] |
CLO:0051621 | RCB2319 cell | A immortal cell line cell that has the charact... | LACF-NaNaI|RCB2319 | [CLO:0000019] |
39037 rows × 4 columns
Let’s inspect all of our cell lines to learn whether they can be mapped against the ontology using the name
field:
[8]:
cell_line_bt.inspect(adata.obs["cell lines"], field=cell_line_bt.name, return_df=True)
✅ 2 terms (66.70%) are validated for name
❗ 1 term (33.30%) is not validated for name: JURKAT
detected 1 terms with synonym: JURKAT
→ standardize terms via .standardize()
[8]:
__validated__ | |
---|---|
HEK293 | True |
JURKAT | False |
THP-1 cell | True |
We observe that JURKAR
cannot be mapped against the Cell Line Ontology. Hence, we create a lookup object and try to find JURKAT cells in the ontology with auto-complete.
[9]:
cell_line_bt_lookup = cell_line_bt.lookup()
[10]:
cell_line_bt_lookup.jurkat_cell
[10]:
CellLine(ontology_id='CLO:0007043', name='JURKAT cell', definition='an immortalized human T lymphocyte cell that was derived in the late 1970s from the peripheral blood of a 14-year-old boy with T cell leukemia|disease: leukemia, T cell', synonyms='JURKAT', parents=array(['CLO:0000523'], dtype=object), _5='jurkat cell')
[11]:
cell_line_bt_lookup.jurkat_cell.name
[11]:
'JURKAT cell'
[12]:
cell_line_bt_lookup.jurkat_cell.definition
[12]:
'an immortalized human T lymphocyte cell that was derived in the late 1970s from the peripheral blood of a 14-year-old boy with T cell leukemia|disease: leukemia, T cell'
Indeed we find that the actual name of the cells is JURKAT cell
. Let’s rename it.
[13]:
adata.obs["cell lines"].replace({"JURKAT": "JURKAT cell"}, inplace=True)
adata.obs["cell lines"]
[13]:
0 HEK293
1 JURKAT cell
2 THP-1 cell
Name: cell lines, dtype: object
[14]:
cell_line_bt.inspect(adata.obs["cell lines"], field=cell_line_bt.name, return_df=True)
✅ 3 terms (100.00%) are validated for name
[14]:
__validated__ | |
---|---|
HEK293 | True |
JURKAT cell | True |
THP-1 cell | True |
Now all terms could be mapped.
We could have also used the search functionality to find the match for JURKAT cells:
[15]:
cell_line_bt.search("JURKAT").head()
[15]:
ontology_id | definition | synonyms | parents | __agg__ | __ratio__ | |
---|---|---|---|---|---|---|
name | ||||||
JURKAT cell | CLO:0007043 | an immortalized human T lymphocyte cell that w... | JURKAT | [CLO:0000523] | jurkat cell | 100.0 |
RCB0806 cell | CLO:0050978 | A immortal human blood cell line cell that has... | RCB0806|Jurkat | [CLO:0000617] | rcb0806 cell | 100.0 |
+/+ (A) cell | CLO:0001020 | None | +/+ (A) | [CLO:0000019] | +/+ (a) cell | 90.0 |
Jurkat J6 cell | CLO:0007044 | None | Jurkat J6 | [CLO:0000019] | jurkat j6 cell | 90.0 |
U cell | CLO:0009449 | None | U | [CLO:0000466] | u cell | 90.0 |
The same workflow can be applied to genes.
Genes¶
[16]:
gene_bt = bt.Gene()
gene_bt
[16]:
PublicOntology
Entity: Gene
Organism: human
Source: ensembl, release-110
#terms: 75719
📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object
[17]:
gene_bt.inspect(adata.var_names, gene_bt.ensembl_gene_id)
✅ 2 terms (66.70%) are validated for ensembl_gene_id
❗ 1 term (33.30%) is not validated for ensembl_gene_id: ENSGcorrupted
[17]:
<lamin_utils._inspect.InspectResult at 0x762c61ff7950>
ENSGcorrupted
is not a valid Ensembl gene ID and should therefore also be corrected.
Conclusion¶
pertpy provides support for ontology management, inspection and mapping through Bionty. Bionty provide access to gene, cell type, cell line, disease, phenotype ontologies and many more.
To access these ontologies we create Bionty objects that have class functions to map synonyms and to inspect data for adherence against ontologies. Mismatches can be remedied by finding the actual correct ontology name using lookup objects or fuzzy matching.