Note

This page was generated from ontology_mapping.ipynb. Some tutorial content may look better in light mode.

Ontology mapping

Ontologies are structured and standardized representations of knowledge in a specific domain, defining the concepts, relationships, and properties within that domain. They are essential for perturbation analysis as they provide a common vocabulary and framework for organizing and integration perturbation data.

pertpy is compatible with Bionty which provides access to public ontologies and functionality to map values against them.

Setup

If you don’t yet have Bionty installed, install it with pip install bionty.

[1]:
import anndata as ad
import numpy as np
import pandas as pd

Create an AnnData object with gene names in Ensemble notation and cell line annotations in the obs slot.

[2]:
adata = ad.AnnData(
    X=np.random.random((3, 3)),
    var=pd.DataFrame(
        index=[
            "ENSG00000148584",
            "ENSG00000121410",
            "ENSGcorrupted",
        ]
    ),
    obs=pd.DataFrame(
        columns=["cell lines"],
        data=[
            "HEK293",
            "JURKAT",
            "THP-1 cell",
        ],
    ),
)
adata
/home/zeth/miniconda3/envs/pertpy/lib/python3.11/site-packages/anndata/_core/anndata.py:183: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
[2]:
AnnData object with n_obs × n_vars = 3 × 3
    obs: 'cell lines'
[3]:
adata.obs
[3]:
cell lines
0 HEK293
1 JURKAT
2 THP-1 cell

Introduction to Bionty

First we import Bionty.

[4]:
import bionty as bt
❗ You are running 3.11.4
Only python versions 3.8~3.10 are currently tested, use at your own risk.
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!

if you see this message repeatedly, run: bt.reset_sources()

Let’s look at all available ontologies.

[5]:
bt.display_available_sources()
[5]:
source organism version url md5 source_name source_website
entity
Organism ensembl vertebrates release-110 https://ftp.ensembl.org/pub/release-110/specie... f3faf95648d3a2b50fd3625456739706 Ensembl https://www.ensembl.org
Organism ensembl vertebrates release-109 https://ftp.ensembl.org/pub/release-109/specie... 7595bb989f5fec07eaca5e2138f67bd4 Ensembl https://www.ensembl.org
Organism ensembl vertebrates release-108 https://ftp.ensembl.org/pub/release-108/specie... d97c1ee302e4072f5f5c7850eff0b642 Ensembl https://www.ensembl.org
Organism ensembl bacteria release-57 https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacte... ee28510ed5586ea7ab4495717c96efc8 Ensembl https://www.ensembl.org
Organism ensembl fungi release-57 http://ftp.ensemblgenomes.org/pub/fungi/releas... dbcde58f4396ab8b2480f7fe9f83df8a Ensembl https://www.ensembl.org
Organism ensembl metazoa release-57 http://ftp.ensemblgenomes.org/pub/metazoa/rele... 424636a574fec078a61cbdddb05f9132 Ensembl https://www.ensembl.org
Organism ensembl plants release-57 https://ftp.ensemblgenomes.ebi.ac.uk/pub/plant... eadaa1f3e527e4c3940c90c7fa5c8bf4 Ensembl https://www.ensembl.org
Organism ncbitaxon all 2023-06-20 s3://bionty-assets/df_all__ncbitaxon__2023-06-... 00d97ba65627f1cd65636d2df22ea76c NCBItaxon Ontology https://github.com/obophenotype/ncbitaxon
Gene ensembl human release-110 s3://bionty-assets/df_human__ensembl__release-... 832f3947e83664588d419608a469b528 Ensembl https://www.ensembl.org
Gene ensembl human release-109 s3://bionty-assets/human_ensembl_release-109_G... 72da9968c74e96d136a489a6102a4546 Ensembl https://www.ensembl.org
Gene ensembl mouse release-110 s3://bionty-assets/df_mouse__ensembl__release-... fa4ce130f2929aefd7ac3bc8eaf0c4de Ensembl https://www.ensembl.org
Gene ensembl mouse release-109 s3://bionty-assets/mouse_ensembl_release-109_G... 08a1165061151b270b985317322bd2ed Ensembl https://www.ensembl.org
Gene ensembl saccharomyces cerevisiae release-110 s3://bionty-assets/df_saccharomyces cerevisiae... 2e59495a3e87ea6575e408697dd73459 Ensembl https://www.ensembl.org
Protein uniprot human 2023-03 s3://bionty-assets/df_human__uniprot__2023-03_... 1c46e85c6faf5eff3de5b4e1e4edc4d3 Uniprot https://www.uniprot.org
Protein uniprot human 2023-02 s3://bionty-assets/human_uniprot_2023-02_Prote... 0cb7264eb43f91bd04dac792dd879241 Uniprot https://www.uniprot.org
Protein uniprot mouse 2023-03 s3://bionty-assets/df_mouse__uniprot__2023-03_... 9d5e9a8225011d3218e10f9bbb96a46c Uniprot https://www.uniprot.org
Protein uniprot mouse 2023-02 s3://bionty-assets/mouse_uniprot_2023-02_Prote... dcae4f62f5df145a5c15163fce7e9135 Uniprot https://www.uniprot.org
CellMarker cellmarker human 2.0 s3://bionty-assets/human_cellmarker_2.0_CellMa... d565d4a542a5c7e7a06255975358e4f4 CellMarker http://bio-bigdata.hrbmu.edu.cn/CellMarker
CellMarker cellmarker mouse 2.0 s3://bionty-assets/mouse_cellmarker_2.0_CellMa... 189586732c63be949e40dfa6a3636105 CellMarker http://bio-bigdata.hrbmu.edu.cn/CellMarker
CellLine clo all 2022-03-21 https://data.bioontology.org/ontologies/CLO/su... ea58a1010b7e745702a8397a526b3a33 Cell Line Ontology https://bioportal.bioontology.org/ontologies/CLO
CellType cl all 2023-04-20 http://purl.obolibrary.org/obo/cl/releases/202... 58cdc1545f0d35e6fce76a65331b00fb Cell Ontology https://obophenotype.github.io/cell-ontology
CellType cl all 2023-02-15 http://purl.obolibrary.org/obo/cl/releases/202... 9331a6a029cb1863bd0584ab41508df7 Cell Ontology https://obophenotype.github.io/cell-ontology
CellType cl all 2022-08-16 http://purl.obolibrary.org/obo/cl/releases/202... d0655766574e63f3fe5ed56d3c030880 Cell Ontology https://obophenotype.github.io/cell-ontology
CellType cl all 2023-08-24 http://purl.obolibrary.org/obo/cl/releases/202... 46e7dd89421f1255cf0191eca1548f73 Cell Ontology https://obophenotype.github.io/cell-ontology
Tissue uberon all 2023-04-19 http://purl.obolibrary.org/obo/uberon/releases... 5611dd1375d5a95ac7d7de8e25e6016f Uberon multi-species anatomy ontology http://obophenotype.github.io/uberon
Tissue uberon all 2023-02-14 http://purl.obolibrary.org/obo/uberon/releases... 3f94e22fae4cdde88a555c5cd59c47da Uberon multi-species anatomy ontology http://obophenotype.github.io/uberon
Tissue uberon all 2022-08-19 http://purl.obolibrary.org/obo/uberon/releases... c7c958a1ee48fdce146f2c1763eed27e Uberon multi-species anatomy ontology http://obophenotype.github.io/uberon
Tissue uberon all 2023-09-05 http://purl.obolibrary.org/obo/uberon/releases... abcee3ede566d1311d758b853ccdf5aa Uberon multi-species anatomy ontology http://obophenotype.github.io/uberon
Disease mondo all 2023-04-04 http://purl.obolibrary.org/obo/mondo/releases/... 700c43dd9ba51aecc7a8edfc3bc2dab1 Mondo Disease Ontology https://mondo.monarchinitiative.org
Disease mondo all 2023-02-06 http://purl.obolibrary.org/obo/mondo/releases/... 2b7d479d4bd02a94eab47d1c9e64c5db Mondo Disease Ontology https://mondo.monarchinitiative.org
Disease mondo all 2022-10-11 http://purl.obolibrary.org/obo/mondo/releases/... 04b808d05c2c2e81430b20a0e87552bb Mondo Disease Ontology https://mondo.monarchinitiative.org
Disease mondo all 2023-08-02 http://purl.obolibrary.org/obo/mondo/releases/... 7f33767422042eec29f08b501fc851db Mondo Disease Ontology https://mondo.monarchinitiative.org
Disease doid human 2023-03-31 http://purl.obolibrary.org/obo/doid/releases/2... 64f083a1e47867c307c8eae308afc3bb Human Disease Ontology https://disease-ontology.org
Disease doid human 2023-01-30 http://purl.obolibrary.org/obo/doid/releases/2... 9f0c92ad2896dda82195e9226a06dc36 Human Disease Ontology https://disease-ontology.org
ExperimentalFactor efo all 3.48.0 http://www.ebi.ac.uk/efo/releases/v3.48.0/efo.owl 3367e9a9ae3dee9113024e5108c49091 The Experimental Factor Ontology https://bioportal.bioontology.org/ontologies/EFO
ExperimentalFactor efo all 3.57.0 http://www.ebi.ac.uk/efo/releases/v3.57.0/efo.owl 2ecafc69b3aba7bdb31ad99438505c05 The Experimental Factor Ontology https://bioportal.bioontology.org/ontologies/EFO
Phenotype hp human 2023-06-17 https://github.com/obophenotype/human-phenotyp... 65e8d96bc81deb893163927063b10c06 Human Phenotype Ontology https://hpo.jax.org
Phenotype hp human 2023-04-05 https://github.com/obophenotype/human-phenotyp... bdf866e11d37cf6fd2aef25c325b2c8a Human Phenotype Ontology https://hpo.jax.org
Phenotype hp human 2023-01-27 https://github.com/obophenotype/human-phenotyp... ceeb3ada771908deef620d74cd8e6b0f Human Phenotype Ontology https://hpo.jax.org
Phenotype mp mammalian 2023-05-31 https://github.com/mgijax/mammalian-phenotype-... be89052cf6d9c0b6197038fe347ef293 Mammalian Phenotype Ontology https://github.com/mgijax/mammalian-phenotype-...
Phenotype zp zebrafish 2022-12-17 https://github.com/obophenotype/zebrafish-phen... 03430b567bf153216c0fa4c3440b3b24 Zebrafish Phenotype Ontology https://github.com/obophenotype/zebrafish-phen...
Phenotype phe human 1.2 s3://bionty-assets/df_human__phe__1.2__Phenoty... 741033ee1b13df7c41b4849e8bd02f13 Phecodes ICD10 map https://phewascatalog.org/phecodes_icd10
Phenotype pato all 2023-05-18 http://purl.obolibrary.org/obo/pato/releases/2... bd472f4971492109493d4ad8a779a8dd Phenotype And Trait Ontology https://github.com/pato-ontology/pato
Pathway go all 2023-05-10 https://data.bioontology.org/ontologies/GO/sub... e9845499eadaef2418f464cd7e9ac92e Gene Ontology http://geneontology.org
Pathway pw all 7.79 https://data.bioontology.org/ontologies/PW/sub... 02e2337bb1ab7cc4332ef6acc4cbdfa6 Pathway Ontology https://www.ebi.ac.uk/ols/ontologies/pw
BFXPipeline lamin all 1.0.0 s3://bionty-assets/bfxpipelines.json a7eff57a256994692fba46e0199ffc94 Bioinformatics Pipeline https://lamin.ai
Drug dron all 2023-03-10 https://data.bioontology.org/ontologies/DRON/s... 75e86011158fae76bb46d96662a33ba3 Drug Ontology https://bioportal.bioontology.org/ontologies/DRON
DevelopmentalStage hsapdv human 2020-03-10 http://purl.obolibrary.org/obo/hsapdv.owl 0423f338c50161880df4d5d1523d24ed Human Developmental Stages https://github.com/obophenotype/developmental-...
DevelopmentalStage mmusdv mouse 2020-03-10 http://purl.obolibrary.org/obo/mmusdv.owl 6342b59cf3082b10c54f90a8c3336b72 Mouse Developmental Stages https://github.com/obophenotype/developmental-...
Ethnicity hancestro human 2023-07-313.0 http://purl.obolibrary.org/obo/hancestro.owl af731447e95b4ca341a91b018edd4885 Human Ancestry Ontology https://github.com/EBISPOT/hancestro
Ethnicity hancestro human 3.0 https://github.com/EBISPOT/hancestro/raw/3.0/h... 76dd9efda9c2abd4bc32fc57c0b755dd Human Ancestry Ontology https://github.com/EBISPOT/hancestro
BioSample ncbi all 2023-09 s3://bionty-assets/df_all__ncbi__2023-09__BioS... 918db9bd1734b97c596c67d9654a4126 NCBI BioSample attributes https://www.ncbi.nlm.nih.gov/biosample/docs/at...

Bionty provides three key functionalities:

  1. inspect: Check whether any of our values (here diseases) are mappable against a specified ontology.

  2. map_synonyms: Map values against synonyms. This is not relevant for our diseases.

  3. curate: Curate ontology values against the ontology to ensure compliance.

Mapping against the Cell Line Ontology with Bionty

We will now showcase how to access the cell line ontology with Bionty. The Cell Line Ontology (CLO) aims to harmonize cell line definitions across the world.

Bionty is centered around Bionty entity objects that provide the above introduced functionality. We create a Bionty CellLine object with the cell line ontology as our source and a specific version for reproducibility.

Cell lines

[6]:
cell_line_bt = bt.CellLine(source="clo", version="2022-03-21")
cell_line_bt
[6]:
PublicOntology
Entity: CellLine
Organism: all
Source: clo, 2022-03-21
#terms: 39037

📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object

We can access the DataFrame that contains all ontology terms:

[7]:
cell_line_bt.df()
[7]:
name definition synonyms parents
ontology_id
CLO:0000000 cell line cell culturing a maintaining cell culture process that keeps ... None []
CLO:0000001 cell line cell A cultured cell that is part of a cell line - ... None []
CLO:0000002 suspension cell line culturing suspension cell line culturing is a cell line ... None [CLO:0000000]
CLO:0000003 adherent cell line culturing adherent cell line culturing is a cell line cu... None [CLO:0000000]
CLO:0000004 cell line cell modification a material processing that modifies an existin... None []
... ... ... ... ...
CLO:0051617 RCB0187 cell A immortal medaka cell line cell that has the ... RCB0187|OLHE-131 [CLO:0009822]
CLO:0051618 RCB2945 cell A immortal medaka cell line cell that has the ... RCB2945|DIT29 [CLO:0009822]
CLO:0051619 RCB0184 cell A immortal medaka cell line cell that has the ... OLF-136|RCB0184 [CLO:0009822]
CLO:0051620 RCB0188 cell A immortal medaka cell line cell that has the ... RCB0188|OLME-104 [CLO:0009822]
CLO:0051621 RCB2319 cell A immortal cell line cell that has the charact... LACF-NaNaI|RCB2319 [CLO:0000019]

39037 rows × 4 columns

Let’s inspect all of our cell lines to learn whether they can be mapped against the ontology using the name field:

[8]:
cell_line_bt.inspect(adata.obs["cell lines"], field=cell_line_bt.name, return_df=True)
2 terms (66.70%) are validated for name
❗ 1 term (33.30%) is not validated for name: JURKAT
   detected 1 terms with synonym: JURKAT
→  standardize terms via .standardize()
[8]:
__validated__
HEK293 True
JURKAT False
THP-1 cell True

We observe that JURKAR cannot be mapped against the Cell Line Ontology. Hence, we create a lookup object and try to find JURKAT cells in the ontology with auto-complete.

[9]:
cell_line_bt_lookup = cell_line_bt.lookup()
[10]:
cell_line_bt_lookup.jurkat_cell
[10]:
CellLine(ontology_id='CLO:0007043', name='JURKAT cell', definition='an immortalized human T lymphocyte cell that was derived in the late 1970s from the peripheral blood of a 14-year-old boy with T cell leukemia|disease: leukemia, T cell', synonyms='JURKAT', parents=array(['CLO:0000523'], dtype=object), _5='jurkat cell')
[11]:
cell_line_bt_lookup.jurkat_cell.name
[11]:
'JURKAT cell'
[12]:
cell_line_bt_lookup.jurkat_cell.definition
[12]:
'an immortalized human T lymphocyte cell that was derived in the late 1970s from the peripheral blood of a 14-year-old boy with T cell leukemia|disease: leukemia, T cell'

Indeed we find that the actual name of the cells is JURKAT cell. Let’s rename it.

[13]:
adata.obs["cell lines"].replace({"JURKAT": "JURKAT cell"}, inplace=True)
adata.obs["cell lines"]
[13]:
0         HEK293
1    JURKAT cell
2     THP-1 cell
Name: cell lines, dtype: object
[14]:
cell_line_bt.inspect(adata.obs["cell lines"], field=cell_line_bt.name, return_df=True)
3 terms (100.00%) are validated for name
[14]:
__validated__
HEK293 True
JURKAT cell True
THP-1 cell True

Now all terms could be mapped.

We could have also used the search functionality to find the match for JURKAT cells:

[15]:
cell_line_bt.search("JURKAT").head()
[15]:
ontology_id definition synonyms parents __agg__ __ratio__
name
JURKAT cell CLO:0007043 an immortalized human T lymphocyte cell that w... JURKAT [CLO:0000523] jurkat cell 100.0
RCB0806 cell CLO:0050978 A immortal human blood cell line cell that has... RCB0806|Jurkat [CLO:0000617] rcb0806 cell 100.0
+/+ (A) cell CLO:0001020 None +/+ (A) [CLO:0000019] +/+ (a) cell 90.0
Jurkat J6 cell CLO:0007044 None Jurkat J6 [CLO:0000019] jurkat j6 cell 90.0
U cell CLO:0009449 None U [CLO:0000466] u cell 90.0

The same workflow can be applied to genes.

Genes

[16]:
gene_bt = bt.Gene()
gene_bt
[16]:
PublicOntology
Entity: Gene
Organism: human
Source: ensembl, release-110
#terms: 75719

📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object
[17]:
gene_bt.inspect(adata.var_names, gene_bt.ensembl_gene_id)
2 terms (66.70%) are validated for ensembl_gene_id
❗ 1 term (33.30%) is not validated for ensembl_gene_id: ENSGcorrupted
[17]:
<lamin_utils._inspect.InspectResult at 0x762c61ff7950>

ENSGcorrupted is not a valid Ensembl gene ID and should therefore also be corrected.

Conclusion

pertpy provides support for ontology management, inspection and mapping through Bionty. Bionty provide access to gene, cell type, cell line, disease, phenotype ontologies and many more.

To access these ontologies we create Bionty objects that have class functions to map synonyms and to inspect data for adherence against ontologies. Mismatches can be remedied by finding the actual correct ontology name using lookup objects or fuzzy matching.