Skip to content

Robotato/ontologize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Setup

python -m pip install ontologize

Usage

Vignette: Genes

Suppose we have a list of genes, perhaps ones that are upregulated in a certain environment, and we wish to understand the functional changes in the cell.

We can first build an Ontology object from a list of the genes' BioCyc IDs:

from ontologize.ontology import build_ontology

# cadA EG10131
# lacA EG10524 
# xylA EG11074 
ont = build_ontology(objects=["EG10131", "EG10524", "EG11074"], schema_type="Gene")

Ontology objects store an annotated ontology graph, as a networkX DiGraph:

import networkx as nx
assert isinstance(ont.graph, nx.DiGraph)

Rich printing options are supported, including truncation of the graph at a given depth, inclusion/exclusion of leaf nodes, whether to color by depth.

print(ont.to_string(max_depth=None, include_leaves=False, colors=True))

alt text

In this example, we see that lacA and xylA are both involved in carbon utilization, while cadA is related to pH adaptation.

Command-Line Interface

Once exposed, ontologize exposes a runnable script, and can also be called as a module:

ontologize <file> <schema_type> [flags]
python -m ontologize <file> <schema_type> [flags]

The required arguments are given as follows:

  • file: Path to a .csv, .tsv, or .xlsx file with BioCyc object IDs to ontologize. By default, assumes a (header-less, if .csv or .tsv) first column containing the IDs to be ontologized. If a .xlsx file is given, then by default, IDs are assumed to be in the first sheet in the first column, treating the first entry as a header.

  • schema_type : Type of the objects (or properties) to be ontologized in the Biocyc Schema. For example, this might be Gene, Pathway, Compound, etc.

Note that schema_type uses the singular form of the class name!

Example:

# TODO

Flags

Ontology-building options:

  • -s <sheet_name>, --sheet <sheet_name>: For a .xlsx file, the name of the sheet containing BioCyc IDs. Ignored if file is not a .xlsx file.
  • -o <objects>, --objects <objects>: For a multi-column file, the name of the column containing BioCyc IDs for the objects to ontologize. Requires a header row containing column names.
  • -p <objects>, --property <objects>: For a multi-column file, the name of the column containing BioCyc IDs for the property to ontologize. Requires a header row containing column names. When using this option, the objects must also be specified using the -o option.
  • WARNING: -p, --property NOT YET IMPLEMENTED

  • --database <orgid>: BioCyc organism ID, used to specify the organism-specific database within to search. ECOLI by default.

Printing options:

  • --depth <depth>: Maximum depth of the ontology to print. No limit by default.
  • --leaves: Whether to show leaf nodes, i.e., the ontologized objects themselves. Not shown by default.
  • --coloroff: Turns off colorful printing.

TODO: graph options (not implemented), pkl options, --interactive (allows maintaining session)

References

[BioCyc19] Karp, P.D., et al., The BioCyc collection of microbial genomes and metabolic pathways Briefings in Bioinformatics (2019).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published