Verify if all the components of a pathway are present in a genomes
Entering edit mode
9.4 years ago
dago ★ 2.8k

I would like to verify if all the protein in specific pathways (e.g. secretion systems, two component regulatory system) are present in a genome.

Is there a way to do this automatically iterating the search for many genomes?

genome annotation • 8.4k views
Entering edit mode
9.4 years ago

Cleaned up the post and moved to a Gist by Ram on 22-Apr-2022.

Entering edit mode

Wow this looks like a really detailed work flow. I will give it a try but I am a bit dumb with the KEGG API. Any tutorial I could use to learn it?

Entering edit mode

Hi dago,

I figured it out by hit and trial method. Bioconductor's KEGGREST could be a good starting point to explore KEGG REST service further. Please also have a look at the HUManN workflow on how they incorporated MinPath. Here are a few more one-liners that you can use on a GENBANK file to understand the KEGG service better:

The API uses KEGG ENZYME database, which is an implementation of the Enzyme Nomenclature (EC Numbers) on the ExplorEnz database, and is maintained in the KEGG LIGAND relational database with additional annotation of reaction hierarchy, organism information, and sequence data links.

To use these one-liners on your GENBANK files, replace test.gbk with the name of the file you are using. First step is to extract a tab-separated list of only those contigs (once you have annotated them through PROKKA) which have enzymes in them. All other contigs are ignored

$ awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","ec:\\1","g")}/^LOCUS/{print ";"$2}'  test.gbk | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}'
NODE_2925_length_923_cov_1.698808   ec: 
NODE_4520_length_55408_cov_7.328635 ec: ec: ec: ec: 
NODE_6043_length_823_cov_1.500607   ec: 
NODE_918_length_270_cov_10.744445   ec:  
NODE_312_length_2459_cov_13.763318  ec: 
NODE_282_length_4935_cov_13.040932  ec:  ec:  ec:  ec:  
NODE_3153_length_49002_cov_8.110995 ec: ec: ec:  ec: ec: ec: ec:  ec: 
NODE_241_length_4648_cov_13.109509  ec: 
NODE_533_length_2973_cov_13.546249  ec:

The following one-liner extracts the list of all enzymes found in the GENBANK file and uses rest-style KEGG API to generate names from EC numbers:

$ for I in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s$i) | awk '{$1=$1"\t"}1'; done
ec:     carnitine 3-dehydrogenase
ec:     peroxiredoxin; thioredoxin peroxidase; tryparedoxin peroxidase; alkyl hydroperoxide reductase C22; AhpC; TrxPx; TXNPx; Prx; PRDX
ec:     Transferred to
ec:  catalase; equilase; caperase; optidase; catalase-peroxidase; CAT
ec:     2-dehydropantoate 2-reductase; 2-oxopantoate reductase; 2-ketopantoate reductase; 2-ketopantoic acid reductase; ketopantoate reductase; ketopantoic acid reductase
ec:  peroxidase; lactoperoxidase; guaiacol peroxidase; plant peroxidase; Japanese radish peroxidase; horseradish peroxidase (HRP); soybean peroxidase (SBP); extensin peroxidase; heme peroxidase; oxyperoxidase; protoheme peroxidase; pyrocatechol peroxidase; scopoletin peroxidase; Coprinus cinereus peroxidase; Arthromyces ramosus peroxidase

For the extracted enzymes, we can list all the KEGG Ortholog (KO) groups each enzyme is part of, along with their detailed description. The KO system is the basis for representation for all proteins and functional RNAs that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes:

$ for I in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s$i | grep -Po '(?<=ko:).*' | sed -e "s/K/ko:K/g") | xargs -n 1 | xargs -I {} curl -s{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec: ko:K00411   UQCRFS1, RIP1, petA; ubiquinol-cytochrome c reductase iron-sulfur subunit [EC:]
ec:  ko:K00001   E1.1.1.1, adh; alcohol dehydrogenase [EC:]
ec:  ko:K00121   frmA, ADH5, adhC; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:]
ec:  ko:K04072   adhE; acetaldehyde dehydrogenase / alcohol dehydrogenase [EC:]
ec:  ko:K11440   gbsB; choline dehydrogenase [EC:]
ec:  ko:K13951   ADH1_7; alcohol dehydrogenase 1/7 [EC:]
ec:  ko:K13952   ADH6; alcohol dehydrogenase 6 [EC:]
ec:  ko:K13953   adhP; alcohol dehydrogenase, propanol-preferring [EC:]
ec:  ko:K13954   yiaY; alcohol dehydrogenase [EC:]
ec:  ko:K13980   ADH4; alcohol dehydrogenase 4 [EC:]
ec:    ko:K00059   fabG; 3-oxoacyl-[acyl-carrier protein] reductase [EC:]
ec:    ko:K11610   mabA; beta-ketoacyl ACP reductase [EC:]
ec:    ko:K17735   lcdH, cdhA; carnitine 3-dehydrogenase [EC:]

Similarly, for the extracted enzymes, we can also list all the known reactions these enzymes are a part of. Each reaction is identified by the R number and is linked to ortholog groups of enzymes enabling integrated analysis of genomic and chemical information.

$ for I in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s$i | grep -Po '(?<=rn:).*') | xargs -n 1 | xargs -I {} curl -s{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec: rn:R02161   Ubiquinol:ferricytochrome-c oxidoreductase; Ubiquinol + 2 Ferricytochrome c <=> Ubiquinone + 2 Ferrocytochrome c + 2 H+
ec:  rn:R00228   acetaldehyde:NAD+ oxidoreductase (CoA-acetylating); Acetaldehyde + CoA + NAD+ <=> Acetyl-CoA + NADH + H+
ec:  rn:R00623   primary_alcohol:NAD+ oxidoreductase; Primary alcohol + NAD+ <=> Aldehyde + NADH + H+
ec:  rn:R00624   Secondary_alcohol:NAD+ oxidoreductase; Secondary alcohol + NAD+ <=> Ketone + NADH + H+
ec:  rn:R00754   ethanol:NAD+ oxidoreductase; Ethanol + NAD+ <=> Acetaldehyde + NADH + H+
ec:  rn:R01172   butanal:NAD+ oxidoreductase (CoA-acylating); Butanal + CoA + NAD+ <=> Butanoyl-CoA + NADH + H+
ec:  rn:R02124   retinol:NAD+ oxidoreductase; Retinol + NAD+ <=> Retinal + NADH + H+
ec:  rn:R02878   1-Octanol:NAD+ oxidoreductase; 1-Octanol + NAD+ <=> 1-Octanal + NADH + H+

We may also be interested in knowing which other organisms contain the same enzymes:

$ for I in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s$i | grep -Po '(?<=genome:).*') | xargs -n 1 | xargs -I {} curl -s{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec: genome:T01003   rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec: genome:T01008   bta, BOVIN, 9913; Bos taurus (cow)
ec:  genome:T00030   dme, DROME, 7227; Drosophila melanogaster (fruit fly)
ec:  genome:T01001   hsa, HUMAN, 9606; Homo sapiens (human)
ec:  genome:T01003   rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:  genome:T01058   ecb, HORSE, 9796; Equus caballus (horse)
ec:    genome:T00007   eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:    genome:T00035   pae, PSEAE, 208964; Pseudomonas aeruginosa PAO1
ec:    genome:T01001   hsa, HUMAN, 9606; Homo sapiens (human)
ec:    genome:T01003   rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec: genome:T00005   sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:    genome:T00564   ckl, CLOK5, 431943; Clostridium kluyveri DSM 555
ec: genome:T00115   lpl, LACPL, 220668; Lactobacillus plantarum WCFS1

Best Wishes,

Entering edit mode
9.4 years ago
5heikki 11k

A combination of KEGG's API, blastp and scripting should do it. If you're doing more small scale stuff, it might be feasible to just use and then the mapping tools provided at the site..

Entering edit mode

Thanks for the answer. I tried that as well. But I was just wondering how sensitive this is. In blast KOALA the search is done using as reference the information for a Genus, and the information available for the genus I am working with are quite limited. Therefore, will I miss some protein known in other taxon to be involved in a specific process but not described in my genus?

Entering edit mode

I don't think so:

The database files are generated from KEGG GENES as a collection of representative genomes by removing similar organisms at the species, genus or family level. When multiple members are present in each species/genus/family group, the first genome is taken as a representative genome. When the other members in the group contain different K numbers that are not present in the representative genome, those genes are added as if they are present in additional chromosomes or plasmids.

Stated right there in the blast koala site..

Entering edit mode

Ok, got it, but for my Genus there are only two genomes available, will it change?

Entering edit mode
9.4 years ago
Ram 44k

I'd take in a list of proteins as query and BLASTP against organism-filtered NR, verify identity and similarity in top 3 hits per query to ensure they fall under a threshold.

Entering edit mode
9.4 years ago
Asaf 10k

You might want to check the methods of this paper: Long-term phenotypic evolution of bacteria : Nature : Nature Publishing Group

They used Flux Balance analysis to determine the substrates a bacterial strain can grow on.

Entering edit mode

This is really cool, thanks for sharing. I am afraid this is to complex for my capacity..:P


Login before adding your answer.

Traffic: 2481 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6