Question: Verify if all the components of a pathway are present in a genomes
gravatar for dago
5.6 years ago by
dago2.6k wrote:

I would like to verify if all the protein in specific pathways (e.g. secretion systems, two component regulatory system) are present in a genome.

Is there a way to do this automatically iterating the search for many genomes?

annotation genome • 6.2k views
ADD COMMENTlink modified 5.6 years ago by umer.zeeshan.ijaz1.8k • written 5.6 years ago by dago2.6k
gravatar for umer.zeeshan.ijaz
5.6 years ago by
Glasgow, UK
umer.zeeshan.ijaz1.8k wrote:

You can use MinPath to see if all the enzymes are sufficient for a given Pathway to exist (the other way around). This is what I will do:

If I start with contigs or a genome sequence (can be a file with multiple genomes in it), I will use a tool such as PROKKA  to generate a genbank file of my annotation as well as all the extracted CDS regions as a FASTA file with amino acid sequences.

Say I start with the following file:

$ head Sample_1.fasta

Next, we use RAPsearch to search our protein sequences against the KEGG database and get a tab-delimited blast output file (identified by *.out.m8 ending). To speed up RAPsearch, you can specify the number of cores to run on using -z switch.

$ rapsearch -d /PATH_TO_KEGG_DB/kegg_db -z 2 -q Sample_1.fasta -o Sample_1.out
$ head *.out.m8
# RAPSearch
# Job submitted: Thu Jul  3 17:57:51 2014
# Query : Sample_1.fasta
# Subject : /PATH_TO_KEGG_DB/kegg_db
# Fields: Query	Subject	identity	aln-len	mismatch	gap-openings	q.start	q.end	s.start	s.end	log(e-value)	bit-score
PROT1	psa:PST_1876	89.2308	325	35	0	1	325	84	408	-156.9	557.37
PROT1	pmy:Pmen_2502	84.7095	327	48	1	1	325	84	410	-149.71	533.49
PROT1	psb:Psyr_2010	78.7692	325	67	1	1	325	89	411	-137.65	493.43
PROT1	hch:HCH_04744	73.2143	336	74	4	2	325	80	411	-130.46	469.54
PROT1	sde:Sde_2105	71.6049	324	89	2	5	325	80	403	-124.55	449.9

We can then identify enzymes in our dataset. For this purpose, we use a REST-style KEGG API to link KEGG entries (2nd column in the output file) to EC numbers:

$ awk -F"\t" '!/#/{print $1","$2}' Sample_1.out.m8 | while IFS=$"," read -r -a mA; do echo -e "${mA[0]}\t$(curl -s${mA[1]})"; done | awk '{print $1"\t"$3}' > ec_assignments.txt
$ head ec_assignments.txt
PROT1	ec:
PROT1	ec:
PROT1	ec:
PROT1	ec:
PROT1	ec:
PROT1	ec:
PROT1	ec:
PROT1	ec:
PROT1	ec:
PROT1	ec:

RAPsearch reports multiple matches for a single protein sequence. We assign the most abundant EC number to each protein sequence:

$ sort ec_assignments.txt | uniq -c | grep -i "ec:" | awk '{gsub("^ec:","",$3);if(ec[$2]){if($1>=c[$2]){ec[$2]=$3;c[$2]=$1}} else {ec[$2]=$3;c[$2]=$1}} END{for (i in ec) print i"\t"ec[i]}' >
$ tail

We next construct parsimonious pathways using MinPath as it does not attempt to reconstruct entire pathways from a given set of protein sequences identified in a protemics/genomics experiment, but determines the minimal set of biological pathways that must exist in the biological system to explain the input protein sequences. It takes a set of reference pathways and a set of proteins that can be mapped to one or more pathways, and finds the minimum number of pathways that can explain all the protein functions. MinPath comes with a mapping file ec2path that lists the reference pathways from MetaCyc and all the EC numbers associated with them. MetaCyc serves as an encyclopedia of metabolism containing more than 2151 patways from more than 2500 different organisms.

$  head -50 ec2path
#Metacyc pathway and ec mapping file
#Pathway	EC

To construct pathways from the generated EC file using MetaCyc pathways (using default mapping file in MinPath), run it as follows:

$ -any -map ec2path -report -details
$ head
path 2 any n/a  naive 1  minpath 0  fam0  39  fam-found  2  name  PWY-6146
path 4 any n/a  naive 1  minpath 1  fam0  8  fam-found  1  name  BRANCHED-CHAIN-AA-SYN-PWY
path 11 any n/a  naive 1  minpath 1  fam0  9  fam-found  1  name  PWY-821
path 44 any n/a  naive 1  minpath 1  fam0  1  fam-found  1  name  GLUTSYNIII-PWY
path 46 any n/a  naive 1  minpath 0  fam0  6  fam-found  1  name  PWY-5505
path 49 any n/a  naive 1  minpath 1  fam0  8  fam-found  1  name  PWY-6549
path 53 any n/a  naive 1  minpath 0  fam0  13  fam-found  1  name  THREOCAT-PWY
path 55 any n/a  naive 1  minpath 0  fam0  5  fam-found  1  name  ILEUSYN-PWY
path 56 any n/a  naive 1  minpath 0  fam0  11  fam-found  1  name  PWY-3001
path 57 any n/a  naive 1  minpath 0  fam0  6  fam-found  1  name  PWY-5101

The last column reports the pathways, which you can search on MetaCyc website to get the detailed information. For example, here is the record for BRANCHED-CHAIN-AA-SYN-PWY on the website:

However, we are interested in KEGG pathways so that we can draw metabolic networks using iPath. We, therefore, generate a new mapping file using the REST-style KEGG API:

$ for i in $(curl -S | awk -F"\t" '{print $1}'); do curl -S$i; done > ec2kegg
$ awk -F"\t" '!/^$/{gsub("^path:","",$1);gsub("^ec:","",$2);print $1"\t"$2}' ec2kegg > ec2kegg_final
$ chmod +x ec2kegg_final
$ head -50 ec2kegg_final

Now run with the new mapping file using -map ec2kegg_final in the command:

$ -any -map ec2kegg_final -report -details
$ head
path 1 any n/a  naive 1  minpath 0  fam0  45  fam-found  3  name  map00010
path 2 any n/a  naive 1  minpath 0  fam0  24  fam-found  1  name  map00020
path 3 any n/a  naive 1  minpath 0  fam0  50  fam-found  1  name  map00030
path 4 any n/a  naive 1  minpath 0  fam0  63  fam-found  1  name  map00040
path 5 any n/a  naive 1  minpath 0  fam0  70  fam-found  2  name  map00051
path 6 any n/a  naive 1  minpath 0  fam0  46  fam-found  1  name  map00052
path 9 any n/a  naive 1  minpath 0  fam0  13  fam-found  1  name  map00062
path 10 any n/a  naive 1  minpath 0  fam0  30  fam-found  2  name  map00071
path 18 any n/a  naive 1  minpath 0  fam0  11  fam-found  2  name  map00190
path 19 any n/a  naive 1  minpath 0  fam0  3  fam-found  1  name  map00195

To get the description of the reported KEGG pathways, use the following one-liner:

$ for i in $(awk '{print $14}'; do curl -S$i; done
path:map00010	Glycolysis / Gluconeogenesis
path:map00020	Citrate cycle (TCA cycle)
path:map00030	Pentose phosphate pathway
path:map00040	Pentose and glucuronate interconversions
path:map00051	Fructose and mannose metabolism
path:map00052	Galactose metabolism
path:map00062	Fatty acid elongation
path:map00071	Fatty acid degradation
path:map00190	Oxidative phosphorylation
path:map00195	Photosynthesis
path:map00230	Purine metabolism
path:map00240	Pyrimidine metabolism
path:map00250	Alanine, aspartate and glutamate metabolism
path:map00280	Valine, leucine and isoleucine degradation
path:map00281	Geraniol degradation
path:map00290	Valine, leucine and isoleucine biosynthesis
path:map00310	Lysine degradation
path:map00330	Arginine and proline metabolism
path:map00360	Phenylalanine metabolism
path:map00362	Benzoate degradation
path:map00380	Tryptophan metabolism
path:map00410	beta-Alanine metabolism
path:map00511	Other glycan degradation
path:map00531	Glycosaminoglycan degradation
path:map00592	alpha-Linolenic acid metabolism
path:map00600	Sphingolipid metabolism
path:map00604	Glycosphingolipid biosynthesis - ganglio series
path:map00620	Pyruvate metabolism
path:map01120	Microbial metabolism in diverse environments

We want to use interactive Pathways Explorer (iPath), which is a web-based tool for the visualization, analysis and customization of the various pathways maps. The metabolic pathways are constructed using 146 KEGG pathways, and give an overview of the complete metabolism in biological systems. We first extract the pathways in the format to be used with iPath and give each of them red color:

$ awk '{gsub("^map","",$14);print $14" #ff0000"}' >

Now go to and click on iPath v2 interface to load the applet:

Next click on the "Customize" button to load the "Customization and data mapping" tab, and upload by clicking on "Select file" button. This will populate the "Element selection" text area. You can manually assign different colors to your pathways here. Finally, click on "Submit data and customize maps" button to generate your metabolic network.

Once generated, you can save the metabolic network as a PDF document to your local computer. The connections in red are the ones present in your sample:

Best Wishes,


ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by umer.zeeshan.ijaz1.8k

Wow this looks like a really detailed work flow. I will give it a try but I am a bit dumb with the KEGG API. Any tutorial I could use to learn it?

ADD REPLYlink written 5.6 years ago by dago2.6k

Hi dago,

I figured it out by hit and trial method. Bioconductor's KEGGREST could be a good starting point to explore KEGG REST service further. Please also have a look at the HUManN workflow on how they incorporated MinPath. Here are a few more one-liners that you can use on a GENBANK file to understand the KEGG service better:

The API uses KEGG ENZYME database, which is an implementation of the Enzyme Nomenclature (EC Numbers) on the ExplorEnz database, and is maintained in the KEGG LIGAND relational database with additional annotation of reaction hierarchy, organism information, and sequence data links.
To use these one-liners on your GENBANK files, replace test.gbk with the name of the file you are using. First step is to extract a tab-separated list of only those contigs (once you have annotated them through PROKKA) which have enzymes in them. All other contigs are ignored

$ awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","ec:\\1","g")}/^LOCUS/{print ";"$2}'  test.gbk | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}'
NODE_2925_length_923_cov_1.698808	ec:	
NODE_4520_length_55408_cov_7.328635	ec:	ec:	ec:	ec:	
NODE_6043_length_823_cov_1.500607	ec:	
NODE_918_length_270_cov_10.744445	ec:	
NODE_312_length_2459_cov_13.763318	ec:	
NODE_282_length_4935_cov_13.040932	ec:	ec:	ec:	ec:	
NODE_3153_length_49002_cov_8.110995	ec:	ec:	ec:	ec:	ec:	ec:	ec:	ec:	
NODE_241_length_4648_cov_13.109509	ec:	
NODE_533_length_2973_cov_13.546249	ec:

The following one-liner extracts the list of all enzymes found in the GENBANK file and uses rest-style KEGG API to generate names from EC numbers:

$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s$i) | awk '{$1=$1"\t"}1'; done
ec:	 carnitine 3-dehydrogenase
ec:	 peroxiredoxin; thioredoxin peroxidase; tryparedoxin peroxidase; alkyl hydroperoxide reductase C22; AhpC; TrxPx; TXNPx; Prx; PRDX
ec:	 Transferred to
ec:	 catalase; equilase; caperase; optidase; catalase-peroxidase; CAT
ec:	 2-dehydropantoate 2-reductase; 2-oxopantoate reductase; 2-ketopantoate reductase; 2-ketopantoic acid reductase; ketopantoate reductase; ketopantoic acid reductase
ec:	 peroxidase; lactoperoxidase; guaiacol peroxidase; plant peroxidase; Japanese radish peroxidase; horseradish peroxidase (HRP); soybean peroxidase (SBP); extensin peroxidase; heme peroxidase; oxyperoxidase; protoheme peroxidase; pyrocatechol peroxidase; scopoletin peroxidase; Coprinus cinereus peroxidase; Arthromyces ramosus peroxidase

For the extracted enzymes, we can list all the KEGG Ortholog (KO) groups each enzyme is part of, along with their detailed description. The KO system is the basis for representation for all proteins and functional RNAs that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes:

$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s$i | grep -Po '(?<=ko:).*' | sed -e "s/K/ko:K/g") | xargs -n 1 | xargs -I {} curl -s{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:	ko:K00411	UQCRFS1, RIP1, petA; ubiquinol-cytochrome c reductase iron-sulfur subunit [EC:]
ec:	ko:K00001	E1.1.1.1, adh; alcohol dehydrogenase [EC:]
ec:	ko:K00121	frmA, ADH5, adhC; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:]
ec:	ko:K04072	adhE; acetaldehyde dehydrogenase / alcohol dehydrogenase [EC:]
ec:	ko:K11440	gbsB; choline dehydrogenase [EC:]
ec:	ko:K13951	ADH1_7; alcohol dehydrogenase 1/7 [EC:]
ec:	ko:K13952	ADH6; alcohol dehydrogenase 6 [EC:]
ec:	ko:K13953	adhP; alcohol dehydrogenase, propanol-preferring [EC:]
ec:	ko:K13954	yiaY; alcohol dehydrogenase [EC:]
ec:	ko:K13980	ADH4; alcohol dehydrogenase 4 [EC:]
ec:	ko:K00059	fabG; 3-oxoacyl-[acyl-carrier protein] reductase [EC:]
ec:	ko:K11610	mabA; beta-ketoacyl ACP reductase [EC:]
ec:	ko:K17735	lcdH, cdhA; carnitine 3-dehydrogenase [EC:]

Similarly, for the extracted enzymes, we can also list all the known reactions these enzymes are a part of. Each reaction is identified by the R number and is linked to ortholog groups of enzymes enabling integrated analysis of genomic and chemical information.

$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s$i | grep -Po '(?<=rn:).*') | xargs -n 1 | xargs -I {} curl -s{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:	rn:R02161	Ubiquinol:ferricytochrome-c oxidoreductase; Ubiquinol + 2 Ferricytochrome c <=> Ubiquinone + 2 Ferrocytochrome c + 2 H+
ec:	rn:R00228	acetaldehyde:NAD+ oxidoreductase (CoA-acetylating); Acetaldehyde + CoA + NAD+ <=> Acetyl-CoA + NADH + H+
ec:	rn:R00623	primary_alcohol:NAD+ oxidoreductase; Primary alcohol + NAD+ <=> Aldehyde + NADH + H+
ec:	rn:R00624	Secondary_alcohol:NAD+ oxidoreductase; Secondary alcohol + NAD+ <=> Ketone + NADH + H+
ec:	rn:R00754	ethanol:NAD+ oxidoreductase; Ethanol + NAD+ <=> Acetaldehyde + NADH + H+
ec:	rn:R01172	butanal:NAD+ oxidoreductase (CoA-acylating); Butanal + CoA + NAD+ <=> Butanoyl-CoA + NADH + H+
ec:	rn:R02124	retinol:NAD+ oxidoreductase; Retinol + NAD+ <=> Retinal + NADH + H+
ec:	rn:R02878	1-Octanol:NAD+ oxidoreductase; 1-Octanol + NAD+ <=> 1-Octanal + NADH + H+

We may also be interested in knowing which other organisms contain the same enzymes:

$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s$i | grep -Po '(?<=genome:).*') | xargs -n 1 | xargs -I {} curl -s{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:	genome:T00030	dme, DROME, 7227; Drosophila melanogaster (fruit fly)
ec:	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:	genome:T01058	ecb, HORSE, 9796; Equus caballus (horse)
ec:	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:	genome:T00035	pae, PSEAE, 208964; Pseudomonas aeruginosa PAO1
ec:	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:	genome:T00005	sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:	genome:T00564	ckl, CLOK5, 431943; Clostridium kluyveri DSM 555
ec:	genome:T00115	lpl, LACPL, 220668; Lactobacillus plantarum WCFS1

Best Wishes,




ADD REPLYlink written 5.6 years ago by umer.zeeshan.ijaz1.8k
gravatar for 5heikki
5.6 years ago by
5heikki9.0k wrote:

A combination of KEGG's API, blastp and scripting should do it. If you're doing more small scale stuff, it might be feasible to just use and then the mapping tools provided at the site..

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by 5heikki9.0k

Thanks for the answer. I tried that as well. But I was just wondering how sensitive this is. In blast KOALA the search is done using as reference the information for a Genus, and the information available for the genus I am working with are quite limited. Therefore, will I miss some protein known in other taxon to be involved in a specific process but not described in my genus?

ADD REPLYlink written 5.6 years ago by dago2.6k

I don't think so:

The database files are generated from KEGG GENES as a collection of representative genomes by removing similar organisms at the species, genus or family level. When multiple members are present in each species/genus/family group, the first genome is taken as a representative genome. When the other members in the group contain different K numbers that are not present in the representative genome, those genes are added as if they are present in additional chromosomes or plasmids.

Stated right there in the blast koala site..

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by 5heikki9.0k

Ok, got it, but for my Genus there are only two genomes available, will it change?

ADD REPLYlink written 5.6 years ago by dago2.6k
gravatar for RamRS
5.6 years ago by
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

I'd take in a list of proteins as query and BLASTP against organism-filtered NR, verify identity and similarity in top 3 hits per query to ensure they fall under a threshold.

ADD COMMENTlink written 5.6 years ago by RamRS30k
gravatar for Asaf
5.6 years ago by
Asaf8.4k wrote:

You might want to check the methods of this paper:

Long-term phenotypic evolution of bacteria : Nature : Nature Publishing Group

They used Flux Balance analysis to determine the substrates a bacterial strain can grow on. 

ADD COMMENTlink written 5.6 years ago by Asaf8.4k

This is really cool, thanks for sharing. I am afraid this is to complex for my capacity..:P

ADD REPLYlink written 5.6 years ago by dago2.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2085 users visited in the last hour