Question: pathway mapping using KEGG
I have assigned KEGG ids for my newly sequenced protein sequences, using Using Kegg/Kaas, sow i have a list of IDs , how do i assign them pathway maps . i need to know which of the genes(proteins) is in what family

sequencing • 1.9k views
Could I ask you to provide an example of an input file and an example of your desired output? It might help us to better understand your question.

Perhaps you might find this tool useful?

input is a file of protein sequences >5000

eg >mgg4500002 qor, 1144-2184 (Clockwise) Quinone oxidoreductase
>mgg4500003 BASYS00003, 2160-2531 (Clockwise) Hypothetical Protein BASYS00003
>mgg4500004 insK, 3371-2562 (CounterClockwise) Putative transposase InsK for insertion sequence element IS150

output should look like this:

Amino acid metabolism

MAP00250 : Alanine, aspartate and glutamate metabolism

MAP00260 : Glycine, serine and threonine metabolism

MAP00270 : Cysteine and methionine metabolism

MAP00280 : Valine, leucine and isoleucine degradation

MAP00290 : Valine, leucine and isoleucine biosynthesis

MAP00300 : Lysine biosynthesis

MAP00310 : Lysine degradation

MAP00330 : Arginine and proline metabolism

MAP00340 : Histidine metabolism

MAP00350 : Tyrosine metabolism

MAP00360 : Phenylalanine metabolism

MAP00380 : Tryptophan metabolism

MAP00400 : Phenylalanine, tyrosine and tryptophan biosynthesis


Biosynthesis of other secondary metabolites

MAP00232 : Caffeine metabolism

MAP00311 : Penicillin and cephalosporin biosynthesis

MAP00401 : Novobiocin biosynthesis

MAP00402 : Benzoxazinoid biosynthesis

MAP00521 : Streptomycin biosynthesis

MAP00524 : Butirosin and neomycin biosynthesis

MAP00940 : Phenylpropanoid biosynthesis

MAP00950 : Isoquinoline alkaloid biosynthesis

MAP00960 : Tropane, piperidine and pyridine alkaloid biosynthesis

MAP00966 : Glucosinolate biosynthesis


All proteins mapped

I solved this by using GhostKOALA.

Just need to provide your query amino acid sequences in FASTA format and speficy which KEGG GENES database file to be searched. You will get an email when your results are ready. On the results, if you go to "reconstruct pathway" it will tell you how many proteins match to each family and also which of the genes is in each family. Hope it helps!

How long does it take usually for GhostKOALA to run a ~5mb AA fasta file? Cheers

I would say probably less than 1 hour. I tried with a 15MB AA fasta file and took about 3 hours.

Thanks, I uploaded a 1.3mb AA fasta file and it took 22 hours. I guess the server is busy at the moment?



Do the FASTA-formatted amino acid sequences have to be divided into proteins, like this:

>PROKKA_00002 hypothetical protein
>PROKKA_00003 ATP-dependent RNA helicase RhlE
>PROKKA_00004 Long-chain-fatty-acid--CoA ligase FadD13

I mean, they have to be, right? Otherwise, how would the program tell where one protein starts and the next one begins.

Welcome @willnotburn : As far as I am aware, partial proteins sequences can be used as input too. This means that you can input sequences that either do not start with M (5prime_partial) or do not end with * (3prime_partial). I did not have any problem with internal protein sequences either.

Thanks, Santiago! Partial protein sequence support definitely helps. But just so I get it clearly: each (full or partial) sequence has to have its own FASTA header >, followed by the sequence on the next line. Is that right?

Yep, it's just a regular FASTA formatted file :-)

