Question: pathway mapping using KEGG
1
gravatar for mwanerhi  erfgtr
3.7 years ago by
United States
mwanerhi erfgtr30 wrote:

I have assigned KEGG ids for my newly sequenced protein sequences, using Using Kegg/Kaas, sow i have a list of IDs , how do i assign them pathway maps . i need to know which of the genes(proteins) is in what family

sequencing • 2.1k views
ADD COMMENTlink modified 2.3 years ago by Santiago Montero-Mendieta120 • written 3.7 years ago by mwanerhi erfgtr30
0
gravatar for Kamil
3.7 years ago by
Kamil1.9k
Boston
Kamil1.9k wrote:

Could I ask you to provide an example of an input file and an example of your desired output? It might help us to better understand your question.

Perhaps you might find this tool useful? https://github.com/endrebak/kg

ADD COMMENTlink written 3.7 years ago by Kamil1.9k

input is a file of protein sequences >5000

eg >mgg4500002 qor, 1144-2184 (Clockwise) Quinone oxidoreductase
MAASQCKRSCSPMKAITLQTYGGPEVALLRHDAPIPQATPGHVLVKVACAGINFMDVHTR
QGKYAQSVTYPVRLPCTLGMEGAGVVVDVGAGVSHLHVGDRVAWCIAWGAYAEYAAVPAD
KIAQIPSAITFDQAAAAMFQGCTAHYLIDDVARLHVGSTCLVHAASGSIGQLLVQMARRL
GATVFATGSSAEKCAIALQRGAHQAWTYDEGRFAERVREATAGQGVDVVFDSLGKTTLRD
SFRACRTRGLIVNYGNVSGSLTDLDPIELGEAGSLFLTRPRLADHMADGATVQRRANAVF
AAMLEGSLTVEIEGHYSLETVKQVHARIEARQQIGKAVVWVDRDLV
>mgg4500003 BASYS00003, 2160-2531 (Clockwise) Hypothetical Protein BASYS00003
MGGPRLGLMQTKKKPADQAGLGYPANSAGSGVVAVQAISAAFGQATFLQTISTAFSDTVA
IQAISTTFDQATFLQTVSTAFSDTVVIQAIRTTFDQATFLQTVSTAFSDTVAIQAIRTTF
DQA
>mgg4500004 insK, 3371-2562 (CounterClockwise) Putative transposase InsK for insertion sequence element IS150
MRDLLKLVSLARSTYYYQLKAMGVADRLSSIKASIQTIQNEHKGRFGYRRMTLELRKERS
LINGKTVRRLMGELGLKCTVRPKKYRSYKGPMGEVSPNTLARQFEAEQPNQKWVTDVTEF
KVAGKKLYLSPVLDLYNGEIVAYQTAIRPQYALVGEMLEKAIEGLPEGGKPMLHSDQGWH
YRYPKYRERLEKAGLEQSMSRKGNCHDNATMESFFGTLKSEFYYRESFESVEQLQAGLDE
YIHYYNHKRIKVKLGGLSPVAYRTRSAVA

output should look like this:

Amino acid metabolism

MAP00250 : Alanine, aspartate and glutamate metabolism

MAP00260 : Glycine, serine and threonine metabolism

MAP00270 : Cysteine and methionine metabolism

MAP00280 : Valine, leucine and isoleucine degradation

MAP00290 : Valine, leucine and isoleucine biosynthesis

MAP00300 : Lysine biosynthesis

MAP00310 : Lysine degradation

MAP00330 : Arginine and proline metabolism

MAP00340 : Histidine metabolism

MAP00350 : Tyrosine metabolism

MAP00360 : Phenylalanine metabolism

MAP00380 : Tryptophan metabolism

MAP00400 : Phenylalanine, tyrosine and tryptophan biosynthesis

 

Biosynthesis of other secondary metabolites

MAP00232 : Caffeine metabolism

MAP00311 : Penicillin and cephalosporin biosynthesis

MAP00401 : Novobiocin biosynthesis

MAP00402 : Benzoxazinoid biosynthesis

MAP00521 : Streptomycin biosynthesis

MAP00524 : Butirosin and neomycin biosynthesis

MAP00940 : Phenylpropanoid biosynthesis

MAP00950 : Isoquinoline alkaloid biosynthesis

MAP00960 : Tropane, piperidine and pyridine alkaloid biosynthesis

MAP00966 : Glucosinolate biosynthesis

 

All proteins mapped

ADD REPLYlink written 3.7 years ago by mwanerhi erfgtr30
0
gravatar for Santiago Montero-Mendieta
2.3 years ago by
Sweden

I solved this by using GhostKOALA.

Just need to provide your query amino acid sequences in FASTA format and speficy which KEGG GENES database file to be searched. You will get an email when your results are ready. On the results, if you go to "reconstruct pathway" it will tell you how many proteins match to each family and also which of the genes is in each family. Hope it helps!

ADD COMMENTlink written 2.3 years ago by Santiago Montero-Mendieta120

How long does it take usually for GhostKOALA to run a ~5mb AA fasta file? Cheers

ADD REPLYlink written 2.2 years ago by h.l.wong40

I would say probably less than 1 hour. I tried with a 15MB AA fasta file and took about 3 hours.

ADD REPLYlink written 2.2 years ago by Santiago Montero-Mendieta120

Thanks, I uploaded a 1.3mb AA fasta file and it took 22 hours. I guess the server is busy at the moment?

cheers

Alan

ADD REPLYlink written 2.2 years ago by h.l.wong40

Do the FASTA-formatted amino acid sequences have to be divided into proteins, like this:

>PROKKA_00002 hypothetical protein
MSINSSLQQLAGGIAAAIGGMIVVQKDNFSPIEHYDTLALVVAIFVGICVYVLSLVSKIV
RDKNKA*
>PROKKA_00003 ATP-dependent RNA helicase RhlE
LEALNRFKAGKTRVLVTTDLLARGIDIQFLPFVINYELPRSPKDYIHRIGRTVRAEASGE
AISFVSPEDQHHFKVIQKKMKKWVTMVEGDGLV*
>PROKKA_00004 Long-chain-fatty-acid--CoA ligase FadD13
MIIRGGENIYSSEVENILYEHPAVTDAALVGIPHQTLGEEPAAVVHLAPGMTATEEELRH
YVSERLAKFKVPVKIIFTQDTLPRNANGKILKRDLKALF*

I mean, they have to be, right? Otherwise, how would the program tell where one protein starts and the next one begins.

ADD REPLYlink written 2.2 years ago by willnotburn40
1

Welcome @willnotburn : As far as I am aware, partial proteins sequences can be used as input too. This means that you can input sequences that either do not start with M (5prime_partial) or do not end with * (3prime_partial). I did not have any problem with internal protein sequences either.

ADD REPLYlink written 2.2 years ago by Santiago Montero-Mendieta120
1

Thanks, Santiago! Partial protein sequence support definitely helps. But just so I get it clearly: each (full or partial) sequence has to have its own FASTA header >, followed by the sequence on the next line. Is that right?

ADD REPLYlink written 2.2 years ago by willnotburn40

Yep, it's just a regular FASTA formatted file :-)

ADD REPLYlink written 2.2 years ago by Santiago Montero-Mendieta120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1971 users visited in the last hour