map proteins to EC numbers (KEGG) in non-model fungal organisms
3
1
Entering edit mode
5.9 years ago
bioPiraten ▴ 10

Hi All,

I have multiple annotated fungal genomes and I would like to map their proteomes to KEGG and infer EC number information for all of them.

My idea is to download the KEGG db (or at least the last free version from 2012) and construct an HMM using hmmer for each EC number, which I can then use to map my proteins against.

For this I need to download KEGG so I wonder if anybody knows how or where to get it? Or if anybody have a better solution to how to map proteomes to EC numbers.

KEGG hmmer • 1.9k views
0
Entering edit mode
5.9 years ago

In R, for S.cerevisiae you can use the org.Sc.sgd.db package:

> biocLite('org.Sc.sgd.db')
> library('org.Sc.sgd.db')
systematic_name ec_number
1         YAL062W   1.4.1.4
2         YAL061W 1.1.1.303
3         YAL061W   1.1.1.4
4         YAL060W 1.1.1.303
5         YAL060W   1.1.1.4
6         YAL054C   6.2.1.1

systematic_name path_id
1         YAL062W   00250
2         YAL062W   00330
3         YAL062W   00910
4         YAL062W   01100
5         YAL061W   00650
6         YAL060W   00650


the ec_number is the EC number derived from KEGG, and the path_id is the KEGG pathway ID.

For other fungal species you can use AnnotationHub: https://bioconductor.org/packages/release/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html

0
Entering edit mode

These are non-yeast species and have not been sequenced before, so I guess this methods does not work.

0
Entering edit mode

Did you check the link on AnnotationHub I posted? It explains how to access orgDb packages for non-model organisms. p.s. Are your organisms in KEGG?

0
Entering edit mode
5.9 years ago
5heikki 10.0k

How about blastp against uniprot_sprot or trembl? See here.

0
Entering edit mode

I guess this is one option, but it is not as clean as going directly to KEGG. If it is not possible to download KEGG directly then I might have to go to a solution like this

0
Entering edit mode

If you or your institution doesn't have a KEGG license, then there is no sensible way to download the entire database. I have no clue where the last free version is stored (certainly not their ftp). Why do you think going through KEGG is more "direct" or "clean"? You could also have a look of the free KEGG alternative, MetaCyc.

0
Entering edit mode
5.9 years ago

When working with the KEGG database, you should put your focus on KEGG Ortholog groups first. The proteins within one KO group have quite similar sequence (as you would expect from a group of pairwise orthologs) but by experience they do not always make up a good MSA to train a profile HMM. Instead, you should use BLAST search and try to classify your proteins into the existing KO groups. After that, the KEGG Database provides EC number for their KO groups when appropriate, i.e., when the proteins of the KO group have catalytic function.

This way, you can get a good EC/KO annotation for the proteomes of you fungi.

But if you want to work with KEGG, you should think about licensing an up-to-date version of KEGG even though it is costly...