Hi guys, I am working with a bacterial genome (454) at the moment and would like to assign COG functional classification for all the 5000 or so genes. I have used the 'rast' web server to annotate this genome. I have written to NCBI about the cognitor program but they tell me that it is no longer supported and that there is no way to do COG searches in batch mode. It would be fantastic if you any of you could share your experiences on this. Thanks !
So far as I know, there is no easy web-based tool for COG assignment. As Michael suggested, you could fetch PSSMs for the COG database from the FTP site and use rpsblast, or you could download the fasta format file myva from the COG FTP site and format it for search yourself.
My impression is that NCBI lacks either the resources or the inclination to support COGs: it barely features in their A-Z resources list and is not regularly updated. You may want to look at KEGG instead. Annotation of protein domains using e.g. HMMER or InterPro seems to be a more popular approach than functional assignment to the entire protein sequence, these days.
As far as I know, there are two possible ways to solve this:
Use an entirely automatic gene annotation pipeline. I know Augustus+ for eucaryotes, I'm sure someone can point you in the right direction for bacteria.
Do gene prediction and classification seperately. If I understand you correctly, you already have the predicted genes and just want to classify them automatically.
One possibility to do this would be rpsblast (with which I'm also currently working- if there are alternatives please let me know).
[...] that there is no way to do COG searches in batch mode
This is definitely not correct. For example, use rpsblast with the COG database:
- download and install NCBI BLAST+
- download the COG database as .smp files from NCBI (cdd.tar.gz here, see README for details)
- create a COG-only rpsblast database (cf. this tutorial, ignore the BioPython part)
- BLAST your predicted genes against your newly created database with the rpstblastn executable and interpret the PSSM matches (easiest way: highest COG match with
e-value < e_maxis a specific hit; note that frame)
Sorry, can I ask some questions? I still don't know how to start my search. I have downloaded BLAST+ and cdd database, and read the user manual. But I just can't figure out where should I type those commend? after installed BLAST+, I just see a group of blast program.... It make me feel difficult to follow or understand. Yes, I don't know about program language, but I have to figure out how to use the blast function to classify my identified result. Because I don't have time and patient to use website search COG one by one... please, is there somebody can help me? please tell me how to start my search... I had try to read guide on NCBI, but for a non-English speaking country student, there is to much words to read and make me feel impatient. sorry, I had to say : Compare with other database, NCBI is very~very~very not easy to understood...T^T.
There is a program that does automated COG assignment - look into MEGAN- it's a metagenomics software but basically you could just run a BLAST on all your stuff and input it. It automatically extracts the COGs and gives you a chart of them for your reads (or genes if you want to assemble them first). Good Luck! (P.S. You might have to clean a few up depending on how well annotated the hits are that you get back from BLAST, but it's definitely a lot quicker to do it this way).