Question: Functional annotation of metagenomic contigs
gravatar for ARich
9 months ago by
United States
ARich80 wrote:

Dear Biostar users,

Its a very naive question but i certainly lack some clarity here. I have contigs from megahit and would like to perform functional annotation. For this purpose i first used prodigal for gene prediction and the prokka for annotation.

My question here is if I would like to do functional annotation on my own instead of using prokka. What do I have to do? I mean which tools and references?

Any suggestion will be of great help!

Cheers! AR

assembly • 378 views
ADD COMMENTlink modified 9 months ago • written 9 months ago by ARich80

Prokka is an amazing software. From convenience of use to results it has never failed me. It is a collection of different softwares and the best place to start (if you don't want to use Prokka) would be Prokka itself i.e looking at what softwares Prokka uses to get annotations. I hope I am making sense here.

ADD REPLYlink written 9 months ago by microfuge1.5k

Thank you for the reply. Question: 1. Can prokka take all kingdom like this --kingdom 'Archaea|Bacteria|Mitochondria|Viruses' ? 2. Can I also use prokka on bins from Maxbin2 output? If yes then, should it be called on each bin similar to how it is called on contigs?

ADD REPLYlink written 9 months ago by ARich80
gravatar for Mensur Dlakic
9 months ago by
Mensur Dlakic4.1k
Mensur Dlakic4.1k wrote:

Seems like you already have binned your sequences. If so, the next step is to annotate the bins for completeness and assign them into general taxonomic categories (if possible). A tool for that is CheckM. Its output looks something like this:

  Bin Id              Marker lineage        # genomes   # markers   # marker sets    0     1     2    3   4   5+   Completeness   Contamination   Strain heterogeneity
  group_000000     k__Archaea (UID146)          59         174           136         8    163    3    0   0   0       94.85            2.21               0.00
  group_000001      k__Archaea (UID2)          207         149           107         70    79    0    0   0   0       51.82            0.00               0.00
  group_000002    k__Bacteria (UID3060)        138         338           246        101   237    0    0   0   0       66.66            0.00               0.00
  group_000003   p__Euryarchaeota (UID3)       148         187           124         9    177    1    0   0   0       92.74            0.81               0.00
  group_000004      k__Archaea (UID2)          207         149           107         32   117    0    0   0   0       83.64            0.00               0.00
  group_000005   c__Thermoprotei (UID147)       54         217           168         4    212    1    0   0   0       98.21            0.60               0.00
  group_000006      k__Archaea (UID2)          207         149           107         5    105    38   1   0   0       95.79           19.52               0.00
  group_000007      k__Archaea (UID2)          207         149           107         1     8    140   0   0   0       99.07           92.21               0.00

You should probably copy and paste the lines above into a wider screen so you can read them properly. Anyway, it shows that the first two bins in this sample are Archaea and the next one is Bacteria, so you can use that to specify the kingdom using Prokka. I don't think you can tell Prokka to look at all kingdoms.

My recommendation is to annotate using Prokka rather than manually. We are talking here 10 minutes vs. many hours or even days, and I am still not sure that manual annotation would be more successful. If you truly feel that your sequence annotation ability is much better than that of Prokka, you can always continue from Prokka annotation and tackle uncharacterized proteins.

ADD COMMENTlink written 9 months ago by Mensur Dlakic4.1k

That you for this detailed reply!

You recommend prokka as well! Question: Should your recommend to contatenate all checkM bins to create one single file which can then be used as a input for prokka? or Do you run prokka on individual bins? What is recommended to classify bins for taxonomy?

Cheers! AR

ADD REPLYlink written 9 months ago by ARich80

Prokka should run on individual bins. If you concatenate them, it would be the same as running it on the whole metagenomic assembly.

After binning, create .fasta files for each bin and put them in the same directory. In the example above they were named group_00000X.fasta. After you run CheckM according to their instructions, the second column of the output (see above) will be the taxonomic classification. That may be only at the level of kingdom, or go all the way down to genus. Either way, it will provide enough information so you can assign kingdom in Prokka.

For some bins there will be no annotation because CheckM only annotates prokaryotes. Those bins could be viruses, eukaryotes, or short contigs that can't be annotated conclusively.

ADD REPLYlink written 9 months ago by Mensur Dlakic4.1k

This is really of great help. I am taking oppurtunity to ask more :) Actually CheckM provides very low resolution classification and I order to have better resolution I was running blastn with Nt database on each combined bin (as shown below). Do you think its a good idea to do so? Or can you recommend something better.

I was running blast like this

  blastn \
  -task megablast \
  -num_threads 16 \
  -db nt \
  -outfmt '6 qseqid qstart qend qlen sseqid staxids sstart send bitscore evalue nident length' \
  -query metabat_all_contigs.fa > \

But as you mentioned combing the bin would be similar to running on contigs. So here shall i run the above blastn only each bins separately?

I apologize of asking basic stuff but I am quite to binning and its workflows.

Cheers! AR

ADD REPLYlink modified 9 months ago • written 9 months ago by ARich80

The degree of annotation granularity by CheckM will depend on your sample. For example, c__Thermoprotei (UID147) is a very clear-cut annotation, and k__Archaea (UID146) obviously is less so. Getting a better resolution depends on how much time you want to spend and for what purpose. For example, if you want to know just for internal use what the most likely annotation is, you could blast 5-10 largest contigs against the NT database and see if there is some kind of consensus regarding the best match. If top hits for all of them are the same and the identity is fairly high, that will probably do the trick.

To publish your finding, you will have to be more rigorous. There are many ways to do it, but you can start with 16S rRNA (if available), build a tree with a representative set of species and see where yours is slotted. The same can be done with a concatenated set of proteins. Pick a paper from a reputable journal describing an annotation of a new species from metagenomic data and most of these steps will be described in greater detail.

ADD REPLYlink written 9 months ago by Mensur Dlakic4.1k

Thank you for detailed explanation. Its was really helpful. As you mentioned to publish one need more rigorous analysis. Can you suggest any good paper showing this section in detail?

ADD REPLYlink written 9 months ago by ARich80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1887 users visited in the last hour