Question

What BLAST+ cutoffs are used in Prokka annotation?

0

Entering edit mode

23 months ago

braun_tube ▴ 30

I am using Prokka to annotate assemblies. Prokka primarily annotates from the UniProt database with a BLAST+ search. My question is what thresholds or cutoffs are used when including BLAST+ hits in the final annotation? I checked the documentation and it only mentions e-value being used as a cutoff. Are there also cutoffs for percent identity and coverage? It seems that if e-value is the only cutoff being used then a gene with e.g. 50% identity to a gene in the UniProt database could incorrectly be annotated as that gene in the output.

BLAST Prokka Annotation • 747 views

ADD COMMENT • link updated 23 months ago by Mensur Dlakic ★ 27k • written 23 months ago by braun_tube ▴ 30

score 1 · Answer 1 · 2022-05-16

You may have to look directly into the script to find out how this is done, as I don't recall ever reading details of this procedure. I would suspect that if prokka uses percent identity or coverage, it is done in a loose fashion.

While your concern is legitimate, I'd like to think that Torsten knows what he's doing when it comes to annotations. Besides, like with any automated annotation procedure, I don't think the goal is to identify only narrow family members or exact orthologs. The subsequent annotation by HMMs would never allow it, because neither Pfam nor TIGR HMMs are sufficiently tuned to identify only family members. Many of their HMMs identify superfamily members despite the professed goals and Pfam name (Protein FAMilies). Some time ago I contributed an HMM to Pfam that explicitly identifies superfamily members, and they are still using it as such. The annotation you'd get from a match to that HMM is Endonuclease/Exonuclease/phosphatase family even though those are 3 distinct functionalities and can't possibly be members of the same protein family. General functionality aside, there are at least 4 different groups of substrates for those metalloenzymes, and probably a dozen subgroups if you break it down by the exact substrate.

If percent identity or coverage were to be used strictly for annotations (and not just with prokka), many incomplete proteins from metagenomes would fail the coverage criterion, and many viruses, thermophiles or other organisms with exotic protein composition would fail the percent identity. That would leave many proteins unannotated, even though they are realistically of known function.

It may help to think about it this way: if a protein is a lipid methyltransferase but is instead annotated as a nucleotide methyltransferase, is that really a very wrong annotation? Do we prefer annotations that are generally correct even if unreliable in fine details, or would it be better for that protein to remain unannotated? Ideally we'd like the annotations to be correct in general and in details, but that is not realistic at this point as so many protein (super)families have not been experimentally characterized.