Question: Annotation with Prokka - small ORFs and genus-specific DB?
gravatar for predeus
2.1 years ago by
predeus1.3k wrote:

Hello everybody,

I've got a couple of questions using Prokka.

1) anybody come across the problem of annotating small ORFs? Lots of operon leader peptides etc remain un-annotated. I understand that's to reduce the false positives, but I still would want to annotate these genes.

2) how does one compile a good genus-specific database? E.g. if I need a reference protein set for Salmonella, what is a good strategy?

Thank you in advance.

prokka bacteria annotation • 1.2k views
ADD COMMENTlink modified 2.1 years ago by Asaf7.0k • written 2.1 years ago by predeus1.3k
gravatar for Joe
2.1 years ago by
United Kingdom
Joe16k wrote:

If leader peptides etc aren't commonly seen as specific separate ORFs I doubt they'd be annotated separated from their 'parent' ORF, though I see it supports a --sig_peptide option these days.

As for the databases, prokka supports a custom protein database, and for that you can follow the instructions here (

Give the --sig_peptide flag a try and curate a selection of your own sequences (from genomes you trust) of interest and follow:

 prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
 cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
 rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
 makeblastdb -dbtype prot -in Coccus
 mv Coccus.p* /path/to/prokka/db/genus/
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Joe16k

Thank you. I see now that some of the small peptides are annotated with --rfam option that generates candidate ncRNAs, which is also useful. What is the option to include the sig_peptide? There's nothing in the manual, and they are not generated by default.

About the genus-specific reference: how would you pick the gbk files you want to use? And is there any way to generate the gene name (I mean common name, like trpA) in any reliable fashion?

ADD REPLYlink written 2.1 years ago by predeus1.3k

Yeah, prokka invokes a number of optional 3rd party applications, and SignalP is one of them. I can't see the specific flag in the docs, but the github page mentions it. You'll no doubt need SignalP installed and in the path though eitherway.

I would just use which ever genomes you trust as a reference and download the GBK from NCBI. I can't really tell you what reference to use. The option is mainly to allow people who have their own custom annotated genomes to include additional features that they might have added by hand relative to the NCBI reference etc. You don't need to do this at all though, if you don't have one you're bothered with. Prokka calls CDSs with prodigal, and then blasts/searches them all against the databases already so you should get the common salmonella annotations. If you don't have custom proteins etc then I wouldnt worry about it. if its a gene with a common gene name in NCBI, it will be picked up by prokka, assuming the variant your sequence has is similar enough to it. Anything prokka can't identify it will call a hypothetical_protein

ADD REPLYlink written 2.1 years ago by Joe16k

Ok, I had to grep through the source code to understand it. SignalP is activated when you're using --gram option. I don't think it's documented anywhere. Anyhow, seems to be working nicely.

Thank you for all the tips again.

ADD REPLYlink written 2.1 years ago by predeus1.3k

Ah good you found the same, I was just about to post the same point!

P.S. be sure to accept one or more answers if you got the answer you needed.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Joe16k
gravatar for Asaf
2.1 years ago by
Asaf7.0k wrote:

some thoughts:

  1. You can try and predict all ORFs with EMBOSS transeq for instance and look for domains using interproscan - you might find some putative short ORFs that way.
  2. I guess by downloading assemblies of a lot of salmonella genomes and extracting the genes but in this case, since most of them are predicted using prokka or similar tools, it won't help you. You can download some well studied Salmonella genomes from NCBI or UCSC genome browser. E. coli and Salmonella are very similar and its genome is well annotated so it might also be useful.
ADD COMMENTlink written 2.1 years ago by Asaf7.0k

Thank you. Well annotated references prove to have quite a lot of mistakes - so that makes it harder to use this strategy.

ADD REPLYlink written 2.1 years ago by predeus1.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1537 users visited in the last hour