Question

Is Prokka giving me an inaccurate number of genes?

0

Entering edit mode

5.4 years ago

zuhaibzulfiqarahmed • 0

Hello,

So I'm using Prokka to annotate the Borrelia Burgdorferi reference genome (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/685/GCF_000008685.2_ASM868v2)

So once I ran prokka on the .fna file, I used the following command to get the list of genes

#Gets the list of genes, and puts them in a file $1_gene_list.txt, where $1 is a command-line argument
cat PROKKA*.tbl | awk '{if ($1 == "gene") {print $2}}' | awk -F'_' '{print $1}' | sort > "$1""_gene_list.txt"

I also created a file of unique genes (i.e. got rid of gene duplicates)

uniq "$1""_gene_list.txt" > "$1""_unique_gene_list.txt"

This was easy. But the problem seems to be when I count the number of genes in these files. Using the

wc -l

command on these files to count the number of genes, I get 471 genes in the gene list, and 421 genes in the unique gene list. But according to this paper (https://www.ncbi.nlm.nih.gov/pubmed/9403685), there should be about 850 genes on the chromosome, and 430 genes on the plasmids. So there should actually be around 1300 genes. But Prokka is only giving me 470 genes.

I'm not sure if I ran Prokka incorrectly, or if I messed up my counting somehow. Could someone please explain to me why Prokka is giving me so few genes?

prokka annotation Assembly software error gene • 2.0k views

ADD COMMENT • link updated 5.4 years ago by h.mon 35k • written 5.4 years ago by zuhaibzulfiqarahmed • 0

0

Entering edit mode

Are you trying to learn how to use prokka? This is obviously a known/annotated genome and there is a GFF file available for this genome. Compare the list you got with what is in the GFF file to see where the differences are.

ADD REPLY • link 5.4 years ago by GenoMax 141k

score 0 · Answer 1 · 2018-11-24

You are selecting annotated genes, not predicted genes. To see how many genes Prokka found, do a:

grep "Found" PROKKA.log

For a recent annotation I did:

[04:11:31] Found 102 tRNAs
[04:12:02] Found 9 rRNAs
[04:43:26] Found 171 ncRNAs.
[04:43:28] Found 0 CRISPRs
[04:43:36] Found 5266 CDS
[04:46:30] Found 541 signal peptides
[04:50:25] Found 1573 unique /gene codes.

What you are looking at is what Prokka calls "unique /gene codes", that is, genes with annotations (in my case 1573), not predicted protein coding genes (5266 protein coding genes predicted).

edit: if you pass to Prokka a "trusted" fasta file of proteins, it will be the first source of annotation:

--proteins [X]    Fasta file of trusted proteins to first annotate from (default ''