Question: Is Prokka giving me an inaccurate number of genes?
0
gravatar for zuhaibzulfiqarahmed
12 months ago by
zuhaibzulfiqarahmed0 wrote:

Hello,

So I'm using Prokka to annotate the Borrelia Burgdorferi reference genome (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/685/GCF_000008685.2_ASM868v2)

So once I ran prokka on the .fna file, I used the following command to get the list of genes

#Gets the list of genes, and puts them in a file $1_gene_list.txt, where $1 is a command-line argument
cat PROKKA*.tbl | awk '{if ($1 == "gene") {print $2}}' | awk -F'_' '{print $1}' | sort > "$1""_gene_list.txt"

I also created a file of unique genes (i.e. got rid of gene duplicates)

uniq "$1""_gene_list.txt" > "$1""_unique_gene_list.txt"

This was easy. But the problem seems to be when I count the number of genes in these files. Using the

wc -l

command on these files to count the number of genes, I get 471 genes in the gene list, and 421 genes in the unique gene list. But according to this paper (https://www.ncbi.nlm.nih.gov/pubmed/9403685), there should be about 850 genes on the chromosome, and 430 genes on the plasmids. So there should actually be around 1300 genes. But Prokka is only giving me 470 genes.

I'm not sure if I ran Prokka incorrectly, or if I messed up my counting somehow. Could someone please explain to me why Prokka is giving me so few genes?

ADD COMMENTlink modified 12 months ago by h.mon29k • written 12 months ago by zuhaibzulfiqarahmed0

Are you trying to learn how to use prokka? This is obviously a known/annotated genome and there is a GFF file available for this genome. Compare the list you got with what is in the GFF file to see where the differences are.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax75k
0
gravatar for h.mon
12 months ago by
h.mon29k
Brazil
h.mon29k wrote:

You are selecting annotated genes, not predicted genes. To see how many genes Prokka found, do a:

grep "Found" PROKKA.log

For a recent annotation I did:

[04:11:31] Found 102 tRNAs
[04:12:02] Found 9 rRNAs
[04:43:26] Found 171 ncRNAs.
[04:43:28] Found 0 CRISPRs
[04:43:36] Found 5266 CDS
[04:46:30] Found 541 signal peptides
[04:50:25] Found 1573 unique /gene codes.
  

What you are looking at is what Prokka calls "unique /gene codes", that is, genes with annotations (in my case 1573), not predicted protein coding genes (5266 protein coding genes predicted).

edit: if you pass to Prokka a "trusted" fasta file of proteins, it will be the first source of annotation:

--proteins [X]    Fasta file of trusted proteins to first annotate from (default ''
ADD COMMENTlink modified 12 months ago • written 12 months ago by h.mon29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1136 users visited in the last hour