Hello,
So I'm using Prokka to annotate the Borrelia Burgdorferi reference genome (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/685/GCF_000008685.2_ASM868v2)
So once I ran prokka on the .fna file, I used the following command to get the list of genes
#Gets the list of genes, and puts them in a file $1_gene_list.txt, where $1 is a command-line argument
cat PROKKA*.tbl | awk '{if ($1 == "gene") {print $2}}' | awk -F'_' '{print $1}' | sort > "$1""_gene_list.txt"
I also created a file of unique genes (i.e. got rid of gene duplicates)
uniq "$1""_gene_list.txt" > "$1""_unique_gene_list.txt"
This was easy. But the problem seems to be when I count the number of genes in these files. Using the
wc -l
command on these files to count the number of genes, I get 471 genes in the gene list, and 421 genes in the unique gene list. But according to this paper (https://www.ncbi.nlm.nih.gov/pubmed/9403685), there should be about 850 genes on the chromosome, and 430 genes on the plasmids. So there should actually be around 1300 genes. But Prokka is only giving me 470 genes.
I'm not sure if I ran Prokka incorrectly, or if I messed up my counting somehow. Could someone please explain to me why Prokka is giving me so few genes?
Are you trying to learn how to use
prokka
? This is obviously a known/annotated genome and there is a GFF file available for this genome. Compare the list you got with what is in the GFF file to see where the differences are.