I am currently engrossed in the analysis of gram-positive pathogenic bacterial genomes, having meticulously selected approximately 300 genomes for my research. To annotate these genomes, I employed Prokka, which yielded intriguing results, indicating a gene count ranging from 6,000 to 7,000 genes per genome. Subsequently, I harnessed the .GFF files generated by Prokka as inputs for the Pangenome analysis using Roary.
Given the range of 6,000 to 7,000 genes per genome, the total gene count would theoretically amount to:
Total Genes = 6,000 (assuming the minimum value) * 300 (genomes) = 1,800,000 genes
It's noteworthy that existing literature suggests an open pan-genome structure within the bacterial group under study, wherein core genes typically comprise a modest 0.3% to 0.5% of the total gene pool.
Findings So Far:
In an endeavor to unveil the Pangenome's secrets, I conducted Roary analyses employing various BlastP identity thresholds (i=70/80/85/90). However, the total gene count derived from Roary analysis exhibited a perplexing range, fluctuating from 250,000 to 470,000 genes. This range signifies that core genes only account for a mere 0.1% to 0.2% of the entire gene pool.
Naturally, this disparity has left me pondering the root cause of the discrepancy. Upon scrutinizing the results obtained from Prokka, I couldn't discern any apparent issues.
- Should I consider employing an alternative program for Pangenome analysis to potentially rectify these unexpected outcomes?
- Given that Roary is widely recommended in the literature for bacterial Pangenome analysis, are there specific adjustments or configurations within the Roary program that I could explore to align with my expectations?
- Could it be possible that my extensive dataset of genomes is contributing to these unexpected results?
I am immensely appreciative of any insights or guidance that anyone can offer to help me address these questions and navigate this intriguing challenge. Thank you in advance for your assistance.