Question

Bacterial Pangenome Analysis

0

Entering edit mode

22 months ago

kirankumareripogu ▴ 10

Hello,

Current Scenario:

I am currently engrossed in the analysis of gram-positive pathogenic bacterial genomes, having meticulously selected approximately 300 genomes for my research. To annotate these genomes, I employed Prokka, which yielded intriguing results, indicating a gene count ranging from 6,000 to 7,000 genes per genome. Subsequently, I harnessed the .GFF files generated by Prokka as inputs for the Pangenome analysis using Roary.

Anticipated Outcomes:

Given the range of 6,000 to 7,000 genes per genome, the total gene count would theoretically amount to:

Total Genes = 6,000 (assuming the minimum value) * 300 (genomes) = 1,800,000 genes

It's noteworthy that existing literature suggests an open pan-genome structure within the bacterial group under study, wherein core genes typically comprise a modest 0.3% to 0.5% of the total gene pool.

Findings So Far:

In an endeavor to unveil the Pangenome's secrets, I conducted Roary analyses employing various BlastP identity thresholds (i=70/80/85/90). However, the total gene count derived from Roary analysis exhibited a perplexing range, fluctuating from 250,000 to 470,000 genes. This range signifies that core genes only account for a mere 0.1% to 0.2% of the entire gene pool.

Naturally, this disparity has left me pondering the root cause of the discrepancy. Upon scrutinizing the results obtained from Prokka, I couldn't discern any apparent issues.

Queries Arise:

Should I consider employing an alternative program for Pangenome analysis to potentially rectify these unexpected outcomes?
Given that Roary is widely recommended in the literature for bacterial Pangenome analysis, are there specific adjustments or configurations within the Roary program that I could explore to align with my expectations?
Could it be possible that my extensive dataset of genomes is contributing to these unexpected results?

I am immensely appreciative of any insights or guidance that anyone can offer to help me address these questions and navigate this intriguing challenge. Thank you in advance for your assistance.

Prokka Pangenome bacteria Roary • 1.6k views

ADD COMMENT • link 21 months ago by kirankumareripogu ▴ 10

score 0 · Answer 1 · 2023-09-26

0

Entering edit mode

22 months ago

liorglic ★ 1.5k

I am not sure I understand why you think there is a problem with your analysis. It makes perfect sense that different blastp cutoffs produce (very) different results. It generally sounds like your results are in line with previous literature with an open pan-genome. You could also look at the gene occupancy and genome stepwise addition plots produced by Roary (I think) and see if they make sense.
Having said that, you can still check your data. For example, maybe some of your genomes contain contaminations? Maybe some of the analyzed genomes do not belong to the group of interest? I would try to detect outlier genomes, e.g. those containing particularly high numbers of genes or a high proportion of rare genes. You could also create a PCA plot based on gene presence/absence to see how your genomes cluster.

ADD COMMENT • link 22 months ago by liorglic ★ 1.5k

0

Entering edit mode

Hi @liorglic,

Thank you for your prompt response.

I totally agree with you that different blastp cuttoffs provide different results and I do observe an open pan-genome structure, however I am concerned about the total genes which ranges from 250,000 to 470,000 genes. Based on the annotation provided by prokka, the total genes in Roary should be close to 1,800,000 genes. And the total genes I derived from Roary is not even to close to that. Isn't that concerning while I try to publish my data in the journal? And also the number of core genes are just 0.1% to 0.2% which is far away from the published literarture (i.e. 0.3% to 0.5%).

ADD REPLY • link 22 months ago by kirankumareripogu ▴ 10

0

Entering edit mode

Oh, so you are concerned that you have too few genes... In this case, I think you are misunderstanding the concept of genes in a pan genome. Let's say you choose a blastp cutoff of 90%. You align gene1 from genome A to gene2 from genome B, and you find 95% identity. In such cases, gene1 and gene2 will be defined as the same (pan-)gene in the pan-genome, so this gene will be counted only once. So it is always expected that the pan-genome will contain a number of (pan-)genes lower than the total number of genes in all genomes. It would be very weird if you end up with 6000*300=1.8M genes.
Regarding the % of core genes - this depends on your definition for core genes. Are only genes present in 100% of the genomes considered core? If so, than it is not surprising that you get lower percentage compared to studies that used less genomes. You'll get a higher percent if you define core genes e.g. as those present in 95% of the genomes.
I hope this clarifies things a bit.

ADD REPLY • link 22 months ago by liorglic ★ 1.5k

0

Entering edit mode

Hi @liorglic,

Thank you so much for your response. This provides a clarification to most of my doubts and questions regarding pangenome analysis and preferably via Roary. With regards to the core genes, will set it 95%.

So just to confirm one last time, When I use the BlastP identity thresholds (i=70/80/85/90) and get the total gene count derived from Roary ranging from 250,000 to 470,000 genes, that means that I am at the right track and I do not need to be concerned about those 1.8M genes?

ADD REPLY • link 22 months ago by kirankumareripogu ▴ 10

0

Entering edit mode

Yes, you are correct. If anything, I would be worried that 250k genes is too high. If you have around 6k genes per sample, and you end up with 250k, it means that the overlap between genomes is very low. This could be a biological phenomenon, but you need to validate it. You'll need to take a deeper dive intro the results and see if things make sense.

ADD REPLY • link 21 months ago by liorglic ★ 1.5k

0

Entering edit mode

And here I was wondering that 250K genes is too low. Bacterial Pangenomics is indeed complicated topic to deal with. Sure, will take a look into the results and will do the analysis accordingly.

Thank you so much for all your suggestions and feedback.

ADD REPLY • link 21 months ago by kirankumareripogu ▴ 10