How to handle paralogs in pan-genomes?
1
0
Entering edit mode
15 months ago
liorglic ▴ 340

Hello,
I am studying the topic of creating pan genomes. Specifically, I am working on eukaryotes, but I think my question also applies to prokaryotes.
After assembling an annotating all genomes, one usually needs to cluster all predicted gene/protein sequences in order to be able to compare genomes in terms of gene content. In other words, we need to match orthologs from different genomes to each other. As far as I understand, this is usually done with some clustering method, e.g. OrthoMCL, CD-HIT or GET_HOMOLOGOUES-EST.
When a cluster only contains 1 (or 0) genes per strain/sample, things are pretty straightforward. However, I couldn't find an explanation of resolving situations where multiple genes from the same strain occur in the same cluster. This happens when paralogs and orthologs are clustered together, and is rather common at least in my data.
My question is how should such clusters be treated? Do we just ignore the fact that they contain paralogs and count them as one gene, and calculate the occupancy as the corresponding number of strains in the cluster as usual? Or maybe some processing of raw clusters should be performed first to avoid paralogs in clusters? If so, can you refer me to some common method? This choice will affect the number of genes in the resulting pan-genome, so it should be made carefully. However, I haven't seen any paper that refers to this issue, so I might be missing something.
Would appreciate a clarification of this matter. Thank you!

orthologs paralogs pan-genome • 499 views
2
Entering edit mode
15 months ago
Joe 19k

There's not a single answer to this question, as it depends what you're trying to say about the data. Many orthology clustering tools provide an option to split paralogs out of clusters, but whether this is important to you or not depends on the question.

If you want to determine a phylogeny from the data, similar to whole genome MLST in prokaryotes, you likely do not want to consider paralogues, so you would perhaps throw out all but the highest scoring example of each in any given cluster. If you don't have much duplication, you could potentially throw out the entire cluster, since you may not be able to say with any certainty which is the 'truly ancestral gene'.

If however, you care about the total genetic content of a strain, or all of its differences from its neighbour (I did something recently where I was looking for all of the genetic differences, including duplications etc, between strains of a species, so paralogues were still of interest in that comparison).

You might, for instance, want to know what the most duplicated gene in your data is, and that might be an interesting question for your specific hypotheses etc, but it might not.

I wouldn't focus too much on the specific 'number of genes' in the pangenome, since this is entirely dependent on the number of strains, stringency of occurence (i.e. does it need to be in 100% of 100 genomes, or 95% of 1000 genomes?), and the identity threshold you used to calculate the clusters in the first place.

0
Entering edit mode

Thanks for the interesting answer. Can you suggest a tool that split paralogs out off clusters? I agree that the number of genes in a pan genome is not particularly meaningful, but would still argue that leaving paralogs within cluster will result in somewhat "incorrect" results, since paralogs will be treated as the same gene, leading to loss of information in the final pan-genome.

0
Entering edit mode

You will lose some information by doing that yes, but whether that matters just depends on what the downstream analysis is going to be. Core/accessory genome phylogenetics is about the main downstream analysis I can think of where you would probably want to ensure no paralogues.

It's been a long time since I used it, but OrthoMCL might have that option IIRC. I'm pretty sure my current go-to tool, roary has an option for it, but it's prokaryote specific.

0
Entering edit mode

Thanks again. Indeed it looks like Roary directly tackles this issue and tries to solve it using Conserved Gene Neighborhoods (CGN, known in eukaryotes as gene syntenny). As for OrthoMCL, it does treat paralogs and orthologs differently, but as far as I could tell, there is no obvious option to force OrthoMCL to only cluster one gene per species (strain) in a cluster.