I am studying the topic of creating pan genomes. Specifically, I am working on eukaryotes, but I think my question also applies to prokaryotes.
After assembling an annotating all genomes, one usually needs to cluster all predicted gene/protein sequences in order to be able to compare genomes in terms of gene content. In other words, we need to match orthologs from different genomes to each other. As far as I understand, this is usually done with some clustering method, e.g. OrthoMCL, CD-HIT or GET_HOMOLOGOUES-EST.
When a cluster only contains 1 (or 0) genes per strain/sample, things are pretty straightforward. However, I couldn't find an explanation of resolving situations where multiple genes from the same strain occur in the same cluster. This happens when paralogs and orthologs are clustered together, and is rather common at least in my data.
My question is how should such clusters be treated? Do we just ignore the fact that they contain paralogs and count them as one gene, and calculate the occupancy as the corresponding number of strains in the cluster as usual? Or maybe some processing of raw clusters should be performed first to avoid paralogs in clusters? If so, can you refer me to some common method? This choice will affect the number of genes in the resulting pan-genome, so it should be made carefully. However, I haven't seen any paper that refers to this issue, so I might be missing something.
Would appreciate a clarification of this matter. Thank you!
There's not a single answer to this question, as it depends what you're trying to say about the data. Many orthology clustering tools provide an option to split paralogs out of clusters, but whether this is important to you or not depends on the question.
If you want to determine a phylogeny from the data, similar to whole genome MLST in prokaryotes, you likely do not want to consider paralogues, so you would perhaps throw out all but the highest scoring example of each in any given cluster. If you don't have much duplication, you could potentially throw out the entire cluster, since you may not be able to say with any certainty which is the 'truly ancestral gene'.
If however, you care about the total genetic content of a strain, or all of its differences from its neighbour (I did something recently where I was looking for all of the genetic differences, including duplications etc, between strains of a species, so paralogues were still of interest in that comparison).
You might, for instance, want to know what the most duplicated gene in your data is, and that might be an interesting question for your specific hypotheses etc, but it might not.
I wouldn't focus too much on the specific 'number of genes' in the pangenome, since this is entirely dependent on the number of strains, stringency of occurence (i.e. does it need to be in 100% of 100 genomes, or 95% of 1000 genomes?), and the identity threshold you used to calculate the clusters in the first place.