I am studying the topic of creating pan genomes. Specifically, I am working on eukaryotes, but I think my question also applies to prokaryotes.
After assembling an annotating all genomes, one usually needs to cluster all predicted gene/protein sequences in order to be able to compare genomes in terms of gene content. In other words, we need to match orthologs from different genomes to each other. As far as I understand, this is usually done with some clustering method, e.g. OrthoMCL, CD-HIT or GET_HOMOLOGOUES-EST.
When a cluster only contains 1 (or 0) genes per strain/sample, things are pretty straightforward. However, I couldn't find an explanation of resolving situations where multiple genes from the same strain occur in the same cluster. This happens when paralogs and orthologs are clustered together, and is rather common at least in my data.
My question is how should such clusters be treated? Do we just ignore the fact that they contain paralogs and count them as one gene, and calculate the occupancy as the corresponding number of strains in the cluster as usual? Or maybe some processing of raw clusters should be performed first to avoid paralogs in clusters? If so, can you refer me to some common method? This choice will affect the number of genes in the resulting pan-genome, so it should be made carefully. However, I haven't seen any paper that refers to this issue, so I might be missing something.
Would appreciate a clarification of this matter. Thank you!