I have a collection of protein orthologous groups output by orthoMCL. I would like to get some idea of the relationship between these OGs. I think one way to approach this is to build a consensus sequence for each OG, then build a phylogenetic tree from an MSA of these consensuses (consensii?).
However, I'm not sure this will actually mean anything. As I understand, classical trees are constructed based on orthology between sequences in tree. Orthologs are proteins separated by speciation rather than gene duplication, so differences in sequence between 2 orthologs can be assumed to represent the impact of speciation.
If the proteins/protein consensuses are not orthologous (which would presumably be the case, if they are derived from separate OGs), would a tree such as I describe (or it's distance matrix) illustrate the relative number of non-orthologous gene duplications between the OG consensuses (so would give an idea of when the two proteins separated from some evolutionary sequence, and then started generating orthologs as their host organisms speciated)? Or would any potential stretches of alignment just be random noise?
I am working on the assumption that all proteins will share some evolutionary relationship, even if it is very faint and stretches back to the first protein in the primordial soup - but that could be wrong! I am also very new to phylogenetics, so may have butchered some of the theory :P
Looks like your question is duplicated here
Hi, thanks for your reply! I think the main difference is that the other question is looking to build a tree of strains based on the ortholog group composition, from what I understand. I assume that this involves accounting for multiple OGs in one strain, and accounting for it's position based on that (might be misunderstanding the BPGA paper though, will read it through a couple more times!).
Rather than having strains as tree tips, I'd like the OGs themselves as nodes, if that makes sense.