Hi all,
I have a large data set consisting of multiple protein sequences for 100+ strains, ML trees built. My data set contains a column which indicates how frequently the sequence occurs and my supervisor wants new trees built which incorporate the sequence frequency to investigate the effect this would have on the branch lengths.
Eg: Strain A
SeqNo 1 2 3 4 5
Occurence 1 1 17 1 31
Is this even possible? I can't find anything that seems even remotely related, dunno have I wildly misunderstood what I'm supposed to be doing.
Thanks in advance
Edited for example clarity
If your sequences are paralogous your tree will collapse. There is no way to know which of those duplicated sequences is the 'ancestor' sequence.
The best you could do probably is draw something like a tanglegram/split decomposition tree. That would at least highlight that some branches of that tree have more sequences than others.