I am performing a spatio-temporal analysis of 652 viral samples but I have found information that as part of the process I should remove duplicate sequences prior an ML tree construction. However, I have not found information that support this procedure. Moreover, this will be a problem if we consider that we could remove identical samples but from different years and/or locations. As far as I understand, the main issue is associated with the computational cost while other reasons are related to technical problems of some programs to deal with duplicate sequences. Please, it would be really great to understand if this step is really necessary. Thanks!
To make a tree you will need an alignment.
but many or all alinment programs will not work with sequences containing some duplicates
(sequences which do not differ from each other) especially if they also have the same headers.
The corresponding branches of the tree will have to be at the same place simultaneously,
tree-building programs don't like it.
Am I wrong in thinking that identical (duplicate) sequences should in fact be very easy to align? Multiple sequence aligners will have not problem, as far as I am aware. In that case the problem is entirely with the phylogenetic inference software - but I have been unable to find any discussion of why this is the case. Why do tree-building programs not like identical sequences?
The following question from the FAQ of IQ-TREE suggests that the answer has something to do with the ability to calculate bootstrap support:
How does IQ-TREE treat identical sequences?
But I feel that this is still not a proper explanation.