I would like to reduce the redundancy of my alignment based on taxonomy and identity.
I know that CD-HIT can do it based on identity, but do you know a way to also include taxonomy data? My concern is to preserve sequences very similar that could have been transferred between different taxa. If you know a script that can do it, I would appreciate.
Here is my idea, but I believe it is not too clever. It should have a simpler way...
Firstly, separate the sequences in files based on taxonomy, let's say by genus.
For each file, do a CD-HIT based on a identity value threshold.
After, concatenate the output files from CD-HIT.