I have a fast file with about 500 protein sequences that have been compiled as the result of blast searches. Many of these sequences are quite similar to each other.
I would like to trim this fasta file such that only one copy of these highly similar sequences are left behind.
My approach so far has been to use the pairwise alignment tool in biopython, but this becomes very intractable as I will have to iterate over the file 250000 times.
Are there any alternatives/better methods to go about this?