I want to remove the sequences which have >90% sequence identity keeping the larger sequence. I am wondering if there is any tool for that.
Say A and B has 90% identity and B is longer; B and C has 90% identity and C is longer. Do you want to remove both A and B?
Exactly, I want to remove all the subsets of sequences with >90% identity.
I was not clear: in the example about, A and C do not have 90% identity. Because B has been thrown away, you may think A should be kept as it is not within 90% identity to other chosen sequences. Do you still want to remove A? If you want to remove A, that is single-linkage clustering or equivalently to find connected components in a graph. You can find the algorithm on wiki and many other places. It is pretty simple and should be achievable in <50 lines of Perl.
usearch is freely available if you have an academic email address, and I second Ido Tamir's recommendation. I have used the uclust operation for this.
T-coffee has a command just for that:
t_coffee -other_pg seq_reformat -in sproteases_large.fasta -action +trim _seq_%%90_
t_coffee's option removes one of a pair, not both.