I got several thousand sequences from blastp search. So I removed the sequences with >90% identity by cd-hit before MSA and also did the same after MSA construction. The assumption was that the sequences with >90% identity will end up in closly related branches. I am wondering if this cutoff makes sense.
I'll preface my answer with "it depends." If you were looking at strains of bacteria for instance the 90% cut-off might be too low for the question you are trying to answer. But, for most applications of phylogenetics collapsing at 90% sequence identity is generally considered fairly routine. If you need to prune down your number of taxa further the suggestion by @a.zielezinski is worth looking in to. Generally what you want to do is prune taxa when you need to make the dataset more manageable in terms of size for alignment and estimating the phylogeny while retaining as much real sequence diversity as possible.