Reducing Number Of Sequences For Phylogentic Tree Construction
1
1
Entering edit mode
10.2 years ago
Pappu ★ 2.1k

I got several thousand sequences from blastp search. So I removed the sequences with >90% identity by cd-hit before MSA and also did the same after MSA construction. The assumption was that the sequences with >90% identity will end up in closly related branches. I am wondering if this cutoff makes sense.

• 2.4k views
ADD COMMENT
3
Entering edit mode

Probably more justified way of reducing the number of sequences would be to build a distanced-based tree (NJ, UPGMA) first for the whole set of sequences. And then you could use Dendroscope3 or iTol programs to auto collapse clades containing very closely-related sequences. During this auto-collapsing, the average branch length to all leaves is calculated for all internal nodes, and those clades where this value is below your threshold are collapsed. You can also specify your own support value or a certain node length.

ADD REPLY
1
Entering edit mode
10.2 years ago
DG 7.3k

I'll preface my answer with "it depends." If you were looking at strains of bacteria for instance the 90% cut-off might be too low for the question you are trying to answer. But, for most applications of phylogenetics collapsing at 90% sequence identity is generally considered fairly routine. If you need to prune down your number of taxa further the suggestion by @a.zielezinski is worth looking in to. Generally what you want to do is prune taxa when you need to make the dataset more manageable in terms of size for alignment and estimating the phylogeny while retaining as much real sequence diversity as possible.

ADD COMMENT

Login before adding your answer.

Traffic: 1437 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6