Strategies to pare down alignment data
1
0
Entering edit mode
8.9 years ago
fhsantanna ▴ 610

I have a large alignment file containing more than two thousand sequences, each one having more than 10 kb (sequences are viral genomes, and they have high identity levels, more than 99.9%).

Running a phylogenetic analysis (even a NJ with 1000 bs replicates) in a desktop computer does not seem feasible.

I must reduce the alignment data in order to ease this analysis, bearing in mind my computational resources (i7 with 8 threads, 32 mega of ram).

Since there is certain redundancy in the alignment, I could remove too much similar sequences.

Could you suggest me a strategy to pare down the alignment, but maintaining the diversity of sequences?

PS: My initial attempt was to group sequences using CD-HIT, using different identity levels. After that I constructed a graph correlating cut-off values with the number of groups formed. In certain cut-off level the number of groups reaches a plateau, which was the chosen criterion to maintain one representative sequence per group. However this approach did not reduced enough sequences, I still had too much data for phylogenetic analysis.

redundancy alignment phylogeny • 1.5k views
ADD COMMENT
0
Entering edit mode
8.9 years ago
kloetzl ★ 1.1k

Running a phylogenetic analysis (even a NJ with 1000 bs replicates) in a desktop computer does not seem feasible.

Using an alignment-free distance estimator this is totally possible. E.g. andi can compute a tree of 2000 bacterial genomes within a few hours.

ADD COMMENT

Login before adding your answer.

Traffic: 2689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6