Question

Strategies to pare down alignment data

0

Entering edit mode

8.9 years ago

fhsantanna ▴ 610

I have a large alignment file containing more than two thousand sequences, each one having more than 10 kb (sequences are viral genomes, and they have high identity levels, more than 99.9%).

Running a phylogenetic analysis (even a NJ with 1000 bs replicates) in a desktop computer does not seem feasible.

I must reduce the alignment data in order to ease this analysis, bearing in mind my computational resources (i7 with 8 threads, 32 mega of ram).

Since there is certain redundancy in the alignment, I could remove too much similar sequences.

Could you suggest me a strategy to pare down the alignment, but maintaining the diversity of sequences?

PS: My initial attempt was to group sequences using CD-HIT, using different identity levels. After that I constructed a graph correlating cut-off values with the number of groups formed. In certain cut-off level the number of groups reaches a plateau, which was the chosen criterion to maintain one representative sequence per group. However this approach did not reduced enough sequences, I still had too much data for phylogenetic analysis.

redundancy alignment phylogeny • 1.5k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by fhsantanna ▴ 610

Ram · Answer 1 · 2015-06-02

0

Entering edit mode

8.9 years ago

kloetzl ★ 1.1k

Running a phylogenetic analysis (even a NJ with 1000 bs replicates) in a desktop computer does not seem feasible.

Using an alignment-free distance estimator this is totally possible. E.g. andi can compute a tree of 2000 bacterial genomes within a few hours.

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by kloetzl ★ 1.1k