Question

Multiple Sequence Alignment for very large data sets

0

Entering edit mode

6.2 years ago

dthorbur ★ 1.9k

What would your recommendations for the best MSA programs for comparing large nucleotide sequences of the same species?

My data set consists of 66 individuals from 11 populations. I will be using VariScan to scan for selection next, but need to realign each of the 21 chromosomes separately for each; population (6 individuals x 11 replicates), population pair (12 individuals in 5 replicates, excluding the outgroup), ecotype (30 individuals x 2 ecotypes), and all individuals together (66 individuals). To give you an idea of the size, chromosome 1 is ~26Mb, in a single consensus fasta sequence.

I was using GUIDANCE with the PRANK algorithm (commonplace in phylogenomic studies), but even on a high performance computing cluster, it took too long for one chromosome with only 6 individuals to make it a feasible approach. I am currently doing a test using MAFFT, but I have been warned it may have a sequence size limitation - it's currently running so we'll see.

Thanks in advance, and if I missed explaining anything, please let me know!

MSA Alignment • 3.1k views

ADD COMMENT • link updated 5.0 years ago by sallyzaki70 ▴ 10 • written 6.2 years ago by dthorbur ★ 1.9k

3

Entering edit mode

What is the scientific logic of doing a chromosome level MSA? How was this consensus sequence generated? What is the ultimate question you are trying to solve?

ADD REPLY • link 6.2 years ago by GenoMax 142k

0

Entering edit mode

Ultimately this is a genome wide scan for balancing selection. We would do it at the whole genome level, but in using each individual chromosome separately, we cut down the computational power needed and to cut down the amount of time needed for each step. Make sense?

ADD REPLY • link 6.2 years ago by dthorbur ★ 1.9k

0

Entering edit mode

I assume you are reasonably sure that your chromosomes are directly align-able. By extending your method, if chromosome level alignments are not feasible, you may have to start dividing the problem into smaller pieces. A casual search showed that people seem to be studying balancing selection at the level of a few genes. Are there known studies for doing this on whole genome level?

ADD REPLY • link 6.2 years ago by GenoMax 142k

0

Entering edit mode

There aren't many, but off the top of my head I can think of 2 good examples. We are trying to get away from the candidate gene approach here, and our data set is outstanding.

I am reasonably sure each chromosome should be aligned without any problems.

ADD REPLY • link 6.2 years ago by dthorbur ★ 1.9k

0

Entering edit mode

Assuming the logic/experiment is all sound - try a different tool.

Ones that tend to be good for large scale:

progressiveMauve
LASTZ
Kalign

ADD REPLY • link 6.2 years ago by Joe 21k

0

Entering edit mode

Thanks, I'll have a look into these now.

ADD REPLY • link 6.2 years ago by dthorbur ★ 1.9k

0

Entering edit mode

plz,Can you suggest me a dataset to use it in comparing between 3 algorithms(Clustal - Muscle - T-coffee)?

ADD REPLY • link 5.0 years ago by sallyzaki70 ▴ 10

0

Entering edit mode

Take a look at balibase

They have alignment benchmarking datasets.

ADD REPLY • link 5.0 years ago by Joe 21k

0

Entering edit mode

Hi miles.thorburn, how did you solve it in the end? Any recommendations? I did pairwise LAST alignments and construct the scaffold level MSA with mugsy. The goal of my approach was different (find sequence variation), and as you can imagine, scalability was an issue.

What was you're experience with multiple whole genome alignments? What worked best?

ADD REPLY • link 5.0 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

Hi, the analysis changed quite significantly since then. I ended up using only needing to align coding sequences from my individuals and an outgroup. I did that was GUIDANCE and the MAFFT algorithm for codon alignments.

Since the majority of my sequences came from the same individual, once I had the coordinates, I didn't need to align them again as they are all mapped to the same reference genome.

ADD REPLY • link 5.0 years ago by dthorbur ★ 1.9k