Question: Multiple Sequence Alignment for very large data sets
gravatar for miles.thorburn
17 months ago by
miles.thorburn80 wrote:

What would your recommendations for the best MSA programs for comparing large nucleotide sequences of the same species?

My data set consists of 66 individuals from 11 populations. I will be using VariScan to scan for selection next, but need to realign each of the 21 chromosomes separately for each; population (6 individuals x 11 replicates), population pair (12 individuals in 5 replicates, excluding the outgroup), ecotype (30 individuals x 2 ecotypes), and all individuals together (66 individuals). To give you an idea of the size, chromosome 1 is ~26Mb, in a single consensus fasta sequence.

I was using GUIDANCE with the PRANK algorithm (commonplace in phylogenomic studies), but even on a high performance computing cluster, it took too long for one chromosome with only 6 individuals to make it a feasible approach. I am currently doing a test using MAFFT, but I have been warned it may have a sequence size limitation - it's currently running so we'll see.

Thanks in advance, and if I missed explaining anything, please let me know!

msa alignment • 918 views
ADD COMMENTlink modified 9 weeks ago by sallyzaki7010 • written 17 months ago by miles.thorburn80

What is the scientific logic of doing a chromosome level MSA? How was this consensus sequence generated? What is the ultimate question you are trying to solve?

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax69k

Ultimately this is a genome wide scan for balancing selection. We would do it at the whole genome level, but in using each individual chromosome separately, we cut down the computational power needed and to cut down the amount of time needed for each step. Make sense?

ADD REPLYlink written 17 months ago by miles.thorburn80

I assume you are reasonably sure that your chromosomes are directly align-able. By extending your method, if chromosome level alignments are not feasible, you may have to start dividing the problem into smaller pieces. A casual search showed that people seem to be studying balancing selection at the level of a few genes. Are there known studies for doing this on whole genome level?

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax69k

There aren't many, but off the top of my head I can think of 2 good examples. We are trying to get away from the candidate gene approach here, and our data set is outstanding.

I am reasonably sure each chromosome should be aligned without any problems.

ADD REPLYlink written 17 months ago by miles.thorburn80

Assuming the logic/experiment is all sound - try a different tool.

Ones that tend to be good for large scale:

  • progressiveMauve
  • Kalign
ADD REPLYlink written 17 months ago by jrj.healey13k

Thanks, I'll have a look into these now.

ADD REPLYlink written 17 months ago by miles.thorburn80

plz,Can you suggest me a dataset to use it in comparing between 3 algorithms(Clustal - Muscle - T-coffee)?

ADD REPLYlink written 9 weeks ago by sallyzaki7010

Take a look at balibase

They have alignment benchmarking datasets.

ADD REPLYlink written 9 weeks ago by jrj.healey13k

Hi miles.thorburn, how did you solve it in the end? Any recommendations? I did pairwise LAST alignments and construct the scaffold level MSA with mugsy. The goal of my approach was different (find sequence variation), and as you can imagine, scalability was an issue.

What was you're experience with multiple whole genome alignments? What worked best?

ADD REPLYlink written 9 weeks ago by Carambakaracho1.4k

Hi, the analysis changed quite significantly since then. I ended up using only needing to align coding sequences from my individuals and an outgroup. I did that was GUIDANCE and the MAFFT algorithm for codon alignments.

Since the majority of my sequences came from the same individual, once I had the coordinates, I didn't need to align them again as they are all mapped to the same reference genome.

ADD REPLYlink written 9 weeks ago by miles.thorburn80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 630 users visited in the last hour