What would your recommendations for the best MSA programs for comparing large nucleotide sequences of the same species?
My data set consists of 66 individuals from 11 populations. I will be using VariScan to scan for selection next, but need to realign each of the 21 chromosomes separately for each; population (6 individuals x 11 replicates), population pair (12 individuals in 5 replicates, excluding the outgroup), ecotype (30 individuals x 2 ecotypes), and all individuals together (66 individuals). To give you an idea of the size, chromosome 1 is ~26Mb, in a single consensus fasta sequence.
I was using GUIDANCE with the PRANK algorithm (commonplace in phylogenomic studies), but even on a high performance computing cluster, it took too long for one chromosome with only 6 individuals to make it a feasible approach. I am currently doing a test using MAFFT, but I have been warned it may have a sequence size limitation - it's currently running so we'll see.
Thanks in advance, and if I missed explaining anything, please let me know!
What is the scientific logic of doing a chromosome level MSA? How was this consensus sequence generated? What is the ultimate question you are trying to solve?
Ultimately this is a genome wide scan for balancing selection. We would do it at the whole genome level, but in using each individual chromosome separately, we cut down the computational power needed and to cut down the amount of time needed for each step. Make sense?
I assume you are reasonably sure that your chromosomes are directly align-able. By extending your method, if chromosome level alignments are not feasible, you may have to start dividing the problem into smaller pieces. A casual search showed that people seem to be studying balancing selection at the level of a few genes. Are there known studies for doing this on whole genome level?
There aren't many, but off the top of my head I can think of 2 good examples. We are trying to get away from the candidate gene approach here, and our data set is outstanding.
I am reasonably sure each chromosome should be aligned without any problems.
Assuming the logic/experiment is all sound - try a different tool.
Ones that tend to be good for large scale:
Thanks, I'll have a look into these now.
plz,Can you suggest me a dataset to use it in comparing between 3 algorithms(Clustal - Muscle - T-coffee)?
Take a look at balibase
They have alignment benchmarking datasets.
Hi miles.thorburn, how did you solve it in the end? Any recommendations? I did pairwise LAST alignments and construct the scaffold level MSA with mugsy. The goal of my approach was different (find sequence variation), and as you can imagine, scalability was an issue.
What was you're experience with multiple whole genome alignments? What worked best?
Hi, the analysis changed quite significantly since then. I ended up using only needing to align coding sequences from my individuals and an outgroup. I did that was GUIDANCE and the MAFFT algorithm for codon alignments.
Since the majority of my sequences came from the same individual, once I had the coordinates, I didn't need to align them again as they are all mapped to the same reference genome.