How To Solve "Input Sequence" Problem In Multiple Sequence Alignment ?
3
4
Entering edit mode
12.0 years ago
User1029725 ▴ 100

Randomizing input order of sequences gives completely different alignments. Is there a way to address this problem ?

multiple random • 6.8k views
0
Entering edit mode

How many sequences are you aligning with MUSCLE?

0
Entering edit mode

in the range of 15000 - 30,000

0
Entering edit mode

usually in the range of 15K-30K

0
Entering edit mode

usually in the range of 15K-30K (funny, but true)

0
Entering edit mode

Do you have similar issues in other alignment programs? Have you tried MAAFT?

0
Entering edit mode

Do you have similar issues in other alignment programs? Have you tried MAAFT? http://mafft.cbrc.jp/alignment/software/

1
Entering edit mode
12.0 years ago

Bob Edgar (Author of Muscle) wrote this blog entry on big alignments:

Consider using Uclust.

3
Entering edit mode

This answer is not related to the question and uclust has nothing to do with multiple alignments.

1
Entering edit mode

The point is - as stated by Bob Edgar - huge alignements are nonsense. The meaningful thing to do is clustering to bin alignable sequences.

0
Entering edit mode

Thanks for the info ! Is there any reference to effect of input order on alignments (my original query) in Bob's blog Or Did I miss something ?

0
Entering edit mode

The point is - as stated by Bob Edgar - huge alignements are nonsense. The meaningful thing to do is clustering to bin alignable sequences. Of cause this is not the answer to the misguided question.

0
Entering edit mode

If that was true, than all Pfam full alignments would be nonsense. Of interest here is also a new paper from Chris Sanders's group (http://www.ncbi.nlm.nih.gov/pubmed/22163331), where the authors used "big" protein alignments to accurately predict folds using statistical methods, which is not possible with "small" alignments.

0
Entering edit mode
12.0 years ago

I found the following code that shuffles the order of sequences in fasta format. The "perl script randomly shuffles the order of sequences in a fasta file. Upon execution, specify your input file (without .fasta extension) and total no. of sequences." Feed that output into MUSCLE.

1
Entering edit mode

Exactly what the code to which the link in my response will do. That code shuffles not the sequence, but the sequence order. So, sequences, 1,2,3,4,5 will become 4,2,3,5,1, for example. Now, with this randomized ordering of the input sequences, you can test for the "input sequence" bias.

0
Entering edit mode

Sorry, If my query is not clear. I am looking for ways to remove the input order bias. In other words, if I change input order, I am obtaining completely different alignment. I want to know if there is any way in which no-matter-what-the-input-order-is I will always obtain similar alignment (if not identical)

0
Entering edit mode

Sorry, If my query is not clear. I am looking for ways to remove the input order bias. In other words, I want to know if there is any way in which no-matter-what-the-input-order-is I will always obtain similar alignment (if not identical)

0
Entering edit mode
12.0 years ago
Andreas ★ 2.5k

I think this "problem" arises because at some stage an asymmetric pairwise distance measure is computed, i.e. the result depends on ordering. However, I'm not sure where exactly this happens in Muscle. The first distance used there (K-mer distance) should be symmetric. Does a -maxiter 1 always give the same result? A manual way to get rid of this would be to sort the sequences first according to some criterium (e.g. length) but there's of course no guarantee that this would give better alignments.

Andreas

0
Entering edit mode

Yes, I was also thinking in lines of sorting.

Till now, I was using -maxiter 5. Let me see what -maxiter 1 gives.

0
Entering edit mode

Yes, I was also thinking in lines of sorting. Yeah, -maxiter 1 must give same result, but let me run a test dataset and confirm !

0
Entering edit mode

I am surprised, even -maxiter 1 doesn't give same alignment ! Any idea, why ?

0
Entering edit mode

Even -maxiter 1 doesn't give same alignment for randomized input order. I think this might happen if more than two sequences have same pair-wise k-mer score. In that case, either of them is aligned before other resulting in different alignments every time.

0
Entering edit mode

Then I'd go for sorting. You might also want to try Mafft or in case of protein sequences Clustal Omega (which has in internal switch to sort sequences first).