Question

Programming Challange: Pairwise Alignments To Multiple Alignment

7

Entering edit mode

13.5 years ago

Rm 8.3k

I have a set of 10-12 very closely related chromosome sequences (from different strains) aligned to a "single" reference chromosome. Now I need to generate multiple sequence alignment of these without afftecting individual alignments to the reference. All that I need is to add relative inserts at respective sequence positions, so that I get a global alignment with respect to reference.

"Note. Here I am NOT looking for sequence similarities".

I hope some scripts are already available to do this? I dont want to reinvent the wheel. OR any suggestions to script (preferably in perl) to address this problem.

Adding more information:

What I need is to do is to "add dashes" at relative "insert" positions in other sequences, so that I get a global alignment with respect to reference.

For example with 10 sequnces: If from position 30 to 55 in seq1 has an insert. but not in other 9 sequences. In the final expanded alignment I will insert 26 dashes (-) (from 30-55) in the sequences 2 to 10 and in the reference.

And in another situation like above if seq1 has insert at 30-55 and if seq3 has insert from 40-45 and seq6 has insert from 42-49. Then I need to insert dashes like above, except in these positions (in seq3: 40-45 and in seq6: 112-119)

Sample Input: Original reference: CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC

Pairwise alignments:

Ref1: CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC

Ref1: CGACAATGCACGACAGAGGAAGC--AGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq2: CGACAAT-CACGACAGAGGAAGCTTAGAACAGATATTTAG---GCCTCTCATTTTCTCTCCC

Ref1: CGACAATGCACGACAGAGGAAG----CAGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq3: CGACAATGCACGACAGAGGAAGTTTTCAGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC

sample Output: Final Multiple sequence alignment:

Ref1: CGACAAT--GCACGACAGAGGAAG----C--AGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAG----C--AGAACAGATA-----ATTGCCTCTCA----TTTTC-CTCCC
Seq2: CGACAAT---CACGACAGAGGAAG----CTTAGAACAGATATTTAG---GCCTCTCA----TTTTCTCTCCC
Seq3: CGACAAT--GCACGACAGAGGAAGTTTTC--AGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC

alignment scripting perl programming indel • 6.9k views

ADD COMMENT • link updated 13.5 years ago by Haibao Tang 3.0k • written 13.5 years ago by Rm 8.3k

Ram · Answer 1 · 2010-10-22

The algorithm you described was used in TBA / MULTIZ for multiple genomic alignments. To download, go to Miller Lab website.

The underlying principle for multiple sequence alignments is that the gap insertion is determined by the order that you align the sequences. So in many cases, doing seq1-seq2-seq3 order is different from seq2-seq1-seq3, this is known as "once a gap, always a gap".

The good news is that TBA does what you need, the bad news is you'll have to use their MAF format. That means you'll need to do the format conversion. Please stay with me. Following are the files you need, and need to be named exactly like this.

First, you'll need ref1, seq1, seq2, seq3 are the raw sequences in FASTA format. For example, ref1 looks like this:

>ref1
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC

Next, you'll have three pairwise comparisons in MAF format. For example, ref1.seq1.sing.maf looks like this:

##maf version=1
a
s ref1 0 60 + 60 CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
s seq1 0 56 + 56 CGACAATAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC

Do this similarly for the other two comparisons, and call ref1.seq2.sing.maf, ref1.seq3.sing.maf.

With all the sequences and pairwise MAFs in place, do a single command:

tba "(((ref1 seq1) seq2) seq3)" *.*.maf tba.maf

This will construct your multiple alignment in the order seq1-seq2-seq3 (by following the guide tree), but I think you already know what to do for a different order.

Finally, you'll have what you need in tba.maf. Good luck!

a score=20563.0
s ref1 0 60 + 60 CGACAAT--GCACGACAGAGGAAGC----AGAACAGATATTTAGATTGCCTCTC----ATTTTCTCTCCC
s seq1 0 56 + 56 CGACAATAAGCACGACAGAGGAAGC----AGAACAGATA-----ATTGCCTCTC----ATTTTC-CTCCC
s seq2 0 58 + 58 CGACAAT---CACGACAGAGGAAGCTT--AGAACAGATATTTAG---GCCTCTC----ATTTTCTCTCCC
s seq3 0 68 + 68 CGACAAT--GCACGACAGAGGAAGTTTTCAGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC

score 9 · Answer 2 · 2010-10-12

9

Entering edit mode

13.5 years ago

Lee Katz ★ 3.1k

MUSCLE has profile option hidden in its user manual. Also you might want to try out MAUVE in case it has some profile option too.

Alternatively you might use an assembly finishing tool like Hawkeye or Consed to align things manually.

ADD COMMENT • link 13.5 years ago by Lee Katz ★ 3.1k

0

Entering edit mode

thanks, I will look into those to see whether it fits my requirement, else need a script to do