Programming Challange: Pairwise Alignments To Multiple Alignment
2
7
Entering edit mode
12.1 years ago
Rm 8.2k

I have a set of 10-12 very closely related chromosome sequences (from different strains) aligned to a "single" reference chromosome. Now I need to generate multiple sequence alignment of these without afftecting individual alignments to the reference. All that I need is to add relative inserts at respective sequence positions, so that I get a global alignment with respect to reference.

"Note. Here I am NOT looking for sequence similarities".

I hope some scripts are already available to do this? I dont want to reinvent the wheel. OR any suggestions to script (preferably in perl) to address this problem.

What I need is to do is to "add dashes" at relative "insert" positions in other sequences, so that I get a global alignment with respect to reference.

For example with 10 sequnces: If from position 30 to 55 in seq1 has an insert. but not in other 9 sequences. In the final expanded alignment I will insert 26 dashes (-) (from 30-55) in the sequences 2 to 10 and in the reference.

And in another situation like above if seq1 has insert at 30-55 and if seq3 has insert from 40-45 and seq6 has insert from 42-49. Then I need to insert dashes like above, except in these positions (in seq3: 40-45 and in seq6: 112-119)

Sample Input: Original reference: CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC

Pairwise alignments:

Ref1: CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC

Ref1: CGACAATGCACGACAGAGGAAGC--AGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq2: CGACAAT-CACGACAGAGGAAGCTTAGAACAGATATTTAG---GCCTCTCATTTTCTCTCCC

Ref1: CGACAATGCACGACAGAGGAAG----CAGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq3: CGACAATGCACGACAGAGGAAGTTTTCAGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC


sample Output: Final Multiple sequence alignment:

Ref1: CGACAAT--GCACGACAGAGGAAG----C--AGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAG----C--AGAACAGATA-----ATTGCCTCTCA----TTTTC-CTCCC
Seq2: CGACAAT---CACGACAGAGGAAG----CTTAGAACAGATATTTAG---GCCTCTCA----TTTTCTCTCCC
Seq3: CGACAAT--GCACGACAGAGGAAGTTTTC--AGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC

alignment scripting perl programming indel • 6.1k views
13
Entering edit mode
12.1 years ago

The algorithm you described was used in TBA / MULTIZ for multiple genomic alignments. To download, go to Miller Lab website.

The underlying principle for multiple sequence alignments is that the gap insertion is determined by the order that you align the sequences. So in many cases, doing seq1-seq2-seq3 order is different from seq2-seq1-seq3, this is known as "once a gap, always a gap".

The good news is that TBA does what you need, the bad news is you'll have to use their MAF format. That means you'll need to do the format conversion. Please stay with me. Following are the files you need, and need to be named exactly like this.

First, you'll need ref1, seq1, seq2, seq3 are the raw sequences in FASTA format. For example, ref1 looks like this:

>ref1
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC


Next, you'll have three pairwise comparisons in MAF format. For example, ref1.seq1.sing.maf looks like this:

##maf version=1
a
s ref1 0 60 + 60 CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
s seq1 0 56 + 56 CGACAATAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC


Do this similarly for the other two comparisons, and call ref1.seq2.sing.maf, ref1.seq3.sing.maf.

With all the sequences and pairwise MAFs in place, do a single command:

tba "(((ref1 seq1) seq2) seq3)" *.*.maf tba.maf


This will construct your multiple alignment in the order seq1-seq2-seq3 (by following the guide tree), but I think you already know what to do for a different order.

Finally, you'll have what you need in tba.maf. Good luck!

a score=20563.0
s ref1 0 60 + 60 CGACAAT--GCACGACAGAGGAAGC----AGAACAGATATTTAGATTGCCTCTC----ATTTTCTCTCCC
s seq1 0 56 + 56 CGACAATAAGCACGACAGAGGAAGC----AGAACAGATA-----ATTGCCTCTC----ATTTTC-CTCCC
s seq2 0 58 + 58 CGACAAT---CACGACAGAGGAAGCTT--AGAACAGATATTTAG---GCCTCTC----ATTTTCTCTCCC
s seq3 0 68 + 68 CGACAAT--GCACGACAGAGGAAGTTTTCAGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC

1
Entering edit mode

Is there any tool available to convert fasta or clustal to MAF format?

0
Entering edit mode

thanks @Haibao Tang, Lemme give it a try. But was wondering how strict it preserves "once a gap, always a gap". If so then this what I wanted.

0
Entering edit mode

I am not aware of the tools, but I would write a script to do the conversion, based on the specs of the MAF format.

0
Entering edit mode

Thanks @Haibao Tang I tried it. Even though it says "once Gap always a Gap" when I tried to align a chromosome from multiple strains, I see slight shift in the indels in the MSA, if I change the input order.

9
Entering edit mode
12.1 years ago
Lee Katz ★ 3.1k

MUSCLE has profile option hidden in its user manual. Also you might want to try out MAUVE in case it has some profile option too.

Alternatively you might use an assembly finishing tool like Hawkeye or Consed to align things manually.

0
Entering edit mode

thanks, I will look into those to see whether it fits my requirement, else need a script to do

0
Entering edit mode

I tried them, of no use for my specific problem.