Question: Programming Challange: Pairwise Alignments To Multiple Alignment
7
gravatar for Rm
10.0 years ago by
Rm8.0k
Danville, PA
Rm8.0k wrote:

I have a set of 10-12 very closely related chromosome sequences (from different strains) aligned to a "single" reference chromosome. Now I need to generate multiple sequence alignment of these without afftecting individual alignments to the reference. All that I need is to add relative inserts at respective sequence positions, so that I get a global alignment with respect to reference.

"Note. Here I am NOT looking for sequence similarities".

I hope some scripts are already available to do this? I dont want to reinvent the wheel. OR any suggestions to script (preferably in perl) to address this problem.

Adding more information:

What I need is to do is to "add dashes" at relative "insert" positions in other sequences, so that I get a global alignment with respect to reference.

For example with 10 sequnces: If from position 30 to 55 in seq1 has an insert. but not in other 9 sequences. In the final expanded alignment I will insert 26 dashes (-) (from 30-55) in the sequences 2 to 10 and in the reference.

And in another situation like above if seq1 has insert at 30-55 and if seq3 has insert from 40-45 and seq6 has insert from 42-49. Then I need to insert dashes like above, except in these positions (in seq3: 40-45 and in seq6: 112-119)

Sample Input: Original reference: CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC

Pairwise alignments:

Ref1: CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC

Ref1: CGACAATGCACGACAGAGGAAGC--AGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq2: CGACAAT-CACGACAGAGGAAGCTTAGAACAGATATTTAG---GCCTCTCATTTTCTCTCCC

Ref1: CGACAATGCACGACAGAGGAAG----CAGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq3: CGACAATGCACGACAGAGGAAGTTTTCAGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC

sample Output: Final Multiple sequence alignment:

Ref1: CGACAAT--GCACGACAGAGGAAG----C--AGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAG----C--AGAACAGATA-----ATTGCCTCTCA----TTTTC-CTCCC
Seq2: CGACAAT---CACGACAGAGGAAG----CTTAGAACAGATATTTAG---GCCTCTCA----TTTTCTCTCCC
Seq3: CGACAAT--GCACGACAGAGGAAGTTTTC--AGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC
ADD COMMENTlink modified 10.0 years ago by Haibao Tang3.0k • written 10.0 years ago by Rm8.0k
13
gravatar for Haibao Tang
10.0 years ago by
Haibao Tang3.0k
Mountain View, CA
Haibao Tang3.0k wrote:

The algorithm you described was used in TBA / MULTIZ for multiple genomic alignments. To download, go to Miller Lab website.

The underlying principle for multiple sequence alignments is that the gap insertion is determined by the order that you align the sequences. So in many cases, doing seq1-seq2-seq3 order is different from seq2-seq1-seq3, this is known as "once a gap, always a gap".

The good news is that TBA does what you need, the bad news is you'll have to use their MAF format. That means you'll need to do the format conversion. Please stay with me. Following are the files you need, and need to be named exactly like this.

First, you'll need ref1, seq1, seq2, seq3 are the raw sequences in FASTA format. For example, ref1 looks like this:

>ref1
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC

Next, you'll have three pairwise comparisons in MAF format. For example, ref1.seq1.sing.maf looks like this:

##maf version=1
a
s ref1 0 60 + 60 CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
s seq1 0 56 + 56 CGACAATAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC

Do this similarly for the other two comparisons, and call ref1.seq2.sing.maf, ref1.seq3.sing.maf.

With all the sequences and pairwise MAFs in place, do a single command:

tba "(((ref1 seq1) seq2) seq3)" *.*.maf tba.maf

This will construct your multiple alignment in the order seq1-seq2-seq3 (by following the guide tree), but I think you already know what to do for a different order.

Finally, you'll have what you need in tba.maf. Good luck!

a score=20563.0
s ref1 0 60 + 60 CGACAAT--GCACGACAGAGGAAGC----AGAACAGATATTTAGATTGCCTCTC----ATTTTCTCTCCC
s seq1 0 56 + 56 CGACAATAAGCACGACAGAGGAAGC----AGAACAGATA-----ATTGCCTCTC----ATTTTC-CTCCC
s seq2 0 58 + 58 CGACAAT---CACGACAGAGGAAGCTT--AGAACAGATATTTAG---GCCTCTC----ATTTTCTCTCCC
s seq3 0 68 + 68 CGACAAT--GCACGACAGAGGAAGTTTTCAGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC
ADD COMMENTlink modified 14 months ago by RamRS30k • written 10.0 years ago by Haibao Tang3.0k
1

Is there any tool available to convert fasta or clustal to MAF format?

ADD REPLYlink written 10.0 years ago by Rm8.0k

thanks @Haibao Tang, Lemme give it a try. But was wondering how strict it preserves "once a gap, always a gap". If so then this what I wanted.

ADD REPLYlink written 10.0 years ago by Rm8.0k

I am not aware of the tools, but I would write a script to do the conversion, based on the specs of the MAF format.

ADD REPLYlink written 10.0 years ago by Haibao Tang3.0k

Thanks @Haibao Tang I tried it. Even though it says "once Gap always a Gap" when I tried to align a chromosome from multiple strains, I see slight shift in the indels in the MSA, if I change the input order.

ADD REPLYlink written 9.9 years ago by Rm8.0k
9
gravatar for Lee Katz
10.0 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

MUSCLE has profile option hidden in its user manual. Also you might want to try out MAUVE in case it has some profile option too.

Alternatively you might use an assembly finishing tool like Hawkeye or Consed to align things manually.

ADD COMMENTlink written 10.0 years ago by Lee Katz3.0k

thanks, I will look into those to see whether it fits my requirement, else need a script to do

ADD REPLYlink written 10.0 years ago by Rm8.0k

I tried them, of no use for my specific problem.

ADD REPLYlink written 10.0 years ago by Rm8.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 924 users visited in the last hour