Question: How to construct multiple sequence alignment from group of pairwise alignments
0
gravatar for kanshenglong
3.8 years ago by
Beijing
kanshenglong0 wrote:

I have a bunch of 10 or more very closely related DNA sequences (from different strains) aligned to a reference chromosome. How can I generate multiple sequence alignment along with "reference" of all these sequences without affecting individual alignments to the reference. Don't cut the blank sequence.

for example:

Input:

Original reference:
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC

Pairwise alignments:

Ref1: CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC

Ref1: CGACAATGCACGACAGAGGAAGC--AGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq2: CGACAAT-CACGACAGAGGAAGCTTAGAACAGATATTTAG---GCCTCTCATTTTCTCTCCC

Ref1: CGACAATGCACGACAGAGGAAG----CAGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq3: CGACAATGCACGACAGAGGAAGTTTTCAGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC

Output: Final Multiple sequence alignment:

Ref1: CGACAAT--GCACGACAGAGGAAG----C--AGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAG----C--AGAACAGATA-----ATTGCCTCTCA----TTTTC-CTCCC
Seq2: CGACAAT---CACGACAGAGGAAG----CTTAGAACAGATATTTAG---GCCTCTCA----TTTTCTCTCCC
Seq3: CGACAAT--GCACGACAGAGGAAGTTTTC--AGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC
alignment • 1.2k views
ADD COMMENTlink modified 3.8 years ago by genomax91k • written 3.8 years ago by kanshenglong0

Someone has asked this question, But I can't run this script. http://www.perlmonks.org/bare/?node_id=866127

Thank you !!

ADD REPLYlink written 3.8 years ago by kanshenglong0

I'm confused. you want to take the aligned query sequence from several pairwise alignments, and compare them all against one reference?

ADD REPLYlink written 3.8 years ago by Joe18k

jrj.healey:

 Thanks for your reply!
 I'm sorry I did not make it clear. As my sequences are much longer than the reference sequence, I can't align these sequences by Multiple Sequence Alignment. I want this result.

enter image description here result

ADD REPLYlink written 3.8 years ago by kanshenglong0

You may not get an optimal answer by default run of an MSA program in this type of a case. You may have to do some manual editing of the alignments to achieve the exact result you want (editing within reason).

ADD REPLYlink modified 3.7 years ago • written 3.8 years ago by genomax91k

HAving to do some editing (within reason) to a multiple sequence alignment is pretty normal. I wouldn't put that in the bucket of MSA not giving you optimal results. I spent my entire PhD doing highly divergent MSA and phylogenetic analysis, this is definitely a case that MSA is entirely appropriate for.

ADD REPLYlink written 3.7 years ago by DG7.1k

I meant to say that the answer from a default run of a MSA program may not be usable as is. I have amended the post above.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by genomax91k
0
gravatar for Joe
3.8 years ago by
Joe18k
United Kingdom
Joe18k wrote:

I'm not sure if this answers your question, I'm still not totally clear on the objective but here goes.

Given a reference sequence ("REF") and 2 dummy subject sequences (SEQ1 & SEQ2), perform pairwise alignments like you have, and get the output in fasta format, thus:

>REF
ATATACAGATATCC
>SEQ1
ATAT-CTGATA-CC

and

>REF
ATATACAGATATCC
>SEQ2
--ATACAG-TATCC

Then extract all the aligned fasta sequences exluding the reference:

$ python ignore1st.py alignment.fasta > alignment1.fasta #repeat for however many alignments you have, giving each output filename a different name/number.

# ignore1st.py

from Bio import SeqIO
import sys

handle = list(SeqIO.parse(sys.argv[1], "fasta"))


for each in handle[1:]:
    print(each.format("fasta"))

Yeilds:

$ python ignore1stfasta.py aln1.fasta
>SEQ1
ATAT-CTGATA-CC

$ python ignore1stfasta.py aln2.fasta
>SEQ2
--ATACAG-TATCC

Then I would simply concatenate all your aligned fasta sequences after the reference (I'm going to assume for now you either have your reference sequence as a separate file already, or are capable of extracting it from the alignment files):

$ cat aligned_reference.fasta alignment*.fasta > resultfile.fasta

>REF
ATATACAGATATCC

>SEQ1
ATAT-CTGATA-CC

>SEQ2
--ATACAG-TATCC

This doesn't account in anyway for the quality of your PSA though.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Joe18k
0
gravatar for DG
3.8 years ago by
DG7.1k
DG7.1k wrote:

I'm not sure why you think you can't do a standard multiple sequence alignment just because your sequences of interest are longer than the reference? Since they are closely related and longer, you gain information by doing a standard multiple sequence alignment, as the "new" sequences will also be aligned against each other. Unless these sequences are absurdly long I don't see why it isn't amenable to a standard analysis.

ADD COMMENTlink written 3.8 years ago by DG7.1k

This was my thought too hence my confusion. Global alignment should handle it I would think?

ADD REPLYlink written 3.8 years ago by Joe18k

If the posted example in OP is all there is then yes. But since actual sequences (being compared to the reference) are longer (as depicted in the post above) then global alignments may not be appropriate.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by genomax91k

MSA != global alignment, although as I posted above to your other comment, MSA, including doing global alignments, is entirely appropriate. What is a "reference" sequence that you are comparing to is actually largely irrelevant in this case for the purposes of constructing a MSA. In fact you actually gain information from using your other sequences in the alignment process.

ADD REPLYlink written 3.7 years ago by DG7.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2093 users visited in the last hour