Question

Question about the sped of the global pairwise alignment of Bio.pairwise.align.globalms and Align.PairwiseAligner()

1

Entering edit mode

3.1 years ago

master_zhen ▴ 10

Hi!

I have two long DNA sequence, and I want to do the global pairwise alignment with needleman-wunsch algorithm. But sometimes Bio.pairwise.align.globalms could run a long time.

In this situation, I choose set one_alignment_only=1. And it could speed up to some extent. Thus, I see some posts said Align.PairwiseAligner() may be faster than that of the previous one. Is that true? If not, can anyone give some function like Bio under python? Thanks

Also, some related questions :

(1) If the program run a long time for one sample, how could I know whether this sample's data is complete. I mean not the the error of the bam file waste the time.

(2) Can Align.PairwiseAligner() quickly get the query sequence? Like alignments[0].seqB in the Bio.pairwise.align.globalms.

(3) Can Align.PairwiseAligner() just recover one result? Or to say set "one_alignment_only=1" in Align.PairwiseAligner()

Thanks!

Python pairwise global aignment • 1.4k views

ADD COMMENT • link 3.1 years ago by master_zhen ▴ 10

score 2 · Accepted Answer · 2021-03-16

2

Entering edit mode

3.1 years ago

master_zhen ▴ 10

The answer is yes. Align.PairwiseAligner() is much much faster than that of Bio.pairwise.align.globalms.

ADD COMMENT • link 3.1 years ago by master_zhen ▴ 10

score 2 · Accepted Answer · 2021-03-20

Also, here I want share my code to get the string of aligned result by Align.PairwiseAligner(). In the document, we could only print the result, and it is hard to reuse the result.

The code is that:

from Bio import Align

Seq1='agcatttggttggtgcacaagtaactgcagtttttgccattactttctacgacaaaacccgcaaatacttttgcaccaacctaata'
Seq2='cttgttggtgcaccgtagccacagcttacgacagccccgcaaatacttttgcactaccc'

aligner = Align.PairwiseAligner()
aligner.mode = 'global'
aligner.match = 5
aligner.mismatch = -4
aligner.open_gap_score = -10
aligner.extend_gap_score = -0.5
aligner.target_end_gap_score = 0.0
aligner.query_end_gap_score = 0.0
alignments = aligner.align(Seq1, Seq2)
tuple_con = alignments[0].aligned[0]
tuple_query = alignments[0].aligned[1]
tupe_num = len(alignments[0].aligned[0])



if tupe_num>0:
    collect_Seq1 = []
    collect_Seq2 = []
    for  xun1 in range(tupe_num):
        collect_Seq1.append(Seq1[tuple_con[xun1][0]:tuple_con[xun1][1]])
        collect_Seq2.append(Seq2[tuple_query[xun1][0]:tuple_query[xun1][1]])
    Over_all_con = collect_Seq1[0]
    Over_all_query = collect_Seq2[0]
    if tupe_num > 1:
        for xun2 in range(tupe_num-1):
            if (tuple_query[xun2+1][0]-tuple_query[xun2][1])*(tuple_con[xun2+1][0]-tuple_con[xun2][1])<1:
                Over_all_con = Over_all_con + '-'*(tuple_query[xun2+1][0]-tuple_query[xun2][1]) + Seq1[tuple_con[xun2][1]:tuple_con[xun2+1][0]] + collect_Seq1[xun2+1]
                Over_all_query = Over_all_query + '-'*(tuple_con[xun2+1][0]-tuple_con[xun2][1]) + Seq2[tuple_query[xun2][1]:tuple_query[xun2+1][0]] + collect_Seq2[xun2+1]
            else:
                Over_all_con = Over_all_con + '-'*(tuple_query[xun2+1][0]-tuple_query[xun2][1]) + Seq1[tuple_con[xun2][1]:tuple_con[xun2+1][0]] + collect_Seq1[xun2+1]
                Over_all_query = Over_all_query + Seq2[tuple_query[xun2][1]:tuple_query[xun2+1][0]]+'-'*(tuple_con[xun2+1][0]-tuple_con[xun2][1])  + collect_Seq2[xun2+1]
    if tuple_con[tupe_num-1][1]<len(Seq1):
        Over_all_query = Over_all_query + '-'*(len(Seq1)-tuple_con[tupe_num-1][1])
        Over_all_con = Over_all_con +  Seq1[tuple_con[tupe_num-1][1]:len(Seq1)]
    if tuple_query[tupe_num-1][1]<len(Seq2):
        Over_all_con = Over_all_con + '-'*(len(Seq2)-tuple_query[tupe_num-1][1])
        Over_all_query = Over_all_query +  Seq2[tuple_query[tupe_num-1][1]:len(Seq2)]
    if tuple_con[0][0]>0:
        Over_all_query = '-'*tuple_con[0][0]+Over_all_query
        Over_all_con = Seq1[0:tuple_con[0][0]]+ Over_all_con
    if tuple_query[0][0]>0:  
        Over_all_con = '-'*tuple_query[0][0]+Over_all_con
        Over_all_query = Seq2[0:tuple_query[0][0]] + Over_all_query
else:
    Over_all_con = Seq1+'-'*len(Seq2)
    Over_all_query = '-'*len(Seq1) + Seq2

So here Over_all_query is the aligned result of Seq2 and Over_all_con is the aligned result of Seq1.

However, if your data is not large, I would recommend you to choose the following code:

from Bio import pairwise2 as pw2


alignments2 = pw2.align.globalms(Seq1, Seq2, 5, -4, -10, -0.5,penalize_end_gaps=False)
alignments2[0].seqA
alignments2[0].seqB