Question: Different alignment results between Emboss Needle and Biopython pairwise2
1
gravatar for markd
11 months ago by
markd20
markd20 wrote:

I'm trying to figure out why Emboss Needle and Biopython pairwise2.align.global give different results.

I start with two sequences that are difficult to align (generated randomly)

seq 1: CATTTTTTGAACAATCAGATCAGAAGCACCTTCGGGATAATCCTCACGGTTCTTGGGCATACGCCTCATGGTAAC
seq 2: ATGGAGCAGTGGGTTGTTACAGATCTCATAATCAGGGAACAGGTCTGGATAATGCCCCATCTCCACGATATCCCG

Emboss needle gives the following alignment, with parameters gapopen=10, gapextend=0.5, datafile=EDNAFULL

------------------AATAGTGTACCTGGCC----GGA-TCTTTATTCCTTCCAGGACAACT-----------CA---CATCTGCCCCACTACCCTAAACAAAATTTAA
CATTTTTGGAGTCACGTCAACAG---ACC--GCCCGGGGGATTCGTGACTAATCGCAGGA----TGGAGCGTGGAGCAGGCCAT----------------------------

However, I cannot get pairwise2.align.global to recreate this alignment. I have tried using globalds with the same parameters, and a dictionary version of EDNAFULL. I've also played with the penalize_extend_when_opening and penalize_end_gaps with no success. Here is a sample alignment from the pairwise2 output and both of the flags previously mentioned set to false.

------CA-----TTTTTTGAACA-ATCAGATCAGAAGCACCTTCGGGATAATCCTCACGGTTCTTGGGCATAC-GCCTCAT------GGTAAC---
ATGGAGCAGTGGGTTGTT---ACAGATC---TCATAATCA-----GGGA--------ACAGGTCT---GGATAATGCCCCATCTCCACGATATCCCG

Both of these programs are just meant to implement the Needleman-Wunsch algorithm, so why are they different and which one should I trust?

ADD COMMENTlink modified 15 days ago by Markus220 • written 11 months ago by markd20
3

I'm also getting different results.

Biopython:

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

seq1 = 'GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACGTAGAGCTGCCTTGTAACGTGGGAGTGAAA'
seq2 = 'ACGATCTATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTTAGC'

alignments = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -0.5)
print(format_alignment(*alignments[0]))

Alignment:

-----------GGCATGTACCTGTCGTGTGCGTT-------------------GTATCC-----------CCCACGCTTGATAGACGTAGAGCTGCCTTGTAACGTGGGAGTGAAA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ACGATCTATGGGGC---TAACT----TCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTT-----------AGC-----------------------
  Score=-0.5

Emboss needle:

needle -asequence seq1.fa -bsequence seq2.fa -gapopen 10 -gapextend 0.5 -outfile aln.needle

Alignment:

########################################
# Program: needle
# Rundate: Tue  3 Apr 2018 14:12:49
# Commandline: needle
#    -auto
#    -stdout
#    -asequence emboss_needle-I20180403-141248-0665-8944238-pg.asequence
#    -bsequence emboss_needle-I20180403-141248-0665-8944238-pg.bsequence
#    -datafile EDNAFULL
#    -gapopen 10.0
#    -gapextend 0.5
#    -endopen 10.0
#    -endextend 0.5
#    -aformat3 pair
#    -snucleotide1
#    -snucleotide2
# Align_format: pair
# Report_file: stdout
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: EMBOSS_001
# 2: EMBOSS_001
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 118
# Identity:      22/118 (18.6%)
# Similarity:    22/118 (18.6%)
# Gaps:          86/118 (72.9%)
# Score: 47.5
# 
#
#=======================================

EMBOSS_001         1 GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACG----     46
                                                                |||    
EMBOSS_001         1 -------------------------------------------ACGATCT      7

EMBOSS_001        47 -TAGAGCTGCCTTGTAACGTGGGAG--TGAAA------------------     75
                      |.|.|||..|||.|.|..|.||||  |.|||                  
EMBOSS_001         8 ATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACA     57

EMBOSS_001        76 ------------------     75

EMBOSS_001        58 ACGACACCCACCCTTAGC     75


#---------------------------------------
#---------------------------------------
ADD REPLYlink modified 11 months ago • written 11 months ago by a.zielezinski8.6k

Thanks for looking into this!

Unfortunately, while those parameters do happen to work for those sequences, its trivial to find sequences that don't match given those parameters.

For example

seq_1 = 'GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACGTAGAGCTGCCTTGTAACGTGGGAGTGAAA'
seq_2 = 'ACGATCTATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTTAGC'

Produces different alignments again.

Btw, I'm just generating the strings with

seq_1 = ''.join([random.choice('ATCG') for _ in range(75)])
seq_2 = ''.join([random.choice('ATCG') for _ in range(75)])
ADD REPLYlink modified 11 months ago • written 11 months ago by markd20

Wooo, you're right. I'm moving my answer as comment. I think the difference in alignments and scores may come from the fact that needle has two additional parameters (not present in biopython): endopen and endextend.

ADD REPLYlink modified 11 months ago • written 11 months ago by a.zielezinski8.6k

I explored that a little too, but by default the -endweight parameter is set to False, which means endopen/endextend shouldn't make any difference. Also as I said in the post, I tried playing with the penalize_extend_when_opening and penalize_end_gaps parameters in biopython but still couldn't get the alignments the same.

ADD REPLYlink written 11 months ago by markd20

Hello, as I am a green hand in programming, I wonder how to get the result like :

Length: 118

Identity: 22/118 (18.6%)

Similarity: 22/118 (18.6%)

Gaps: 86/118 (72.9%)

Score: 47.5

Is it get from the python on the window system or it is the result from linux system(with shell or something like that?) Thank you !

ADD REPLYlink written 7 weeks ago by 8241632580

Please just ask a new question instead of adding a tangent to an existing one. Please also note, if you do open a new thread we require much more information, and you should show some effort you’ve made to solve the problem.

ADD REPLYlink written 7 weeks ago by jrj.healey11k
2

Tagging Dr. Peter Cock: Peter

ADD REPLYlink written 11 months ago by genomax63k

What version of Biopython are you on?

The pairwise module was re-written in (I think) 1.68, but it may be later. You'll get different results with the rewritten version to the old version. Peter would know for sure.

ADD REPLYlink written 11 months ago by jrj.healey11k

I am on 1.71 master.

ADD REPLYlink written 11 months ago by markd20

Have you also tried the variations on global? e.g. globalxx, globalms, globalmx etc etc.

I've never really delved in to/understood the differences, but I know there are multiple options for local and global

ADD REPLYlink written 11 months ago by jrj.healey11k

Some of those alternatives (I think the ones ending in -x) allow you to define custom scoring functions.

ADD REPLYlink written 11 months ago by cschu1811.5k

I'm using globalds with the EDNAFULL matrix converted into a dictionary for the match_dict parameter. This should be the most accurate way to reproduce Emboss Needle I think.

Below is my full code to translate the EDNAFULL matrix

EDNAFullLetters = ['A', 'T', 'G', 'C', 'S', 'W', 'R', 'Y', 'K', 'M', 'B', 'V', 'H', 'D', 'N', 'U']
EDNAFullScores = [
    [5,-4,-4,-4,-4, 1, 1,-4,-4, 1,-4,-1,-1,-1,-2,-4],
    [-4, 5,-4,-4,-4, 1,-4, 1, 1,-4,-1,-4,-1,-1,-2, 5],
    [-4,-4, 5,-4, 1,-4, 1,-4, 1,-4,-1,-1,-4,-1,-2,-4],
    [-4,-4,-4, 5, 1,-4,-4, 1,-4, 1,-1,-1,-1,-4,-2,-4],
    [-4,-4, 1, 1,-1,-4,-2,-2,-2,-2,-1,-1,-3,-3,-1,-4],
     [1, 1,-4,-4,-4,-1,-2,-2,-2,-2,-3,-3,-1,-1,-1, 1],
     [1,-4, 1,-4,-2,-2,-1,-4,-2,-2,-3,-1,-3,-1,-1,-4],
    [-4, 1,-4, 1,-2,-2,-4,-1,-2,-2,-1,-3,-1,-3,-1, 1],
    [-4, 1, 1,-4,-2,-2,-2,-2,-1,-4,-1,-3,-3,-1,-1, 1],
     [1,-4,-4, 1,-2,-2,-2,-2,-4,-1,-3,-1,-1,-3,-1,-4],
    [-4,-1,-1,-1,-1,-3,-3,-1,-1,-3,-1,-2,-2,-2,-1,-1],
    [-1,-4,-1,-1,-1,-3,-1,-3,-3,-1,-2,-1,-2,-2,-1,-4],
    [-1,-1,-4,-1,-3,-1,-3,-1,-3,-1,-2,-2,-1,-2,-1,-1],
    [-1,-1,-1,-4,-3,-1,-1,-3,-1,-3,-2,-2,-2,-1,-1,-1],
    [-2,-2,-2,-2,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-2],
    [-4, 5,-4,-4,-4, 1,-4, 1, 1,-4,-1,-4,-1,-1,-2, 5]]

EDNAFull = dict()
for i, a in enumerate(EDNAFullLetters):
    for j, b in enumerate(EDNAFullLetters):
        EDNAFull[(a, b)] = EDNAFullScores[i][j]

for aln in pairwise2.align.globalds(seq_1, seq_2, EDNAFull, -10, -0.5):
    print pairwise2.format_alignment(*aln)
ADD REPLYlink modified 11 months ago • written 11 months ago by markd20

I'm using BioPython 1.70.

ADD REPLYlink written 11 months ago by a.zielezinski8.6k
1
gravatar for Markus
15 days ago by
Markus220
Markus220 wrote:

Sorry for the late answer, I didn't see this before. I would recommend to send such things to Biopython's Github site.

The standard in EMBOSS' Needle is not to penalize end gaps (END GAP PENALTY: FALSE). This means: Gaps at the beginning and the end of the alignment are 'for free'. You must do the same in Biopython's pairwise2 with the keyword parameter penalize_end_gaps=False.

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

seq1 = 'GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACGTAGAGCTGCCTTGTAACGTGGGAGTGAAA'
seq2 = 'ACGATCTATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTTAGC'

alignments = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -0.5,
                                      penalize_end_gaps=False)
print(format_alignment(*alignments[0]))

Then you get the same result as in EMBOSS' Needle:

GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACG-----TAGAGCTGCCTTGTAACGTGGGAG--TGAAA------------------------------------
                                           |||     |.|.|||..|||.|.|..|.||||  |.|||                                    
-------------------------------------------ACGATCTATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTTAGC
  Score=47.5

You must not change penalize_extend_when_opening, this parameter is important for how a gap opening is penalized: Only gap-open penalty (default, also in EMBOSS), or gap-open penalty + gap-extension penalty (unusual).

ADD COMMENTlink modified 15 days ago • written 15 days ago by Markus220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1601 users visited in the last hour