Question

how to align lists of words using biopython pairwise2

1

Entering edit mode

7.0 years ago

euacar ▴ 10

When I run the script below, output is getting split into single chars. Any idea why? It looks like the second argument gets split into single chars. I am trying to align the word sequences. I will have many words hence cannot map them to letters only.

Thx Ercan

from Bio.Seq import Seq

from Bio.pairwise2 import format_alignment

fruits = ["orange","pear", "apple","pear","orange"]

fruits1 = ["pear","apple"]


from Bio import pairwise2

alignments = pairwise2.align.localms(fruits,fruits1,2,-1,-0.5,-0.1, gap_char=["-"])

for a in alignments: 

    print(format_alignment(*a))

Output:

 ['orange', 'r', 'a', 'e', 'p', 'e', 'l', 'p', 'p', 'a', 'pear', 'orange']
 |||||||||
['-', 'r', 'a', 'e', 'p', 'e', 'l', 'p', 'p', 'a', '-', '-']
  Score=4

alignment sequence • 7.5k views

ADD COMMENT • link updated 4.2 years ago by yannis1962 • 0 • written 7.0 years ago by euacar ▴ 10

1

Entering edit mode

Please reformat your code using the 101010 button or by putting 4 spaces before each line of code. This is particularly important for python as you'll need to correctly indent the code so that we can be sure it's written correctly.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

Thx For the feedback

ADD REPLY • link 7.0 years ago by euacar ▴ 10

0

Entering edit mode

As said by jrj.healey, code formatting is very important. I now added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

Thx For the feedback

ADD REPLY • link 7.0 years ago by euacar ▴ 10

1

Entering edit mode

6.7 years ago

Markus ▴ 320

You can do this with Biopython's pairwise2 (again). There was a bug in pairwise2 in Biopython releases 1.68/1.69 which prevented the proper handling of lists as input. If you run the same code in Biopython 1.70 you get the following (and expected) result:

['orange', 'pear', 'apple', 'pear', 'orange']
 ||
['-', 'pear', 'apple', '-', '-']
  Score=4

ADD COMMENT • link 6.7 years ago by Markus ▴ 320

0

Entering edit mode

Thanks for posting this info. Good to know that.

ADD REPLY • link 6.7 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

4.2 years ago

yannis1962 • 0

Comparing two lists of words didn't worked for me either, so what I did was to convert words into Chinese characters (there are more than 20,000 of them in Unicode), aligning the sequences as character strings, and then back to Latin-alphabet words again. Works like a charm:

from bio import pairwise2
from bio.pairwise2 import format_alignment

LISTA=["alors","en","fait","depuis","novembre","2017","du","coup",",","j'","ai","fait",",","plusieurs","ce","que","j'","appelle","des","crises",",","en","fait","c'","est",",","pendant",",","une","semaine",",","entre","une","semaine","et","10","jours",",","je","me","sens","un","peu","comme",",","déconnectée","de","la","réalité","un","peu",",","j'","arrive","pas","à",",","j'","arrivais","plus","à","faire","la","différence","entre",",","entre","si","j'","étais","dans","un","ou","si","j'","étais","dans","la","réalité","."]
LISTB=["alors","en","fait","depuis","euh","novembre","deux","mille","dix-sept","du","coup","j'","ai","fait","euh","hum","hum","plusieurs","euh","enfin","ce","que","j'","appelle","des","crises","en","fait","c'","est","euh","pendant","euh","une","semaine","entre","une","semaine","et","dix","jours","euh","je","me","sens","un","peu","comme","euh","déconnectée","de","la","réalité","un","peu","j'","arrive","pas","à","j'","arrivais","plus","à","faire","la","différence","entre","entre","si","j'","étais","dans","un","rêve","ou","si","j'","étais","dans","la","réalité"]

charcode=ord(u"一")-1
LATtoHAN={}
HANtoLAT={}
LISTA_=[]
LISTB_=[]
for x in LISTA:
    if x in LATtoHAN.keys():
        LISTA_.append(LATtoHAN[x])
    else:
        charcode+=1
        LATtoHAN[x]=chr(charcode)
        HANtoLAT[chr(charcode)]=x
        LISTA_.append(LATtoHAN[x])
for x in LISTB:
    if x in LATtoHAN.keys():
        LISTB_.append(LATtoHAN[x])
    else:
        charcode+=1
        LATtoHAN[x]=chr(charcode)
        HANtoLAT[chr(charcode)]=x
        LISTB_.append(LATtoHAN[x])

LISTA__="".join(LISTA_)
LISTB__="".join(LISTB_)

alignments=pairwise2.align.globalxx(LISTA__,LISTB__)

RESA=[]
RESB=[]
for w in alignments[0][0]:
    if (w == "-"):
        RESA.append("-")
    else:
        RESA.append(HANtoLAT[w])
for w in alignments[0][1]:
    if (w == "-"):
        RESB.append("-")
    else:
        RESB.append(HANtoLAT[w])

print(RESA,RESB)

ADD COMMENT • link 4.2 years ago by yannis1962 • 0

score 6 · Accepted Answer · 2017-04-10

BioPython allows for performing an alignment on sequences' symbols (not words). I can't think of any program that would do the words' alignment. I suggest you search the Web for Python modules/functions that perform string character alignments and then modify the existing code to handle words.

For example, I tweaked some code from GitHub to do local (water) and global (needle) alignments of words.

Filename: alignment.py:

Usage:

import alignment
fruits1 = ["orange", "pear", "apple", "pear", "orange"]
fruits2 = ["pear", "apple"]
aln = alignment.needle(fruits1, fruits2)
identity = aln[0]
score = aln[1]
print(identity, score)
print(aln[2])
print(aln[3])

Output:

(0.4, 25)
['-', 'pear', 'apple', '-', '-']
['orange', 'pear', 'apple', 'pear', 'orange']