how to align lists of words using biopython pairwise2
3
1
Entering edit mode
7.0 years ago
euacar ▴ 10

When I run the script below, output is getting split into single chars. Any idea why? It looks like the second argument gets split into single chars. I am trying to align the word sequences. I will have many words hence cannot map them to letters only.

Thx Ercan

from Bio.Seq import Seq

from Bio.pairwise2 import format_alignment

fruits = ["orange","pear", "apple","pear","orange"]

fruits1 = ["pear","apple"]


from Bio import pairwise2

alignments = pairwise2.align.localms(fruits,fruits1,2,-1,-0.5,-0.1, gap_char=["-"])

for a in alignments: 

    print(format_alignment(*a))

Output:

 ['orange', 'r', 'a', 'e', 'p', 'e', 'l', 'p', 'p', 'a', 'pear', 'orange']
 |||||||||
['-', 'r', 'a', 'e', 'p', 'e', 'l', 'p', 'p', 'a', '-', '-']
  Score=4
alignment sequence • 7.5k views
ADD COMMENT
1
Entering edit mode

Please reformat your code using the 101010 button or by putting 4 spaces before each line of code. This is particularly important for python as you'll need to correctly indent the code so that we can be sure it's written correctly.

ADD REPLY
0
Entering edit mode

Thx For the feedback

ADD REPLY
0
Entering edit mode

As said by jrj.healey, code formatting is very important. I now added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

Thx For the feedback

ADD REPLY
6
Entering edit mode
7.0 years ago

BioPython allows for performing an alignment on sequences' symbols (not words). I can't think of any program that would do the words' alignment. I suggest you search the Web for Python modules/functions that perform string character alignments and then modify the existing code to handle words.

For example, I tweaked some code from GitHub to do local (water) and global (needle) alignments of words.

Filename: alignment.py:

Usage:

import alignment
fruits1 = ["orange", "pear", "apple", "pear", "orange"]
fruits2 = ["pear", "apple"]
aln = alignment.needle(fruits1, fruits2)
identity = aln[0]
score = aln[1]
print(identity, score)
print(aln[2])
print(aln[3])

Output:

(0.4, 25)
['-', 'pear', 'apple', '-', '-']
['orange', 'pear', 'apple', 'pear', 'orange']
ADD COMMENT
0
Entering edit mode

Thank you for the code..

ADD REPLY
1
Entering edit mode
6.7 years ago
Markus ▴ 320

You can do this with Biopython's pairwise2 (again). There was a bug in pairwise2 in Biopython releases 1.68/1.69 which prevented the proper handling of lists as input. If you run the same code in Biopython 1.70 you get the following (and expected) result:

['orange', 'pear', 'apple', 'pear', 'orange']
 ||
['-', 'pear', 'apple', '-', '-']
  Score=4
ADD COMMENT
0
Entering edit mode

Thanks for posting this info. Good to know that.

ADD REPLY
0
Entering edit mode
4.2 years ago
yannis1962 • 0

Comparing two lists of words didn't worked for me either, so what I did was to convert words into Chinese characters (there are more than 20,000 of them in Unicode), aligning the sequences as character strings, and then back to Latin-alphabet words again. Works like a charm:

from bio import pairwise2
from bio.pairwise2 import format_alignment

LISTA=["alors","en","fait","depuis","novembre","2017","du","coup",",","j'","ai","fait",",","plusieurs","ce","que","j'","appelle","des","crises",",","en","fait","c'","est",",","pendant",",","une","semaine",",","entre","une","semaine","et","10","jours",",","je","me","sens","un","peu","comme",",","déconnectée","de","la","réalité","un","peu",",","j'","arrive","pas","à",",","j'","arrivais","plus","à","faire","la","différence","entre",",","entre","si","j'","étais","dans","un","ou","si","j'","étais","dans","la","réalité","."]
LISTB=["alors","en","fait","depuis","euh","novembre","deux","mille","dix-sept","du","coup","j'","ai","fait","euh","hum","hum","plusieurs","euh","enfin","ce","que","j'","appelle","des","crises","en","fait","c'","est","euh","pendant","euh","une","semaine","entre","une","semaine","et","dix","jours","euh","je","me","sens","un","peu","comme","euh","déconnectée","de","la","réalité","un","peu","j'","arrive","pas","à","j'","arrivais","plus","à","faire","la","différence","entre","entre","si","j'","étais","dans","un","rêve","ou","si","j'","étais","dans","la","réalité"]

charcode=ord(u"一")-1
LATtoHAN={}
HANtoLAT={}
LISTA_=[]
LISTB_=[]
for x in LISTA:
    if x in LATtoHAN.keys():
        LISTA_.append(LATtoHAN[x])
    else:
        charcode+=1
        LATtoHAN[x]=chr(charcode)
        HANtoLAT[chr(charcode)]=x
        LISTA_.append(LATtoHAN[x])
for x in LISTB:
    if x in LATtoHAN.keys():
        LISTB_.append(LATtoHAN[x])
    else:
        charcode+=1
        LATtoHAN[x]=chr(charcode)
        HANtoLAT[chr(charcode)]=x
        LISTB_.append(LATtoHAN[x])

LISTA__="".join(LISTA_)
LISTB__="".join(LISTB_)

alignments=pairwise2.align.globalxx(LISTA__,LISTB__)

RESA=[]
RESB=[]
for w in alignments[0][0]:
    if (w == "-"):
        RESA.append("-")
    else:
        RESA.append(HANtoLAT[w])
for w in alignments[0][1]:
    if (w == "-"):
        RESB.append("-")
    else:
        RESB.append(HANtoLAT[w])

print(RESA,RESB)
ADD COMMENT

Login before adding your answer.

Traffic: 1837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6