Question

Global Pairwise Alignment For Long Sequence Throws Error In Python!!

1

Entering edit mode

11.3 years ago

abhishekniroula7 ▴ 50

Hello there,

I am performing pairwise global alignment using Emboss Needleman-Wunsch algorithm via python script. The script runs pretty well with shorter sequences but it throws an error when I perform with a pair of proteins (the longest protein Titin). I am trying to perform pairwise global alignment of ensembl protein ENSP00000343764 and SwissProt protein Q8WZ42. The length of these two sequences are not same, so I am interested to see the alignment. I am using python to perform this alignment. The code I used is:

from Bio.Emboss.Applications import NeedleCommandline
from Bio import AlignIO
needle_cline=NeedleCommandline(asequence="Q8WZ42.fa",bsequence="ENSP00000343764.fa",outfile="ENSP00000343764.needle",gapopen=10,gapextend=0.5)
stdout,stderr=needle_cline()

This generates an error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/Bio/Application/__init__.py", line 437, in __call__
stdout_str, stderr_str)
Bio.Application.ApplicationError: Command 'needle -outfile=ENSP00000343764.needle -asequence=Q8WZ42.fa -bsequence=ENSP00000343764.fa -gapopen=10 -gapextend=0.5' returned non-zero exit status 1, 'Needleman-Wunsch global alignment of two sequences'

If I use only a small fragment (say 5000 amino acid) of any one of the sequences, the script works. It generates an alignment file. I am not sure, if the error is because of the length of the proteins. Can anyone explain the possible reason for this error and how to fix it? I might use fragments of the sequences to see the alignment but thats not a good idea when my script is running for large number of proteins. Do you have any idea how I can do it?

Thanks in advance!

python • 5.4k views

ADD COMMENT • link updated 7.9 years ago by Markus ▴ 320 • written 11.3 years ago by abhishekniroula7 ▴ 50

score 4 · Answer 1 · 2013-01-23

4

Entering edit mode

11.3 years ago

brentp 24k

When you do sequence alignment with an N-length sequence and an M-length sequence, it's probably creating at least 2 N*M arrays which can be a lot of memory.

Try running that needle command from the command-line and watch the memory usage. (Or just watch usage from the python script).

If memory is the problem, you may try using http://pypi.python.org/pypi/nwalign/ as it makes some attempt to use as little memory as possible.

ADD COMMENT • link 11.3 years ago by brentp 24k

1

Entering edit mode

Thanks @brentp This module seems to work faster. But, it did not solve my problem. Both the strings are of length approximately 35000. So I got message: MemoryError. Probably, I should make smaller fragments of one sequence and then form alignment with the other sequence.

ADD REPLY • link 11.3 years ago by abhishekniroula7 ▴ 50

1

Entering edit mode

you can either split them or go to a machine with more memory. you sure you want to do global sequence alignment on 35kb regions?

ADD REPLY • link 11.3 years ago by brentp 24k

0

Entering edit mode

Well, I am doing that for large number of sequences. And, I want to make the process automatic.

ADD REPLY • link 11.3 years ago by abhishekniroula7 ▴ 50

score 0 · Answer 2 · 2016-05-23

0

Entering edit mode

7.9 years ago

Markus ▴ 320

You might consider using EMBOSS Stretcher, which uses a modified Needelman-Wunsch algorithm that works in linear space (instead of quadratic). Biopython also provides a command-line interface for Stretcher under Bio.Emboss.Applications.

ADD COMMENT • link 7.9 years ago by Markus ▴ 320