Question: Global Pairwise Alignment For Long Sequence Throws Error In Python!!
6.2 years ago by
abhishekniroula750 wrote:

Hello there,

I am performing pairwise global alignment using Emboss Needleman-Wunsch algorithm via python script. The script runs pretty well with shorter sequences but it throws an error when I perform with a pair of proteins (the longest protein Titin). I am trying to perform pairwise global alignment of ensembl protein ENSP00000343764 and SwissProt protein Q8WZ42. The length of these two sequences are not same, so I am interested to see the alignment. I am using python to perform this alignment. The code I used is:

from Bio.Emboss.Applications import NeedleCommandline
from Bio import AlignIO

This generates an error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/Bio/Application/", line 437, in __call__
stdout_str, stderr_str)
Bio.Application.ApplicationError: Command 'needle -outfile=ENSP00000343764.needle -asequence=Q8WZ42.fa -bsequence=ENSP00000343764.fa -gapopen=10 -gapextend=0.5' returned non-zero exit status 1, 'Needleman-Wunsch global alignment of two sequences'

If I use only a small fragment (say 5000 amino acid) of any one of the sequences, the script works. It generates an alignment file. I am not sure, if the error is because of the length of the proteins. Can anyone explain the possible reason for this error and how to fix it? I might use fragments of the sequences to see the alignment but thats not a good idea when my script is running for large number of proteins. Do you have any idea how I can do it?

Thanks in advance!

python • 3.4k views
python • 3.4k views
6.2 years ago by
Salt Lake City, UT
brentp22k wrote:

When you do sequence alignment with an N-length sequence and an M-length sequence, it's probably creating at least 2 N*M arrays which can be a lot of memory.

Try running that needle command from the command-line and watch the memory usage. (Or just watch usage from the python script).

If memory is the problem, you may try using as it makes some attempt to use as little memory as possible.

written 6.2 years ago by brentp22k

Thanks @brentp This module seems to work faster. But, it did not solve my problem. Both the strings are of length approximately 35000. So I got message: MemoryError. Probably, I should make smaller fragments of one sequence and then form alignment with the other sequence.

written 6.2 years ago by abhishekniroula750

you can either split them or go to a machine with more memory. you sure you want to do global sequence alignment on 35kb regions?

written 6.2 years ago by brentp22k

Well, I am doing that for large number of sequences. And, I want to make the process automatic.

written 6.2 years ago by abhishekniroula750
2.8 years ago by
Markus230 wrote:

You might consider using EMBOSS Stretcher, which uses a modified Needelman-Wunsch algorithm that works in linear space (instead of quadratic). Biopython also provides a command-line interface for Stretcher under Bio.Emboss.Applications.

written 2.8 years ago by Markus230
