Question: Approach to sequence alignment for long strings
gravatar for dustin.farkas.yo
3.4 years ago by
dustin.farkas.yo0 wrote:


I am trying to use sequence alignment tools to align two text strings of length 500,000. I tried using the pairwise2 package from Biopython to perform a global alignment (with one_alignment_only = True), but the process is too time-consuming and memory-intensive. I have read that Needleman-Wunsch is O(n^2) (both in terms of processing time and memory required). This is confusing to me, as I believe biologists must align DNA sequences much longer than this. If someone could describe how biologists get around this issue and/or suggest any common packages (especially written in python) that are used to align such long sequences efficiently, I would be grateful. Any guidance would be greatly appreciated. Thank you.

sequence alignment • 1.0k views
ADD COMMENTlink modified 3.4 years ago by Joe18k • written 3.4 years ago by dustin.farkas.yo0
gravatar for shwethacm
3.4 years ago by
Seattle, WA
shwethacm210 wrote:

Hi Dustin,

Smith-Waterman and Needleman-Wunsch are both memory-intensive, which is why, for aligning long stretches of nucleotides, heuristic algorithms such as BLAST ( are preferred. BWA is another commonly preferred aligner for aligning sequences against a reference genome. Smith-Waterman will give you the optimal alignment, but these algorithms are faster and in most cases, give similar results.

ADD COMMENTlink written 3.4 years ago by shwethacm210

Thank you very much! My main difficulty with BLAST is it only seems to accept a restricted alphabet. I am especially interested in tools that allow me to align text with any ascii characters.

ADD REPLYlink written 3.4 years ago by dustin.farkas.yo0

If you're interested in computing distances between strings which contain any ASCII characters, you may be better off asking this on StackExchange, as a generic programming/computer science question. There will be few, if any, bioinformatics alignment tools that will cope with all possible characters, most would just be restricted to [A,C,T,G,N,-] for DNA and maybe ~25 amino acid characters if they support some of the more unsual ones.

ADD REPLYlink written 3.4 years ago by Joe18k
gravatar for Joe
3.4 years ago by
United Kingdom
Joe18k wrote:

For alignment of long sequences, some commonly used tools are MUMmer (pairwise only I believe), and LASTZ.

Any tools that use a suffix tree approach apparently scale quite well for large sequences.

Give those a try.

ADD COMMENTlink written 3.4 years ago by Joe18k

Thank you very much! I will explore those now.

ADD REPLYlink written 3.4 years ago by dustin.farkas.yo0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1282 users visited in the last hour