Question

Approach to sequence alignment for long strings

0

Entering edit mode

6.8 years ago

dustin.farkas.yo • 0

Hello,

I am trying to use sequence alignment tools to align two text strings of length 500,000. I tried using the pairwise2 package from Biopython to perform a global alignment (with one_alignment_only = True), but the process is too time-consuming and memory-intensive. I have read that Needleman-Wunsch is O(n^2) (both in terms of processing time and memory required). This is confusing to me, as I believe biologists must align DNA sequences much longer than this. If someone could describe how biologists get around this issue and/or suggest any common packages (especially written in python) that are used to align such long sequences efficiently, I would be grateful. Any guidance would be greatly appreciated. Thank you.

sequence alignment • 1.8k views

ADD COMMENT • link updated 6.8 years ago by Joe 21k • written 6.8 years ago by dustin.farkas.yo • 0

score 1 · Answer 1 · 2017-06-30

1

Entering edit mode

6.8 years ago

shwethacm ▴ 240

Hi Dustin,

Smith-Waterman and Needleman-Wunsch are both memory-intensive, which is why, for aligning long stretches of nucleotides, heuristic algorithms such as BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) are preferred. BWA is another commonly preferred aligner for aligning sequences against a reference genome. Smith-Waterman will give you the optimal alignment, but these algorithms are faster and in most cases, give similar results.

ADD COMMENT • link 6.8 years ago by shwethacm ▴ 240

0

Entering edit mode

Thank you very much! My main difficulty with BLAST is it only seems to accept a restricted alphabet. I am especially interested in tools that allow me to align text with any ascii characters.

ADD REPLY • link 6.8 years ago by dustin.farkas.yo • 0

0

Entering edit mode

If you're interested in computing distances between strings which contain any ASCII characters, you may be better off asking this on StackExchange, as a generic programming/computer science question. There will be few, if any, bioinformatics alignment tools that will cope with all possible characters, most would just be restricted to [A,C,T,G,N,-] for DNA and maybe ~25 amino acid characters if they support some of the more unsual ones.

ADD REPLY • link 6.8 years ago by Joe 21k

score 1 · Answer 2 · 2017-06-30

1

Entering edit mode

6.8 years ago

Joe 21k

For alignment of long sequences, some commonly used tools are MUMmer (pairwise only I believe), and LASTZ.

Any tools that use a suffix tree approach apparently scale quite well for large sequences.

Give those a try.

ADD COMMENT • link 6.8 years ago by Joe 21k

0

Entering edit mode

Thank you very much! I will explore those now.

ADD REPLY • link 6.8 years ago by dustin.farkas.yo • 0