Question: Approach to sequence alignment for long strings
0
gravatar for dustin.farkas.yo
11 months ago by
dustin.farkas.yo0 wrote:

Hello,

I am trying to use sequence alignment tools to align two text strings of length 500,000. I tried using the pairwise2 package from Biopython to perform a global alignment (with one_alignment_only = True), but the process is too time-consuming and memory-intensive. I have read that Needleman-Wunsch is O(n^2) (both in terms of processing time and memory required). This is confusing to me, as I believe biologists must align DNA sequences much longer than this. If someone could describe how biologists get around this issue and/or suggest any common packages (especially written in python) that are used to align such long sequences efficiently, I would be grateful. Any guidance would be greatly appreciated. Thank you.

sequence alignment • 362 views
ADD COMMENTlink modified 10 months ago by jrj.healey4.2k • written 11 months ago by dustin.farkas.yo0
1
gravatar for shwethacm
10 months ago by
shwethacm170
Seattle, WA
shwethacm170 wrote:

Hi Dustin,

Smith-Waterman and Needleman-Wunsch are both memory-intensive, which is why, for aligning long stretches of nucleotides, heuristic algorithms such as BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) are preferred. BWA is another commonly preferred aligner for aligning sequences against a reference genome. Smith-Waterman will give you the optimal alignment, but these algorithms are faster and in most cases, give similar results.

ADD COMMENTlink written 10 months ago by shwethacm170

Thank you very much! My main difficulty with BLAST is it only seems to accept a restricted alphabet. I am especially interested in tools that allow me to align text with any ascii characters.

ADD REPLYlink written 10 months ago by dustin.farkas.yo0

If you're interested in computing distances between strings which contain any ASCII characters, you may be better off asking this on StackExchange, as a generic programming/computer science question. There will be few, if any, bioinformatics alignment tools that will cope with all possible characters, most would just be restricted to [A,C,T,G,N,-] for DNA and maybe ~25 amino acid characters if they support some of the more unsual ones.

ADD REPLYlink written 10 months ago by jrj.healey4.2k
1
gravatar for jrj.healey
10 months ago by
jrj.healey4.2k
United Kingdom
jrj.healey4.2k wrote:

For alignment of long sequences, some commonly used tools are MUMmer (pairwise only I believe), and LASTZ.

Any tools that use a suffix tree approach apparently scale quite well for large sequences.

Give those a try.

ADD COMMENTlink written 10 months ago by jrj.healey4.2k

Thank you very much! I will explore those now.

ADD REPLYlink written 10 months ago by dustin.farkas.yo0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 664 users visited in the last hour