Approach to sequence alignment for long strings
2
0
Entering edit mode
6.8 years ago

Hello,

I am trying to use sequence alignment tools to align two text strings of length 500,000. I tried using the pairwise2 package from Biopython to perform a global alignment (with one_alignment_only = True), but the process is too time-consuming and memory-intensive. I have read that Needleman-Wunsch is O(n^2) (both in terms of processing time and memory required). This is confusing to me, as I believe biologists must align DNA sequences much longer than this. If someone could describe how biologists get around this issue and/or suggest any common packages (especially written in python) that are used to align such long sequences efficiently, I would be grateful. Any guidance would be greatly appreciated. Thank you.

sequence alignment • 1.8k views
ADD COMMENT
1
Entering edit mode
6.8 years ago
shwethacm ▴ 240

Hi Dustin,

Smith-Waterman and Needleman-Wunsch are both memory-intensive, which is why, for aligning long stretches of nucleotides, heuristic algorithms such as BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) are preferred. BWA is another commonly preferred aligner for aligning sequences against a reference genome. Smith-Waterman will give you the optimal alignment, but these algorithms are faster and in most cases, give similar results.

ADD COMMENT
0
Entering edit mode

Thank you very much! My main difficulty with BLAST is it only seems to accept a restricted alphabet. I am especially interested in tools that allow me to align text with any ascii characters.

ADD REPLY
0
Entering edit mode

If you're interested in computing distances between strings which contain any ASCII characters, you may be better off asking this on StackExchange, as a generic programming/computer science question. There will be few, if any, bioinformatics alignment tools that will cope with all possible characters, most would just be restricted to [A,C,T,G,N,-] for DNA and maybe ~25 amino acid characters if they support some of the more unsual ones.

ADD REPLY
1
Entering edit mode
6.8 years ago
Joe 21k

For alignment of long sequences, some commonly used tools are MUMmer (pairwise only I believe), and LASTZ.

Any tools that use a suffix tree approach apparently scale quite well for large sequences.

Give those a try.

ADD COMMENT
0
Entering edit mode

Thank you very much! I will explore those now.

ADD REPLY

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6