Hi everybody !
I'm working in order to create my own pairwise sequence alignment program in Python. I use the pairwise2.align command from Bipython. When I use it with small sequences it works. I put the code bellow (2 for a match, -2 for a mismatch, -3 for an open gap and -1 for an extend gap):
from Bio import pairwise2 target_seq = "ATGCNTGA" query_seq = "ATTGGCCATTN" alignments = pairwise2.align.globalms(target_seq, query_seq, 2, -2, -3, -1)
However when I used two huge sequence (HHV8 consensus sequence from an illumina sequencing), I got this error:
segmentation fault (core dumped)
I used the same code.
The size of sequences are:
cat ../Results/1G_S15/1G_S15.fasta | grep -v ">" | wc -c 140280 cat ../Results/8G_S12/1G_S12.fasta | grep -v ">" | wc -c 140272
Do you think that the huge sequence size can be the origin of this error ? If is the case, do you have a trick to avoid it ?
I don't think the pairwise aligner, as implemented in biopython could possibly align 140K long sequences. From your example we can't tell if a single sequence is 140K or all together. But since you are talking about "huge" sequences I assumed the former situation.
It was not designed for sequences of that size. You would need to use a different tool in my opinion.
if you have multiple sequences then you need to show the code you use, because from your example one cannot tell how you are using it.
Segfault usually means too much data yep. As Istvan said, this is not really what Pairwise is for.
Moreover, alignment of very long sequences is still a tricky task. Its made a bit easier when it is just a pairwise alignment and for that I'd suggest
If you need to do multiple alignment, you'll struggle, but
LASTZis at least capable of it in my experience.
Is a multiline fasta? Maybe useful
Indeed, my files are multiline fasta. I tried to read them with readlines method and then remove the "\n". However I got the same error. I think that Istvan Albert has true. My sequences are too large to be use by pairwise2.align.