Question

Segmentation fault Biopython pairwise alignment

0

Entering edit mode

11 months ago

antoine.fauchois92 ▴ 20

Hi everybody !

I'm working in order to create my own pairwise sequence alignment program in Python. I use the pairwise2.align command from Bipython. When I use it with small sequences it works. I put the code bellow (2 for a match, -2 for a mismatch, -3 for an open gap and -1 for an extend gap):

from Bio import pairwise2
target_seq = "ATGCNTGA"
query_seq = "ATTGGCCATTN"
alignments = pairwise2.align.globalms(target_seq, query_seq, 2, -2, -3, -1)

However when I used two huge sequence (HHV8 consensus sequence from an illumina sequencing), I got this error:

segmentation fault (core dumped)

I used the same code.

The size of sequences are:

cat ../Results/1G_S15/1G_S15.fasta | grep -v ">" | wc -c
140280
cat ../Results/8G_S12/1G_S12.fasta | grep -v ">" | wc -c
140272

Do you think that the huge sequence size can be the origin of this error ? If is the case, do you have a trick to avoid it ?

Best regards,

Antoine

biopython alignment • 786 views

ADD COMMENT • link updated 10 months ago by Joe 21k • written 11 months ago by antoine.fauchois92 ▴ 20

3

Entering edit mode

I don't think the pairwise aligner, as implemented in biopython could possibly align 140K long sequences. From your example we can't tell if a single sequence is 140K or all together. But since you are talking about "huge" sequences I assumed the former situation.

It was not designed for sequences of that size. You would need to use a different tool in my opinion.

if you have multiple sequences then you need to show the code you use, because from your example one cannot tell how you are using it.

ADD REPLY • link 11 months ago by Istvan Albert 100k

1

Entering edit mode

Segfault usually means too much data yep. As Istvan said, this is not really what Pairwise is for.

Moreover, alignment of very long sequences is still a tricky task. Its made a bit easier when it is just a pairwise alignment and for that I'd suggest mummer.

If you need to do multiple alignment, you'll struggle, but LASTZ is at least capable of it in my experience.

ADD REPLY • link 10 months ago by Joe 21k

0

Entering edit mode

Is a multiline fasta? Maybe useful

https://github.com/biopython/biopython/issues/3387

ADD REPLY • link 11 months ago by Shred ★ 1.4k

0

Entering edit mode

Indeed, my files are multiline fasta. I tried to read them with readlines method and then remove the "\n". However I got the same error. I think that Istvan Albert has true. My sequences are too large to be use by pairwise2.align.

ADD REPLY • link 10 months ago by antoine.fauchois92 ▴ 20