Question

Alignment error using Biopython

0

Entering edit mode

10 months ago

Chris K ▴ 10

Hello, i am trying to write a program using biopython that will align some sequences from a fasta file (for the test that i will present 5 of them) against a fasta file containing a genome. For each, gene-genome alignment i want the score of the alignment. However, when i initially used pairwise2 it game an error "MemoryError: Out of memory" and suggested that i should use Bio.Align.PairwiseAligner. When i tried the Bio.Align.PairwiseAligner the code just ran but gave me no output and the terminal said it was Killed. here is the code:

from Bio import SeqIO
from Bio.Align import PairwiseAligner


genome_file = "CP000033.3[1..1993560].fasta"
genome_sequence = next(SeqIO.parse(genome_file, "fasta")).seq


gene_file = "genes.fasta"
gene_sequences = list(SeqIO.parse(gene_file, "fasta"))


aligner = PairwiseAligner()


for gene_seq in gene_sequences:
    alignment = aligner.align(genome_sequence, gene_seq.seq)


    print("Gene:", gene_seq.id)
    print("Alignment Score:", alignment.score)
    print()

Also, here is a screenshot from the terminal and the VScode: enter image description here

Thank you in advance, for your kind recommendations!

Edit: Could it be that the genome consumes memory or exceed the time limit?

code • 829 views

ADD COMMENT • link updated 10 months ago by GenoMax 142k • written 10 months ago by Chris K ▴ 10

score 2 · Accepted Answer · 2023-07-01

2

Entering edit mode

10 months ago

zorbax ▴ 610

I don't know how much ram you have or how long your genes are, but you definitely need more ram. I made a little example for you:

import random
from io import StringIO
from Bio import SeqIO
from Bio.Align import PairwiseAligner

genome_str = f">genome\n{''.join(random.choices('ATGC', k=1993560))}"
genes_str = ''.join([f">seq{x}\n{''.join(random.choices('ATGC', k=1000))}\n" for x in range(5)])

genome_bio = next(SeqIO.parse(StringIO(genome_str), "fasta")).seq
genes_bio = SeqIO.parse(StringIO(genes_str), "fast")

aligner = PairwiseAligner()

for gene_seq in genes_bio:
    alignment = aligner.align(gene_seq, genome_bio)
    print(f"Gene: {gene_seq.id}\nAlignment score: {alignment.score}")

For five genes of only 1000 nt of length with your reference, at least in size, I required ~8-10 Gb de RAM.

ADD COMMENT • link 10 months ago by zorbax ▴ 610

0

Entering edit mode

Well my laptop has currently 6gb of ram (I know...). The genome fasta is L. Acidophilus's NCFM genome and the fasta containing the genes has for this example 5 seqs approximately 900-1500kb. The final goal is to scan like 100 genes against the genome provided by the user. So you think if this program runs to a server will it produce the desired result? And if so, is there any chance to do it locally without the server?

ADD REPLY • link 10 months ago by Chris K ▴ 10

1

Entering edit mode

if you want this kind of alignment you should use a computer with a bit more resources otherwise, you can use minimap2 to map your genes vs a reference genome, which consumes fewer resources.

ADD REPLY • link 10 months ago by zorbax ▴ 610

0

Entering edit mode

Great i will talk with my lab because they will eventually use it and not me and also i will try and use the minimap2 tool. Thank you very much!

ADD REPLY • link 10 months ago by Chris K ▴ 10