Question

How can I make parsing big fasta with biopython SeqIO go faster?

0

Entering edit mode

3.1 years ago

jonas.andersson ▴ 30

Hi! I'm an master thesis student that tries to learn this:) I've searched this forum but could really find anything that answers my question.

I want to extract specific chromosome sequences from the human genome to a new fasta file and I'm using this python script:

sequences = SeqIO.parse(GRCh38.p13.genome.fa, 'fasta')
with open(new_fasta.fa', 'w+') as fa:
           for line in sequences:
                if line.id == chr:
                     fa.write(">" + strline.id) + "\n")
                     fa.write(str(line.seq)+ "\n\n")
                     break # I use break here, otherwise it will continue with chr10,12,13 etc. if i choose chr1

It works, but chr1 will take like 1 sec to extract and chr20 will take 20. How can I speed this up?

truly greatful for help!

fasta biopython seqio extract python • 1.5k views

ADD COMMENT • link updated 3.1 years ago by Joe 21k • written 3.1 years ago by jonas.andersson ▴ 30

0

Entering edit mode

well, I'm not a python guy but use SeqIO seems to loads each whole sequence in memory while you just need to scan line per line an check for the sequence header.

Furthermore , for this task, you should have a look at samtools faidx

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

There may be contributors on this site that already have a faster solution and will share it. On the surface, however, I question the need to expend the effort to lower the running time from 20 seconds to 1 second. It seems easier to wait additional 19 seconds than to potentially spend more time than that optimizing the code.

ADD REPLY • link 3.1 years ago by Mensur Dlakic ★ 27k

score 0 · Answer 1 · 2021-03-08

I don't know for certain, but I would imagine indexing will be faster than what you're doing, especially if you want more than one record from different parts of the file since you only need to read it once.

from Bio import SeqIO

record_dict = SeqIO.index("example.fasta", "fasta")
print(record_dict["gi:12345678"])  # use any record ID

You still need to read the whole file at least once, and there will be little you can do to make this any faster whilst staying within (bio)python.

In my testing with the human genome, the parse approach takes 43 seconds to find a random ID near the back end of the file. index takes about 2 minutes to build the index, but then retrieving each record chosen at random thereafter takes a negligible amount of time, so these speed gains will add up quickly if you want multiple potential records.