Question

Fastest way to parse through a fasta file in python

0

Entering edit mode

2.2 years ago

mr.two • 0

Hello!

I would like to get to know your experiences about the fastest way to parse over a fasta file and count the occurrence of a substring in each Seq object.

Currently I am using the Biopython SeqIO.parse method, but I am wondering whether there are other options?

SeqIO Biopython Python • 2.5k views

ADD COMMENT • link updated 2.2 years ago by jkim ▴ 170 • written 2.2 years ago by mr.two • 0

1

Entering edit mode

Past thread of interest: How can I make parsing big fasta with biopython SeqIO go faster?

ADD REPLY • link 2.2 years ago by GenoMax 141k

3

Entering edit mode

2.2 years ago

Rob 6.5k

The readfq python implementation here is likely close to as good as you are going to be able to do in pure/native Python. As a bonus, it's small, self-contained, and you can pretty clearly see what it's doing. If you want more efficient iteration over a large Fasta file (that you intend to read line-by-line) — then you'd likely have to move to another language. You can see here that Python is much slower than other languages when it comes to FASTA/FASTQ parsing — and in general.

ADD COMMENT • link 2.2 years ago by Rob 6.5k

1

Entering edit mode

2.2 years ago

liorglic ★ 1.4k

Well, you could always write your own parser, and if your coding skills are good you might be able to parse a bit faster than SeqIO. However, I doubt that this will actually be worth the effort. If your task is to search for substrings within record sequences, then probably 99% of run time will be spent on string matching, rather than the fasta parsing. This will be the case whether you parse with your own code or use an external library.
Bottom line: if you want to improve run time, focus on making the substring search more efficient, and this is not related to SeqIO.

ADD COMMENT • link 2.2 years ago by liorglic ★ 1.4k

0

Entering edit mode

2.2 years ago

jkim ▴ 170

def parse_fasta(fa):
    name, seq = None, []
    for line in fa:
        line = line.rstrip()
        if line.startswith(">"):
            if name:
                yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name:
        yield (name, ''.join(seq))

--How to use--

for gene, seq in parse_fasta(my_fasta_file):
    print(gene, seq)

This is not the fastest one but it may help. However, Python is quite slow compared to other languages for such a task.

ADD COMMENT • link 2.2 years ago by jkim ▴ 170

score 2 · Accepted Answer · 2022-01-27

The scaling problem is not that of parsing a FASTA file but counting the occurrence of the substring.

The parsing of the FASTA file in BioPython simply loads it into a single massive string object and that is plenty fast - it would be hard to improve on that in Python, as the code behind the scenes is quite simple, and is very likely IO bound already.

When you consider counting substrings, it depends on whether you are counting exact matches or not, and whether you need to find potentially overlapping matches or not.

timing the simplest counting process:

 python -m timeit -s 'x="ATGC"*1_000_000' -n 100 'x.count("ATGC")'

prints:

100 loops, best of 5: 3.44 msec per loop

so counting all one million ATGC in a 4 million long sequence made out of ATGC will take about 4ms, this may be quick or slow depending on your needs