Fastest way to parse through a fasta file in python
4
0
Entering edit mode
2.2 years ago
mr.two • 0

Hello!

I would like to get to know your experiences about the fastest way to parse over a fasta file and count the occurrence of a substring in each Seq object.

Currently I am using the Biopython SeqIO.parse method, but I am wondering whether there are other options?

SeqIO Biopython Python • 2.5k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
2
Entering edit mode
2.2 years ago

The scaling problem is not that of parsing a FASTA file but counting the occurrence of the substring.

The parsing of the FASTA file in BioPython simply loads it into a single massive string object and that is plenty fast - it would be hard to improve on that in Python, as the code behind the scenes is quite simple, and is very likely IO bound already.

When you consider counting substrings, it depends on whether you are counting exact matches or not, and whether you need to find potentially overlapping matches or not.

timing the simplest counting process:

 python -m timeit -s 'x="ATGC"*1_000_000' -n 100 'x.count("ATGC")'

prints:

100 loops, best of 5: 3.44 msec per loop

so counting all one million ATGC in a 4 million long sequence made out of ATGC will take about 4ms, this may be quick or slow depending on your needs

ADD COMMENT
3
Entering edit mode
2.2 years ago
Rob 6.5k

The readfq python implementation here is likely close to as good as you are going to be able to do in pure/native Python. As a bonus, it's small, self-contained, and you can pretty clearly see what it's doing. If you want more efficient iteration over a large Fasta file (that you intend to read line-by-line) — then you'd likely have to move to another language. You can see here that Python is much slower than other languages when it comes to FASTA/FASTQ parsing — and in general.

ADD COMMENT
1
Entering edit mode
2.2 years ago
liorglic ★ 1.4k

Well, you could always write your own parser, and if your coding skills are good you might be able to parse a bit faster than SeqIO. However, I doubt that this will actually be worth the effort. If your task is to search for substrings within record sequences, then probably 99% of run time will be spent on string matching, rather than the fasta parsing. This will be the case whether you parse with your own code or use an external library.
Bottom line: if you want to improve run time, focus on making the substring search more efficient, and this is not related to SeqIO.

ADD COMMENT
0
Entering edit mode
2.2 years ago
jkim ▴ 170
def parse_fasta(fa):
    name, seq = None, []
    for line in fa:
        line = line.rstrip()
        if line.startswith(">"):
            if name:
                yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name:
        yield (name, ''.join(seq))

--How to use--

for gene, seq in parse_fasta(my_fasta_file):
    print(gene, seq)

This is not the fastest one but it may help. However, Python is quite slow compared to other languages for such a task.

ADD COMMENT

Login before adding your answer.

Traffic: 2453 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6