Hey everyone :)
Here's a 100% python BAM file reader, and it has some very good performance characteristics:
This little proof-of-concept script aims to do just 1 job - if you want to iterate over all reads in a BAM file, it will do that for you as quickly as possible, giving you the data back as either raw BAM or SAM.
>>> import pybam >>> >>> pure_bam_data = pybam.bgunzip('./ENCFF001LCU.bam.gz') >>> >>> parser = pybam.compile_parser(['pos','mapq','qname']) >>> >>> for read in parser(pure_bam_data): .... print read .... break .... (3000742, 0, 'SOLEXA1_0001:4:49:11382:21230#0')
The script has two parts - a class for decompressing the BAM file and parsing out it's header, and a generator function that gets defined at run-time to decode the packed binary BAM data into SAM data that you can use in python.
Originally I was skeptical that Python could do this well, because unpacking binary data is not something python is known for being particularly good at - however, after playing around a bit I see that on modern Python implementations it is actually fairly reasonable - and on PyPy it is as good as any C. So on an uncompressed BAM file, doing it all in python is at least 2-3x faster than using Pysam - which has to make a call out to C land and then bring your value back into python every time you want something. If you want more than 1 thing from a read, this really adds up. On a 23Gb Encode file (ENCFF001IMQ.bam), I ran the two code snippits:
pysam - 20min 53s
import pysam pure_bam = pysam.AlignmentFile('./ENCFF001IMQ.bam') for read in pure_bam: read.pos read.flag read.rname read.mapq read.rnext read.pnext read.tlen
pybam - 12min 53s
import pybam pure_bam = pybam.bgunzip('./ENCFF001IMQ.bam') parser = pybam.compile_parser(['pos','flag','rname','mapq','rnext','pnext','tlen']) for read in parser(pure_bam): read read read read read read read
However, the above comparison is a little unfair, and i'll explain why.
Decoding the BAM data into SAM is actually fairly irrelevant to the overall BAM-reading time. This is because 80-90% of the work that the computer has to do is decompressing the BAM data, not decoding it, which i found surprising. In short, even though i spent most of my time in this project optimizing the decoder, speeding up the decompression has a much bigger effect than any time-savings gained from having a fast decoder.
Unfortunately, decompressing gzip files (which is what a BAM essentially is) is notoriously broken in python. Despite being two modules for it (zlib and gzip), they are both 2-3x slower than the gzip program you'll find already installed on your machine (http://aripollak.com/pythongzipbenchmarks/), or the C decompression code used by pysam/htspython. The take-home message here is that pure-python decompression/decoding is only faster than pysam when the compression-to-data ratio is high -- i.e. there are many reads packed into as little compressed space as possible, so we spend more time on decoding than decompression.
Unless, of course, you use your local copy of unix gzip (or even better, a parallel decompressor like pigz) and pipe the uncompressed data to python. In the comparison above, thats exactly what its doing - detecting i have pigz installed, subprocessing it out (with 3 threads, because i dont see much improvement above 3 on my 4-core machine), and then reading from its stdout. Using gzip (1 thread) its just a bit faster than pysam, by only a few minutes. Also it was run in pypy, which i recommend, but again decoding is not as significant as decompression so you wont see a huge change running it in standard python.
Really, honestly, no one is going to use my code -- im only putting it up here to demonstrate two things:
1) Future next-gen file format creators should aim to make their formats quick to decompress by design. The actual encoding of the data before compression is fairly irrelevant. A typical L2 cache these days is at minimum 256K, with most personal computers having 512K+, and compute servers having much much more - so our compression blocks should probably be increased to double or quadruple what it is for BAM (65KB hard-limit). Formats should also be designed for parallel decompression. As good as pigz is, parallel gzip was an afterthought. There are compression/decompression algorithms designed specifically for multi-core decompression from a stream, and we should take advantage of them :) If you have written a BAM parser, or you read/write BAM data - you can dramatically improve performance by making use of multiple cores.
2) Python doesn't have to be slow any more. "samtools view -c" on that 23Gb file takes 11min18s , while on python with 3 pigz decompression threads it takes 11min58s. The -c flag for samtools uses code optimized just for counting, and skips the unpacking of all the other data. The python code however is not optimized for counting, so i'm sure you could write a counter in python that's just as fast as the C (particularly if there was a better gzip module). This, however, only holds true if you're using pypy, and that's the big gotcha. There is a growing split in all interpreted languages between the vanilla run time environments, and the JIT'd/optimized environments (like pypy for python). The JITs just keep getting faster and faster as new CS ideas about how to do it are developed. I could have, for example, used some of the special pypy string concatination functions to really speed up the decoder, particularly for the seq/qual, but then it wouldn't work on regular python. This is troubling, and is touched on here: https://vimeo.com/61044810 - no one wants there to be two kinds of python. But no one wants to wait around for hours for their python/perl/ruby programs to execute either. I don't know what will win out, community inertia or performance, but one thing is certainly true its up to the developers of new python software. If you write code being mindful of performance optimizations that JITs can do, and try to stay within the standard library or only use big external libraries like numpy - the more likely your users will be able to run your code at C-speeds using a JIT, with none of the issues usually associated with write/compiling C in Bioinformatics.