Question

Tool:pybam - 100% python BAM reader

21

Entering edit mode

8.0 years ago

John 13k

Hey everyone :)

Here's a 100% python BAM file reader, and it has some very good performance characteristics:

https://github.com/JohnLonginotto/pybam

This little proof-of-concept script aims to do just 1 job - if you want to iterate over all reads in a BAM file, it will do that for you as quickly as possible, giving you the data back as either raw BAM or SAM.

>>> import pybam
>>>
>>> pure_bam_data = pybam.bgunzip('./ENCFF001LCU.bam.gz')
>>>
>>> parser = pybam.compile_parser(['pos','mapq','qname'])
>>>
>>> for read in parser(pure_bam_data):
....     print read
....     break
....
(3000742, 0, 'SOLEXA1_0001:4:49:11382:21230#0')

The script has two parts - a class for decompressing the BAM file and parsing out it's header, and a generator function that gets defined at run-time to decode the packed binary BAM data into SAM data that you can use in python.

Originally I was skeptical that Python could do this well, because unpacking binary data is not something python is known for being particularly good at - however, after playing around a bit I see that on modern Python implementations it is actually fairly reasonable - and on PyPy it is as good as any C. So on an uncompressed BAM file, doing it all in python is at least 2-3x faster than using Pysam - which has to make a call out to C land and then bring your value back into python every time you want something. If you want more than 1 thing from a read, this really adds up. On a 23Gb Encode file (ENCFF001IMQ.bam), I ran the two code snippets:

pysam - 20min 53s

import pysam
pure_bam = pysam.AlignmentFile('./ENCFF001IMQ.bam')
for read in pure_bam:
    read.pos
    read.flag
    read.rname
    read.mapq
    read.rnext
    read.pnext
    read.tlen

pybam - 12min 53s

import pybam
pure_bam = pybam.bgunzip('./ENCFF001IMQ.bam')
parser = pybam.compile_parser(['pos','flag','rname','mapq','rnext','pnext','tlen'])
for read in parser(pure_bam):
    read[0]
    read[1]
    read[2]
    read[3]
    read[4]
    read[5]
    read[6]

However, the above comparison is a little unfair, and I'll explain why.

Decoding the BAM data into SAM is actually fairly irrelevant to the overall BAM-reading time. This is because 80-90% of the work that the computer has to do is decompressing the BAM data, not decoding it, which I found surprising. In short, even though I spent most of my time in this project optimizing the decoder, speeding up the decompression has a much bigger effect than any time-savings gained from having a fast decoder.

Unfortunately, decompressing gzip files (which is what a BAM essentially is) is notoriously broken in python. Despite being two modules for it (zlib and gzip), they are both 2-3x slower than the gzip program you'll find already installed on your machine (http://aripollak.com/pythongzipbenchmarks/), or the C decompression code used by pysam/htspython. The take-home message here is that pure-python decompression/decoding is only faster than pysam when the compression-to-data ratio is high -- i.e. there are many reads packed into as little compressed space as possible, so we spend more time on decoding than decompression.

Unless, of course, you use your local copy of unix gzip (or even better, a parallel decompressor like pigz) and pipe the uncompressed data to python. In the comparison above, thats exactly what its doing - detecting I have pigz installed, subprocessing it out (with 3 threads, because I don't see much improvement above 3 on my 4-core machine), and then reading from its stdout. Using gzip (1 thread) its just a bit faster than pysam, by only a few minutes. Also it was run in pypy, which I recommend, but again decoding is not as significant as decompression so you wont see a huge change running it in standard python.

Really, honestly, no one is going to use my code -- I'm only putting it up here to demonstrate two things:

1) Future next-gen file format creators should aim to make their formats quick to decompress by design. The actual encoding of the data before compression is fairly irrelevant. A typical L2 cache these days is at minimum 256K, with most personal computers having 512K+, and compute servers having much much more - so our compression blocks should probably be increased to double or quadruple what it is for BAM (65KB hard-limit). Formats should also be designed for parallel decompression. As good as pigz is, parallel gzip was an afterthought. There are compression/decompression algorithms designed specifically for multi-core decompression from a stream, and we should take advantage of them :) If you have written a BAM parser, or you read/write BAM data - you can dramatically improve performance by making use of multiple cores.

2) Python doesn't have to be slow any more. samtools view -c on that 23Gb file takes 11min18s , while on python with 3 pigz decompression threads it takes 11min58s. The -c flag for samtools uses code optimized just for counting, and skips the unpacking of all the other data. The python code however is not optimized for counting, so I'm sure you could write a counter in python that's just as fast as the C (particularly if there was a better gzip module). This, however, only holds true if you're using pypy, and that's the big gotcha. There is a growing split in all interpreted languages between the vanilla run time environments, and the JIT'd/optimized environments (like pypy for python). The JITs just keep getting faster and faster as new CS ideas about how to do it are developed. I could have, for example, used some of the special pypy string concatenation functions to really speed up the decoder, particularly for the seq/qual, but then it wouldn't work on regular python. This is troubling, and is touched on here: https://vimeo.com/61044810 - no one wants there to be two kinds of python. But no one wants to wait around for hours for their python/perl/ruby programs to execute either. I don't know what will win out, community inertia or performance, but one thing is certainly true its up to the developers of new python software. If you write code being mindful of performance optimizations that JITs can do, and try to stay within the standard library or only use big external libraries like numpy - the more likely your users will be able to run your code at C-speeds using a JIT, with none of the issues usually associated with write/compiling C in Bioinformatics.

python htspython bam pysam • 19k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 8.0 years ago by John 13k

2

Entering edit mode

When you get a chance, could you maybe add a bit more information to your README.md page on github? Ideally some variant of this post would be a start.

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

Yes definitely - I was thinking of cutting the blurb in the body of the code out and putting it in the readme. And writing some tests. And putting it on pypi. But man, I have lab meeting in 5 days and I cant present any of this, so i'm going to have to put it on hold until a free weekend or something. The README though, I should do that now :)

ADD REPLY • link 8.0 years ago by John 13k

1

Entering edit mode

I'll have Steffen harass you about this in your lab meeting then :)

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

2

Entering edit mode

1) Decoding speed and parallelization have long been considered in BAM and CRAM. BAM prefers zlib over bzlib mainly because it is much faster on decompression. In the early days, I rejected the request to switch to bzlib precisely for the sake of performance. BAM is easily parallelizable as BGZF blocks are independent. Samtools does not take the full advantage of possible parallelization because it is a bit old and doing that would need to rewrite the BGZF module. Sambamba is designed for parallelization from ground up. It is better at multi-threading. CRAM can also be parallelized. The parallelization part is more efficient than samtools.

ADD REPLY • link 8.0 years ago by lh3 33k

1

Entering edit mode

This looks like a cool experiment. I see that you reference Peter Cock's bgzf module. Can I ask if you tried implementing virtual offsets using the BAM index? I've wanted to implement something like this in my simplesam module for a while, but am currently using samtools for bam reading.

ADD REPLY • link 8.0 years ago by Matt Shirley 10k

0

Entering edit mode

Thanks Matt!! I must admit, I saw your simplesam module a few days back on another thread and I thought it was awesome, which is why I thought it might be useful to lift out the part of my code that does the BAM reading from another project and make it it's own thing. Originally I was going to call it simplebam and then post it under your post saying "maybe you can use something here to remove your samtools dependency?", but then I thought that might be a bit pretentious since my code is very very, er, basic. No unit tests. No integration tests. No documentation. Not even sure SAM data is reported perfectly (I think the pos values are off-by-one and I dont know how the qual values are turned to phred scores). So.. yeah its not exactly done, just a proof of concept that hopefully would get someone else interested in taking it on.

As for making use of the indexes... well, uh, its a real pain. New versions of bgzip make sure compression blocks start and end at the beginning/end of a read, and the header gets its own block too. But older versions (and the data im using to test) doesnt do any of that, so you cant jump to a random block using the virtual offsets and start reading the BAM data because you might be in the middle of a read, and much like a reading-frame issue for DNA, you'll unpack garbage. I mean, you could try and find some tell-tale signs like the QNAME being a string of ASCII followed by a \x00, but its not proper. The only proper way is to use the BAM index (if it exists), which i presume tells you which byte of the uncompressed data a indexed read starts, which you use in conjunction with the virtual offset. That would probably double the amount of code needed, introduce some new bugs, and so if you're going that far things stop being simple. I'm not saying its a bad idea, but it does become a proper project then. Like what if the user doesnt have a .bai? What if the .bai is wrong? What if the data is sorted weird? So because of that, I decided if I wanted to keep things simple I should just dump the BAM data to SQL instead, where there's just 1 proper index, you get better compression, on-the-fly header generation, its writable in-place, etc etc etc etc. That's the bam+ format alluded to in a pybam comment somewhere.

So yeah, for about 100 lines-of-code you could make a "BAM bisect" to find quick-ish the read you need from a BAM file - but it would involve a lot of guess work. I doubt it would ever fail though. For about 1000 lines you could do full virtual offset and bai decoding -- and im not sure how much better that would be :P

ADD REPLY • link 8.0 years ago by John 13k

0

Entering edit mode

On size, I guess it depends on the SQL implementation. I am not that certain you can get better compression than BAM, let alone CRAM. Again depending on the SQL implementation, scanning through a table is likely to be slower than reading a BAM. In addition, with SQL, you will lose streamability, one seek per query and direct http/ftp access. Even before BAM, people have been talking about alignment in SQL, but till now I have not seen practical implementations that match the performance and the necessary feature set of BAM/CRAM.

On indexing, are you thinking of a new indexing algorithm? To implement the standard BAM indexing, you don't need to care if an alignment is bridging two BGZF blocks or not.

ADD REPLY • link 8.0 years ago by lh3 33k

1

Entering edit mode

What I believe is happening with these modules is the decompression and interpreter are running in the same process, whereas opening a new shell to run zcat effectively spreads the work out to multiple cores (and more so with pigz). The same is true in Perl, where doing the work with the core module is about 2X slower than opening zcat for reading. There is also a gzip I/O layer for Perl that is as fast (or faster in some tests) as opening zcat, but it is not a core module. I only mention that because I think another reason these modules (Perl/Python/whatever) are slower is all the magic associated with objects and tied filehandles, as opposed to just reading from a stream. So, I'm not sure I'd say it is broken in Python but rather it is a "feature" of this type of runtime/environment or whatever you want to call it. All these new compilers and runtimes are exciting but it is a little overwhelming to see the number of different directions progress is taking in terms of separate projects. Cool post, it gave me a lot to think about!

ADD REPLY • link 8.0 years ago by SES 8.6k

0

Entering edit mode

I was trying your code to read the bam file but found that there is not 'pip install pybam' but it's 'pip install pybamm'. Can you please tell me how did you install the pybam? because, 'pip install pybam' and 'import pybam' are not working.

ADD REPLY • link 24 months ago by Niveditha • 0

score 3 · Answer 1 · 2016-04-19

3

Entering edit mode

8.0 years ago

chen ★ 2.5k

Great work! I tried pysam several months ago and it worked well!

If someone wants to read/write bam in Julia, use HTSLIB.jl (https://github.com/OpenGene/HTSLIB.jl)

BTW, we created lots of bioinformatics projects with Julia, if any guys want to contribute, please take a look at projects at https://github.com/OpenGene

Here are some of OpenGene projects

OpenGene.jl
OpenGene, core libraries for NGS data analysis and bioinformatics in julia

SeqMaker.jl
Next Generation Sequencing simulation with SNP, Variation and Sequencing Error Integrated

FusionDirect.jl
Detect gene fusion directly from raw fastq files

ADD COMMENT • link 8.0 years ago by chen ★ 2.5k

0

Entering edit mode

That's pretty awesome Chen, I see you also develop in Rust and Python too :) Lots of cool projects on your git. I would, however, make a new thread for your software since then it will be easier to find when people search for "Julia bam", etc.

It would also be pretty interesting to speed-test all the analogous bioinformatics modules across all the different languages. Perhaps BAM reading is fast in one, FASTQ-reading is better in another, etc etc. Maybe even a consensus could be developed across languages for how modules should look/work, not to mention we could presumably all learn from each other's implimentations.

ADD REPLY • link 8.0 years ago by John 13k

score 1 · Answer 2 · 2016-04-18

1

Entering edit mode

8.0 years ago

John 13k

Updated the Github with some README info as requested :) https://github.com/JohnLonginotto/pybam

Also there was a bug where system-installed gzip had insufficient midichlorians unless it was run with --force. ~ba-ding-tsk~

.

I need help on two bits of the BAM spec to fully finish this project:

1) How do you go from a 0-255 single-byte int8 to a phred score like the ones you see when you view a SAM file?

2) What is bin?! I can't seem to figure out how its calculated or what its used for. Here's the relevant bit from the docs:

BIN is calculated using the reg2bin() function in Section 5.3. For mapped reads this uses POS-1 (i.e., 0-based left position) and the alignment end point using the alignment length from the CIGAR string. For unmapped reads (e.g., paired end reads where only one part is mapped, see Section 2) treat the alignment as being length one. Note unmapped reads with POS 0 (which becomes −1 in BAM) therefore use reg2bin(-1, 0) which is 4680.

Section 5.3 is the very last page of the spec (https://samtools.github.io/hts-specs/SAMv1.pdf). It would seem to me one could generate this value rather than store it in the BAM right?

To do:

From here on out I will work on improving the user-experience of pybam so that it does what you might expect. At the moment it returns unpacked BAM data, which is not the same as SAM data (for example, positions in BAM are 0-based, while in SAM they're 1-based).

Other changes will include:

Will probably merge bgunzip() with compile_parser() because I can't see anyone using one without the other.

I also have code in another project that will do the bgzip compression (which is needed to make a fully-functioning BAM files that samtools/etc can read) using multiprocessing which is pretty fast, so I could add that in as a new function.

The other project also has code to make your own BAM headers on-the-fly, so with the above I could make a parallelised BAM writer if that was something that was appealing enough to people to warrant work on it.

All the best :)

ADD COMMENT • link 8.0 years ago by John 13k

1

Entering edit mode

+33
See the UCSC paper in Genome Research or the Tabix paper about the rationale of binning. Understanding R-tree will be useful, too. Binning is critical to the standard BAM indexing, but you are right that it is not necessary to keep "bin" in BAM files. It can be computed on the fly. There are also other ways to index BAM files, not necessarily relying on binning.

A few other points. On my machine, pigz is slower than gzip on decoding, in particular on BAM files. Also, I believe pigz uses at most 3 threads on decompression. If you use pypy, perhaps it would be better to directly call libz.so with cffi. There is a significant overhead on piping inside python.

ADD REPLY • link 8.0 years ago by lh3 33k

0

Entering edit mode

Ah, +33! Awesome! That was the shortest answer to a difficult problem ever - you might as well have said the answer is 42 :P

Yes, I must admit, im not familiar with R-trees at all. I should read that paper and then go bother Devon to explain it to me in simple words. Actually, better idea - he has to interview new bioinformaticians applying to our institute soon, so maybe one of the questions could be "How would you explain R-Trees to a child?" and then they're graded on how well I actually understood. "How would you explain X to a Biologist" would actually make a fairly good bioinformatics interview question...

Re: pigz - thats really interesting because, as great as pigz is, we're still limited to the rate at which python can ingest data from the stdin, as you say. I can see this using the linux pv tool, that my pipe throughput maxes out - but if that was compressed data, we'd have a much higher throughput. There are some other clear-improvements I could make to the code if I knew the user had pypy and I didnt have to support CPython too - so perhaps that's a better direction for the project anyway. Lots to think about :)

ADD REPLY • link 8.0 years ago by John 13k

1

Entering edit mode

If you ever really want to know the details of binning then swing by the office and I'll bore you to death with the details of what an R tree is.

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

2

Entering edit mode

Please record that and share on YouTube :)

ADD REPLY • link 8.0 years ago by Ram 43k

score 1 · Answer 3 · 2017-04-19

Pybam Beta is now done: https://github.com/JohnLonginotto/pybam

It's just all-round better than the previous version in terms of how you use the API.

Perhaps the two most interesting things are:

It can do dynamic and static parsing (decode something from BAM to SAM as and when you need it, as well as asking for the same X properties to be decoded every time a new read is read). Syntax is as simple as possible.
One can decompress and recompress BAM files using any compression format they wish, so long as their is a tool installed on the system that prints uncompressed data to it's stdout. So if you don't make use of BAM indexes, you can instead make use of much faster compression algorithms like LZO (about 6x faster than samtools), or high-compression-rate algorithms like LZMA. pybam will treat them no differently than a standard BAM file.

I looked into supporting BAM indexes, but it's just so difficult to understand the spec and how it's implimented i gave up.

score 0 · Answer 4 · 2016-08-06

0

Entering edit mode

7.7 years ago

531800824 • 0

I am a newbie at bioinformatics. I am trying to get the 'read ID ,the Chromosome ID，the position of the read matched to the the reference genome ' from a bam file using pybam. I have tried the 'rname', but it says 'NameError: global name 'rname' is not defined ',or the command：bamfile.chromosomes_from_header ,but it prints all the chromosomes name. command:refID just print '0' I just want to ask :how could i get each read's chromosome name?

Thanks.

ADD COMMENT • link 7.7 years ago by 531800824 • 0

1

Entering edit mode

If anyone in the future wants help using pybam or anything else i've written, it's preferable to send me a message on GitHub directly to keep all the issues in one place (for other users likely to have the same questions). That being said, this is a good question for Biostars :)

The BAM format doesn't store the actual chromosome names (rname) for each read in the file. It stores a number called the refID, which is then used to find the actual chromosome name from the list of chromosomes in the header of the BAM. This is because reference names might get very long, and storing very long names for each read would be a waste of space. So to do that conversion in pybam looks something like this:

>>> import pybam
>>> pure_bam_data = pybam.bgunzip('/Users/John/Desktop/ENCFF001LCU.bam')
Using gzip!
>>> parser = pybam.compile_parser(['qname','refID','pos'])
>>> for qname,refID,pos in parser(pure_bam_data):
...     if refID < 0: rname = 'Unmapped'
...     else:         rname = pure_bam_data.chromosome_names[refID]
...     print qname, rname, pos
...     break
...
SOLEXA1_0001:4:49:11382:21230#0 chr1 3000742

So as you can see i've defined rname myself there, and what the chromosome name should be where that refID comes back as less than 0 (which is 'no rname' in the BAM spec). Future updates of pybam will likely include the rname function by default - however it's worth noting that this conversion from a number to a name takes a bit of time - a significant amount of time if you need to do it billions of times. Sometimes, depending on your program, you can leave everything as refIDs for the majority of your program's work, and only at the last moment do the conversion from refIDs to rnames (for example, if you wanted to count reads per rname - the name doesn't really matter for anything other than printing the final results, so count per refID then convert at the end).

Hope that works and clears it up :)

ADD REPLY • link 7.7 years ago by John 13k