Question

Byte-wise difference between two BAMs - i.e. binary patches

2

Entering edit mode

8.1 years ago

John 13k

Hello all :)

So i've almost finished the re-writeable BAM file format that was discussed first here on another thread. One thing that has become apparent during the course of development was that it actually isn't particularly useful to write-over the old BAM as you create the new BAM, because if the middle process cuts out for any reason, you'll end up with a corrupt file (or at least, a file half-modified).

.

bamplus --out my.bam+  | some_process.jar I=/dev/stdin O=/dev/stdout | bamplus --in my.bam+

This isn't so much of a problem if you run

bamplus --out my.bam+ | some_process.jar I=/dev/stdin O=/my.mark_duplicates.bam
cat my.mark_duplicates.bam | bamplus --in my.bam+

but the utility of the whole program is kind of wasted if you have to write a whole BAM to disk in the first place.

For this reason it seems sensible to implement a 'diff' of the before-and-after BAMs, store only that, then commit the changes on the original file in-place when you are ready. Since my.bam+ is really just an SQLite file, this is as simple as adding a new column and dumping diff data into it. As a working prototype i used bsdiff, which is a byte-wise diff (for binary files) originally written for BSD and has python bindings. I also tried xdelta and xdelta3 but they were worse than bsdiff. After running bsdiff on a before-and-after-marked-duplicates BAM, the diff was about 10% of the file size, which is obviously a lot smaller than having a second BAM file (but still unacceptably large given what actually changed).

I think it should be more than possible to get that down to <3% given that a read flag is 4 bytes of a roughly 150-ish bytes-per-read (generally much more), and thats assuming EVERY read is a duplicate.

So before I go ahead and sink a couple of days into making a diff for BAM files, I thought I should really check to make sure im not reinventing any wheels! Has anyone come across "patches" for BAM files before (or any projects that started down that path)? If so, i'd like to use there patch format.

Thank you :)

BAM • 1.8k views

ADD COMMENT • link updated 8.0 years ago by Biostar 20 • written 8.1 years ago by John 13k

1

Entering edit mode

You should have a look at the GAP5 format first.

ADD REPLY • link 8.1 years ago by lh3 33k

0

Entering edit mode

Thank you for that lh3 - there are some really interesting points made in that paper.

1) They do compression on 1024 reads at a time, where as I currently I default to 2000. Perhaps 2000 is to high, and you get just as good compression ratios at 1000, but better random-access. Will investigate.

2) They use LZMA2 where as I use just standard LZMA.

3) They compress data columns individually, rather than whole reads to improve compression performance. Ive found this trick to be useful in the past, when i had a mix of uint32 and int16 (position column and signal columns) - and i'm sure that it would make an even bigger difference here on BAM with the mix of ints and char (although i've never thought about it until now) -- yeah, particularly for QNAME where they tend to all start with a common prefix, or pos/refID/seq and qual too. However, this would mean splitting BAM data, and that could get... messy. Right now I don't alter the BAM data, I just realign compression blocks to always be as the start/end of reads, recreate the header from scratch, but other than that never touch the data. But since the first 9 fields are all fixed length, perhaps I could spread those out over individually-compressed-columns and still keep things fast-ish. Thing is, adding a new column in SQL costs bytes. Will that cost outweight the pros of better compression? Probably, but i'll have to test to make sure, and it would be a big change to the format. On the other hand, certain operations become trivial. Counting reads in chromosome 1, for example, would be a simple "select sum(length(refID))/32 from data where refID=0"". A read count in 0.001ms :) There are also bit shift operators in SQLite, but I dont know if you can make it iterate a blob and bitshift as you go - if you could, counting duplicates would again be a sub-second operation.

But really, im not trying to replace/mess with the BAM file format or enable targeted/efficient writing of specific read data. I'm just extending the bgzip format so that you can write a new chunk in-place without having to modify the other 99Gb of data. Of course you can parse a chunk, replace a read, put it all back together, then compress and re-insert it, so it has some writing capabilities, but perhaps the more i mess with the binary the less true to a simple "bam+" it gets.

ADD REPLY • link 8.1 years ago by John 13k

1

Entering edit mode

I have not read the paper, but from what the author told me a long time ago, GAP5 is mutable by design. The author also thought of keeping delta to a BAM in a separate file/database, but I don't know if he has implemented or described it.

ADD REPLY • link 8.1 years ago by lh3 33k

0

Entering edit mode

It seems to be high-performance for certain operations like rearranging a collection of reads (due to an indel) and doing that much faster in a block than, say, reallocating each read one at a time. For example, a whole block of reads are tagged as reverse strand rather than the individual reads, so if it turns out that all those reads need to be on the opposite strand, its a single op to move them. I haven't seen anything about deltas yet, but will keep reading :)

ADD REPLY • link 8.1 years ago by John 13k

0

Entering edit mode

Ok, hm, it seems that instead of storing the absolute positions (or any other numerical data which has a correlation with the next/previous read), they store the deltas to help the compressor out. This is another nice little trick that I could start using in some other projects, where ive found storing deltas to be roughly, file-size-wise, comparable to storing absolute values - however i didn't check if deltas compress better, which i suspect they do. Good to know :)

But yeah, ultimately this isn't what i'm after. I'm after something that would take two BAM files as input (before and after some GATK step), and calculate a very small "patch" file of differences, allowing you to delete one of the two input BAMs. So for example you have your 100Gb mapped BAM with no duplicates, your 100Gb mapped BAM with base quality scores adjusted, and then instead of keeping both (or deleting one and potentially losing information), you make a 1Gb patch, delete either the original or the recalibrated BAMs, and have no loss of information at all and only store 101GB on disk.

Put that into an SQL table, and you end up with a BAM file that has very inexpensive versioning built in :)

ADD REPLY • link 8.1 years ago by John 13k