Hello all :)
So i've almost finished the re-writeable BAM file format that was discussed first here on another thread. One thing that has become apparent during the course of development was that it actually isn't particularly useful to write-over the old BAM as you create the new BAM, because if the middle process cuts out for any reason, you'll end up with a corrupt file (or at least, a file half-modified).
bamplus --out my.bam+ | some_process.jar I=/dev/stdin O=/dev/stdout | bamplus --in my.bam+
This isn't so much of a problem if you run
bamplus --out my.bam+ | some_process.jar I=/dev/stdin O=/my.mark_duplicates.bam cat my.mark_duplicates.bam | bamplus --in my.bam+
but the utility of the whole program is kind of wasted if you have to write a whole BAM to disk in the first place.
For this reason it seems sensible to implement a 'diff' of the before-and-after BAMs, store only that, then commit the changes on the original file in-place when you are ready. Since my.bam+ is really just an SQLite file, this is as simple as adding a new column and dumping diff data into it. As a working prototype i used bsdiff, which is a byte-wise diff (for binary files) originally written for BSD and has python bindings. I also tried xdelta and xdelta3 but they were worse than bsdiff. After running bsdiff on a before-and-after-marked-duplicates BAM, the diff was about 10% of the file size, which is obviously a lot smaller than having a second BAM file (but still unacceptably large given what actually changed).
I think it should be more than possible to get that down to <3% given that a read flag is 4 bytes of a roughly 150-ish bytes-per-read (generally much more), and thats assuming EVERY read is a duplicate.
So before I go ahead and sink a couple of days into making a diff for BAM files, I thought I should really check to make sure im not reinventing any wheels! Has anyone come across "patches" for BAM files before (or any projects that started down that path)? If so, i'd like to use there patch format.
Thank you :)