In-place writable BAM files
1
2
Entering edit mode
6.2 years ago
John 13k

I am trying to figure out the bgzip format, but it seems bgzip from htslib has only a few of the functions I need. I know a .bam is just a binary SAM compressed with bgzip. I'm trying to get the binary blobs in their individual bgzip chunks.

Thank you :)

bgzip bam • 2.2k views
2
Entering edit mode

In case you missed this: Reading BAM files with gzip

0
Entering edit mode

Thanks genomax. That lead me to the Bio.bgzf module, and eventually these docs: http://biopython.org/DIST/docs/api/Bio.bgzf-module.html

4
Entering edit mode
6.2 years ago
John 13k

So I answered this question with the following python code:

from Bio import bgzf
myBAM = 'original.bam'
with open(myBAM,'rb') as r1:
with open(myBAM,'rb') as r2:
for x,values in enumerate(bgzf.BgzfBlocks(r1)):
with open(str(x),'wb') as w:


Why would anyone want to do that? Well, I really dislike two things - read only data, and custom file formats. The bgzip format is both of these things.

Having said that, I really like the bgzip format because the idea behind it is ingenious -- breaking up the binary BAM file and compressing it in chunks, so if you want to read chr10:100-200 you don't have to decompress the whole thing. Random read access.

Unfortunately, the implementation they chose of concatenating chunks into a single read-only file was possibly a mistake. Want to mark duplicates? Write a new file. Want to tag some reads? Write a new file. Want to convert base quality scores? Write a new file.

Instead, lets break up the BAM file just like bgzip does, but stick it in an SQLite database. We get random read and write access, still a single file, and no need for anything special other than gzip to get the data out. Here's the code (takes approx. 57 seconds to convert a regular 1Gb BAM file to an SQLite database):

#!/usr/bin/env python
import os
import sys
import zlib
import sqlite3
import argparse
from Bio import bgzf

try: import pylzma; gotlzma = True
except: gotlzma = False

## Parse user-supplied command line options:
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter,
description="Turn a BAM file into an SQL database. Optional extra compression.")

args = parser.parse_args()
if not args.input or not args.output: print 'You must supply an input and an output!'; exit()

file_size = float(os.stat(args.input).st_size) # float so division works later on

con = sqlite3.connect(args.output, timeout=120)
con.isolation_level = 'EXCLUSIVE'
con.execute('BEGIN EXCLUSIVE')
cur = con.cursor()
cur.execute('DROP TABLE IF EXISTS output')
cur.execute('CREATE TABLE output(vpos INT PRIMARY KEY,data BLOB)')
r1 = open(args.input,'rb')
r2 = open(args.input,'rb')
rows = []

percentage = 0
print '\033[A Progress: |' + ('='*percentage) + '>' + (' '*(100-percentage)) + '|'

print ''
for x,values in enumerate(bgzf.BgzfBlocks(r2)):
if args.gzip: data = zlib.compress(zlib.decompress(data,31),9)
elif args.lzma and gotlzma: data = pylzma.compress(zlib.decompress(data,31),9)
rows.append((values[2],buffer(data)))
if x % 100 == 99:
status(values[0])
cur.executemany('INSERT INTO output VALUES (?,?)', rows)
con.commit()
rows = []
cur.executemany('INSERT INTO output VALUES (?,?)', rows)
con.commit()


\$ bam2sql.py --input original.bam --output original.sql

Result looks something like this:

pos is the VIRTUAL position of the uncompressed BAM file - same as what bgzip has in the header. I haven't returned the binary data in that query, because it would be gzipped garbage, but i've returned the length of the binary. If you wanted to, you could extend the format to include the BAM index information (usually a separate .bai file), so it's now all in one place.

But like everything in life, all these extra features come at a cost. A file size cost...

1958320 original.bam
2048208 original.sql


So it increases file size by 4.5% for a BAM file with 46798 bgzip blocks. Of course, larger block compression means less rows, means less overhead for the SQL, however, for an inplace-writeable bam file, I think it's probably worth it.

However. We're not done there. Since we no longer have to use gzip for it's ability to have multiple archives concatenated together, we could zip up however we like. Lets use LZMA.

bam2sql.py --input original.bam --output original.sql --gzip
bam2sql.py --input original.bam --output original.sql --lzma


So with LZMA, thats 94% the original file size, and with random read and write access, and all standard python modules :)

2
Entering edit mode
1
Entering edit mode

That's very interesting but I'm not sure I'm understanding everything. You don't put one read in one row of sqlite isn't it ? AS far as Inderstand, you put chunks of the BAM into sqlite ?

0
Entering edit mode

Its not reads-per-row, but gzip'd-blocks-per-row. The same compressed blocks as bgzip usually make, except instead of concatenating the chunks end-to-end to get the final .bam file, you put them into SQL. Now you can edit a chunk (decompress, read, change a bam flag or something, recompress, put back into the database) without effecting the whole file. I actually thought of it because the GATK pipeline had me re-writing files over and over, but I know that was unavoidable.

Your other link is a very good point - we already have a lot of other formats that are superior to BAM which we don't use. We have enough formats. However, my suggestion is that this isn't a change to the BAM format itself - the data still needs to be read by htslib to parse the binary uncompressed data into SAM format. This only would replace how the BAM file is compressed. Slightly larger files, but editable in-place.