Reading BAM files with gzip
2
1
Entering edit mode
7.8 years ago
jhrf ▴ 10

The SAM specification states the following:

BGZF is block compression implemented on top of the standard gzip file format....a compliant gunzip utility can decompress a BGZF compressed file

I would like be able to read a BAM file using gzip, as the specification says I should be able to. However, when I open a BAM file using the inbuilt python library gzip I am able to read the header normally but the reads still look compressed in some manner, like so:

AMCBCZ0XDZ100SMiASi??'"?Icd(ZHS2000-943_70:2:2301:14923:121232@("("("("("("("("("("("("("("("("(""%%##%'&##'((&''($'((($'($&%&'%%&&'(("'#%#&'&&$&(&((  !%!ASi?'!" !!!!""BCZ0XDZ100SMiIAd?sHS2000-905_68:3:1102:3166:478590?B ?"?"?"?"?"?"?"?"?"("("("("("("("!"%%##%''''#''!'(&'%'((%$&&(('('"'!'&(&!'&"# %&#&&$"!! "BCZ0XDZ51^1$30^CCCT$18SMiASi?'"I?d'HS2000-943_70:2:1302:17909:144833"("("?"?"?"?"?"?"?"("(!"!"?$!(!!"!"!"%%# #'#%!#& #%$###%%"!'(#'"!H0CH1CH2CBCZ0)'"IYd????????HS2000-943_70:2:1302:17909:144833@($""(!?"?"$?"?"?"??""("("("("("("("!


I can see the making of a real read in there, (i.e the read name: HS2000-943_70:2:2301:14923:121232) but most of it is illegible. Perhaps I should be reading in entire blocks and then decompressing further? I am not sure. I have also tried reading the file using the bgzf biopython module and receive the same results.

I am aware of the BioStars question here. However, I feel given the explicit declaration in the SAM file specification that the files are readable by gzip, the answers (to use pysam) was inadequate.

BAM gzip • 6.2k views
9
Entering edit mode
7.8 years ago

Well the reads are still in binary, rather than in a text format like in a SAM file (i.e., decompressing a BAM file won't give you a SAM file). The format for a BAM file is mentioned in the SAM spec., so you'll need to write an interpreter to parse the binary (it'd be much easier to just use pysam).

0
Entering edit mode

Thank you. This makes sense.

Do you have any idea how I would go about doing this? I would quite like to do this as a learning exercise.

I assume that the samtools must do this somewhere so I suppose I'll start by inspecting the relevant part of samtools.

0
Entering edit mode

You'll definitely want to have a look at the samtools C code (use the version from sourceforge rather than github, since the former is more self contained). You'll end up making a bunch of custom structures that will parallel those used in the C code (and pysam) and then using f.read(some_number) to fill in the various components.

It's good that you're only doing this as an exercise, since it'd be quicker to just rewrite pysam :)

2
Entering edit mode
7.8 years ago

Take a look at Peter Cock's SamBam2014 Biopython branch. He wrote the bgzf interface as well as some BAM/SAM classes that replace most of pysam with the exception of the pileup engine. I've been meaning to test it, as this is a pure Python BAM reader/class/writer, and should have a better object interface than pysam.

1
Entering edit mode

Peter also made sure it's compatible with py2, py3, pypy, and Jython, and wrote tests to compare output to samtools and pysam.

0
Entering edit mode

This looks interesting. Many thanks!