Reading BAM files with gzip
2
1
Entering edit mode
9.8 years ago
jhrf ▴ 10

The SAM specification states the following:

BGZF is block compression implemented on top of the standard gzip file format....a compliant gunzip utility can decompress a BGZF compressed file

I would like be able to read a BAM file using gzip, as the specification says I should be able to. However, when I open a BAM file using the inbuilt python library gzip I am able to read the header normally but the reads still look compressed in some manner, like so:

AMCBCZ0XDZ100SMiASi??'"?Icd(ZHS2000-943_70:2:2301:14923:121232@("("("("("("("("("("("("("("("("(""%%##%'&##'((&''($'((($'($&%&'%%&&'(("'#%#&'&&$&(&((  !%!ASi?'!" !!!!""BCZ0XDZ100SMiIAd?sHS2000-905_68:3:1102:3166:478590?B ?"?"?"?"?"?"?"?"?"("("("("("("("!"%%##%''''#''!'(&'%'((%$&&(('('"'!'&(&!'&"# %&#&&$"!! "BCZ0XDZ51^1$30^CCCT$18SMiASi?'"I?d'HS2000-943_70:2:1302:17909:144833"("("?"?"?"?"?"?"?"("(!"!"?$!(!!"!"!"%%# #'#%!#& #%$###%%"!'(#'"!H0CH1CH2CBCZ0)'"IYd????????HS2000-943_70:2:1302:17909:144833@($""(!?"?"$?"?"?"??""("("("("("("("!

I can see the making of a real read in there, (i.e the read name: HS2000-943_70:2:2301:14923:121232) but most of it is illegible. Perhaps I should be reading in entire blocks and then decompressing further? I am not sure. I have also tried reading the file using the bgzf biopython module and receive the same results.

I am aware of the BioStars question here. However, I feel given the explicit declaration in the SAM file specification that the files are readable by gzip, the answers (to use pysam) was inadequate.

Thanks for reading!

BAM gzip • 7.4k views
ADD COMMENT
9
Entering edit mode
9.8 years ago

Well the reads are still in binary, rather than in a text format like in a SAM file (i.e., decompressing a BAM file won't give you a SAM file). The format for a BAM file is mentioned in the SAM spec., so you'll need to write an interpreter to parse the binary (it'd be much easier to just use pysam).

ADD COMMENT
0
Entering edit mode

Thank you. This makes sense.

Do you have any idea how I would go about doing this? I would quite like to do this as a learning exercise.

I assume that the samtools must do this somewhere so I suppose I'll start by inspecting the relevant part of samtools.

ADD REPLY
0
Entering edit mode

You'll definitely want to have a look at the samtools C code (use the version from sourceforge rather than github, since the former is more self contained). You'll end up making a bunch of custom structures that will parallel those used in the C code (and pysam) and then using f.read(some_number) to fill in the various components.

It's good that you're only doing this as an exercise, since it'd be quicker to just rewrite pysam :)

ADD REPLY
2
Entering edit mode
9.8 years ago

Take a look at Peter Cock's SamBam2014 Biopython branch. He wrote the bgzf interface as well as some BAM/SAM classes that replace most of pysam with the exception of the pileup engine. I've been meaning to test it, as this is a pure Python BAM reader/class/writer, and should have a better object interface than pysam.

ADD COMMENT
1
Entering edit mode

Peter also made sure it's compatible with py2, py3, pypy, and Jython, and wrote tests to compare output to samtools and pysam.

ADD REPLY
0
Entering edit mode

This looks interesting. Many thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6