Question: Reading BAM files with gzip
1
gravatar for jhrf
5.1 years ago by
jhrf10
United Kingdom
jhrf10 wrote:

The SAM specification states the following:

BGZF is block compression implemented on top of the standard gzip file format....a compliant gunzip utility can
decompress a BGZF compressed file

I would like be able to read a BAM file using gzip, as the specification says I should be able to. However, when I open a BAM file using the inbuilt python library gzip I am able to read the header normally but the reads still look compressed in some manner, like so:

 AMCBCZ0XDZ100SMiASi??'"?Icd(ZHS2000-943_70:2:2301:14923:121232@("("("("("("("("("("("("("("("("(""%%##%'&##'((&''($'((($'($&%&'%%&&'(("'#%#&'&&$&(&((  !%!ASi?'!" !!!!""BCZ0XDZ100SMiIAd?sHS2000-905_68:3:1102:3166:478590?B ?"?"?"?"?"?"?"?"?"("("("("("("("!"%%##%''''#''!'(&'%'((%$&&(('('"'!'&(&!'&"# %&#&&$"!! "BCZ0XDZ51^1$30^CCCT$18SMiASi?'"I?d'HS2000-943_70:2:1302:17909:144833"("("?"?"?"?"?"?"?"("(!"!"?$!(!!"!"!"%%# #'#%!#& #%$###%%"!'(#'"!H0CH1CH2CBCZ0)'"IYd????????HS2000-943_70:2:1302:17909:144833@($""(!?"?"$?"?"?"??""("("("("("("("! 

I can see the making of a real read in there,  (i.e the read name: HS2000-943_70:2:2301:14923:121232) but most of it is illegible. Perhaps I should be reading in entire blocks and then decompressing further? I am not sure. I have also tried reading the file using the bgzf biopython module and receive the same results.

Read A .Bam File With Biopython Module Bgzf Without Samtools. However, I feel given the explicit declaration in the SAM file specification that the files are readable by gzip, the answers (to use pysam) was inadequate.

Thanks for reading!

 

 

gzip bam • 4.1k views
ADD COMMENTlink modified 5.1 years ago by Matt Shirley9.1k • written 5.1 years ago by jhrf10
8
gravatar for Devon Ryan
5.1 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

Well the reads are still in binary, rather than in a text format like in a SAM file (i.e., decompressing a BAM file won't give you a SAM file). The format for a BAM file is mentioned in the SAM spec., so you'll need to write an interpreter to parse the binary (it'd be much easier to just use pysam).

ADD COMMENTlink written 5.1 years ago by Devon Ryan92k

Thank you. This makes sense.

Do you have any idea how I would go about doing this? I would quite like to do this as a learning exercise.

I assume that the samtools must do this somewhere so I suppose I'll start by inspecting the relevant part of samtools.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by jhrf10

You'll definitely want to have a look at the samtools C code (use the version from sourceforge rather than github, since the former is more self contained). You'll end up making a bunch of custom structures that will parallel those used in the C code (and pysam) and then using f.read(some_number) to fill in the various components.

It's good that you're only doing this as an exercise, since it'd be quicker to just rewrite pysam :)

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Devon Ryan92k
2
gravatar for Matt Shirley
5.1 years ago by
Matt Shirley9.1k
Cambridge, MA
Matt Shirley9.1k wrote:

Take a look at Peter Cock's SamBam2014 Biopython branch. He wrote the bgzf interface as well as some BAM/SAM classes that replace most of pysam with the exception of the pileup engine. I've been meaning to test it, as this is a pure Python BAM reader/class/writer, and should have a better object interface than pysam.

ADD COMMENTlink written 5.1 years ago by Matt Shirley9.1k
1

Peter also made sure it's compatible with py2, py3, pypy, and Jython, and wrote tests to compare output to samtools and pysam.

ADD REPLYlink written 5.1 years ago by Matt Shirley9.1k

This looks interesting. Many thanks!

ADD REPLYlink written 5.1 years ago by jhrf10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 799 users visited in the last hour