Sam To Bam - Loss Of Data Or Just Great Compression?
1
1
Entering edit mode
7.8 years ago
Clare ▴ 170

I use picard tools to take a BWA alignment sam file and convert it into a sorted bam file. Normally this works well, but for a small number of samples, I am getting VERY small bam files. e.g. SAM file = 42G, BAM file = 831M Samtools produces the same BAM file size. If I take the bam and convert it back to SAM, the 42G file is reproduced.

I'm confused as to why the BAM file is so small, when for the majority of other samples, the BAM file is ~1/4 of the size of the SAM file - i.e. should be about 10G here.

I'm using picard 1.77 and this command:

java -Xmx${JAVMEM} -jar${pic_dir}/SortSam.jar SO=coordinate INPUT=${out_dir}/"4_"${SAMPLE_ABB}"_BWA_pe12.sam" OUTPUT=${out_dir}/"5_"${SAMPLE_ABB}"_BWA_pe12.bam" VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true MAX_RECORDS_IN_RAM=500000 TMP_DIR=\${tmp_dir}

sam bam picard • 11k views
1
Entering edit mode

I would investigate this file in more detail. A compression rate of almost 50 fold is very surprising - so much so that it makes me suspect that there is no useful information in your file, otherwise it wouldn't compress so well.

0
Entering edit mode

Agreed. I was thinking it might be a highly targeted experiment, where they got 10000x depth on a very small number of regions. You'd expect those results to be highly compressible, since many of the sequences would be identical.

0
Entering edit mode

@Clare Typically, a bam file can be reduced by nearly a factor of four, as what you observed. The size of the final bam file depends on the number of reads and the compression algorithm. How many reads do you have in this file?

3
Entering edit mode
7.8 years ago

If you can recreate the original sam file from the bam file, then clearly, all the information is there and the answer is compression. The scale of compression depends on the content and the algorithm. A bam full of identical or (highly similar) reads is more compressible. In fact, I'd wager that the read duplication rate was pretty high in your data.

0
Entering edit mode

+1 ditto my immediate thought was high read redundancy. If you can convert SAM<->BAM in both directions it must be OK.