BAM compression: .tar.gz = same size as before?
1
0
Entering edit mode
4.0 years ago
Marvin ▴ 190

I tried to compress 5 bam files using:

tar -czvf original_bams.tar.gz *.bam


The resulting file sizes ("ll --block-size=M") are:

8067M file1.bam
6962M file2.bam
10662M file3.bam
7794M file4.bam
7346M file5.bam
40828M original_bams.tar.gz


There's a difference of 3MB between the archive and the sum of the sizes of the bam files. Is this expected? I know that there is CRAM (which I will turn to next) but I'm surprised to see that good old .tar.gz has 0 effect?

bam compression tar gz • 4.4k views
1
Entering edit mode

CRAM is good for archive purposes - it can take ~24 hours for a CRAM file to be created out of a ~30GB BAM file, and the size will be probably ~60% of the BAM. Check out if your BAM files have qual scores binned, and try to bin them while creating the CRAM - that will have a nontrivial impact on the size.

0
Entering edit mode

that seems like a really long time. do you have benchmarks?

0
Entering edit mode

Not really - I was running trials and I tried converting a really small BAM file and a large BAM file to check compression ratios.

0
Entering edit mode

You'll actually get better compression by converting them to sam.gz (or better yet, sam.bz2), and the process is quite fast using pigz/pbzip2.

7
Entering edit mode
4.0 years ago
Benn 8.1k

BAM files are already compressed (SAM files). Compressing them again doesn't make sense.

2
Entering edit mode

this. if you did want to make a single file archive of them, just use tar and not tar.gz