BAM files compression
5
0
Entering edit mode
17 months ago
User000 ▴ 460

Hello,

I have a lof of bam files (nearly 500) each 10GB. In total my data occupies 7T. I know bam files are already compressed. Does it make sense to compress the ones I do not use as one unique tgz file? Or any other format?

bam • 2.7k views
1
Entering edit mode

Compressing bams using gzip will not be worth the effort, as the time spent compressing/decompressing them will be more expensive than the space you will end up saving overall.

Look for archival solutions. If you're dealing with 500 BAMs, you are most probably working for an institution that has HPC cluster with storage and archival options.

1
Entering edit mode

Not many tools accept CRAM input so if you ever need to do anything with these files they will have to be reconverted. So take that into account in making the decision.

0
Entering edit mode

I did the maths on how long it takes to recover AWS CPU costs (based on a spot price some arbitrary time ago) in the reduction of AWS standard S3 disk charges for a BAM to CRAM conversion. At that point it happened to be around 1 day! Obviously longer for cheaper storage tiers.

I didn't do the reverse costs - CRAM to BAM - but it'll be a similar order of magnitude.

If you absolutely must keep BAM format it's always possible to uncompress them first (zcat in.bam > in.u.bam) and then recompress using another tool with far superior compression ratios, such as bsc or mcm. It'll still be likely considerably larger than CRAM though and it'll take considerably longer. The process can be reversed, ending with bgzip to recompress the BAM.

0
Entering edit mode

Do I have to use reference genomes as well? something like samtools view -T ref.fa -C -o file.cram file.bam? or is it possible to avoid it?

0
Entering edit mode

no, you have to specify a genome.

0
Entering edit mode

when I use the command line above will it create the cram file and bam will disappear? Thanks to all, your help was very useful!

0
Entering edit mode

No it won't. Well written tools (actually, all non-destructive tools) don't overwrite/delete input files.

0
Entering edit mode

I see that the CRAM lossless compression reduces the BAM size from 1.7 to 1.4, so from 10 T I'll go to 7-6T. I guess this is the maximum compression? How do you guys archive the BAM files in your clusters?

3
Entering edit mode
17 months ago

Or any other format?

2
Entering edit mode
17 months ago
JC 12k

As you pointed, it is already compressed, so tgz is not helpful. However, you can convert them to CRAM or use a reference based method to reduce file size.

2
Entering edit mode
17 months ago
jkbonfield ▴ 670

CRAM generation is actually faster than BAM generation in samtools, at least at the default compression levels. CRAM decoding is slower than BAM though unless you're I/O bound, in which case CRAM will be faster due to being smaller.

See https://github.com/samtools/www.htslib.org/pull/23/commits/6a123b6aa7e677c899799cf615b6ca27659193d0 (not merged yet sadly) for some modern benchmarks.

For archival, you have to be certain the reference will be around for as long as the archive too. Either cache a copy of it with your files or use the embedded reference mode of CRAM. You can do this with

samtools view -O cram,embed_ref in.bam -o out.cram

1
Entering edit mode
17 months ago
Rm 8.1k

CRAM format is the one option for you. which is significantly better lossless compression than BAM

1
Entering edit mode
4 days ago
Divon ▴ 10

You might also want to try Genozip, which compresses both BAM and CRAM files (as well as FASTQ, VCF etc): www.genozip.com.

Full disclosure: I am the developer of Genozip

0
Entering edit mode

Is genozip a commercial tool? I ask because of the .com URL. Is it FOSS?

0
Entering edit mode

It is not FOSS, but it is free for non-commercial use, and the source code is available on github.