Extra compressed formats for raw/aligned reads and variant tables have been around for some time but I think saw slow adoption.
Our current disk space usage is making us have another look at switching to file formats that offer better compression than vanilla FASTQ, BAM and BCF..
- CRAM instead of BAM
- CRAM(unmapped) instead of FASTQ
- uBAM (unmapped BAM) instead of FASTQ
- DRAGEN ORA (From Illumina /Enancio) instead of FASTQ
- spVCF instead of VCF/BCF
At least these aspect are important when considering new file formats:
- compression factor to be gained / file size reduction to be gained
- lossy or lossless
- biological still meaningful
- technical compatible with current pipelines and tools (e.g. bwa/gatk/bcftools, IGV)
- open (source) file format / API specification
We care most about improved compression / reduced file size for the FASTQ and BAM files. Less about improved compression for BCF.
Did you / your organization already make the switch to file formats that offer better compression than vanilla FASTQ/BAM/BCF?
How did this switch turn out? Looking for example at the above listed aspects?
Relevant external blog post and benchmark: