Question

Forum:2021: state and usage of compressed file standards better than BAM and FASTQ

5

Entering edit mode

3.0 years ago

William ★ 5.3k

Extra compressed formats for raw/aligned reads and variant tables have been around for some time but I think saw slow adoption.

Our current disk space usage is making us have another look at switching to file formats that offer better compression than vanilla FASTQ, BAM and BCF..

For example:

CRAM instead of BAM
CRAM(unmapped) instead of FASTQ
uBAM (unmapped BAM) instead of FASTQ
DRAGEN ORA (From Illumina /Enancio) instead of FASTQ
spVCF instead of VCF/BCF
etc.

At least these aspect are important when considering new file formats:

compression factor to be gained / file size reduction to be gained
lossy or lossless
biological still meaningful
technical compatible with current pipelines and tools (e.g. bwa/gatk/bcftools, IGV)
open (source) file format / API specification

We care most about improved compression / reduced file size for the FASTQ and BAM files. Less about improved compression for BCF.

Did you / your organization already make the switch to file formats that offer better compression than vanilla FASTQ/BAM/BCF?

How did this switch turn out? Looking for example at the above listed aspects?

Relevant external blog post and benchmark:

https://www.ga4gh.org/news/guest-post-seven-myths-about-cram-the-community-standard-for-genomic-data-compression/

http://www.htslib.org/benchmarks/CRAM.html

bam compression fastq • 3.4k views

ADD COMMENT • link updated 21 months ago by quentin54520 ▴ 120 • written 3.0 years ago by William ★ 5.3k

0

Entering edit mode

The Illumina /Enacio format (fastq.ora) offers c.a. 5X extra compression over fastq.gz according to their FAQ. But the format is closed I think, and you have to rely on Illumina keeping their converter available and free for use. https://www.illumina.com/company/about-us/mergers-acquisitions/enancio.html/

ADD REPLY • link 3.0 years ago by William ★ 5.3k

0

Entering edit mode

Illumina will have a vested interest in keeping this format supported (should that catch on) for a long time since they are in the business of selling sequencers. Problem is to see if other technologies adopt it or go their own way. That is when we will have problems of competing formats causing additional headaches for end-users.

ADD REPLY • link 3.0 years ago by GenoMax 141k

0

Entering edit mode

I don't understand why Illumina doesn't just publish the format. Since indeed they are in the business of selling sequencing machines/kits. And an open format would improve adoption , i.e. fastq.ora becoming an open industry standard. And the format does not seem to be too difficult to copy / improve upon, i.e. fast and dirty mapping against reference, encode difference (or exact match) of read v.s. reference.

ADD REPLY • link 3.0 years ago by William ★ 5.3k

0

Entering edit mode

Illumina just spent (a good bit of money?) to acquire that technology. They are also the dominant player in the market so perhaps they don't have an immediate need to make the technology (my guess is it is not a simple format) public. Perhaps if a competitor announces an open/comparable technology (and if it looks like it may start getting adopted) then they would face some pressure. In any case, they are making decompressors available for free so end-users are not locked out of the data.

ADD REPLY • link 3.0 years ago by GenoMax 141k

0

Entering edit mode

For the free decompression software, you have to be careful, for example the latest version of dragen (3.10) allows paired-end compression (which has many advantages, a better compression rate, simplified the possibility of pipe the result of the decompression to a software of analysis without the need to use mkfifo in particular) nevertheless this compression uses version 2.6 of ora and the decompression is still at 2.5.5 and unable to decompress from the paired end

ADD REPLY • link 21 months ago by quentin54520 ▴ 120

0

Entering edit mode

Ion Torrent uses uBAM over fastq. We actually lose information if uBAM is converted to fastq

Edit. I haven't used this tool but it seems to offer lossless BAM to CRAM for Ion Torrent reads..

ADD REPLY • link 3.0 years ago by 5heikki 11k

1

Entering edit mode

3.0 years ago

shelkmike ★ 1.2k

My lab uses ALAPY Compressor to compress FASTQ. It compresses FASTQ approximately 2 times better than gzip does. There was a discussion of ALAPY Compressor on BioStars: Lossless ALAPY Fastq Compressor (now for MacOS X with 10-20% improved speed and compression ratio)

ADD COMMENT • link 3.0 years ago by shelkmike ★ 1.2k

0

Entering edit mode

The compression seems similar to unmapped CRAM (2x), but the ALAPY organization (at least their referenced website from the github page) does not exist anymore. Seems therefore a bit risky for a long term storage format.

ADD REPLY • link 3.0 years ago by William ★ 5.3k

0

Entering edit mode

I agree. However, though their organisation exists no more, they published their compressor on GitHub.

ADD REPLY • link 3.0 years ago by shelkmike ★ 1.2k

1

Entering edit mode

3.0 years ago

GenoMax 141k

Every sequencing center/large lab runs into this question in their life cycle :-)

At some point you have to accept the fact that if you wish to store raw data long term then planning for the additional expense goes towards "cost of doing business". While a new standard may emerge over time (and will need to be accepted across manufacturers) converting past data to that standard may end up being an exercise in futility. So that solution would be useful going forward from that point.

If you are generating data that can be (or eventually be made) public then you should consider putting it in public repositories likes SRA/ENA/DDBJ and allow those organizations to maintain a copy. Warning: SRA has even proposed post-processing data to reduce its size. Otherwise see my comment above.

At some point we will reach a point of diminishing returns and it would be simply cost effective to regenerate the data as needed when samples are going to remain available. For non-replaceable samples you will just need to store the data.

ADD COMMENT • link 3.0 years ago by GenoMax 141k

score 2 · Accepted Answer · 2021-12-06

2

Entering edit mode

2.4 years ago

Divon ▴ 230

You might also want to look at my Genozip tool. Its a lossless compressor for all common genomic file formats (BAM/CRAM, FASTQ, VCF etc). Some benchmarks are here: https://genozip.com/benchmarks.html

ADD COMMENT • link 2.4 years ago by Divon ▴ 230