Forum:2021: state and usuge of compressed file standards better than BAM and FASTQ
2
4
Entering edit mode
12 weeks ago
William ★ 4.9k

Extra compressed formats for raw/aligned reads and variant tables have been around for some time but I think saw slow adoption.

Our current disk space usage is making us have another look at switching to file formats that offer better compression than vanilla FASTQ, BAM and BCF..

For example:

  • CRAM instead of BAM
  • CRAM(unmapped) instead of FASTQ
  • uBAM (unmapped BAM) instead of FASTQ
  • DRAGEN ORA (From Illumina /Enancio) instead of FASTQ
  • spVCF instead of VCF/BCF
  • etc.

At least these aspect are important when considering new file formats:

  • compression factor to be gained / file size reduction to be gained
  • lossy or lossless
  • biological still meaningful
  • technical compatible with current pipelines and tools (e.g. bwa/gatk/bcftools, IGV)
  • open (source) file format / API specification

We care most about improved compression / reduced file size for the FASTQ and BAM files. Less about improved compression for BCF.

Did you / your organization already make the switch to file formats that offer better compression than vanilla FASTQ/BAM/BCF?

How did this switch turn out? Looking for example at the above listed aspects?

Relevant external blog post and benchmark:

https://www.ga4gh.org/news/guest-post-seven-myths-about-cram-the-community-standard-for-genomic-data-compression/

http://www.htslib.org/benchmarks/CRAM.html

fastq compression bam • 444 views
ADD COMMENT
0
Entering edit mode

The Illumina /Enacio format (fastq.ora) offers c.a. 5X extra compression over fastq.gz according to their FAQ. But the format is closed I think, and you have to rely on Illumina keeping their converter available and free for use. https://www.illumina.com/company/about-us/mergers-acquisitions/enancio.html/

ADD REPLY
0
Entering edit mode

Illumina will have a vested interest in keeping this format supported (should that catch on) for a long time since they are in the business of selling sequencers. Problem is to see if other technologies adopt it or go their own way. That is when we will have problems of competing formats causing additional headaches for end-users.

ADD REPLY
0
Entering edit mode

I don't understand why Illumina doesn't just publish the format. Since indeed they are in the business of selling sequencing machines/kits. And an open format would improve adoption , i.e. fastq.ora becoming an open industry standard. And the format does not seem to be too difficult to copy / improve upon, i.e. fast and dirty mapping against reference, encode difference (or exact match) of read v.s. reference.

ADD REPLY
0
Entering edit mode

Illumina just spent (a good bit of money?) to acquire that technology. They are also the dominant player in the market so perhaps they don't have an immediate need to make the technology (my guess is it is not a simple format) public. Perhaps if a competitor announces an open/comparable technology (and if it looks like it may start getting adopted) then they would face some pressure. In any case, they are making decompressors available for free so end-users are not locked out of the data.

ADD REPLY
0
Entering edit mode

Ion Torrent uses uBAM over fastq. We actually lose information if uBAM is converted to fastq

Edit. I haven't used this tool but it seems to offer lossless BAM to CRAM for Ion Torrent reads..

ADD REPLY
1
Entering edit mode
12 weeks ago
shelkmike ▴ 540

My lab uses ALAPY Compressor to compress FASTQ. It compresses FASTQ approximately 2 times better than gzip does. There was a discussion of ALAPY Compressor on BioStars: Lossless ALAPY Fastq Compressor (now for MacOS X with 10-20% improved speed and compression ratio)

ADD COMMENT
0
Entering edit mode

The compression seems similar to unmapped CRAM (2x), but the ALAPY organization (at least their referenced website from the github page) does not exist anymore. Seems therefore a bit risky for a long term storage format.

ADD REPLY
0
Entering edit mode

I agree. However, though their organisation exists no more, they published their compressor on GitHub.

ADD REPLY
1
Entering edit mode
12 weeks ago
GenoMax 104k

Every sequencing center/large lab runs into this question in their life cycle :-)

At some point you have to accept the fact that if you wish to store raw data long term then planning for the additional expense goes towards "cost of doing business". While a new standard may emerge over time (and will need to be accepted across manufacturers) converting past data to that standard may end up being an exercise in futility. So that solution would be useful going forward from that point.

If you are generating data that can be (or eventually be made) public then you should consider putting it in public repositories likes SRA/ENA/DDBJ and allow those organizations to maintain a copy. Warning: SRA has even proposed post-processing data to reduce its size. Otherwise see my comment above.

At some point we will reach a point of diminishing returns and it would be simply cost effective to regenerate the data as needed when samples are going to remain available. For non-replaceable samples you will just need to store the data.

ADD COMMENT

Login before adding your answer.

Traffic: 1063 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6