Best Omic file compressor?
2
0
Entering edit mode
4 months ago
matthew.pace ▴ 30

Our team has been having storage space issues; we predicted that we will not have enough available memory to store the files generated by our pipelines. Standard file compressors (gzip, bzip2, 7zip) weren't cutting it and I started experimenting with file-specific compressors. This is where google spat out 'Genozip'.

I've managed to successfully reproduce it's claimed compression ratios on fastq.gz, vcf.gz and BAM files within a timeframe comparable to standard compression tools. I was not able to compress CRAM though (code in comments). It's got some additional utility which allows the user to read the doubly compressed files into stdout without decompressing.

I'm quite impressed with Genozip. It seems to be the best option but I remain a little skeptical as I haven't found any forum posts discussing it.

Has anybody had any experience with Genozip, or recommends another file compressor?

cram genozip gzip compression • 587 views
0
Entering edit mode

This is not really answering your question, but it may be useful.

Disk storage is relatively cheap compared to time and effort needed to test various compression algorithms and then to actually compress the files. Don't know if that would solve your problem, but there are 8-10 Tb hard disks available for under \$200. Also, in my experience using almost anything other than gzip (say, 7z or xz in their strongest compression modes) will get you within 2%-5% of those tools that claim the best compression. Is it really worth the effort to squeeze out the last couple of percent?

0
Entering edit mode

We aren't ruling out buying more storage but it's being left as a last resort. According to my testing with paired-end FASTQ files, conventional compression methods (gzip, bzip2, 7zip, rar) and the specialized DSRC using their highest compression factor gave me compression ratios of between 5 and 8. Genozip gave me a compression ratio of 21 with the same files.

Thats lossless compression to ~4% of the original file size and ~23% of the gzipped file size.

I've tested this multiple times with different files and it seems to be legitimate, which is why I wonder why this tool hasn't been getting any more attention.

0
Entering edit mode

I see that a ratio of 21 is being claimed but how much time does that compression add (assuming same amount would be needed for decompression). If you are a smaller lab it may be worth investing that time but for large projects that may simply not be worth it.

0
Entering edit mode

Compression time was comparable to standard compression tools

1
Entering edit mode
9 weeks ago
matthew.pace ▴ 30

Update: Genozip has been patched and the issue with CRAM resolved. It's now being used for archiving. Thanks Divon!

1
Entering edit mode
9 weeks ago
colindaven ★ 3.5k

Not an answer to your question, but I saw the reputation of brotli but didn't enjoy using it much. The implementations I tried were not as user friendly as pigz, gzip etc. Also many pipelines are happy working with gzipped files, but brotli is more exotic and untested.

The key factor behind compression with custom algorithms is the length of support. If this is just a phd/postdoc project, can I still trust it in 10-15 years? And yes, we still see some (very limited) demand for 10 year old files at the moment, so we anticipate this will still happen in the future.

That's why I would stick to well supported formats such as bam, cram, fastq.gz etc.

1
Entering edit mode

Another critical factor is acceptance of a new format by companies making sequencing hardware. If Illumina throws it weight behind a format then people will consider adopting it. Illumina acquired a company called Enancio for the purpose of lossless data compression. This format is already being rolled out. They are providing the decompression software for free.