Best Omic file compressor?
1
0
Entering edit mode
4 weeks ago

Our team has been having storage space issues; we predicted that we will not have enough available memory to store the files generated by our pipelines. Standard file compressors (gzip, bzip2, 7zip) weren't cutting it and I started experimenting with file-specific compressors. This is where google spat out 'Genozip'.

I've managed to successfully reproduce it's claimed compression ratios on fastq.gz, vcf.gz and BAM files within a timeframe comparable to standard compression tools. I was not able to compress CRAM though (code in comments). It's got some additional utility which allows the user to read the doubly compressed files into stdout without decompressing.

I'm quite impressed with Genozip. It seems to be the best option but I remain a little skeptical as I haven't found any forum posts discussing it.

Has anybody had any experience with Genozip, or recommends another file compressor?

Documentation: https://genozip.readthedocs.io/

cram genozip gzip compression • 223 views
ADD COMMENT
0
Entering edit mode

This is not really answering your question, but it may be useful.

Disk storage is relatively cheap compared to time and effort needed to test various compression algorithms and then to actually compress the files. Don't know if that would solve your problem, but there are 8-10 Tb hard disks available for under $200. Also, in my experience using almost anything other than gzip (say, 7z or xz in their strongest compression modes) will get you within 2%-5% of those tools that claim the best compression. Is it really worth the effort to squeeze out the last couple of percent?

ADD REPLY
0
Entering edit mode

We aren't ruling out buying more storage but it's being left as a last resort. According to my testing with paired-end FASTQ files, conventional compression methods (gzip, bzip2, 7zip, rar) and the specialized DSRC using their highest compression factor gave me compression ratios of between 5 and 8. Genozip gave me a compression ratio of 21 with the same files.

Thats lossless compression to ~4% of the original file size and ~23% of the gzipped file size.

I've tested this multiple times with different files and it seems to be legitimate, which is why I wonder why this tool hasn't been getting any more attention.

ADD REPLY
0
Entering edit mode
4 weeks ago

Unsuccessful CRAM command:

genozip \
--reference GRCh37_latest_genomic.ref.genozip \
--output sample.cram.genozip \
sample.cram

The original sample.cram is 10GB while the output sample.cram.genozip is 15GB. I was given the message:

"FYI: header of HTS154_3.cram has contig '1' (and maybe others, too), missing in $home/GRCh37_latest_genomic.ref.genozip. No harm."
ADD COMMENT

Login before adding your answer.

Traffic: 2378 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6