Tool:Genozip: A new compression tool for FASTQ, BAM, VCF and more
0
1
Entering edit mode
12 weeks ago
Divon ▴ 100

Genozip is a new(ish) compression software for compressing genomic files. It usually compresses x2-x5 times better than standard compression (eg .gz), and it works on all common genomic file formats. I am its developer.

It is a lot more than just a compressor though, it has some interesting analytical capabilities too.

Installation, documentation and source code: http://genozip.com

Publication: A Universal Extensible Genomic Data Compressor

Feedback / feature requests would be more than welcome.

Note: this tool is not open source, but it is free for non-commercial use, and the source code is available.

bam vcf fastq compression • 422 views
2
Entering edit mode

This seems like a great tool which has been seriously overlooked.

I've been testing it out and was able to reproduce the compression ratios you claimed in your 2021 paper with fastq.gz, vcf.gz, and .bam files. I'm having trouble with CRAM files however:

genozip \
--reference GRCh37_latest_genomic.ref.genozip \
--output sample.cram.genozip \
sample.cram


The original sample.cram is 10GB while the output sample.cram.genozip is 15GB. I was given the message:

"FYI: header of HTS154_3.cram has contig '1' (and maybe others, too), missing in /scratch/mpace21/GRCh37_latest_genomic.ref.genozip. No harm."

Any suggestions?

1
Entering edit mode

Hi Matthew, I sent you a response on the other thread as well, repeating here in case you didn't see it.

First, thank you for your kind words, it is very rewarding to hear.

Can you please send me a small sample (eg first 10k lines) of the CRAM to support@genozip.com and I will look into it.

0
Entering edit mode

From Github: Yes, Genozip can compress already-compressed files (.gz .bz2 .xz .bam .cram).

Generally, compression of compressed data does not work well. This is a very amazing computational result.