BAM to CRAM and BAM recover with smaller size
0
0
Entering edit mode
6 weeks ago
geocarvalho ▴ 360

Hi, I was trying to decide which algorithm from samtools to use for CRAM compression and I noticed the BAM files recovered from CRAM are smaller (-10 GiB) than the original BAM file. Do you know what information I am losing with this transformation?

$ docker run -v $PWD:$PWD quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1 samtools view -@ 14 -T $PWD/hg38.fa -C --output-fmt-option archive -o $PWD/SAMPLE-P_archive.cram $PWD/SAMPLE-P.bam
$ docker run -v $PWD:$PWD quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1 samtools view -@ 14 -T $PWD/hg38.fa --input-fmt-option archive -o $PWD/SAMPLE-P_unarchive.bam $PWD/SAMPLE-P_archive.cram

-rw-rw-r-- 1 where where  61G Mar 14 21:25 SAMPLE-P.bam
-rw-rw-r-- 1 where where 9.0M Mar 14 21:25 SAMPLE-P.bam.bai
-rw-r--r-- 1 root     root      18G Mar 15 00:16 SAMPLE-P_archive.cram
-rw-r--r-- 1 root     root      51G Mar 15 01:02 SAMPLE-P_unarchive.bam
-rw-r--r-- 1 root     root      18G Mar 15 01:23 SAMPLE-P_small.cram
-rw-r--r-- 1 root     root      51G Mar 15 01:47 SAMPLE-P_unsmall.bam
-rw-r--r-- 1 root     root      19G Mar 15 01:58 SAMPLE-P_normal.cram
-rw-r--r-- 1 root     root      51G Mar 15 02:14 SAMPLE-P_unormal.bam
-rw-r--r-- 1 root     root      21G Mar 15 03:13 SAMPLE-P_fast.cram
-rw-r--r-- 1 root     root      51G Mar 15 03:35 SAMPLE-P_unfast.bam
BAM samtools CRAM • 331 views
ADD COMMENT
1
Entering edit mode

Don't depend on file sizes for any decisions. Look inside/compare the reads.

ADD REPLY
0
Entering edit mode

Just as gzip -1 to gzip -9 can give different file sizes, so can two identical BAMs be very different in size. That may or may not be the cause. You'd have to uncompress to test. (Note there's no point in --input-fmt-option archive as the input format is self-describing, but I wonder if it somehow enabled archive mode for BAM output, which would indeed be something like gzip -9.)

Best thing though is to convert both to SAM and compare them. Htslib comes with a compare_sam.pl tool in the test directory to aid such things. It's slow as it's not designed for anything other than testing, but it'd maybe help give some confidence.

Also, if you know you'll be sticking to samtools/htslib/noodles derived tools for decoding CRAM then you could also try -O cram,archive,version=3.1 to get maximum compression. ALthough frankly "archive" is typically too extreme IMO. It's good compression, but "small" is often a better tradeoff. Try both and see.

ADD REPLY

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6