Question: Does SortSam lead to a loss of data?
gravatar for James Reeve
2.6 years ago by
James Reeve100
James Reeve100 wrote:

As part of my pipeline I'm using the Picard program SortSam to order the reads in my BAM file by their position (SORT_ORDER=coordinate). However when I run this code, my output file has less space.

java[tmp-directory] -jar picard.jar SortSam \
     I=before-sort.bam \ 
     O=after-sort.bam \

du before-sort.bam = 44131980 KB

du after-sort.bam = 28874760 KB

Do I have a loss of data, or does SortSam have a filtering step I dont' know of?

picard • 779 views
ADD COMMENTlink modified 2.6 years ago by ATpoint44k • written 2.6 years ago by James Reeve100
gravatar for ATpoint
2.6 years ago by
ATpoint44k wrote:

No, it does not (should not) remove anything. The size difference is in 99.9% due to different compression levels. Simply use samtools flagstat to count reads in both files. Should be identical.

ADD COMMENTlink written 2.6 years ago by ATpoint44k

I checked my files. They have the same number of reads, thanks for the help.

Do you know why my file is nearly 50% smaller after sorting? This is remarkable compression form a programe that I assumed only rearanges the data.

ADD REPLYlink written 2.6 years ago by James Reeve100

How did you create the before-sort.bam? Maybe you used a very low compression level on this one? I think standard compression level for most tools is 5 (from 0-9). I think (if I remember correctly) typically the size difference between an uncompressed BAM and a standard BAM that you get from normal samtools view -b is like 20%, but I have to say that I really have no expert knowledge on compression and stuff so do not take me as a reference^^

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by ATpoint44k

I compressed from SAM to BAM using samtools view -b. SortSam sets the default compression level to 5 (20%).

It seems a previous post (Sam To Bam - Loss Of Data Or Just Great Compression?) found SortSam to be very efficent when converting SAM to BAM. I guess part of the SortSAM program compresses the files.

ADD REPLYlink written 2.6 years ago by James Reeve100

Ok I see. I never cared too much about file sizes as our HPC cluster has almost 350TB of space, so I often leave intermediate files completely uncompressed to save time.

ADD REPLYlink written 2.6 years ago by ATpoint44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 990 users visited in the last hour