Question: Does SortSam lead to a loss of data?
gravatar for James Reeve
19 months ago by
James Reeve90
James Reeve90 wrote:

As part of my pipeline I'm using the Picard program SortSam to order the reads in my BAM file by their position (SORT_ORDER=coordinate). However when I run this code, my output file has less space.

java[tmp-directory] -jar picard.jar SortSam \
     I=before-sort.bam \ 
     O=after-sort.bam \

du before-sort.bam = 44131980 KB

du after-sort.bam = 28874760 KB

Do I have a loss of data, or does SortSam have a filtering step I dont' know of?

picard • 489 views
ADD COMMENTlink modified 19 months ago by ATpoint28k • written 19 months ago by James Reeve90
gravatar for ATpoint
19 months ago by
ATpoint28k wrote:

No, it does not (should not) remove anything. The size difference is in 99.9% due to different compression levels. Simply use samtools flagstat to count reads in both files. Should be identical.

ADD COMMENTlink written 19 months ago by ATpoint28k

I checked my files. They have the same number of reads, thanks for the help.

Do you know why my file is nearly 50% smaller after sorting? This is remarkable compression form a programe that I assumed only rearanges the data.

ADD REPLYlink written 19 months ago by James Reeve90

How did you create the before-sort.bam? Maybe you used a very low compression level on this one? I think standard compression level for most tools is 5 (from 0-9). I think (if I remember correctly) typically the size difference between an uncompressed BAM and a standard BAM that you get from normal samtools view -b is like 20%, but I have to say that I really have no expert knowledge on compression and stuff so do not take me as a reference^^

ADD REPLYlink modified 19 months ago • written 19 months ago by ATpoint28k

I compressed from SAM to BAM using samtools view -b. SortSam sets the default compression level to 5 (20%).

It seems a previous post (Sam To Bam - Loss Of Data Or Just Great Compression?) found SortSam to be very efficent when converting SAM to BAM. I guess part of the SortSAM program compresses the files.

ADD REPLYlink written 19 months ago by James Reeve90

Ok I see. I never cared too much about file sizes as our HPC cluster has almost 350TB of space, so I often leave intermediate files completely uncompressed to save time.

ADD REPLYlink written 19 months ago by ATpoint28k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 960 users visited in the last hour