Does SortSam lead to a loss of data?
1
0
Entering edit mode
3.3 years ago
James Reeve ▴ 130

As part of my pipeline I'm using the Picard program SortSam to order the reads in my BAM file by their position (SORT_ORDER=coordinate). However when I run this code, my output file has less space.

java -Djava.io.tmpdir=[tmp-directory] -jar picard.jar SortSam \
     I=before-sort.bam \ 
     O=after-sort.bam \
     SORT_ORDER=coordinate

du before-sort.bam = 44131980 KB

du after-sort.bam = 28874760 KB

Do I have a loss of data, or does SortSam have a filtering step I dont' know of?

Picard • 986 views
ADD COMMENT
3
Entering edit mode
3.3 years ago
ATpoint 55k

No, it does not (should not) remove anything. The size difference is in 99.9% due to different compression levels. Simply use samtools flagstat to count reads in both files. Should be identical.

ADD COMMENT
0
Entering edit mode

I checked my files. They have the same number of reads, thanks for the help.

Do you know why my file is nearly 50% smaller after sorting? This is remarkable compression form a programe that I assumed only rearanges the data.

ADD REPLY
0
Entering edit mode

How did you create the before-sort.bam? Maybe you used a very low compression level on this one? I think standard compression level for most tools is 5 (from 0-9). I think (if I remember correctly) typically the size difference between an uncompressed BAM and a standard BAM that you get from normal samtools view -b is like 20%, but I have to say that I really have no expert knowledge on compression and stuff so do not take me as a reference^^

ADD REPLY
0
Entering edit mode

I compressed from SAM to BAM using samtools view -b. SortSam sets the default compression level to 5 (20%).

It seems a previous post (Sam To Bam - Loss Of Data Or Just Great Compression?) found SortSam to be very efficent when converting SAM to BAM. I guess part of the SortSAM program compresses the files.

ADD REPLY
0
Entering edit mode

Ok I see. I never cared too much about file sizes as our HPC cluster has almost 350TB of space, so I often leave intermediate files completely uncompressed to save time.

ADD REPLY

Login before adding your answer.

Traffic: 1739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6