Does SortSam lead to a loss of data?
1
0
Entering edit mode
3.3 years ago
James Reeve ▴ 130

As part of my pipeline I'm using the Picard program SortSam to order the reads in my BAM file by their position (SORT_ORDER=coordinate). However when I run this code, my output file has less space.

java -Djava.io.tmpdir=[tmp-directory] -jar picard.jar SortSam \
I=before-sort.bam \
O=after-sort.bam \
SORT_ORDER=coordinate


du before-sort.bam = 44131980 KB

du after-sort.bam = 28874760 KB

Do I have a loss of data, or does SortSam have a filtering step I dont' know of?

Picard • 987 views
3
Entering edit mode
3.3 years ago
ATpoint 55k

No, it does not (should not) remove anything. The size difference is in 99.9% due to different compression levels. Simply use samtools flagstat to count reads in both files. Should be identical.

0
Entering edit mode

I checked my files. They have the same number of reads, thanks for the help.

Do you know why my file is nearly 50% smaller after sorting? This is remarkable compression form a programe that I assumed only rearanges the data.

0
Entering edit mode

How did you create the before-sort.bam? Maybe you used a very low compression level on this one? I think standard compression level for most tools is 5 (from 0-9). I think (if I remember correctly) typically the size difference between an uncompressed BAM and a standard BAM that you get from normal samtools view -b is like 20%, but I have to say that I really have no expert knowledge on compression and stuff so do not take me as a reference^^

0
Entering edit mode

I compressed from SAM to BAM using samtools view -b. SortSam sets the default compression level to 5 (20%).

It seems a previous post (Sam To Bam - Loss Of Data Or Just Great Compression?) found SortSam to be very efficent when converting SAM to BAM. I guess part of the SortSAM program compresses the files.

0
Entering edit mode

Ok I see. I never cared too much about file sizes as our HPC cluster has almost 350TB of space, so I often leave intermediate files completely uncompressed to save time.