Why samtools sort input file is 4.3 Gb but output become 3.5Gb....
2
1
Entering edit mode
7.9 years ago
bxia ▴ 180

Try to sort the bam file, but the input is 4.3 Gb, output is 3.5 Gb

command line:

samtools sort 231.bam 231.sorted.bam

ChIP-Seq • 2.3k views
ADD COMMENT
6
Entering edit mode
7.9 years ago
Rob 6.5k

This is because BAM files are block compressed, and sorted data is more easily compressible than unsorted data (repetitions and similar records tend to be closer together, and the underlying compression algorithm can do a better job). This is a well-documented phenomenon, and there are even a number of compression tools (e.g. SCALCE, MINCE, ORCOM, etc.) that are built on this principle of compression "boosting".

ADD COMMENT
1
Entering edit mode
7.9 years ago
John 13k

Just as Rob has said, sorting helps out the compression algorithm as reads sorted by position will have very similar DNA, eg:

AAAAAGTGTT
AAAAAGTGTT
AAAAGTGTTA
AAAAGTGTTA
AAAGTGTTAA
...

Also the chromosome name will often be the same, as will the first few bits of the position value.

From a more theoretical point, which I think is interesting enough to worth mentioning, sorting is actually all about removing information from the data. This might sound weird, but in terms of entropy, thats exactly what's happening:

 Original         Sorted
    5               1
    3               2
    4               3
    2               4
    1               5
   10               6
    9               7
    8               8
    7               9
    6              10
   14              11
   16              12
   15              13
   17              14
   13              15
   11              16
   12              17

You might say that these two lists contain exactly the same amount of information in them, but you'd be wrong. Although there's a 1:1 mapping of items in the lists, I encoded a message in the original list, where if the number on the next row is higher than the number on the current row, we write down a 1, else a 0. Can you read the message?

Thus, sorting the list reduce the amount of information/entropy in the dataset. This is, fundamentally, why sorting helps compression.

ADD COMMENT

Login before adding your answer.

Traffic: 2370 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6