Question: Why samtools sort input file is 4.3 Gb but output become 3.5Gb....
1
3.2 years ago by
bxia140
bxia140 wrote:

Try to sort the bam file, but the input is 4.3 Gb, output is 3.5 Gb

command line:

samtools sort 231.bam 231.sorted.bam

chip-seq • 1.2k views
modified 3.2 years ago by John12k • written 3.2 years ago by bxia140
6
3.2 years ago by
Rob3.4k
United States
Rob3.4k wrote:

This is because BAM files are block compressed, and sorted data is more easily compressible than unsorted data (repetitions and similar records tend to be closer together, and the underlying compression algorithm can do a better job). This is a well-documented phenomenon, and there are even a number of compression tools (e.g. SCALCE, MINCE, ORCOM, etc.) that are built on this principle of compression "boosting".

1
3.2 years ago by
John12k
Germany
John12k wrote:

Just as Rob has said, sorting helps out the compression algorithm as reads sorted by position will have very similar DNA, eg:

``````AAAAAGTGTT
AAAAAGTGTT
AAAAGTGTTA
AAAAGTGTTA
AAAGTGTTAA
...
``````

Also the chromosome name will often be the same, as will the first few bits of the position value.

From a more theoretical point, which I think is interesting enough to worth mentioning, sorting is actually all about removing information from the data. This might sound weird, but in terms of entropy, thats exactly what's happening:

`````` Original         Sorted
5               1
3               2
4               3
2               4
1               5
10               6
9               7
8               8
7               9
6              10
14              11
16              12
15              13
17              14
13              15
11              16
12              17
``````

You might say that these two lists contain exactly the same amount of information in them, but you'd be wrong. Although there's a 1:1 mapping of items in the lists, I encoded a message in the original list, where if the number on the next row is higher than the number on the current row, we write down a 1, else a 0. Can you read the message?

Thus, sorting the list reduce the amount of information/entropy in the dataset. This is, fundamentally, why sorting helps compression.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.