Samtools sort by name - bam size issue
1
0
Entering edit mode
7 months ago
quentin54520 ▴ 20

Hello all,

I want to sort my bam by queryname so i used the command:

Samtools sort -n -m 1 -@ 10 -o /Path/ouput.bam /path/input.bam


It's work fine but at the end the ouput.bam is really bigger thant input.bam (70Go vs 47Go). It is normal ? Before to go on with this bam i prefer be sure... My input bam is the ouput of samtools view (to remove unwanted read).

Quentin

alignment genome samtools • 250 views
1
Entering edit mode

I would suggest to always be explicit when setting memory limits, so 1G rather than 1. Maybe they made it bullet-proof in recent samtools versions but (if memory serves) there was I think a time when -m 1 was interpreted as memory=1byte and this resulted in samtools spamming the disk with millions of tiny temporary files for each chunk. Can be that I mix it up with another tool, but it does not hurt to be explicit. But yes, as Pierre says this behaviour is expected.

0
Entering edit mode

Thanks for your reply. Yes in the real command i put 1G but i forgot here sorry 🙂

2
Entering edit mode
7 months ago

yes, it's normal. when it's sorted by coordinate, some similar DNA sequences are grouped in the same block of gzip compression, which improve the performance of the compression. When sorting by query-name, you break those groups.