Question

Sorting a STAR-output bam file with samtools leads to a significant reduction in size of the sorted bam file

0

Entering edit mode

16 months ago

e.r.zakiev ▴ 230

Normally STAR outputs bam files that are already sorted by coordinate: *.sortedByCoord.out.bam

But just now I tried sorting such output bam files with samtools and the resulting (presumably, twice-sorted) files are actually ~20% smaller, which is a bit concerning. I expected the files to be identical, much less observing a difference in sizes.

The headers, as queried by samtools view -H bamfile > f1 and samtools view -H bamfile_sorted > f2, are identical before and after sorting, as confirmed by the cmp f1 f2.

What is going on?

samtools STAR • 1.9k views

ADD COMMENT • link 16 months ago by e.r.zakiev ▴ 230

2

Entering edit mode

Generally do not use file sizes as a metric for any QC. Extent of compression can change based on what data is near each other.

If the file was already sorted then perhaps it was compressed more when you tried sorting it again. You can control the level of compression using following option for samtools sort.

-l INT     Set compression level, from 0 (uncompressed) to 9 (best)

ADD REPLY • link 16 months ago by GenoMax 144k

1

Entering edit mode

I've noticed what OP sees too - there's definitely a difference in size. I charted it up to difference in compression levels as well as there is no easy way to check if two BAMs are identical in terms of content - we'd also need to examine the significance of each difference, which IMO is a pain.

ADD REPLY • link 16 months ago by Ram 44k

2

Entering edit mode

BAM's header may not contain all the PG lines. To make sure that both STAR's BAM and samtools BAM contain the same number of reads run:

samtools idxstats bam_fn

If these are the same then IMHO there is nothing to worry. The size diff may be caused by different compression level (default for gzip is 6), blocksize or if somehow you got unmapped reads in the input BAM samtools minimizer option.

ADD REPLY • link 16 months ago by Darked89 4.6k

2

Entering edit mode

BAM's header may not contain all the PG lines.

Curious why that can happen? OP is re-sorting a sorted file (not sure why).

ADD REPLY • link 16 months ago by GenoMax 144k

1

Entering edit mode

You are right that in the OP case there is nothing indicating any missing PG lines. I meant a general case scenario where "two BAM headers match 100%" may not reflect the differences in BAM bodies. I.e. bedtools intersect outputs BAMs without the relevant PG line

ADD REPLY • link 16 months ago by Darked89 4.6k

0

Entering edit mode

was re-sorting to double-check (and to double-guess, apparently)

also multiple functions from the rseqc suite were printing an explicit warning saying that the input should be coordinate-sorted by samtools. And I understood this as if samtools introduces some tags or whatnot exclusive to it and only it during its sorting step. And that it might be different from STAR's sorted output files. Basically I was bullied by an analysis suite's warning messages

ADD REPLY • link 16 months ago by e.r.zakiev ▴ 230

1

Entering edit mode

samtools idxstats outputs identical outputs for both files, so I assume the difference in size is due to the compression then

ADD REPLY • link 16 months ago by e.r.zakiev ▴ 230

score 3 · Accepted Answer · 2023-03-14

I am not too familiar with the subject, neither with details of compression algorithms and nor the exact specification of the BAM format, but I think it is just a matter of more or less efficient compression.

Various implementations of zlib exist and the samtools authors have run some benchmarks showing significant differences, at least for lower compression levels.

But even with the exact same implementation and compression level, sizes of compressed files may vary, if their content is reordered. The reason is, that all compression works by storing similar content only once and referencing this piece of information at multiple locations. By design, zlib is limited to a 32 KB window, which was a sensible choice in the early '90s, when it was conceived. So if sorting brings more similar content closer together and into that 32kb window (e.g. by a different handling of mates or secondary alignments), it allows for a more efficient compression. This is nicely exemplified by clumpify from the BBTools suite, which can shrink compressed FastQ files by another 50% or so just by reordering reads by sequence similarity.

Since today's computers can hold much more information in memory, even in mobile and embedded environments, the window for compression algorithms can be increased for a way better performance. Facebook's Zstandard, for example, has no inherent limit and can address terabytes of memory. It mostly operates on something between 1 - 8 MB, though. However, it has a long range mode --long that allows for a maximum window size of 2GB. You would probably be surprised how much that can shrink a Fasta file of some reference genome interspersed with long transposons. Also zstd --train is a really cool feature, because it builds a dictionary of repeating patterns informed by the content that needs to be compressed. Once trained, using a dictionary works really well, if thousands of similar, small files need to be compressed individually.

But I'm getting off topic - ultimately you just wanted to know if everything was fine with your BAM files, and it is ;-)