Question: Is there any benefit in sorting a sam/bam file by coordinates vs. queryname?
gravatar for O.rka
17 months ago by
O.rka210 wrote:

I'm running the dropseq pipeline and there is a part where the samfile gets sorted. It looks like there is an option to either sort by coordinates or by queryname. Is there a benefit to either of these?

alignment • 413 views
ADD COMMENTlink modified 17 months ago by h.mon30k • written 17 months ago by O.rka210

sorting by coordinates is more efficient when visualised in for instance a genome browser (because the way it needs to be queried is based on genomic location rather than on name). I can't think of one immediately but I'm sure that sorting on names also has it use cases.

ADD REPLYlink written 17 months ago by lieven.sterck8.3k

I think paired-end reads would be guaranteed to be adjacent in a sort-by-name, whereas it wouldn't necessarily be so with a coordinate sorted BAM.

ADD REPLYlink modified 17 months ago • written 17 months ago by manuel.belmadani1.2k

True. Quantification of paired-end reads when counting fragments (defined by the two mates) requires name-sorting. Tools like featureCounts will reorder the BAMs by query name given you specify paired-end input.

ADD REPLYlink written 17 months ago by ATpoint36k
gravatar for h.mon
17 months ago by
h.mon30k wrote:

Reading a file sequentially is faster than random access, and keeping in memory just the information necessary for some calculation is more efficient than keeping the whole file. Some tasks are more easily performed depending on how the bam is sorted, because the bam can be read sequentially and just part of the data need to be kept in memory.

For example, marking duplicates (which, for paired reads, is done by looking at 5' mapping positions of both reads) is a lot easier for bams sorted by position, because you guarantee the reads physically closer inside the bam are also close on the genome. If they weren't, one would need to scan the whole file to build a hash of reads per position in order to mark duplicates.

Conversely, counting reads mapping to features it easier for name-sorted files, as read pairs are next to each other, and secondary / supplementary alignments are next to primary alignments. Again, if they weren't, one would need to scan the whole file to build a hash of reads names per feature mapped.

Of course, most immediately as an end-user, one has to pay attention to which settings are necessary and which sorting order is expected by the tool of choice.

ADD COMMENTlink written 17 months ago by h.mon30k

My memory is a bit fuzzy on this but I recall a discussion that name or coordinate sorted BAM file can compress better (similar characters near by compress better). If someone does not comment on this I will check on it tomorrow.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax87k

You are right, coordinate-sorted bam files are smaller than unsorted bam with the same compression level. In fact, even fastq files can be further compressed by clustering similar sequences, as is done by from the BBTools package.

My intuition says name sorting wouldn't help much, if anything, to further compress a bam file.

ADD REPLYlink written 17 months ago by h.mon30k

Do you recall if name sorted BAM files are (not sure by how much) smaller/larger than co-ordinate sorted ones (same file)? One should be smaller since name sorted BAM's will have fastq headers (similar) near each other. Based on your comment about clumpify my feeling is the name sorted bam may be smaller (by not a lot but still) than same file sorted by co-ordinates. If you don't check on it tonight I will check it tomorrow.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax87k

I don't recall, but I just made some quick tests with smalls bams I have around (example files from several installed programs). I name- and coordinate-sorted these files and compared sizes:

ls -l -S
total 817060
-rw-r--r-- 1 hmon hmon 474147656 Mar 12 00:10 nsorted_f1.bam
-rw-r--r-- 1 hmon hmon 323868932 Mar 12 00:07 csorted_f1.bam
-rw-r--r-- 1 hmon hmon  16269272 Mar 12 00:17 nsorted_f4.bam
-rw-r--r-- 1 hmon hmon  15985146 Mar 12 00:17 csorted_f4.bam
-rw-r--r-- 1 hmon hmon   2596560 Mar 12 00:20 nsorted_f2.bam
-rw-r--r-- 1 hmon hmon   1762304 Mar 12 00:20 csorted_f2.bam
-rw-r--r-- 1 hmon hmon   1290775 Mar 12 00:17 nsorted_f3.bam
-rw-r--r-- 1 hmon hmon    736082 Mar 12 00:17 csorted_f3.bam

Coordinate-sorted files were always smaller.

ADD REPLYlink written 17 months ago by h.mon30k

Thanks for checking that. I guess having chromosome names lined up makes for better compression than the read names.

ADD REPLYlink written 17 months ago by genomax87k

It is not the chromosome names that compress better, it is the sequences - they are generally longer than chromosome names. When sorting by coordinate, similar or identical reads end up next to each other, improving compression.

I also faintly remember a thread where this issue was discussed in detail, but I can't find it. However, the question is not new, and has been discussed several times, e.g.:

sorting a BAM produces a smaller file than the original

Size of BAM file reduces after sorting with samtools

ADD REPLYlink written 17 months ago by h.mon30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1556 users visited in the last hour