Fast way to sort bam file by queryname similar to picard SortSam SORT_ORDER=queryname?
7 weeks ago
kalavattam

When sorting by queryname with Samtools (samtools sort -n), Samtools does a natural sort by colon-delimited subfield. On the other, when sorty by queryname with Picard (picard SortSam SORT_ORDER=queryname), Picard does not sort by colon-delimited subfield, instead treating the queryname as one field and then sorting in ASCII sort order (for example, as described in this comment and its sub-comments).

I would like to sort my bam files in the picard SortSam SORT_ORDER=queryname manner, but Picard SortSam is quite a bit slower than samtools sort -n; samtools sort -n can be parallelized while picard SamSort SORT_ORDER=queryname can't be parallelized. Is there a fast alternative to picard SamSort SORT_ORDER=queryname for this task?

7 weeks ago

I don't think there a software doing this "fast". You could fork samtools and change the function that compare the name of the reads here:

    if (g_is_by_qname) {
        int t = strnum_cmp(bam_get_qname(a.bam_record), bam_get_qname(b.bam_record));
        if (t != 0) return t;
        return (int) (a.bam_record->core.flag&0xc0) - (int) (b.bam_record->core.flag&0xc0);

strnum_cmp is implemented here


