Hi all,
I'm aiming to resort my BAM files by read mates, but the common solutions I've seen online do not produce what I need (example below). This post is building off of a similar past inquiry that (to my knowledge) went unresolved: Keeping paired reads together when sorting BAM file by name
FS10002072:15:BSE39216-1017:1:1101:1780:1000    99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGTCCTGCGGCGGGTCGCCTGCCCTGCCCCCGAACCCCGCCTGGGGGCCGCGGTCGGCCCGGCGCTTCTCCGGAGGCACCCACTGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF AS:i:275    XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4  MD:Z:16A21A19A17G74 YS:i:281    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:1780:1000    147 x   305 44  147M4S  =   194 -262    GTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGCGAGGTTCTGGCCTTTCAGGCCGCAGGAAGAGGAACGGAGCGAGTCCCCGCGTGCGGCGCGATTCCCTGAGCTGTGGGACGTGCACCCAGGACTCGGCTCACACATGCTACT FFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFF,FFFFF::,::FFFFFFFF:FFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:281    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:47A43C55   YS:i:275    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:3950:1000    99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCCGCATGGAGGCCGCGGTCGGCTCGGCGCTTCTCAGGAGGCACCCACTGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFF::FFF:FFFF:FFFFF:FFFFFF,FFFF AS:i:274    XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4  MD:Z:54C17C3G7C66   YS:i:281    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:3950:1000    147 x   305 44  147M4S  =   194 -262    GTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGCGAGGTTCAGGCCTTTAAGGCCGCAGGAAGAGGAACGGAGCGAGTCCCCGCGCGTGGCGCGATTCCCTGAGCTGTGGGACGTGCACCCAGGACTCGGCTCACACATGCTGCC ,FFF,FFF,FFFFFFF::F:,F:F:FFFFF,F,FFFFFFF,FFF:FFFFFF:FF:FFF:FF,:FF:FFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFF,FFFFFFFF AS:i:281    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:55C37C53   YS:i:274    YT:Z:CP
I've tried so far: 
1) samtools sort -n
The output seems to be sorted by ascending read names, which has the effect of separating the mates. Example below:
FS10002072:15:BSE39216-1017:1:1101:1000:3240    99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCACTGCCACCGCGAAGTGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF AS:i:295    XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:110A40 YS:i:294    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:10040:3630   99  x   194 44  48M1D103M   =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCAATGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGCFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   AS:i:287    XN:i:0  XM:i:1  XO:i:1  XG:i:1  NM:i:2  MD:Z:48^C47C55  YS:i:294    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:10040:3870   99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGATCCCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCAATGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:288    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:47A48C54   YS:i:281    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:10080:3790   99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCACTGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF AS:i:302    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:151    YS:i:274    YT:Z:CP
2) picard-tools SortSam sorting by queryname
This produced the same result as example 1.
3) Rsubread's repair utility.
Strangely this changed the values of MAPQ and CIGAR for many reads to "0" and "*" respectively.
Your samtools example looks like it is sorted by position not by name. Are you sure it is the output of a
samtools sort -ncommand?An alternative to using
samtools sortto group by name issamtools collate, though this does not guarantee the sort order between groups.That doesn't look like
sort -nas it's position sorted. Are you sure?Also, I'd recommend
samtools collateas a far faster way of grouping mates together, unless there is a specific reason why the names need to be in sorted order (rather than simply grouped together) or unless you need to randomise position order (as collate is still position correlated), eg when doing analysis of insert size via sampling the first X reads.What is your samtools version? Does that happen with the latest one?
ATpoint I'm using Samtools version 1.15.1
I updated to the latest (1.16.1) and the sorting behavior was unchanged to the previous version.