Keeping paired reads together when sorting BAM file by name
3
4
Entering edit mode
9.8 years ago
Rob 6.9k

So I've noticed that the SNAP aligner sometimes writes alignment records for a read pair non-consecutively in the output BAM file (I think this is actually a bug). So, you'll get something like a bunch of alignments for read X, then alignments for some other reads, then one or more alignments for read X. I want all of the alignments for a read to appear consecutively in the file. The standard solution I've seen for this is to use e.g. samtools sort -n, and sort the reads by name. However, I've noticed that this has the undesirable property of breaking up records for alignments of paired-end reads that belong together. For example, when I do this, the beginning of my BAM file looks like:

0       99      ENST00000562765 656     0       60M16S  =       759     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       355     ENST00000562212 846     0       60M16S  =       949     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       355     ENST00000568423 804     0       60M16S  =       907     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       355     ENST00000425597 854     0       60M16S  =       957     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       355     ENST00000545456 714     0       60M16S  =       817     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       355     ENST00000361900 871     0       60M16S  =       974     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       355     ENST00000567529 564     0       60M16S  =       667     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       147     ENST00000562765 759     0       76M     =       656     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       403     ENST00000562212 949     0       76M     =       846     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       403     ENST00000568423 907     0       76M     =       804     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       403     ENST00000425597 957     0       76M     =       854     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       403     ENST00000545456 817     0       76M     =       714     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       403     ENST00000361900 974     0       76M     =       871     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP
0       403     ENST00000567529 667     0       76M     =       564     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0  RG:Z:FASTQ      PG:Z:SNAP

So here, the reads seem to be sorted by query name, but, within the set of records for a name, the pairs are all spread out (they actually seem to be sorted, secondarily, by the SAM status flags). This means that the paired alignment that corresponds to the ends of a read aligning to a given contig are no longer adjacent in the file (as they originally were, and as I'd like them to be). Is there some tool to accomplish this type of sorting, rather than what samtools -n does? I'd like to place all of the alignments for a given read together, but I'd like to keep paired-end alignments adjacent in the sorted output.

alignment next-gen RNA-Seq • 6.1k views
ADD COMMENT
0
Entering edit mode
9.8 years ago

You can try out this with pysam. It has many attributes and methods for read object and provides some features like next() and seek() to compare next read's attributes to current read and keep track of your pointer. Update this post if you find any solution. :)

ADD COMMENT
0
Entering edit mode
9.8 years ago
matted 7.8k

I think samtools is doing the right thing. The relevant piece of code from bam_sort.c is:

if (g_is_by_qname) {
        int t = strnum_cmp(bam_get_qname(a), bam_get_qname(b));
        return (t < 0 || (t == 0 && (a->core.flag&0xc0) < (b->core.flag&0xc0)));
    }

It compares the read names, and if they're the same, it masks out only the flags for first and second in pair, in order to sort by that.

In your example, there's one read with flags of 99, six with 355, one with 147, and six with 403. They're all the same name, read pair "0". The first seven reads are from one read end, and the last seven are from the other. The twelve repeated reads are secondary alignments (flag 0x100), so you could filter those out with samtools view -bF 256 in.bam > out.bam to get a bam file with only two read records per pair (the primary alignments).

ADD COMMENT
0
Entering edit mode

Hi matted,

I'm not arguing that samtools is doing the "wrong" thing (in fact, one could probably claim that, since sorting by query name is something that is somewhat ambiguously defined, and since samtools is samtools, what it's doing is, by definition, the "right thing"). What I'm saying is that it's not really giving me what I want. Perhaps if I explain the situation more, my use case will make sense. I care quite a bit about secondary alignments, because I'm doing RNA-seq quantification and I want to disambiguate multiple alignments to similar isoforms. What I'd really like is the output in the format provided by e.g. Bowtie, the alignments for a read are grouped together and, furthermore, the alignment records for read-ends mapping in a pair follow one another consecutively. For example, in the original file, the records for read 0 look like this:

0       99      ENST00000562765 656     0       60M16S  =       759     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       147     ENST00000562765 759     0       76M     =       656     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       355     ENST00000562212 846     0       60M16S  =       949     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       403     ENST00000562212 949     0       76M     =       846     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       355     ENST00000568423 804     0       60M16S  =       907     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       403     ENST00000568423 907     0       76M     =       804     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       355     ENST00000425597 854     0       60M16S  =       957     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       403     ENST00000425597 957     0       76M     =       854     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       355     ENST00000545456 714     0       60M16S  =       817     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       403     ENST00000545456 817     0       76M     =       714     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       355     ENST00000361900 871     0       60M16S  =       974     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       403     ENST00000361900 974     0       76M     =       871     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       355     ENST00000567529 564     0       60M16S  =       667     179     CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG    ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0
0       403     ENST00000567529 667     0       76M     =       564     -179    CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT    HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII    RG:Z:FASTQ      PL:Z:Illumina   PU:Z:pu LB:Z:lb SM:Z:sm PG:Z:SNAP       NM:i:0

Notice how the records for alignments of the read to the same contig are consecutive in the file (rather than all of the alignments for read1 followed by all of the alignments for read2). This order (which is what is usually provided by the read mappers) is much easier to parse and process in my case. Essentially, the processing that I need is a sort based solely on query name that it stable with respect to the original order --- that is, the alignment records with a given query name will have the same relative order in the input and output.

ADD REPLY
1
Entering edit mode

At least if SNAP produced alignments with the HI and NH auxiliary tags then the sorter from novoalign would produce what you want. Samtools should actually be pretty simple to modify to produce this sort order (I had previously tweaked samtools to handle HI and NH tags and that proved quite simple). If you do this often enough then let me know and I can help provide a version that will produce this sort order.

ADD REPLY
0
Entering edit mode

The alignments above are actually from SNAP. Unfortunately, it doesn't look like the HI and NH tags are present; thought that would, in fact, make the task much easier. It might be possible to annotate the records with this information "on-the-fly" from samtools, though, since the records are adjacent in the original file.

ADD REPLY
0
Entering edit mode

Samtools doesn't provide a mechanism to add those auxiliary tags, it expects the aligner to do it. Your best bet is to simply modify samtools' sort function.

ADD REPLY
1
Entering edit mode

Ah, I see, I misunderstood. I thought you wanted end1,end1,end1,end2,end2,end2, but instead you want end1,end2,end1,end2,end1,end2, where adjacent records refer to the same mapping (I think).

My only suggestion, which may be wrong since I've never used SNAP, is to try running it in single-threaded mode (-t 1). It's possible that the output ordering is getting slightly mixed up is because they're not thread-safe in output across pairs (just across single records). This is a total guess, though. If that works for you, it will be slower, but you could split your fastq file into smaller chunks beforehand to get the concurrency back.

ADD REPLY
0
Entering edit mode
5.1 years ago
stuart archer ▴ 140

I guess this is an old issue but I also ran into this problem with samtools sort -n ; I would just like it to sort by read name, but then preserve the original order from the aligner which places the read pairs together and makes it easier to parse.

There is already an issue on samtools github repo about this here: https://github.com/samtools/samtools/issues/520

In the meantime here is my work-around: add an additional tag called "XS" into the sam tags and insert an amalgam of the read name and original line-number in the bam file into that tag. Then pipe in to samtools sort -t to sort by that tag.

samtools view -h Aligned.out.bam | awk '{if(substr($1,1,1)=="@"){print $0} else {s = "0000000000"NR; print $0"\tXS:Z:"$1"_"substr(s, 1 + length(s) - 10); }}' | samtools sort -t XS > nameSorted.bam

The if(substr($1,1,1)=="@" conditional is to ensure that lines starting with "@" (sam headers) are not changed.

This is not tested extensively, obviously, but seems to work for me. You should verify that XS is not already in the sam file (if so just use a different tag name.)

ADD COMMENT
0
Entering edit mode

Actually I just realised there is a utility in Rsubread ("repair") that is meant to take a location-sorted bam file as input and place the read-pairs together. See http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf

ADD REPLY

Login before adding your answer.

Traffic: 825 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6