So I've noticed that the SNAP aligner sometimes writes alignment records for a read pair non-consecutively in the output BAM file (I think this is actually a bug). So, you'll get something like a bunch of alignments for read X, then alignments for some other reads, then one or more alignments for read X. I want all of the alignments for a read to appear consecutively in the file. The standard solution I've seen for this is to use e.g. samtools sort -n
, and sort the reads by name. However, I've noticed that this has the undesirable property of breaking up records for alignments of paired-end reads that belong together. For example, when I do this, the beginning of my BAM file looks like:
0 99 ENST00000562765 656 0 60M16S = 759 179 CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 355 ENST00000562212 846 0 60M16S = 949 179 CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 355 ENST00000568423 804 0 60M16S = 907 179 CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 355 ENST00000425597 854 0 60M16S = 957 179 CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 355 ENST00000545456 714 0 60M16S = 817 179 CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 355 ENST00000361900 871 0 60M16S = 974 179 CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 355 ENST00000567529 564 0 60M16S = 667 179 CAGTATTCCGCCACCCCCAATTACACGTACTCCAATGAGATGTGAACCAGCCACGCCTACCAGCGGGAAGGCGAGG ?IIIIIG@EA:DIGIIIIIIIIIIFHIIIIIIIHHBDGE=>DBECHIIHHHHEEDA@=8B################ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 147 ENST00000562765 759 0 76M = 656 -179 CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 403 ENST00000562212 949 0 76M = 846 -179 CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 403 ENST00000568423 907 0 76M = 804 -179 CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 403 ENST00000425597 957 0 76M = 854 -179 CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 403 ENST00000545456 817 0 76M = 714 -179 CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 403 ENST00000361900 974 0 76M = 871 -179 CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
0 403 ENST00000567529 667 0 76M = 564 -179 CATCGTCATTTGTGGTTACCAAGCAGGGTTCCCCCTTCCCTTTTCTCCTTCCCTACTTTGTACAAAGGACCAGAGT HFB=61=;???A@?==DDBDDBGDFH@BDE>A>;:;<45?>EDGGGDDGIHG@GBGHIHGGGGG:74//;+:BDII PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm NM:i:0 RG:Z:FASTQ PG:Z:SNAP
So here, the reads seem to be sorted by query name, but, within the set of records for a name, the pairs are all spread out (they actually seem to be sorted, secondarily, by the SAM status flags). This means that the paired alignment that corresponds to the ends of a read aligning to a given contig are no longer adjacent in the file (as they originally were, and as I'd like them to be). Is there some tool to accomplish this type of sorting, rather than what samtools -n
does? I'd like to place all of the alignments for a given read together, but I'd like to keep paired-end alignments adjacent in the sorted output.
Hi matted,
I'm not arguing that samtools is doing the "wrong" thing (in fact, one could probably claim that, since sorting by query name is something that is somewhat ambiguously defined, and since samtools is samtools, what it's doing is, by definition, the "right thing"). What I'm saying is that it's not really giving me what I want. Perhaps if I explain the situation more, my use case will make sense. I care quite a bit about secondary alignments, because I'm doing RNA-seq quantification and I want to disambiguate multiple alignments to similar isoforms. What I'd really like is the output in the format provided by e.g. Bowtie, the alignments for a read are grouped together and, furthermore, the alignment records for read-ends mapping in a pair follow one another consecutively. For example, in the original file, the records for read 0 look like this:
Notice how the records for alignments of the read to the same contig are consecutive in the file (rather than all of the alignments for read1 followed by all of the alignments for read2). This order (which is what is usually provided by the read mappers) is much easier to parse and process in my case. Essentially, the processing that I need is a sort based solely on query name that it stable with respect to the original order --- that is, the alignment records with a given query name will have the same relative order in the input and output.
At least if SNAP produced alignments with the HI and NH auxiliary tags then the sorter from novoalign would produce what you want. Samtools should actually be pretty simple to modify to produce this sort order (I had previously tweaked samtools to handle HI and NH tags and that proved quite simple). If you do this often enough then let me know and I can help provide a version that will produce this sort order.
The alignments above are actually from SNAP. Unfortunately, it doesn't look like the HI and NH tags are present; thought that would, in fact, make the task much easier. It might be possible to annotate the records with this information "on-the-fly" from samtools, though, since the records are adjacent in the original file.
Samtools doesn't provide a mechanism to add those auxiliary tags, it expects the aligner to do it. Your best bet is to simply modify samtools' sort function.
Ah, I see, I misunderstood. I thought you wanted
end1,end1,end1,end2,end2,end2
, but instead you wantend1,end2,end1,end2,end1,end2
, where adjacent records refer to the same mapping (I think).My only suggestion, which may be wrong since I've never used SNAP, is to try running it in single-threaded mode (
-t 1
). It's possible that the output ordering is getting slightly mixed up is because they're not thread-safe in output across pairs (just across single records). This is a total guess, though. If that works for you, it will be slower, but you could split yourfastq
file into smaller chunks beforehand to get the concurrency back.