There are some aspects of the SAM/BAM format that I still do not understand, one of which is what secondary alignments should point to.
PNEXT: Position of the primary alignment of the NEXT read in the template. Set as 0 when the information is unavailable. This field equals POS at the primary line of the next read.
This quote changes the way I see the SAM format entirely. Here's a diagram to help:
Alignments 1 and 3 are from the same read, as are alignments 2 and 4. The aligner can put these four alignments into 2 "chains". Chain 1->2, because the distance between the reads is sensible, and they are on the same chromosome, and chain 3->4, again for the same reasons. Chaining 1->4 or 3->2 is not possible due to the distances being too small/large, or perhaps different chromosomes.
In this situation, I would expect an aligner to give all 4 alignments the same QNAME, but 1 and 2 are the primary alignments, and 3 and 4 are secondary alignments. Arrows show where the PNEXT of each alignment points to. This forms the two chains, all of which are properly mapped in a pair (say the aligner doesn't think being secondary prevents being properly mapped).
The way this quote reads however, is that all secondary alignments should point to the primary alignment of the next read in the fragment/template, as show in the second illustration. Alignments 3 and 4 cannot have the properly mapped in a pair flag set. The knowledge that alignments 3 and 4 would be properly mapped in a pair if alignments 1 and 2 didn't exist is not encodable within the SAM spec. and must be deduced by programs interpreting the whole alignment group (all alignments with the same QNAME).
If anyone could clarify on which scenario is correct, that would really help me figure out what data in SAM format is supposed to look like. It seems like this notion I had of chaining together alignments with the RNEXT/PNEXT is totally false..?
All the best,
Yeah he'll definitely know - he's clarified so many things about BAM for me over the years he should be second author on my thesis -_-;
EDIT: The same could be said about you genomax :)
Search the samtools-devel email list. There was so discussion about this topic within the past 6 months or so. The consensus is that this is an ambiguity in the spec (if I remember correctly).
BTW, in practice what you expected to see is what you get, but it ends up being aligner-dependent.
Ah ok thanks Devon, that's good to hear. I think i found the conversation on the mailing list, with James Bonfield?) I was having a sort of crisis of confidence over this and i'm kind of glad that the answer is "whatever the aligners want". I at least feel a #####little##### less stupid :)
Oop. well. that feeling didn't last long... :P
About a minute or so, looking at the time interval :-)
The SAM spec is kind of confusing. It does not really provide information about what you're supposed to do with multiple discordant alignments for pairs...
What aligner are you using? I'd be interested to see what BBMap produces in this scenario. Not sure if it will be correct, though. But of course I'll try to fix it if it's wrong.
I'm not looking at any data in particular - i'm just trying to explain the BAM spec in my thesis, and despite having worked with BAM data for years and recently wrote a parser in python, I just sort of realised that I can't explain the BAM spec. o_o
Yeah, it's, um. Not sure if it can actually be completely described in a closed from. I highly doubt it.
I don't think you need to describe the format in your thesis. A link to the spec will suffice and a sentence or two will suffice.