Question: SAM / BAM alignments
8
gravatar for John
2.3 years ago by
John12k
Germany
John12k wrote:

Hello all,

There are some aspects of the SAM/BAM format that I still do not understand, one of which is what secondary alignments should point to.

PNEXT: Position of the primary alignment of the NEXT read in the template. Set as 0 when the information is unavailable. This field equals POS at the primary line of the next read.

This quote changes the way I see the SAM format entirely. Here's a diagram to help:

image

Alignments 1 and 3 are from the same read, as are alignments 2 and 4. The aligner can put these four alignments into 2 "chains". Chain 1->2, because the distance between the reads is sensible, and they are on the same chromosome, and chain 3->4, again for the same reasons. Chaining 1->4 or 3->2 is not possible due to the distances being too small/large, or perhaps different chromosomes.

In this situation, I would expect an aligner to give all 4 alignments the same QNAME, but 1 and 2 are the primary alignments, and 3 and 4 are secondary alignments. Arrows show where the PNEXT of each alignment points to. This forms the two chains, all of which are properly mapped in a pair (say the aligner doesn't think being secondary prevents being properly mapped).

image2

The way this quote reads however, is that all secondary alignments should point to the primary alignment of the next read in the fragment/template, as show in the second illustration. Alignments 3 and 4 cannot have the properly mapped in a pair flag set. The knowledge that alignments 3 and 4 would be properly mapped in a pair if alignments 1 and 2 didn't exist is not encodable within the SAM spec. and must be deduced by programs interpreting the whole alignment group (all alignments with the same QNAME).

If anyone could clarify on which scenario is correct, that would really help me figure out what data in SAM format is supposed to look like. It seems like this notion I had of chaining together alignments with the RNEXT/PNEXT is totally false..?

All the best,

sam bam • 1.0k views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by John12k
1

Tagging: lh3

ADD REPLYlink written 2.3 years ago by genomax65k

Yeah he'll definitely know - he's clarified so many things about BAM for me over the years he should be second author on my thesis -_-;

EDIT: The same could be said about you genomax :)

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by John12k
1

Search the samtools-devel email list. There was so discussion about this topic within the past 6 months or so. The consensus is that this is an ambiguity in the spec (if I remember correctly).

BTW, in practice what you expected to see is what you get, but it ends up being aligner-dependent.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Devon Ryan89k

Ah ok thanks Devon, that's good to hear. I think i found the conversation on the mailing list, with James Bonfield?) I was having a sort of crisis of confidence over this and i'm kind of glad that the answer is "whatever the aligners want". I at least feel a #####little##### less stupid :)

ADD REPLYlink written 2.3 years ago by John12k

Oop. well. that feeling didn't last long... :P

ADD REPLYlink written 2.3 years ago by John12k

About a minute or so, looking at the time interval :-)

ADD REPLYlink written 2.3 years ago by genomax65k
1

Hi John,

The SAM spec is kind of confusing. It does not really provide information about what you're supposed to do with multiple discordant alignments for pairs...

What aligner are you using? I'd be interested to see what BBMap produces in this scenario. Not sure if it will be correct, though. But of course I'll try to fix it if it's wrong.

ADD REPLYlink written 2.3 years ago by Brian Bushnell16k
1

I'm not looking at any data in particular - i'm just trying to explain the BAM spec in my thesis, and despite having worked with BAM data for years and recently wrote a parser in python, I just sort of realised that I can't explain the BAM spec. o_o

ADD REPLYlink written 2.3 years ago by John12k

Yeah, it's, um. Not sure if it can actually be completely described in a closed from. I highly doubt it.

ADD REPLYlink written 2.3 years ago by Brian Bushnell16k

I don't think you need to describe the format in your thesis. A link to the spec will suffice and a sentence or two will suffice.

ADD REPLYlink written 2.3 years ago by Devon Ryan89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1583 users visited in the last hour