Question

Finding exactly where reads map from SAM

0

Entering edit mode

6 months ago

Sisyphus • 0

Hello,

I am trying to find exactly where my reads have mapped on a reference genome using information found in a SAM file. The files I am working with were generated using Bowtie1 and bwa. These tools were run with default parameters, no special flags were utilized, other than Bowtie being run in SAM mode output(-S). Looking at the resultant SAMs, I have things like POS, TLEN, CIGAR, optional MD field, etc. What I would like to do is to know precisely where the reads that aligned mapped to the genome including their beginning and end positions. I cannot find a straightforward way to do this. I cannot use existing tools as I need to write this myself. Is there a straightforward way to do this?

alignment read Bowtie SAM BWA • 528 views

ADD COMMENT • link updated 6 months ago by Istvan Albert 100k • written 6 months ago by Sisyphus • 0

score 2 · Answer 1 · 2023-10-12

2

Entering edit mode

6 months ago

Istvan Albert 100k

alas, there is no straightforward way to find out the end of the alignment - I personally consider this as an oversight of the BAM format ....

what I would recommend is that you use another tool, for example bedtools bamtobed that would compute the end of each alignment

if you need to do it by hand, you would need to parse the cigar and add up the numbers in the cigar string for M and D operators (while ignoring the all other operators) and add those to POS so your alignment starts and ends would be:

start = POS
end = POS + Nmatch + Ndel

it is not all that hard to do actually, write a mini parser for the CIGAR string format that computes each of the above

ADD COMMENT • link 6 months ago by Istvan Albert 100k

0

Entering edit mode

Thank you so much for your response. A couple of things:

Must the bit flags be taken into account?
I have paired end reads and they have different CIGAR strings. Sometimes the TLEN is zero and other times it is longer than the combined SEQ length of the reads. I notice that the POS and PNEXT are inverse of one another for these paired reads lines. Wondering the significance of these facts?
The difference between the PNEXT lines, the TLEN, and the length from the combined M and D operator lengths, when added together for the pair, are all different numbers. Is this expected?

ADD REPLY • link 6 months ago by Sisyphus • 0

0

Entering edit mode

The bitflag is used only to determine which read is aligned (that bit 4 is not set)

If you are dealing with paired-end reads then the start/end for each read alignment would be the same, as long as you want the alignment of the read itself.

But maybe you are asking about the alignment of the fragment that the pairs come from. In that case, when you have two pairs one alignment in the pair is more to the left and the other more to the right. Then to compute the start/end of the fragment, you would do:

start = leftmost POS
end  = rightmost POS + Nmatch + Ndel

where the Nmatch and Ndel are computed on the rightmost of the pair only.

I believe the TLEN is computed as indicated above as end - start. Thus if you already have and want to use TLEN you could do

start = leftmost POS
end = start + TLEN

Which is evidently a lot simpler to do :-)

But it may have corner cases when reads don't align as proper pairs and the aligner will set TLEN zero - the SAM spec will describe these behaviors.

The negative length is just an artifact where the coordinate is computed from the left in pair or the right in pair.

The length of the SEQ cannot be used unless the alignment has only matches and mismatches.

ADD REPLY • link 6 months ago by Istvan Albert 100k