I have some RNA-seq reads in indexed BAM format, which I am converting to BED with
bamToBed -split to understand how skipped regions within the read are handled.
Here is one sample read for some human data, which (according to the CIGAR string
38M88N9M) contains a skip of 88 bases:
RNAME : chr21 POS - 1 : 9825622 POS - 1 + length(CIGAR) : 9825757 QNAME : HWI-ST700693:268:C1LYCACXX:7:1215:16268:22176 FLAG : 83 16 & FLAG : - MAPQ : 255 CIGAR : 38M88N9M RNEXT : = PNEXT : 9825546 TLEN : -212 SEQ : CCGTTGGTGGCCCCGCCTGGNACCGAACCCGGCCCCGCGTGGGGCCC QUAL : @B?@88??6&00&6@==;5,#(69@;::6A766:)81@?D@:DD=1< TAGS : NM:i:2 XS:A:- NH:i:1
When converted to BED with
bamToBed -split -bed12, I get the following end result:
chr21 9825545 9825577 HWI-ST700693:268:C1LYCACXX:7:1215:16268:22176/2 255 + . . . chr21 9825622 9825757 HWI-ST700693:268:C1LYCACXX:7:1215:16268:22176/1 255 - . . .
Given the PNEXT of
9825546, I understand why the second read
HWI-.../2 starts at
What I don't understand is why this second read stops at
9M matches, shouldn't it be
9825546 - 1 + 38 + 9 — or
What I am missing about how this second read's coordinates are generated?
Other sample reads I have been looking at seem to follow the rule that the
M elements of the CIGAR string are summed, but this example and a few others do not.
I looked at the source code for
bamToBed, but I seem to be confused about how the
BamTools::getBamBlocks function operates with
-split, in that the
N operand in the CIGAR string appears to break a single BAM read into two separate BAM blocks, after which the blocks should be printed separately.
Instead, the first read
HWI-.../1 is printed as a contiguous region across the
88N skipped region (shouldn't this be two separate blocks of length 38 and 9?). It is also unclear why the second read has the stop value that it has.
Sorry if these are dumb questions and thanks for your knowledge.