Calculating Coordinates Of Split Rna-Seq Reads Converted From Bam To Bed
2
0
Entering edit mode
8.9 years ago

I have some RNA-seq reads in indexed BAM format, which I am converting to BED with bamToBed -split to understand how skipped regions within the read are handled.

Here is one sample read for some human data, which (according to the CIGAR string 38M88N9M) contains a skip of 88 bases:

RNAME                             :  chr21
POS - 1                           :  9825622
POS - 1 + length(CIGAR)           :  9825757
QNAME                             :  HWI-ST700693:268:C1LYCACXX:7:1215:16268:22176
FLAG                              :  83
16 & FLAG                         :  -
MAPQ                              :  255
CIGAR                             :  38M88N9M
RNEXT                             :  =
PNEXT                             :  9825546
TLEN                              :  -212
SEQ                               :  CCGTTGGTGGCCCCGCCTGGNACCGAACCCGGCCCCGCGTGGGGCCC
QUAL                              :  @B?@88??6&00&6@==;5,#(69@;::6A766:)81@?D@:DD=1<
TAGS                              :  NM:i:2    XS:A:-    NH:i:1


When converted to BED with bamToBed -split -bed12, I get the following end result:

chr21    9825545    9825577    HWI-ST700693:268:C1LYCACXX:7:1215:16268:22176/2     255      +    . . .
chr21    9825622    9825757    HWI-ST700693:268:C1LYCACXX:7:1215:16268:22176/1     255      -    . . .


Given the PNEXT of 9825546, I understand why the second read HWI-.../2 starts at 9825545.

What I don't understand is why this second read stops at 9825577.

Given the 38M and 9M matches, shouldn't it be 9825546 - 1 + 38 + 9 — or 9825592?

What I am missing about how this second read's coordinates are generated?

Other sample reads I have been looking at seem to follow the rule that the M elements of the CIGAR string are summed, but this example and a few others do not.

I looked at the source code for bamToBed, but I seem to be confused about how the BamTools::getBamBlocks function operates with -split, in that the N operand in the CIGAR string appears to break a single BAM read into two separate BAM blocks, after which the blocks should be printed separately.

Instead, the first read HWI-.../1 is printed as a contiguous region across the 88N skipped region (shouldn't this be two separate blocks of length 38 and 9?). It is also unclear why the second read has the stop value that it has.

Sorry if these are dumb questions and thanks for your knowledge.

bed bam • 3.6k views
1
Entering edit mode
8.9 years ago

Hi Alex, I can't tell what is happening in the blocks of example BED12 output you provide seems incomplete, so I thought I'd demonstrate with a toy example.

Consider the following SAM file, which contains two alignments, one for each end of a paired-end sequence. End 2 aligns from 1-based positions 1-40 inclusive, yet it is a 30bp sequence and the middle 10bp are a skip (N operator). End 1 aligns from position 100-130, inclusive.

> cat two_blocks.sam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:chr1 LN:249250621
two_blocks_1_1    163    chr1    1    40    15M10N15M    chr1    100    130    GAAGGCCACCGCCGCGGTTATTTTCCTTCA    CCCDDB?=FJIIJIGFJIJHIJJJJJJJJI    MD:Z:50    XA:i:2
two_blocks_1_2    99    chr1    100    40    30M    chr1    1    -130    AGGCGATGCTAACGAAAAATTCGGAAATTT    CCCDDB?=FJIIJIGFJIJHIJJJJJJJJI    MD:Z:50    XA:i:2


Now, without the -split option, bedtools bamtobed will just report single blocks for each:

samtools view -Sb two_blocks.sam > two_blocks.bam
bedtools bamtobed -i two_blocks.bam
chr1    0    40    two_blocks_1_1/2    40    +
chr1    99    129    two_blocks_1_2/1    40    +


Now, if we use the -split option, end 2 is broken into two separate blocks (0-based started):

bedtools bamtobed -i two_blocks.bam -split
chr1    0    15    two_blocks_1_1/2    40    +
chr1    25    40    two_blocks_1_1/2    40    +
chr1    99    129    two_blocks_1_2/1    40    +


Using -bed12, the two blocks for end 2 are represented as BED blocks (the -split option is implied for -bed12)

 bedtools bamtobed -i two_blocks.bam -bed12
chr1    0    40    two_blocks_1_1/2    40    +    0    40    255,0,0    2    15,15    0,25
chr1    99    129    two_blocks_1_2/1    40    +    99    129    255,0,0    1    30    0

bedtools bamtobed -i two_blocks.bam -bed12 -split
chr1    0    40    two_blocks_1_1/2    40    +    0    40    255,0,0    2    15,15    0,25
chr1    99    129    two_blocks_1_2/1    40    +    99    129    255,0,0    1    30    0

0
Entering edit mode

Okay, thanks. I think my confusion is about how blocks are presented in BED12 format, as opposed to regular BED.

0
Entering edit mode
8.9 years ago

It looks like you're misunderstanding the SAM output. The PNEXT of 9825546 means the mate starts at that position (not that position - 1, as the SAM output is 1-based). You haven't shown the CIGAR string for read 2, so we can't answer why it's stopping at 9825592 (keep in mind that SAM is 1-based and BED is 0-based!). Also the 38 and 9 values are for read 1 (the one ending in /1), which goes from 9825622 to that +38+88+9-1, not for read 2.

0
Entering edit mode

But when I use the -split option, shouldn't an element with a CIGAR string of 38M88N9M get split into two elements of length 38 and length 9 or am I misunderstanding how the skip operation is handled in the BamTools::getBamBlocks() function?

In this function, it looks like a new BAM block is created once an M operation happens after an N operation, which means there should be two blocks (and two BED elements) for a read with this CIGAR string.