Question

How does bedtools bamtobed calculate end position

0

Entering edit mode

5.0 years ago

timing ▴ 20

I downloaded toy.sam from https://raw.githubusercontent.com/samtools/samtools/develop/examples/toy.sam to check how bedtools calculate the end position. Here is what I have done:

I converted the toy.sam to bam file with samtools and then to bed file with bedtools (toy.bed). Then I used R (with tidyverse) to calculate a comparison table with the following scripts:

sam <- read_delim("toy.sam", delim = "\t", col_names = F, skip = 2) 
bed <- read_delim("toy.bed", delim = "\t", col_names = F)

sam <- sam %>% mutate(seq = X10, sam_read_len = str_length(seq))
bed <- bed %>% mutate(difference_in_positions = X3 -X2)
compare_tbl <- bind_cols(sam, bed) %>% select(X1, sam_read_len, difference_in_positions)

Here is the compare table I have obtained

X1  sam_read_len    difference_in_positions
r001    19  16
r002    17  10
r003    6   6
r004    12  25
r003    5   5
r001    9   9
x1  20  20
x2  21  21
x3  26  22
x4  25  25
x5  24  24
x6  23  23

Could someone explain why the end position - start position does not always equal to the length of the read sequence from sam file, please? I have always been thinking that bamtobed calculates end position by using sam's start position - 1 + read length (since sam is 1 based and bed is 0 based). This is the case for the last few alignments but not the first few.

bed bam bedtools • 2.1k views

ADD COMMENT • link updated 5.0 years ago by John Marshall 3.0k • written 5.0 years ago by timing ▴ 20

2

Entering edit mode

5.0 years ago

Pierre Lindenbaum 161k

Could someone explain why the end position - start position does not always equal to the length of the read sequence

because the read-length is not enough: , you also have to look at the cigar string (eg: 8M4I4M1D3M ) : a read may be aligned to the reference with some insertions (length++) or deletions (length--).

ADD COMMENT • link 5.0 years ago by Pierre Lindenbaum 161k

score 5 · Accepted Answer · 2019-05-01

Bamtobed is using the CIGAR string to measure each read length in terms of positions matched along the reference sequence.

For example, for the first read, r001:

r001    163     ref     7       30      8M4I4M1D3M      …etc…

The CIGAR string can be used to reconstruct the mapping of the read's bases to the reference bases:

                    11111    1111122222222223333333333444444
ref coord  12345678901234    5678901234567890123456789012345
  ref seq  AGCATGTTAGATAA----GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
     r001        TTAGATAAAGAGGATA-CTG

and from this we can see that the read spans 8+4+1+3 = 16 reference positions from 7–22 (in inclusive “human-style” reckoning), i.e., 6 22 in BED-style coordinates — which is what was produced by bamtobed:

ref   6   22   r001/2   30   +

This is what the CIGAR operator table on page 7 of the SAM spec means by “Consumes reference”.

The last few alignments in that file have uncomplicated CIGAR strings (e.g. 25M), hence the read length and the number of reference positions matched are the same.