How does bedtools bamtobed calculate end position
2
0
Entering edit mode
5.0 years ago
timing ▴ 20

I downloaded toy.sam from https://raw.githubusercontent.com/samtools/samtools/develop/examples/toy.sam to check how bedtools calculate the end position. Here is what I have done:

I converted the toy.sam to bam file with samtools and then to bed file with bedtools (toy.bed). Then I used R (with tidyverse) to calculate a comparison table with the following scripts:

sam <- read_delim("toy.sam", delim = "\t", col_names = F, skip = 2) 
bed <- read_delim("toy.bed", delim = "\t", col_names = F)

sam <- sam %>% mutate(seq = X10, sam_read_len = str_length(seq))
bed <- bed %>% mutate(difference_in_positions = X3 -X2)
compare_tbl <- bind_cols(sam, bed) %>% select(X1, sam_read_len, difference_in_positions)

Here is the compare table I have obtained

X1  sam_read_len    difference_in_positions
r001    19  16
r002    17  10
r003    6   6
r004    12  25
r003    5   5
r001    9   9
x1  20  20
x2  21  21
x3  26  22
x4  25  25
x5  24  24
x6  23  23

Could someone explain why the end position - start position does not always equal to the length of the read sequence from sam file, please? I have always been thinking that bamtobed calculates end position by using sam's start position - 1 + read length (since sam is 1 based and bed is 0 based). This is the case for the last few alignments but not the first few.

bed bam bedtools • 2.1k views
ADD COMMENT
5
Entering edit mode
5.0 years ago

Bamtobed is using the CIGAR string to measure each read length in terms of positions matched along the reference sequence.

For example, for the first read, r001:

r001    163     ref     7       30      8M4I4M1D3M      …etc…

The CIGAR string can be used to reconstruct the mapping of the read's bases to the reference bases:

                    11111    1111122222222223333333333444444
ref coord  12345678901234    5678901234567890123456789012345
  ref seq  AGCATGTTAGATAA----GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
     r001        TTAGATAAAGAGGATA-CTG

and from this we can see that the read spans 8+4+1+3 = 16 reference positions from 7–22 (in inclusive “human-style” reckoning), i.e., 6 22 in BED-style coordinates — which is what was produced by bamtobed:

ref   6   22   r001/2   30   +

This is what the CIGAR operator table on page 7 of the SAM spec means by “Consumes reference”.

The last few alignments in that file have uncomplicated CIGAR strings (e.g. 25M), hence the read length and the number of reference positions matched are the same.

ADD COMMENT
2
Entering edit mode
5.0 years ago

Could someone explain why the end position - start position does not always equal to the length of the read sequence

because the read-length is not enough: , you also have to look at the cigar string (eg: 8M4I4M1D3M ) : a read may be aligned to the reference with some insertions (length++) or deletions (length--).

ADD COMMENT

Login before adding your answer.

Traffic: 1852 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6