I downloaded toy.sam from https://raw.githubusercontent.com/samtools/samtools/develop/examples/toy.sam to check how bedtools calculate the end position. Here is what I have done:
I converted the toy.sam to bam file with samtools and then to bed file with bedtools (toy.bed). Then I used R (with tidyverse) to calculate a comparison table with the following scripts:
sam <- read_delim("toy.sam", delim = "\t", col_names = F, skip = 2) bed <- read_delim("toy.bed", delim = "\t", col_names = F) sam <- sam %>% mutate(seq = X10, sam_read_len = str_length(seq)) bed <- bed %>% mutate(difference_in_positions = X3 -X2) compare_tbl <- bind_cols(sam, bed) %>% select(X1, sam_read_len, difference_in_positions)
Here is the compare table I have obtained
X1 sam_read_len difference_in_positions r001 19 16 r002 17 10 r003 6 6 r004 12 25 r003 5 5 r001 9 9 x1 20 20 x2 21 21 x3 26 22 x4 25 25 x5 24 24 x6 23 23
Could someone explain why the end position - start position does not always equal to the length of the read sequence from sam file, please? I have always been thinking that bamtobed calculates end position by using sam's start position - 1 + read length (since sam is 1 based and bed is 0 based). This is the case for the last few alignments but not the first few.