Question: How does bedtools bamtobed calculate end position
0
gravatar for timing
4 months ago by
timing10
Singapore
timing10 wrote:

I downloaded toy.sam from https://raw.githubusercontent.com/samtools/samtools/develop/examples/toy.sam to check how bedtools calculate the end position. Here is what I have done:

I converted the toy.sam to bam file with samtools and then to bed file with bedtools (toy.bed). Then I used R (with tidyverse) to calculate a comparison table with the following scripts:

sam <- read_delim("toy.sam", delim = "\t", col_names = F, skip = 2) 
bed <- read_delim("toy.bed", delim = "\t", col_names = F)

sam <- sam %>% mutate(seq = X10, sam_read_len = str_length(seq))
bed <- bed %>% mutate(difference_in_positions = X3 -X2)
compare_tbl <- bind_cols(sam, bed) %>% select(X1, sam_read_len, difference_in_positions)

Here is the compare table I have obtained

X1  sam_read_len    difference_in_positions
r001    19  16
r002    17  10
r003    6   6
r004    12  25
r003    5   5
r001    9   9
x1  20  20
x2  21  21
x3  26  22
x4  25  25
x5  24  24
x6  23  23

Could someone explain why the end position - start position does not always equal to the length of the read sequence from sam file, please? I have always been thinking that bamtobed calculates end position by using sam's start position - 1 + read length (since sam is 1 based and bed is 0 based). This is the case for the last few alignments but not the first few.

bam bed bedtools • 235 views
ADD COMMENTlink modified 4 months ago by John Marshall1.7k • written 4 months ago by timing10
3
gravatar for John Marshall
4 months ago by
John Marshall1.7k
Glasgow, Scotland
John Marshall1.7k wrote:

Bamtobed is using the CIGAR string to measure each read length in terms of positions matched along the reference sequence.

For example, for the first read, r001:

r001    163     ref     7       30      8M4I4M1D3M      …etc…

The CIGAR string can be used to reconstruct the mapping of the read's bases to the reference bases:

                    11111    1111122222222223333333333444444
ref coord  12345678901234    5678901234567890123456789012345
  ref seq  AGCATGTTAGATAA----GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
     r001        TTAGATAAAGAGGATA-CTG

and from this we can see that the read spans 8+4+1+3 = 16 reference positions from 7–22 (in inclusive “human-style” reckoning), i.e., 6 22 in BED-style coordinates — which is what was produced by bamtobed:

ref   6   22   r001/2   30   +

This is what the CIGAR operator table on page 7 of the SAM spec means by “Consumes reference”.

The last few alignments in that file have uncomplicated CIGAR strings (e.g. 25M), hence the read length and the number of reference positions matched are the same.

ADD COMMENTlink written 4 months ago by John Marshall1.7k
2
gravatar for Pierre Lindenbaum
4 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

Could someone explain why the end position - start position does not always equal to the length of the read sequence

because the read-length is not enough: , you also have to look at the cigar string (eg: 8M4I4M1D3M ) : a read may be aligned to the reference with some insertions (length++) or deletions (length--).

ADD COMMENTlink modified 4 months ago • written 4 months ago by Pierre Lindenbaum123k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 549 users visited in the last hour