Question: How does bedtools bamtobed calculate end position
0
gravatar for 6timings
25 days ago by
6timings0
6timings0 wrote:

I downloaded toy.sam from https://raw.githubusercontent.com/samtools/samtools/develop/examples/toy.sam to check how bedtools calculate the end position. Here is what I have done:

I converted the toy.sam to bam file with samtools and then to bed file with bedtools (toy.bed). Then I used R (with tidyverse) to calculate a comparison table with the following scripts:

sam <- read_delim("toy.sam", delim = "\t", col_names = F, skip = 2) 
bed <- read_delim("toy.bed", delim = "\t", col_names = F)

sam <- sam %>% mutate(seq = X10, sam_read_len = str_length(seq))
bed <- bed %>% mutate(difference_in_positions = X3 -X2)
compare_tbl <- bind_cols(sam, bed) %>% select(X1, sam_read_len, difference_in_positions)

Here is the compare table I have obtained

X1  sam_read_len    difference_in_positions
r001    19  16
r002    17  10
r003    6   6
r004    12  25
r003    5   5
r001    9   9
x1  20  20
x2  21  21
x3  26  22
x4  25  25
x5  24  24
x6  23  23

Could someone explain why the end position - start position does not always equal to the length of the read sequence from sam file, please? I have always been thinking that bamtobed calculates end position by using sam's start position - 1 + read length (since sam is 1 based and bed is 0 based). This is the case for the last few alignments but not the first few.

bam bed bedtools • 111 views
ADD COMMENTlink modified 25 days ago by John Marshall1.5k • written 25 days ago by 6timings0
3
gravatar for John Marshall
25 days ago by
John Marshall1.5k
Glasgow, Scotland
John Marshall1.5k wrote:

Bamtobed is using the CIGAR string to measure each read length in terms of positions matched along the reference sequence.

For example, for the first read, r001:

r001    163     ref     7       30      8M4I4M1D3M      …etc…

The CIGAR string can be used to reconstruct the mapping of the read's bases to the reference bases:

                    11111    1111122222222223333333333444444
ref coord  12345678901234    5678901234567890123456789012345
  ref seq  AGCATGTTAGATAA----GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
     r001        TTAGATAAAGAGGATA-CTG

and from this we can see that the read spans 8+4+1+3 = 16 reference positions from 7–22 (in inclusive “human-style” reckoning), i.e., 6 22 in BED-style coordinates — which is what was produced by bamtobed:

ref   6   22   r001/2   30   +

This is what the CIGAR operator table on page 7 of the SAM spec means by “Consumes reference”.

The last few alignments in that file have uncomplicated CIGAR strings (e.g. 25M), hence the read length and the number of reference positions matched are the same.

ADD COMMENTlink written 25 days ago by John Marshall1.5k
2
gravatar for Pierre Lindenbaum
25 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

Could someone explain why the end position - start position does not always equal to the length of the read sequence

because the read-length is not enough: , you also have to look at the cigar string (eg: 8M4I4M1D3M ) : a read may be aligned to the reference with some insertions (length++) or deletions (length--).

ADD COMMENTlink modified 25 days ago • written 25 days ago by Pierre Lindenbaum120k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1094 users visited in the last hour