Question

samtools depth for start positions

0

Entering edit mode

7.5 years ago

muraved ▴ 10

Hi,

I am trying to obtain the number of reads starting at each position from a bam file. I've used

samtools depth -aa ${bamfile} | cut -f 3  > ${pileupFile}

to create a depth of coverage per position, and confirmed that all reads have length 100 using

samtools view ${bamfile} | awk '{print length($10)}' |  sort -u

On that file I've used a simple Python implementation to obtain the number of starting reads of length 100 at each position (I'll use some faster language later, but right now I'm working on a toy example):

import numpy as np
...
pileup = np.loadtxt(infilename, dtype=int)
N = len(pileup)
readLen=100

# keep the first entry as is, and then calculate the difference pileup[i]-pileup[i-1]
pileup = np.ediff1d(pileup, to_begin=pileup[0]).astype(int) 

for i in xrange(N-readLen):
    pileup[i+readLen] += pileup[i]
    assert pileup[i]>=0
    assert pileup[i+readLen]>=0

It is easy to show that this yields the required number of starting reads at each position. This number cannot be negative, unless readLen is wrong or the pileup was not created by overlapping 100-mers. Strangely enough, I get tons of negative values. I am fairly confident that my Python script is correct.

My question is: does samtools do something fancy (other than capping at 8000, which is not the case in my case) when calculating read depth? I am using samtools 1.3.1-15-g21fe47e (using htslib 1.3.1-4-gdd40524).

EDIT: I am aware that I can easily achieve my goal using

samtools view ${bamfile} | cut -f 4 | uniq -c | awk '{print $1}' | cut  -f 1 -d ' '

I am more concerned about a possible bug in samtools (unless I made a stupid mistake in my Python code).

samtools depth of coverage • 3.1k views

ADD COMMENT • link updated 7.5 years ago by GouthamAtla 12k • written 7.5 years ago by muraved ▴ 10

0

Entering edit mode

It is easy to show that this yields the required number of starting reads at each position

is it ?

ADD REPLY • link 7.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes. Using ediff1d, pileup contains the number of changes in coverage. It is equivalent to the number of reads starting there minus the number of reads ending at i-1, i.e. it undercounts the number of start reads by the ones ending right before. In the loop, pileup[i] contains the number of starting positions at i. Adding that count to pileup[i+readLen] compensates the undercounting. This is the in-place version of an algorithm that stores the number of reads ending at a given position in a separate array.

ADD REPLY • link 7.5 years ago by muraved ▴ 10

score 0 · Answer 1 · 2016-10-15

0

Entering edit mode

7.5 years ago

GouthamAtla 12k

Its not clear, what is your goal here, but if you are looking for coverage of first base of reads ( Like a TSS data ):

 ~~ samtools view ${bamfile} | cut -f 4 | uniq -c | awk '{print $1}' | cut  -f 1 -d ' ' ~~

This is true only if the read mapped to forward strand.

Better way to get the coverage of first base of reads :

samtools view -f 64 -u -b in.bam |
        bamToBed -i stdin | 
        awk -v OFS="\t" '{ if ($6=="+" ) { print $1, $2 } else { print $1, $3} }' | 
        sort -k1,1 -k2,2n | 
        uniq -c | 
        awk  -v OFS="\t" '{ print $2,$3,$1}' > out.txt

In general, when we think of depth at first base, it usually applies to TSS data. Hence I used -f 64

ADD COMMENT • link 7.5 years ago by GouthamAtla 12k

0

Entering edit mode

This is true only if the read mapped to forward strand.

Good point! My goal is to make coverage statistically independent between adjacent positions (which they are not for pileup data, since mapping a read to position x adds counts to its neighbors). In case of the reverse strand, I need what would be the last position I guess. In other words, for any read I need the lowest-numbered position in the reference genome it covers, regardless of strand.

ADD REPLY • link 7.5 years ago by muraved ▴ 10