Question: samtools depth for start positions
0
gravatar for muraved
3.2 years ago by
muraved10
United States
muraved10 wrote:

Hi,

I am trying to obtain the number of reads starting at each position from a bam file. I've used

samtools depth -aa ${bamfile} | cut -f 3  > ${pileupFile}

to create a depth of coverage per position, and confirmed that all reads have length 100 using

samtools view ${bamfile} | awk '{print length($10)}' |  sort -u

On that file I've used a simple Python implementation to obtain the number of starting reads of length 100 at each position (I'll use some faster language later, but right now I'm working on a toy example):

import numpy as np
...
pileup = np.loadtxt(infilename, dtype=int)
N = len(pileup)
readLen=100

# keep the first entry as is, and then calculate the difference pileup[i]-pileup[i-1]
pileup = np.ediff1d(pileup, to_begin=pileup[0]).astype(int) 

for i in xrange(N-readLen):
    pileup[i+readLen] += pileup[i]
    assert pileup[i]>=0
    assert pileup[i+readLen]>=0

It is easy to show that this yields the required number of starting reads at each position. This number cannot be negative, unless readLen is wrong or the pileup was not created by overlapping 100-mers. Strangely enough, I get tons of negative values. I am fairly confident that my Python script is correct.

My question is: does samtools do something fancy (other than capping at 8000, which is not the case in my case) when calculating read depth? I am using samtools 1.3.1-15-g21fe47e (using htslib 1.3.1-4-gdd40524).

EDIT: I am aware that I can easily achieve my goal using

samtools view ${bamfile} | cut -f 4 | uniq -c | awk '{print $1}' | cut  -f 1 -d ' '

I am more concerned about a possible bug in samtools (unless I made a stupid mistake in my Python code).

depth of coverage samtools • 1.8k views
ADD COMMENTlink modified 3.2 years ago by geek_y10k • written 3.2 years ago by muraved10

It is easy to show that this yields the required number of starting reads at each position

is it ?

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum124k

Yes. Using ediff1d, pileup contains the number of changes in coverage. It is equivalent to the number of reads starting there minus the number of reads ending at i-1, i.e. it undercounts the number of start reads by the ones ending right before. In the loop, pileup[i] contains the number of starting positions at i. Adding that count to pileup[i+readLen] compensates the undercounting. This is the in-place version of an algorithm that stores the number of reads ending at a given position in a separate array.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by muraved10
0
gravatar for geek_y
3.2 years ago by
geek_y10k
Barcelona
geek_y10k wrote:

Its not clear, what is your goal here, but if you are looking for coverage of first base of reads ( Like a TSS data ):

 ~~ samtools view ${bamfile} | cut -f 4 | uniq -c | awk '{print $1}' | cut  -f 1 -d ' ' ~~

This is true only if the read mapped to forward strand.

Better way to get the coverage of first base of reads :

samtools view -f 64 -u -b in.bam |
        bamToBed -i stdin | 
        awk -v OFS="\t" '{ if ($6=="+" ) { print $1, $2 } else { print $1, $3} }' | 
        sort -k1,1 -k2,2n | 
        uniq -c | 
        awk  -v OFS="\t" '{ print $2,$3,$1}' > out.txt

In general, when we think of depth at first base, it usually applies to TSS data. Hence I used -f 64

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by geek_y10k

This is true only if the read mapped to forward strand.

Good point! My goal is to make coverage statistically independent between adjacent positions (which they are not for pileup data, since mapping a read to position x adds counts to its neighbors). In case of the reverse strand, I need what would be the last position I guess. In other words, for any read I need the lowest-numbered position in the reference genome it covers, regardless of strand.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by muraved10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1084 users visited in the last hour