Genomic Coverage - Samtool's undocumented "depth" verses the poorly documented pileup.
3
3
Entering edit mode
7.4 years ago
John 13k

Hey all :]

I use samtools's depth, and occasionally samtool's pileup commands to calculate coverage of my reads to the genome before binning for coverage. I'm pretty sure everyone else does too ;)

One very common problem is that people find "samtools depth" and "samtools mpileup" don't match up - and it's commonly attributed to filtering poor quality reads, duplicates, etc (mpileup doing more filtering).

But there's a ton of other questions I just can't get the answers too from the sametools docs, namely:

Does depth/pileup count the region between a paired reads, or just the reads themselves?
Does samtools depth/pileup count singletons in paired end sequencing? Does it extrapolate based on average read length to fillout the whole fragment?
If a read maps to multiple locations, is it counted multiple times?

I'm working with ChIP-Seq data, so i might have to correct for peak-shift. What do you guys think?

Thanks! :)

samtools pileup mpileup depth coverage • 7.5k views
5
Entering edit mode
7.4 years ago

There comes a time in every bioinformatician's life when he/she needs to just read the source code. This is one of those times.

So, depth actually uses pileup internally, just to confuse things a bit more. Depth does not count regions spliced over (i.e., covered by an N operator) or between reads in a pair, which is good because what's actually going on there would be undefined. I don't think that pileup counts regions between reads, though I guess I'd have to look to be 100% certain.

Neither tool cares whether a read is a singleton or a part of a pair unless you tell them to filter according to that. Samtools is a pretty general program, so it won't do ChIP-seq specific things like extrapolating bounds based on average template length.

In general, you'll need to start going through the code to get answers to questions like these.

0
Entering edit mode

A million thanks Devon for setting the record straight! I'm REALLY surprised that it only counts the reads themselves for the pileup - we have always used this for ALL our ChIP-Seq coverage plots, and I guess it's totally wrong! :O I mean, it depends what you're looking for I suppose - but like you say, it's about time I started writing my own scripts so I really know what's going on. I also work a lot with DNAse-Seq data, and there the 5' read is the most important (as that's where the DNAse actually cuts). I think it's time I made a depth tool for my specific applications. Thank you very much for the advise and clarity :)

1
Entering edit mode

Did you fix this problem? I met this problem too. I find the depth of mpileup is far less than samtools depth? I check it again and again and didn't find the answer. And I find PileOMeth might invoke mpileup and count the methylation and non-methylation reads and then calculated the methylation ratio. I don't use C, neither C++. For these question, what I can do is email the authors. LoL

0
Entering edit mode

Well Devon is the author so you're half way there :P

0
Entering edit mode

As you likely noted in another thread, mpileup is doing some filtering that depth isn't doing. Regarding PileOMeth, it does filtering to, but its defaults are somewhat different from the standard "samtools mpileup", though they work the same way and use the same API. You can play with the -q and -p (these are identical to -q and -Q in samtools mpileup and samtools depth) options to tweak things to your liking, though I should warn that I chose those because I find them to be the most useful.

3
Entering edit mode
7.4 years ago

Does depth/pileup count the region between a paired reads, or just the reads themselves?

samtools depth 'just' runs a multi-mpileup and count the number of bases under each base of the reference. smal l'qlen', deletions and base quality below 'baseQ' are ignored. Here is the core code of 'depth':

if (!(b->core.flag&BAM_FUNMAP)) {
if ((int)b->core.qual < aux->min_mapQ) b->core.flag |= BAM_FUNMAP;
else if (aux->min_len && bam_cigar2qlen(&b->core, bam1_cigar(b)) < aux->min_len) b->core.flag |= BAM_FUNMAP;


(...)

while (bam_mplp_auto(mplp, &tid, &pos, n_plp, plp) > 0) {
if (pos < beg || pos >= end) continue; // out of range; skip
if (bed && bed_overlap(bed, h->target_name[tid], pos, pos + 1) == 0) continue;
fputs(h->target_name[tid], stdout); printf("\t%d", pos+1);
for (i = 0; i < n; ++i) { // base level filters have to go here
int j, m = 0;
for (j = 0; j < n_plp[i]; ++j) {
const bam_pileup1_t *p = plp[i] + j;
if (p->is_del || p->is_refskip) ++m;
else if (bam1_qual(p->b)[p->qpos] < baseQ) ++m; // low
}
printf("\t%d", n_plp[i] - m); // this the depth to output
}
}

0
Entering edit mode

Thank you Pierre for taking the time to find the bit of code responsible.

I must confess, I'm not used to reading C, and I have no idea what a FUNMAP is, hahah, but I can kind of make sense of it.

There should be a flag to switch this behaviour off IMHO: "if (bed && bed_overlap(bed, h->target_name[tid], pos, pos + 1) == 0) continue; " but on the whole it looks very easy to modify this code to do "full fragment" pileups with/without singleton extension. I'll give it a shot :] Thank you so much!

1
Entering edit mode

FUNMAP "FLAG UNMAPPED"

3
Entering edit mode
5.8 years ago
Shicheng Guo ★ 8.8k

OK. Now the question is totally clear. samtools mpileup conducted too much filtering, therefore, the depth from the mpileup usually far less than samtools depth

1, mapping qulity (samtools mpileup -q)

2, base qulity (samtools mpileup -Q)

After remove these filtering, the number of depth from these two method will be exactly same.