Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts
5
7
Entering edit mode
8.1 years ago
Surya Saha ▴ 260

I am getting different counts for the number of bases on reference covered by aligned reads using samtools depth/mpileup and BEDTools genomeCoverageBed commands. I am using samtools-0.1.19 and bedtools-2.17.0

samtools mpileup -ABQ0 -d10000000 -f ref.fas qry.bam > qry.mpileup
samtools depth -q0 -Q0 qry.bam > qry.depth

genomeCoverageBed -ibam qry.bam -g ref.genome -dz > qry.dz
wc -l qry.[dm]*
1026779 qry.depth
1027173 qry.dz
1026779 qry.mpileup


Any ideas? Thanks

samtools bedtools • 14k views
0
Entering edit mode

If there are any spliced (CIGAR ~ /N/) alignments in your BAM file, bedtools will count the bases in the spiced region by default. To prevent this, use the -split option.

0
Entering edit mode

Thanks for the prompt response! Unfortunately, using -split did not change the counts.

0
Entering edit mode

I think you should follow Heng's advice and look at discrepant positions with IGV or the like. I did this myself and found positions where bedtools reports coverage that is higher than samtools. When I look in IGV, the alignments support the bedtools counts, but that is an N=1 experiment. There may be cases where the opposite is true. I will try to look more closely at this soon.

0
Entering edit mode

Glad to hear that you have seen something similar. See my reply to Heng's comment below. I have seen this in about 80% of my mapping runs this week using tmap, bwa mem, bowtie2, bwasw and bwa aln. Interestingly, the counts were same for only bwa aln produced SAM files.

0
Entering edit mode

You can extract sites unique to bedtools and check the alignment in a BAM viewer.

0
Entering edit mode

I did check the counts in Tablet/Bamview and it looks like samtools depth is ignoring some sites and under-counting others. Do I need to tweak the parameters? Thanks

4
Entering edit mode
7.5 years ago
plafdesigns ▴ 40

This happened to me too, but after playing around with both I realised that it's because samtools mpileup counted in properly mapped only, while bedtools used all the mappings.

Solution: filter the .bam file first using samtools view -b -f 2 to get only mappings with properly mapped reads, then pass that output to bedtools. You will get exactly the same results for both of the methods.

1
Entering edit mode

The discrepancy in my case seems to be due to secondary alignments. samtools view -b -F 0x100 whatever.bam | bedtools coverage -abam stdin ... agrees with samtools depth. I want to count those secondary alignements though, so samtools view | bedtools coverage seems to be the most proper way to go.

ETA: samtools view -u ... | bedtools coverage ... appears to work fine too, despite what the docs implies

1
Entering edit mode
8.1 years ago
nlomioni ▴ 30

A guess: I know that samtools outputs only positions with depth > 0. Might it be that genomeCoverageBed instead outputs a line for every position in the reference? If your reference is 1,027,173 bases long then this is likely.

0
Entering edit mode
1
Entering edit mode
8.1 years ago

There might be a couple of things going on here. The first is that I think you are using an outdated version of BEDTools - I can't find genomeCoverageBed anywhere in the current git repository, and I think it has been superseded by coverage. In any case, you should take a look at the -split option in the documentation for coverage as I suspect that your reads have insertions and deletions, and that BEDTools is not correctly splitting the CIGAR string of these reads to produce coverage in the way that samtools does. This could result in a different number of bases in your reads "covering" the bases in the alignment, and a different number of lines in your output.

Edit: Looks like lh3 has the answer.

0
Entering edit mode

0
Entering edit mode

That's right, the genomeCoverageBed command is just a shell script that is built when you install bedtools. It merely calls the new tool "bedtools genomecov".

1
Entering edit mode
8.1 years ago

Based on my comments above, I looked more closely into this. I ran the exact commands you used on a test file I have. It is never the case that the samtools depth is reported to be higher than bedtools. When discrepant, bedtools is always higher.

Here is an example. In this case, bedtools reports a coverage of 5, whereas samtools reports coverage of 1. Every other case I check out has the same pattern. I actually doubt there is a bug in the samtools logic for counting depth. I suspect that instead, the parameters you are using aren't behaving exactly the way you expect, and as such, some alignments are just being filtered out before being counted.

1
Entering edit mode

Samtools skips secondary alignments and aberrant read pairs. It is possible to change the default in the code, but not possible at command line. Mpileup has -A, but it turns out to be useless (a bug). If my guess is correct, 4 of reads in your example do not have bit 2 flagged.

2
Entering edit mode

Yep, in that example 4 reads are duplicates. bedtools just counts whatever you have in your BAM file so long as it is mapped. Filtering duplicates, discordants, etc. is expected to be done ahead of time with samtools, Picard, or bamtools. I think this solves the discrepancy.

0
Entering edit mode

I have found exactly the same after exploring a few cases by eye. The coverage count was less for mpileup/depth even when the base count was equal to genomeCoverageBed. I will go with the genomeCoverageBed counts. Thank you for looking into it.

0
Entering edit mode
7.9 years ago
volavii • 0

hi:)

i have a question regarding the depth calculated by samtools, when there is a deletion in the read.

scaffold79    978    N    5    g\$gggg    HGHGH
scaffold79    979    N    4    aaaa    HHGH
scaffold79    980    N    4    g-1ng-1ng-1ng-1n    HH?H
scaffold79    981    N    4    ****    HHEH
scaffold79    982    N    4    tttt    HHEH
scaffold79    983    N    4    tttt    HHGH
scaffold79    984    N    4    tttt    HHGD


at position 980 it is indicated, that here is a deletion in 4 reads at the following position. So these 4 reads do not really cover the next position 981. Should i check for these positions and correct the depth to gain the real coverage, or is this the accepted way to calculate the coverage?

0
Entering edit mode

You should start a new question for future queries like this, but the answer is "it depends". If you're looking for base-by-base depth, then yes, the coverage at that position would be 1. If you're calculating genomic coverage for sequencing metrics, you wouldn't penalize that position for having a deletion that was adequately covered by reads.