Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts
5
7
Entering edit mode
8.1 years ago
Surya Saha ▴ 260

I am getting different counts for the number of bases on reference covered by aligned reads using samtools depth/mpileup and BEDTools genomeCoverageBed commands. I am using samtools-0.1.19 and bedtools-2.17.0

samtools mpileup -ABQ0 -d10000000 -f ref.fas qry.bam > qry.mpileup
samtools depth -q0 -Q0 qry.bam > qry.depth

genomeCoverageBed -ibam qry.bam -g ref.genome -dz > qry.dz
wc -l qry.[dm]*
  1026779 qry.depth
  1027173 qry.dz
  1026779 qry.mpileup

Any ideas? Thanks

samtools bedtools • 14k views
ADD COMMENT
0
Entering edit mode

If there are any spliced (CIGAR ~ /N/) alignments in your BAM file, bedtools will count the bases in the spiced region by default. To prevent this, use the -split option.

ADD REPLY
0
Entering edit mode

Thanks for the prompt response! Unfortunately, using -split did not change the counts.

ADD REPLY
0
Entering edit mode

I think you should follow Heng's advice and look at discrepant positions with IGV or the like. I did this myself and found positions where bedtools reports coverage that is higher than samtools. When I look in IGV, the alignments support the bedtools counts, but that is an N=1 experiment. There may be cases where the opposite is true. I will try to look more closely at this soon.

ADD REPLY
0
Entering edit mode

Glad to hear that you have seen something similar. See my reply to Heng's comment below. I have seen this in about 80% of my mapping runs this week using tmap, bwa mem, bowtie2, bwasw and bwa aln. Interestingly, the counts were same for only bwa aln produced SAM files.

ADD REPLY
0
Entering edit mode

You can extract sites unique to bedtools and check the alignment in a BAM viewer.

ADD REPLY
0
Entering edit mode

I did check the counts in Tablet/Bamview and it looks like samtools depth is ignoring some sites and under-counting others. Do I need to tweak the parameters? Thanks

ADD REPLY
4
Entering edit mode
7.5 years ago
plafdesigns ▴ 40

This happened to me too, but after playing around with both I realised that it's because samtools mpileup counted in properly mapped only, while bedtools used all the mappings.

Solution: filter the .bam file first using samtools view -b -f 2 to get only mappings with properly mapped reads, then pass that output to bedtools. You will get exactly the same results for both of the methods.

ADD COMMENT
1
Entering edit mode

The discrepancy in my case seems to be due to secondary alignments. samtools view -b -F 0x100 whatever.bam | bedtools coverage -abam stdin ... agrees with samtools depth. I want to count those secondary alignements though, so samtools view | bedtools coverage seems to be the most proper way to go.

ETA: samtools view -u ... | bedtools coverage ... appears to work fine too, despite what the docs implies

ADD REPLY
1
Entering edit mode
8.1 years ago
nlomioni ▴ 30

A guess: I know that samtools outputs only positions with depth > 0. Might it be that genomeCoverageBed instead outputs a line for every position in the reference? If your reference is 1,027,173 bases long then this is likely.

ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode
8.1 years ago

There might be a couple of things going on here. The first is that I think you are using an outdated version of BEDTools - I can't find genomeCoverageBed anywhere in the current git repository, and I think it has been superseded by coverage. In any case, you should take a look at the -split option in the documentation for coverage as I suspect that your reads have insertions and deletions, and that BEDTools is not correctly splitting the CIGAR string of these reads to produce coverage in the way that samtools does. This could result in a different number of bases in your reads "covering" the bases in the alignment, and a different number of lines in your output.

Edit: Looks like lh3 has the answer.

ADD COMMENT
0
Entering edit mode

Thanks for the reply but genomeCoverageBed is there as an alias for genomecov. And this is the latest version 2.17.0. See http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html

ADD REPLY
0
Entering edit mode

That's right, the genomeCoverageBed command is just a shell script that is built when you install bedtools. It merely calls the new tool "bedtools genomecov".

ADD REPLY
1
Entering edit mode
8.1 years ago

Based on my comments above, I looked more closely into this. I ran the exact commands you used on a test file I have. It is never the case that the samtools depth is reported to be higher than bedtools. When discrepant, bedtools is always higher.

Here is an example. In this case, bedtools reports a coverage of 5, whereas samtools reports coverage of 1. Every other case I check out has the same pattern. I actually doubt there is a bug in the samtools logic for counting depth. I suspect that instead, the parameters you are using aren't behaving exactly the way you expect, and as such, some alignments are just being filtered out before being counted.

ADD COMMENT
1
Entering edit mode

Samtools skips secondary alignments and aberrant read pairs. It is possible to change the default in the code, but not possible at command line. Mpileup has -A, but it turns out to be useless (a bug). If my guess is correct, 4 of reads in your example do not have bit 2 flagged.

ADD REPLY
2
Entering edit mode

Yep, in that example 4 reads are duplicates. bedtools just counts whatever you have in your BAM file so long as it is mapped. Filtering duplicates, discordants, etc. is expected to be done ahead of time with samtools, Picard, or bamtools. I think this solves the discrepancy.

ADD REPLY
0
Entering edit mode

I have found exactly the same after exploring a few cases by eye. The coverage count was less for mpileup/depth even when the base count was equal to genomeCoverageBed. I will go with the genomeCoverageBed counts. Thank you for looking into it.

ADD REPLY
0
Entering edit mode
7.9 years ago
volavii • 0

hi:)

i have a question regarding the depth calculated by samtools, when there is a deletion in the read.

scaffold79    978    N    5    g$gggg    HGHGH
scaffold79    979    N    4    aaaa    HHGH
scaffold79    980    N    4    g-1ng-1ng-1ng-1n    HH?H
scaffold79    981    N    4    ****    HHEH
scaffold79    982    N    4    tttt    HHEH
scaffold79    983    N    4    tttt    HHGH
scaffold79    984    N    4    tttt    HHGD

at position 980 it is indicated, that here is a deletion in 4 reads at the following position. So these 4 reads do not really cover the next position 981. Should i check for these positions and correct the depth to gain the real coverage, or is this the accepted way to calculate the coverage?

Many thanks in advance!:)

ADD COMMENT
0
Entering edit mode

You should start a new question for future queries like this, but the answer is "it depends". If you're looking for base-by-base depth, then yes, the coverage at that position would be 1. If you're calculating genomic coverage for sequencing metrics, you wouldn't penalize that position for having a deletion that was adequately covered by reads.

ADD REPLY

Login before adding your answer.

Traffic: 2368 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6