samtools bedcov vs. bedtools coverage
1
11
Entering edit mode
6.5 years ago
Dataman ▴ 350

Hi, currently, I am trying to calculate the coverage for my genes of interest (i.e. number of reads falling in or overlapping by the gene of interest) from DNA-seq data in a BAM file. Genes of interest are annotated in a BED file.

There are several answers on Biostar suggesting to use
bedtools coverage e.g., bedtools coverage -abam sample.bam -b exons.bed -counts. However, while checking the Samtools' manual, I stumbled upon samtools bedcov with the following description:

read depth per BED region. Reports read depth per genomic region, as specified in the supplied BED file.

which can be used like this: samtools bedcov gene.bed sample.bam.

I have tested these two and I am getting different results! For instance, for OR4F5gene I get the following results (using the same BAM file for both):

chr1    69091   70008   "OR4F5" 61 #from bedtools coverage


and

chr1    69091   70008   "OR4F5" 4714  #from  samtools bedcov


where 61 is very different than 4714!

Another observation is that the result from Samtools is instantaneous for the above test while Bedtools takes a lot of time to produce the result. The speed itself has led me to think of using samtools bedcov. However, I was wondering whether there is catch that I am overlooking! I tried to find more information about samtools bedcov but I was not able to find anything more than that one-liner description.

[EDIT]: This is how it looks like in IGV. It seems that Bedtools results is closer to reality.

DNA-seq coverage • 23k views
4
Entering edit mode

That's peculiar. My gut feeling says the samtools bedcov is reporting the total number of reads within a region. However testing samtools view BAM <region> | wc -l gives me different answers. I'm not familiar with bedtools coverage but the value of 67 seems close to high coverage libraries.

Another option using samtools that I know works. samtools depth will give you per base pair depth of coverage. You can take the median (or mean) coverage within a locus. If you're parsing regions that are large (>10kb) you can obtain the read counts samtools view BAM <region> | wc -l and then determine the coverage by this equation (READ COUNTS / BASES SPANED) * AVERAGE READ LENGTH

Using either samtools view or samtools depth you can also limit the reads to ones with mapping qualities greater than 10 (or whatever value you like). You can also limit the read length in samtools depth

0
Entering edit mode

Thanks for your reply! bedtools coverage does not provide option for filtering based on mapping quality. The value of 61 is achieved when one counts all mapped reads. If one uses your first approach and also filters reads with mapping score below 10, then the result would be 6. I have now noticed that bedtools multicov can also be used for the same purpose with support for filtering based on mapping quality. Also it only counts non duplicate reads.

1
Entering edit mode

How does the data looks in IGV in that region ? Can you post the exact commands ?

0
Entering edit mode

Let me check! I'll get back to you asap!

0
Entering edit mode

Now, I have edited the post with an image from the IGV.

9
Entering edit mode
6.5 years ago

The description that goes with samtools bedcov is wrong (I'll create an issue on github). It returns the sum of per-base coverage in each region. I'm not sure how that would be useful, but that's what it is.

Edit: Here's the issue to get the description fixed.

2
Entering edit mode

Thank you! I had a hunch if I did samtools depth and added up the coverages it would equal the output.

0
Entering edit mode

Yup, your hunch nailed it. I should note that samtools bedcov (like samtools depth) skips alignments marked as duplicates, QC fail, etc. I'm pretty sure bedtools doesn't do that.

1
Entering edit mode

Thank you for your response. It seems that bedtools multicov (see here) can take care of duplicates and QC-failed reads and also filtering based on mapping quality.