Question: What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?
gravatar for Malachi Griffith
7.0 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith18k wrote:

I have two very large BAM files (high depth, human, whole genome). I have a seemingly simple question. I want to know how many positions in each are covered by at least N reads (say 20). For now I am not concerned about requiring a minimum mapping quality for each alignment or a minimum read quality for the reads involved.

Things I have considered:

  • samtools mpileup (then piped to awk to assess the minimum depth requirement, then piped to wc -l). This seemed slow...
  • samtools depth (storing the output to disk so that I can assess coverage at different cutoffs later). Even if I divide the genome into ~133 evenly sized pieces, this seems very slow...
  • bedtools coverage?
  • bedtools genomecov?
  • bedtools multicov?
  • bamtools coverage?

Any idea which of these might be fastest for this question? Something else I haven't thought of? I can use parallel processes to ensure that the performance bottleneck is disk access but want that access to be as efficient as possible. It seems that some of these tools are doing more than I need for this particular task...

ADD COMMENTlink modified 19 months ago by Paul1.4k • written 7.0 years ago by Malachi Griffith18k

Of the bedtools options, genomecov is the one you want (see answer from Jordan). It is admittedly not the fastest - there are ways to speed it up, but we haven't yet done so. I'd be interested to know the runs times you face as a function of the number of reads.

ADD REPLYlink written 7.0 years ago by Aaronquinlan11k

All the answers were helpful but 'bedtools genomecov' was the method I wound up using in the end. 'samtools depth' has convenient output. I expect Pierre's solution is the probably the fastest.

ADD REPLYlink written 7.0 years ago by Malachi Griffith18k
gravatar for Pierre Lindenbaum
7.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

Use the basic code for the Samtools API:


 if ((int)pos >= tmp->beg && (int)pos < tmp->end)


 if (n>20 && (int)pos >= tmp->beg && (int)pos < tmp->end)

compile: (sould be something like:)

gcc -O3 -o mytool mytool.c -I /path/to/samtools -L /path/to/samtools -lbam -lz

Edit: and you can run this code using one process per chromosome...

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Pierre Lindenbaum128k
gravatar for Jordan
7.0 years ago by
Jordan1.1k wrote:

You could try bedtools genomeCoverageBed. It takes in bam file and reference genome as input and outputs the coverage stats.

Here is an example:

genomeCoverageBed -ibam file.bam -g human_g1k_v37.fasta > coverage.txt

The output file has 5 columns:

genome    0    1670110268    3101804739    0.538432
genome    1    696287280    3101804739    0.224478
genome    2    329358858    3101804739    0.106183
genome    3    144809487    3101804739    0.0466856
genome    4    66917607    3101804739    0.0215738
genome    5    32869034    3101804739    0.0105967
genome    6    17980502    3101804739    0.00579679
genome    7    10955385    3101804739    0.00353194
  • Column1: Genome/Chromosome name
    Column2: Depth
    Column3: Number of bases in the bam file with depth equal to Column2
    Column4: Genome Length
    Column5: Fraction of human genome with the depth equal to Column2

So, if you need coverage >= 5, in this case it would be 6.26% (you just add 0.538432+0.224478+0.106183+0.0466856+0.0215738 and subtract from 1).

I hope this is what you are looking for.

ADD COMMENTlink written 7.0 years ago by Jordan1.1k

This just takes 5-10 min depending on the system you have. Never exceeded 10 min for me.

ADD REPLYlink written 7.0 years ago by Jordan1.1k
gravatar for Zev.Kronenberg
7.0 years ago by
United States
Zev.Kronenberg11k wrote:

I pipe samtools depth to a C++ program I wrote to count depth X. From these counts I can quickly calculate the precent of bam covered at X or greater.

I also take missing sites into account.

chr1 1 10

chr 34 5

2-33 = 0;

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Zev.Kronenberg11k

Yeah, samtools depth is the method I was using. But for these massive BAM files, just running samtools depth seems to take a long time...

ADD REPLYlink written 7.0 years ago by Malachi Griffith18k

I've switched over to using sambamba depth, it's much faster than samtools depth and is also multithreaded. It also gives a parsed pileup like output with coverage by the type of base and has additional options to calculate coverage in overlapping widows.

ADD REPLYlink written 3.2 years ago by Vasisht170

I don't see an equivalent to depth or bedcov in the sambamba docs

ADD REPLYlink written 3.1 years ago by ariel.balter140
gravatar for drkennetz
19 months ago by
drkennetz480 wrote:

To update this thread, sambamba depth is a wonderful library that supports multithreading which drastically increases speed but provides the same functionality as samtools depth.

I would recommend everybody check out this package! Sambamba

ADD COMMENTlink written 19 months ago by drkennetz480
gravatar for Paul
19 months ago by
European Union
Paul1.4k wrote:

For very large BAM files (like WGS) I was tested Mosdepth Link. It is very fast and there are lot of option. Mosdepth can report the mean depth in 500-base windows genome-wide info under 9 minutes of user time with 3 threads.

Example of syntax:

mosdepth -n --fast-mode --by 500 sample.wgs $sample.wgs.cram
ADD COMMENTlink written 19 months ago by Paul1.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1263 users visited in the last hour