Question: Tools To Calculate Average Coverage For A Bam File?
42
gravatar for Biomed
6.2 years ago by
Biomed4.1k
Bethesda, MD, USA
Biomed4.1k wrote:

I would like to get the average coverage of all the captured bases in a bam file. What would be the best way to do this? What I am looking is a simple one number like 40X. Given that there may be millions of bases sequenced in a next gen study I would like to get the overall average coverage for all these bases as a single number.

Thanks

bam coverage sequencing • 97k views
ADD COMMENTlink modified 7 months ago by jokipokemon00010 • written 6.2 years ago by Biomed4.1k
2

People seem to be equally devided between GATK, Samtools and BedTools.

ADD REPLYlink written 6.2 years ago by Biomed4.1k
1

Why is it wrong to take the number of sequenced bases and divide by the genome size?

ADD REPLYlink written 4.6 years ago by Ketil3.8k
4

because you cannot align all the bases on the genome, some reads are optical duplicates, etc...

ADD REPLYlink written 4.6 years ago by Pierre Lindenbaum92k
1

Okay, contamination is a fair point, but even if some reads fail to match, that could still be a sign of genetic divergence, and should still count towards genome coverage. Do any of the other suggestions deal with duplicates implicitly?

ADD REPLYlink written 4.6 years ago by Ketil3.8k
28
gravatar for Casey Bergman
6.2 years ago by
Casey Bergman17k
Manchester, UK
Casey Bergman17k wrote:

Try the genomeCoverageBed tool in the BEDtools package, which takes a BAM or BED file as input and:

computes a histogram of feature coverage (e.g., aligned sequences) for a given genome. Optionally, by using the –d option, it will report the depth of coverage at each base on each chromosome in the genome file (-g).

The way we run this is to first make a tab-delimited genome.txt file with "contig_name contig_size" information for all contigs in your genome and then run genomeCoverageBed on a sorted BAM file:

$ genomeCoverageBed -ibam sortedBamFile.bam -g genome.txt > coverage.txt

This will give you a summary histogram of coverage across each contig and for the entire genome, from which you can obtain the modal or mean value (the single fold-coverage value you are looking for).

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Casey Bergman17k

Direct link: http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html

ADD REPLYlink written 20 months ago by student-t370

With your command, I have different value for the same contig. For instance:

jcf7180001864843    2   4   137928  2.90006e-05
jcf7180001864843    3   28  137928  0.000203004
jcf7180001864843    4   10  137928  7.25016e-05

What does it mean ?. How can I have one value for each contig.

ADD REPLYlink written 9 months ago by Picasa270
1

That's because it reports a histogram. From the documentation: The values in the output are:

  1. chromosome (or entire genome) [eg jcf7180001864843]
  2. depth of coverage from features in input file [eg. 2]
  3. number of bases on chromosome (or genome) with depth equal to column 2. [eg. 4]
  4. size of chromosome (or entire genome) in base pairs [eg. 137928]
  5. fraction of bases on chromosome (or entire genome) with depth equal to column 2. [2.90006e-05]

[all values taken from your first row]

If you want to get the average coverage: add up the product of bases per coverage [2* 4+3* 28+4* 10+...] and divide by the total number of bases [137928].

ADD REPLYlink modified 9 months ago • written 9 months ago by ronald.jaepel10

Is your BAM file sorted (as noted on the help doc page linked by @student-t above)?

ADD REPLYlink written 9 months ago by genomax226k
21
gravatar for Michael Dondrup
6.2 years ago by
Bergen, Norway
Michael Dondrup41k wrote:

And the mandatory R solution. The Rsamtools package can be used to read in the BAM file.

Here is a code example:

library(Rsamtools)
bamcoverage <- function (bamfile) {
  # read in the bam file
  bam <- scanBam(bamfile)[[1]] # the result comes in nested lists
  # filter reads without match position
  ind <- ! is.na(bam$pos)
  ## remove non-matches, they are not relevant to us
  bam <- lapply(bam, function(x) x[ind])
  ranges <- IRanges(start=bam$pos, width=bam$qwidth, names=make.names(bam$qname, unique=TRUE))
  ## names of the bam data frame:
  ## "qname"  "flag"   "rname"  "strand" "pos"    "qwidth"
  ## "mapq"   "cigar"  "mrnm"   "mpos"   "isize"  "seq"    "qual"
  ## construc: genomic ranges object containing all reads
  ranges <- GRanges(seqnames=Rle(bam$rname), ranges=ranges, strand=Rle(bam$strand), flag=bam$flag, readid=bam$rname )
  ## returns a coverage for each reference sequence (aka. chromosome) in the bam file
  return (mean(coverage(ranges)))      
}

And the output:

> bamcoverage("~/test.bam")
gi|225184640|emb|AL009126.3| 
                    15.35720

The good thing about this is you can do much more with the genomic ranges object than computing the coverage, the downside is, that the code is not that efficient.

ADD COMMENTlink written 6.2 years ago by Michael Dondrup41k

@Michael : What maximum size of input bam file that can be handled by this solution ? It slurps the whole bamfile in memory ??

ADD REPLYlink written 6.2 years ago by toni2.1k
1

Sorry for the late reply. Yes it does, and it is not very memory efficient. For a 850MB BAM file with approximately 17 million alignments the R process uses up to 15GB of memory during processing.

ADD REPLYlink written 3.1 years ago by Michael Dondrup41k
2

A reply four years after.....

ADD REPLYlink written 20 months ago by student-t370

@Micheal: Will it work for paired end BAM files?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by bioinfo850
19
gravatar for russdurrett
4.1 years ago by
russdurrett190
russdurrett190 wrote:

if you have samtools and awk, you can keep it simple with this:

samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'

and with stdev:

samtools depth *bamfile*  |  awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}'
ADD COMMENTlink modified 4.0 years ago by Peter5.4k • written 4.1 years ago by russdurrett190
1

Not sure this works. It calculates coverage over regions but does not account for regions that were not covered with a read. Then it divides all the sum with the number of rows that were reported... but the positions without a read hit are not included. Basically - it gives a too high result.

ADD REPLYlink written 3.2 years ago by nbartonicek90
2

samtools depth -a

Using the '-a' flag outputs all positions (including zero depth), which should solve this problem.

ADD REPLYlink written 3 months ago by Ian4.7k
1

[FYI] Only available from v1.3, I couldn't find that parameter in my version

https://github.com/samtools/samtools/issues/374

ADD REPLYlink written 3 months ago by Sukhdeep Singh8.7k
1

That's true. To be fair, the denominator needs to be the size of genome where the BAM file was generated against, rather than the number of bases covered at least once (that's what NR is in this context). To get the genome size from the BAM file:

samtools view -H *bamfile* | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'

 

ADD REPLYlink written 2.9 years ago by sung10

Please correct me if I am wrong, but none of these answers seem to work with nanopore data.

ADD REPLYlink written 17 months ago by biology12340

why would it not?

ADD REPLYlink written 17 months ago by Michael Dondrup41k

In the first command, its $3 for me

ADD REPLYlink written 4.1 years ago by Sukhdeep Singh8.7k

I agree, and edited Russ' answer accordingly to avoid confusing anyone else.

ADD REPLYlink written 4.0 years ago by Peter5.4k

samtools depth is the most straightforward way to do this

ADD REPLYlink written 4.0 years ago by Lee Katz2.8k
14
gravatar for Pierre Lindenbaum
6.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum92k wrote:

did you try the GATK "DepthOfCoverage" ? http://www.broadinstitute.org/gsa/wiki/index.php/Depth_of_Coverage_v3.0

or you can run 'samtools pileup' and calculate the mean value for the coverage column.

ADD COMMENTlink written 6.2 years ago by Pierre Lindenbaum92k
2

I have used the samtools method (which excludes bases with 0 coverage) and the BEDTools version (which includes bases with 0 coverage). The former will give a higher mean coverage. This will have an impact on genome where there are large areas that are not sequenced.

ADD REPLYlink written 6.1 years ago by Ian4.7k
1

Sorry I am new to bioinformatics. I have used 'samtools mpileup', but how can I get mean value for the depth column. Thanks.

ADD REPLYlink written 6 months ago by jjaimi92112910
1

I tried GATK DepthOfCoverage last July with no luck - complicated java.lang.RuntimeExceptions. I did find success with BEDtools however (see answer below/above)

ADD REPLYlink written 6.2 years ago by Casey Bergman17k
1

I use this method to calculate % of sites with at least a coverage of 20x in a list of sites from a bed. It provides that number directly as output.

ADD REPLYlink written 5.9 years ago by Wjeck480

That's the gte_20 value from the sample_cumulative_coverage_proportions file, correct?

ADD REPLYlink written 18 months ago by stianlagstad730

I was interested in this issue too, but I wanted to know the coverage for certain genes. after looking to this tool's wiki I've realized that I can feed it with a list of genes intervals (which of course I'll have to retrieve from UCSC or Ensembl) and I'll surely sort my problem out! look what an unexpected great piece of information I've just found! once again thanks a lot, Pierre.

ADD REPLYlink written 6.2 years ago by Jorge Amigo9.5k

Thanks Pierre. I haven't tried Depthof Coverage but I tried Samtools. Samtools would give you a base by base coverage, and I use it successfully like Jorge explained but I 'll look at GATK for sure.

ADD REPLYlink written 6.2 years ago by Biomed4.1k
13
gravatar for Michael.James.Clark
6.2 years ago by
Palo Alto
Michael.James.Clark480 wrote:

Here's a very simple perl script I wrote that does exactly this that you're free to use:

($num,$den)=(0,0);
while ($cov=<STDIN>) {
    $num=$num+$cov;
    $den++;
}
$cov=$num/$den;
print "Mean Coverage = $cov\n";

Just pipe samtools pileup to it like so:

/path/to/samtools pileup in.bam | awk '{print $4}' | perl mean_coverage.pl

It will also work easily for genomic regions. Just index the BAM file and use samtools view like so:

/path/to/samtools view -b in.bam <genomic region> | /path/to/samtools pileup - | awk '{print $4}' | perl mean_coverage.pl

Hope someone finds this useful.

ADD COMMENTlink modified 5.8 years ago • written 6.2 years ago by Michael.James.Clark480
4

One more comment on this pipeline... In my data, where I have relatively low coverage there are quite a few genomic positions not covered at all. These will not show up in the pileup command, so I guess that the mean coverage calculated this way will be the mean coverage of the covered positions. Am I right? Is there a way (in the "samtools pileup" command) to output all positions even if they are not covered by any read?

ADD REPLYlink written 5.9 years ago by Panos1.5k

looks promising. Thanks

ADD REPLYlink written 6.2 years ago by Biomed4.1k

I am certainly going to try this. Thanks for the answer.

ADD REPLYlink written 6.2 years ago by Biomed4.1k

Nice answer! Tried it and it's working! The only thing I'd like to mention is that I think that the BAM file needs to be sorted.

ADD REPLYlink written 5.9 years ago by Panos1.5k

That's correct, panos. This simple solution is only if every base is covered at least once because samtools pileup doesn't report bases with zero coverage. Thanks for pointing that out!

ADD REPLYlink written 5.8 years ago by Michael.James.Clark480

can you give me a detailed description about $cov=[HTML] you used ? Thanks!

ADD REPLYlink written 4.1 years ago by siyu100

Very useful thanks! Used it on my own dataset (sorry to bump an old thread but still very handy)

ADD REPLYlink written 3.2 years ago by Andrew Watson30

I made a BioStars account to upvote your contribution. I've been trying to find something this simple and straightforward for some time now.. Thanks!

ADD REPLYlink written 5 months ago by pvdam30
1

this is a very old answer. you would find then useful a simpler samtools depth based solution:

samtools depth [-b regions.bed] input.bam | awk '{c++;s+=$3}END{print s/c}'
ADD REPLYlink modified 5 months ago • written 5 months ago by Jorge Amigo9.5k
5
gravatar for Heikki
6.2 years ago by
Heikki350
Heikki350 wrote:

@michael.james.clark, thank you. I wrote for myself a script that I can add to when the need arises:

#!/usr/bin/env perl

use strict;
use warnings;

# Usage: samtools pileup file.bam | bam_coverage.pl

my $num;        # per residue coverage
my $len;        # sequence length counter 
my $min = 1000; # minimum coverage
my $max = 0;    # maximum coverage

while (<>) {
    my @a = split /\t/;
    $num += $a[3];
    $min = $a[3] if $min > $a[3];
    $max = $a[3] if $max < $a[3];
    $len++;
}

printf  "Mean coverage  : %.1f\n", $num/$len;
printf  "Coverage range : %d - %d\n", $min, $max;
ADD COMMENTlink written 6.2 years ago by Heikki350
1

Heiki thanks for the codes. It is working for me

ADD REPLYlink written 4.6 years ago by Kwame50

This doesn't work because mpileup does not report the bases with coverage of zero

ADD REPLYlink written 4.0 years ago by Raygozak1.1k
4
gravatar for Rm
5.8 years ago by
Rm7.5k
Danville, PA
Rm7.5k wrote:

use

samtools pileup accepted_hits.bam | awk '{ count++ ; SUM += $4 } END { print "Total: " SUM "\t" "Nucleotides: " count "\t" "Average_coverage: " SUM/count }'

(use $8 instead; if pileup with -cf option)

ADD COMMENTlink modified 4.6 years ago by Istvan Albert ♦♦ 70k • written 5.8 years ago by Rm7.5k

I used the above command to extract the average coverage on my aligned bam files and my statistics are: Total: 8011581243 Nucleotides: 510234050 Average_coverage: 15.7018

I want to know that the average_coverage is stating that on a distribution each of my base is read atleast 15 time right? What does the total signifies and the Nucleotides count: 510234050. Is this the total no of bases that are in the exonic region? then how will I calculate the number of reads from here that actually got mapped on in the exome region? Can we do this from the nucleotides count? I know for each reads consist of 100 bases so if I divide this 510234050 bases with 100 it gives me the reads but that will not give me the reads that actually got mapped in the exome region. How t extract that?

ADD REPLYlink written 3.4 years ago by vchris_ngs2.9k
3
gravatar for William
3.7 years ago by
William3.8k
Europe
William3.8k wrote:

How about just running qualimap in GUI mode or to create a QC PDF? Your can run it on the whole genome or on 1 or multiple regions using a bed file.

global coverage stats

coverage histogram

percentage of genome covered x times plot

http://qualimap.bioinfo.cipf.es/

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by William3.8k
3
gravatar for 5heikki
23 months ago by
5heikki6.3k
Finland
5heikki6.3k wrote:

This is for people who have a sam/bam file consisting of reads mapped to multiple contigs (samtools needs to be in $PATH):

bbmap/pileup.sh -h

Written by Brian Bushnell
Last modified May 19, 2015

Description:  Calculates per-scaffold coverage information from an unsorted sam file.

Usage:  pileup.sh in=<input> out=<output>

Input may be stdin or a SAM file, compressed or uncompressed.
Output may be stdout or a file.

Input Parameters:
in=<file>           The input sam file; this is the only required parameter.
ref=<file>          Scans a reference fasta for per-scaffold GC counts, not otherwise needed.
fastaorf=<file>     An optional fasta file with ORF header information in PRODIGAL's output format.  Must also specify 'outorf'.
unpigz=t            Decompress with pigz for faster decompression.

..
ADD COMMENTlink modified 23 months ago • written 23 months ago by 5heikki6.3k

does it work with a bam file ?

ADD REPLYlink written 5 months ago by Picasa270
1

Yes it does. Use the latest version.

$ pileup.sh
Written by Brian Bushnell
Last modified September 1, 2016
Description:  Calculates per-scaffold coverage information from an unsorted sam or bam file.
ADD REPLYlink modified 5 months ago • written 5 months ago by genomax226k
2
gravatar for Mdperry
6.2 years ago by
Mdperry40
Mdperry40 wrote:

BioPerl has a library named Bio-Samtools which provides a familiar Perl/BioPerl access to samtools. I heard the author, Lincoln Stein, give a talk on on how he wove the two together use XS. As I recall, one of his examples was exactly the question you posed

ADD COMMENTlink written 6.2 years ago by Mdperry40
0
gravatar for anuragkuleuven
3.7 years ago by
anuragkuleuven0 wrote:

Qualimap is very fast and excellent tool. thanks

ADD COMMENTlink written 3.7 years ago by anuragkuleuven0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1068 users visited in the last hour