Question: Tools To Calculate Average Coverage For A Bam File?
52
gravatar for Biomed
6.9 years ago by
Biomed4.3k
Bethesda, MD, USA
Biomed4.3k wrote:

I would like to get the average coverage of all the captured bases in a bam file. What would be the best way to do this? What I am looking is a simple one number like 40X. Given that there may be millions of bases sequenced in a next gen study I would like to get the overall average coverage for all these bases as a single number.

Thanks

bam coverage sequencing • 111k views
ADD COMMENTlink modified 8 weeks ago by bioinfo890 • written 6.9 years ago by Biomed4.3k
2

People seem to be equally devided between GATK, Samtools and BedTools.

ADD REPLYlink written 6.9 years ago by Biomed4.3k
1

Why is it wrong to take the number of sequenced bases and divide by the genome size?

ADD REPLYlink written 5.2 years ago by Ketil3.8k
4

because you cannot align all the bases on the genome, some reads are optical duplicates, etc...

ADD REPLYlink written 5.2 years ago by Pierre Lindenbaum102k
1

Okay, contamination is a fair point, but even if some reads fail to match, that could still be a sign of genetic divergence, and should still count towards genome coverage. Do any of the other suggestions deal with duplicates implicitly?

ADD REPLYlink written 5.2 years ago by Ketil3.8k
1

Most answers seems to be very old and hence would like to have updated suggestions.

I have 6 bam files and I have used samtools depth to calculate chromosome wise depth for all chromosomes and then plotted in R. While looking at the plots at 2-3 places, depth shows upto 200-3500 and hence I would like to calculate average read depth of each chromosome from each bam file. For e.g.: 1) average depth of chr1 from bam1, average depth of chr1 from bam2, ...until bam6. 2) average depth of chr1 from all 6 bam together.

Kindly guide.

Thanks.

ADD REPLYlink modified 5 months ago • written 5 months ago by bioinfo890
1

Try the solutions if they don't work then you need to reframe the question with your commands and write a new query in a new post. This is advised. And yes paired end should work.

ADD REPLYlink written 5 months ago by vchris_ngs4.2k

in addition, if you have already plotted depths it means that you have got a temporal R data.frame (or similar type) in there containing chromosome, position and depth information. splitting such data.frame by chromosome and calculating average values should be straight-forward. I would just suggest to take into account empty values not reported by samtools depth dividing each chromosome depths' sum by the size of the chromosome, and not by the number of depth values.

ADD REPLYlink written 5 months ago by Jorge Amigo10.0k

Hi,

I have multiple paired-end bam files from RNA-Seq data, already aligned and computed depth with samtools depth > bam.depth. Would any one please guide for any straight forward/latest way to compute coverage plots, avg. coverage plots and other metrics from depth files?

Picard CollectRNAseqmetrics, RSeQC and QoRTs have not been helpful. :-(

Thanks.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by bioinfo890

Have you run any of the other tools mentioned in this thread (e.g. Qualimap, pileup from BBMap)?

ADD REPLYlink written 8 weeks ago by genomax39k

@genomax My bam files are very big and Qualimap has not been helpful, so as bbmap which is Java-based (memory issues like Picard and QoRTs) and hence I used samtools depth. My depth files are also in GB.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by bioinfo890

Both programs will happily accept larger allocations of RAM and can open very large files. If you don't have access to a machine with more RAM then that is a different problem.

ADD REPLYlink written 8 weeks ago by genomax39k
31
gravatar for Casey Bergman
6.9 years ago by
Casey Bergman17k
Athens, GA, USA
Casey Bergman17k wrote:

Try the genomeCoverageBed tool in the BEDtools package, which takes a BAM or BED file as input and:

computes a histogram of feature coverage (e.g., aligned sequences) for a given genome. Optionally, by using the –d option, it will report the depth of coverage at each base on each chromosome in the genome file (-g).

The way we run this is to first make a tab-delimited genome.txt file with "contig_name contig_size" information for all contigs in your genome and then run genomeCoverageBed on a sorted BAM file:

$ genomeCoverageBed -ibam sortedBamFile.bam -g genome.txt > coverage.txt

This will give you a summary histogram of coverage across each contig and for the entire genome, from which you can obtain the modal or mean value (the single fold-coverage value you are looking for).

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Casey Bergman17k

Direct link: http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html

ADD REPLYlink written 2.3 years ago by student-t410

With your command, I have different value for the same contig. For instance:

jcf7180001864843    2   4   137928  2.90006e-05
jcf7180001864843    3   28  137928  0.000203004
jcf7180001864843    4   10  137928  7.25016e-05

What does it mean ?. How can I have one value for each contig.

ADD REPLYlink written 17 months ago by Picasa320
1

That's because it reports a histogram. From the documentation: The values in the output are:

  1. chromosome (or entire genome) [eg jcf7180001864843]
  2. depth of coverage from features in input file [eg. 2]
  3. number of bases on chromosome (or genome) with depth equal to column 2. [eg. 4]
  4. size of chromosome (or entire genome) in base pairs [eg. 137928]
  5. fraction of bases on chromosome (or entire genome) with depth equal to column 2. [2.90006e-05]

[all values taken from your first row]

If you want to get the average coverage: add up the product of bases per coverage [2* 4+3* 28+4* 10+...] and divide by the total number of bases [137928].

ADD REPLYlink modified 16 months ago • written 16 months ago by ronald.jaepel10

Is your BAM file sorted (as noted on the help doc page linked by @student-t above)?

ADD REPLYlink written 17 months ago by genomax39k
25
gravatar for russdurrett
4.7 years ago by
russdurrett250
russdurrett250 wrote:

if you have samtools and awk, you can keep it simple with this:

samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'

and with stdev:

samtools depth *bamfile*  |  awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}'
ADD COMMENTlink modified 4.7 years ago by Peter5.6k • written 4.7 years ago by russdurrett250
1

Not sure this works. It calculates coverage over regions but does not account for regions that were not covered with a read. Then it divides all the sum with the number of rows that were reported... but the positions without a read hit are not included. Basically - it gives a too high result.

ADD REPLYlink written 3.9 years ago by nbartonicek100
3

samtools depth -a

Using the '-a' flag outputs all positions (including zero depth), which should solve this problem.

ADD REPLYlink written 11 months ago by Ian5.0k
2

[FYI] Only available from v1.3, I couldn't find that parameter in my version

https://github.com/samtools/samtools/issues/374

ADD REPLYlink written 11 months ago by Sukhdeep Singh9.2k
2

That's true. To be fair, the denominator needs to be the size of genome where the BAM file was generated against, rather than the number of bases covered at least once (that's what NR is in this context). To get the genome size from the BAM file:

samtools view -H *bamfile* | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'

 

ADD REPLYlink written 3.6 years ago by sung20

Please correct me if I am wrong, but none of these answers seem to work with nanopore data.

ADD REPLYlink written 2.1 years ago by biology12340

why would it not?

ADD REPLYlink written 2.1 years ago by Michael Dondrup43k

In the first command, its $3 for me

ADD REPLYlink written 4.7 years ago by Sukhdeep Singh9.2k

I agree, and edited Russ' answer accordingly to avoid confusing anyone else.

ADD REPLYlink written 4.7 years ago by Peter5.6k

samtools depth is the most straightforward way to do this

ADD REPLYlink written 4.6 years ago by Lee Katz2.8k
21
gravatar for Michael Dondrup
6.9 years ago by
Bergen, Norway
Michael Dondrup43k wrote:

And the mandatory R solution. The Rsamtools package can be used to read in the BAM file.

Here is a code example:

library(Rsamtools)
bamcoverage <- function (bamfile) {
  # read in the bam file
  bam <- scanBam(bamfile)[[1]] # the result comes in nested lists
  # filter reads without match position
  ind <- ! is.na(bam$pos)
  ## remove non-matches, they are not relevant to us
  bam <- lapply(bam, function(x) x[ind])
  ranges <- IRanges(start=bam$pos, width=bam$qwidth, names=make.names(bam$qname, unique=TRUE))
  ## names of the bam data frame:
  ## "qname"  "flag"   "rname"  "strand" "pos"    "qwidth"
  ## "mapq"   "cigar"  "mrnm"   "mpos"   "isize"  "seq"    "qual"
  ## construc: genomic ranges object containing all reads
  ranges <- GRanges(seqnames=Rle(bam$rname), ranges=ranges, strand=Rle(bam$strand), flag=bam$flag, readid=bam$rname )
  ## returns a coverage for each reference sequence (aka. chromosome) in the bam file
  return (mean(coverage(ranges)))      
}

And the output:

> bamcoverage("~/test.bam")
gi|225184640|emb|AL009126.3| 
                    15.35720

The good thing about this is you can do much more with the genomic ranges object than computing the coverage, the downside is, that the code is not that efficient.

ADD COMMENTlink written 6.9 years ago by Michael Dondrup43k

@Michael : What maximum size of input bam file that can be handled by this solution ? It slurps the whole bamfile in memory ??

ADD REPLYlink written 6.8 years ago by toni2.1k
1

Sorry for the late reply. Yes it does, and it is not very memory efficient. For a 850MB BAM file with approximately 17 million alignments the R process uses up to 15GB of memory during processing.

ADD REPLYlink written 3.8 years ago by Michael Dondrup43k
3

A reply four years after.....

ADD REPLYlink written 2.3 years ago by student-t410

@Micheal: Will it work for paired end BAM files?

ADD REPLYlink modified 10 months ago • written 10 months ago by bioinfo890
14
gravatar for Pierre Lindenbaum
6.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum102k wrote:

did you try the GATK "DepthOfCoverage" ? http://www.broadinstitute.org/gsa/wiki/index.php/Depth_of_Coverage_v3.0

or you can run 'samtools pileup' and calculate the mean value for the coverage column.

ADD COMMENTlink written 6.9 years ago by Pierre Lindenbaum102k
2

I have used the samtools method (which excludes bases with 0 coverage) and the BEDTools version (which includes bases with 0 coverage). The former will give a higher mean coverage. This will have an impact on genome where there are large areas that are not sequenced.

ADD REPLYlink written 6.7 years ago by Ian5.0k
1

Sorry I am new to bioinformatics. I have used 'samtools mpileup', but how can I get mean value for the depth column. Thanks.

ADD REPLYlink written 14 months ago by jjaimi92112910
1

I tried GATK DepthOfCoverage last July with no luck - complicated java.lang.RuntimeExceptions. I did find success with BEDtools however (see answer below/above)

ADD REPLYlink written 6.9 years ago by Casey Bergman17k
1

I use this method to calculate % of sites with at least a coverage of 20x in a list of sites from a bed. It provides that number directly as output.

ADD REPLYlink written 6.5 years ago by Wjeck480

That's the gte_20 value from the sample_cumulative_coverage_proportions file, correct?

ADD REPLYlink written 2.1 years ago by stianlagstad890

I was interested in this issue too, but I wanted to know the coverage for certain genes. after looking to this tool's wiki I've realized that I can feed it with a list of genes intervals (which of course I'll have to retrieve from UCSC or Ensembl) and I'll surely sort my problem out! look what an unexpected great piece of information I've just found! once again thanks a lot, Pierre.

ADD REPLYlink written 6.9 years ago by Jorge Amigo10.0k

Thanks Pierre. I haven't tried Depthof Coverage but I tried Samtools. Samtools would give you a base by base coverage, and I use it successfully like Jorge explained but I 'll look at GATK for sure.

ADD REPLYlink written 6.9 years ago by Biomed4.3k
14
gravatar for Michael.James.Clark
6.9 years ago by
Palo Alto
Michael.James.Clark530 wrote:

Here's a very simple perl script I wrote that does exactly this that you're free to use:

($num,$den)=(0,0);
while ($cov=<STDIN>) {
    $num=$num+$cov;
    $den++;
}
$cov=$num/$den;
print "Mean Coverage = $cov\n";

Just pipe samtools pileup to it like so:

/path/to/samtools pileup in.bam | awk '{print $4}' | perl mean_coverage.pl

It will also work easily for genomic regions. Just index the BAM file and use samtools view like so:

/path/to/samtools view -b in.bam <genomic region> | /path/to/samtools pileup - | awk '{print $4}' | perl mean_coverage.pl

Hope someone finds this useful.

ADD COMMENTlink modified 6.5 years ago • written 6.9 years ago by Michael.James.Clark530
4

One more comment on this pipeline... In my data, where I have relatively low coverage there are quite a few genomic positions not covered at all. These will not show up in the pileup command, so I guess that the mean coverage calculated this way will be the mean coverage of the covered positions. Am I right? Is there a way (in the "samtools pileup" command) to output all positions even if they are not covered by any read?

ADD REPLYlink written 6.6 years ago by Panos1.5k

looks promising. Thanks

ADD REPLYlink written 6.9 years ago by Biomed4.3k

I am certainly going to try this. Thanks for the answer.

ADD REPLYlink written 6.9 years ago by Biomed4.3k

Nice answer! Tried it and it's working! The only thing I'd like to mention is that I think that the BAM file needs to be sorted.

ADD REPLYlink written 6.6 years ago by Panos1.5k

That's correct, panos. This simple solution is only if every base is covered at least once because samtools pileup doesn't report bases with zero coverage. Thanks for pointing that out!

ADD REPLYlink written 6.5 years ago by Michael.James.Clark530

can you give me a detailed description about $cov=[HTML] you used ? Thanks!

ADD REPLYlink written 4.7 years ago by siyu110

Very useful thanks! Used it on my own dataset (sorry to bump an old thread but still very handy)

ADD REPLYlink written 3.8 years ago by Andrew Watson30

I made a BioStars account to upvote your contribution. I've been trying to find something this simple and straightforward for some time now.. Thanks!

ADD REPLYlink written 13 months ago by pvdam30
2

this is a very old answer. you would find then useful a simpler samtools depth based solution:

samtools depth [-b regions.bed] input.bam | awk '{c++;s+=$3}END{print s/c}'
ADD REPLYlink modified 13 months ago • written 13 months ago by Jorge Amigo10.0k

@Jorge Amigo

samtools depth bamfile | awk '{sum+=$3} END { print "Average = ",sum/NR}'

by @russdurrett also gives same output for my paired-end bam file from RNA-Seq:-)

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by bioinfo890
1

that's because NR is an awk variable that stores the number of rows, therefore the c variable on my code is not wrong, but it is superfluous.

ADD REPLYlink written 8 weeks ago by Jorge Amigo10.0k
5
gravatar for Heikki
6.9 years ago by
Heikki350
Heikki350 wrote:

@michael.james.clark, thank you. I wrote for myself a script that I can add to when the need arises:

#!/usr/bin/env perl

use strict;
use warnings;

# Usage: samtools pileup file.bam | bam_coverage.pl

my $num;        # per residue coverage
my $len;        # sequence length counter 
my $min = 1000; # minimum coverage
my $max = 0;    # maximum coverage

while (<>) {
    my @a = split /\t/;
    $num += $a[3];
    $min = $a[3] if $min > $a[3];
    $max = $a[3] if $max < $a[3];
    $len++;
}

printf  "Mean coverage  : %.1f\n", $num/$len;
printf  "Coverage range : %d - %d\n", $min, $max;
ADD COMMENTlink written 6.9 years ago by Heikki350
1

Heiki thanks for the codes. It is working for me

ADD REPLYlink written 5.3 years ago by Kwame50

This doesn't work because mpileup does not report the bases with coverage of zero

ADD REPLYlink written 4.7 years ago by Raygozak1.1k
4
gravatar for Rm
6.5 years ago by
Rm7.6k
Danville, PA
Rm7.6k wrote:

use

samtools pileup accepted_hits.bam | awk '{ count++ ; SUM += $4 } END { print "Total: " SUM "\t" "Nucleotides: " count "\t" "Average_coverage: " SUM/count }'

(use $8 instead; if pileup with -cf option)

ADD COMMENTlink modified 5.3 years ago by Istvan Albert ♦♦ 75k • written 6.5 years ago by Rm7.6k

I used the above command to extract the average coverage on my aligned bam files and my statistics are: Total: 8011581243 Nucleotides: 510234050 Average_coverage: 15.7018

I want to know that the average_coverage is stating that on a distribution each of my base is read atleast 15 time right? What does the total signifies and the Nucleotides count: 510234050. Is this the total no of bases that are in the exonic region? then how will I calculate the number of reads from here that actually got mapped on in the exome region? Can we do this from the nucleotides count? I know for each reads consist of 100 bases so if I divide this 510234050 bases with 100 it gives me the reads but that will not give me the reads that actually got mapped in the exome region. How t extract that?

ADD REPLYlink written 4.0 years ago by vchris_ngs4.2k
3
gravatar for William
4.4 years ago by
William4.0k
Europe
William4.0k wrote:

How about just running qualimap in GUI mode or to create a QC PDF? Your can run it on the whole genome or on 1 or multiple regions using a bed file.

global coverage stats

coverage histogram

percentage of genome covered x times plot

http://qualimap.bioinfo.cipf.es/

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by William4.0k
3
gravatar for 5heikki
2.6 years ago by
5heikki6.9k
Finland
5heikki6.9k wrote:

This is for people who have a sam/bam file consisting of reads mapped to multiple contigs (samtools needs to be in $PATH):

bbmap/pileup.sh -h

Written by Brian Bushnell
Last modified May 19, 2015

Description:  Calculates per-scaffold coverage information from an unsorted sam file.

Usage:  pileup.sh in=<input> out=<output>

Input may be stdin or a SAM file, compressed or uncompressed.
Output may be stdout or a file.

Input Parameters:
in=<file>           The input sam file; this is the only required parameter.
ref=<file>          Scans a reference fasta for per-scaffold GC counts, not otherwise needed.
fastaorf=<file>     An optional fasta file with ORF header information in PRODIGAL's output format.  Must also specify 'outorf'.
unpigz=t            Decompress with pigz for faster decompression.

..
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by 5heikki6.9k

does it work with a bam file ?

ADD REPLYlink written 13 months ago by Picasa320
1

Yes it does. Use the latest version.

$ pileup.sh
Written by Brian Bushnell
Last modified September 1, 2016
Description:  Calculates per-scaffold coverage information from an unsorted sam or bam file.
ADD REPLYlink modified 13 months ago • written 13 months ago by genomax39k
2
gravatar for Mdperry
6.9 years ago by
Mdperry40
Mdperry40 wrote:

BioPerl has a library named Bio-Samtools which provides a familiar Perl/BioPerl access to samtools. I heard the author, Lincoln Stein, give a talk on on how he wove the two together use XS. As I recall, one of his examples was exactly the question you posed

ADD COMMENTlink written 6.9 years ago by Mdperry40
0
gravatar for anuragkuleuven
4.4 years ago by
anuragkuleuven0 wrote:

Qualimap is very fast and excellent tool. thanks

ADD COMMENTlink written 4.4 years ago by anuragkuleuven0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 754 users visited in the last hour