Tools To Calculate Average Coverage For A Bam File?
12
90
Entering edit mode
10.6 years ago
Biomed 4.8k

I would like to get the average coverage of all the captured bases in a bam file. What would be the best way to do this? What I am looking is a simple one number like 40X. Given that there may be millions of bases sequenced in a next gen study I would like to get the overall average coverage for all these bases as a single number.

Thanks

coverage bam sequencing • 179k views
ADD COMMENT
5
Entering edit mode

People seem to be equally devided between GATK, Samtools and BedTools.

ADD REPLY
1
Entering edit mode

The Depth of Coverage is depreciated in GATK4. It's only available in GATK3. See here

ADD REPLY
3
Entering edit mode

Most answers seems to be very old and hence would like to have updated suggestions.

I have 6 bam files and I have used samtools depth to calculate chromosome wise depth for all chromosomes and then plotted in R. While looking at the plots at 2-3 places, depth shows upto 200-3500 and hence I would like to calculate average read depth of each chromosome from each bam file. For e.g.: 1) average depth of chr1 from bam1, average depth of chr1 from bam2, ...until bam6. 2) average depth of chr1 from all 6 bam together.

Kindly guide.

Thanks.

ADD REPLY
1
Entering edit mode

Try the solutions if they don't work then you need to reframe the question with your commands and write a new query in a new post. This is advised. And yes paired end should work.

ADD REPLY
0
Entering edit mode

in addition, if you have already plotted depths it means that you have got a temporal R data.frame (or similar type) in there containing chromosome, position and depth information. splitting such data.frame by chromosome and calculating average values should be straight-forward. I would just suggest to take into account empty values not reported by samtools depth dividing each chromosome depths' sum by the size of the chromosome, and not by the number of depth values.

ADD REPLY
1
Entering edit mode

Why is it wrong to take the number of sequenced bases and divide by the genome size?

ADD REPLY
4
Entering edit mode

because you cannot align all the bases on the genome, some reads are optical duplicates, etc...

ADD REPLY
1
Entering edit mode

Okay, contamination is a fair point, but even if some reads fail to match, that could still be a sign of genetic divergence, and should still count towards genome coverage. Do any of the other suggestions deal with duplicates implicitly?

ADD REPLY
0
Entering edit mode

Qualimap is very fast and excellent tool. thanks

ADD REPLY
0
Entering edit mode

Hi,

I have multiple paired-end bam files from RNA-Seq data, already aligned and computed depth with samtools depth > bam.depth. Would any one please guide for any straight forward/latest way to compute coverage plots, avg. coverage plots and other metrics from depth files?

Picard CollectRNAseqmetrics, RSeQC and QoRTs have not been helpful. :-(

Thanks.

ADD REPLY
0
Entering edit mode

Have you run any of the other tools mentioned in this thread (e.g. Qualimap, pileup from BBMap)?

ADD REPLY
0
Entering edit mode

@genomax My bam files are very big and Qualimap has not been helpful, so as bbmap which is Java-based (memory issues like Picard and QoRTs) and hence I used samtools depth. My depth files are also in GB.

ADD REPLY
0
Entering edit mode

Both programs will happily accept larger allocations of RAM and can open very large files. If you don't have access to a machine with more RAM then that is a different problem.

ADD REPLY
49
Entering edit mode
8.5 years ago
russdurrett ▴ 450

if you have samtools and awk, you can keep it simple with this:

samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'  and with stdev: samtools depth *bamfile* | awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}'

ADD COMMENT
4
Entering edit mode

Not sure this works. It calculates coverage over regions but does not account for regions that were not covered with a read. Then it divides all the sum with the number of rows that were reported... but the positions without a read hit are not included. Basically - it gives a too high result.

ADD REPLY
11
Entering edit mode

samtools depth -a

Using the '-a' flag outputs all positions (including zero depth), which should solve this problem.

ADD REPLY
3
Entering edit mode

[FYI] Only available from v1.3, I couldn't find that parameter in my version

https://github.com/samtools/samtools/issues/374

ADD REPLY
4
Entering edit mode

That's true. To be fair, the denominator needs to be the size of genome where the BAM file was generated against, rather than the number of bases covered at least once (that's what NR is in this context). To get the genome size from the BAM file:

samtools view -H *bamfile* | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'  ADD REPLY 0 Entering edit mode Please correct me if I am wrong, but none of these answers seem to work with nanopore data. ADD REPLY 1 Entering edit mode why would it not? ADD REPLY 0 Entering edit mode In the first command, its $3 for me

ADD REPLY
0
Entering edit mode

I agree, and edited Russ' answer accordingly to avoid confusing anyone else.

ADD REPLY
0
Entering edit mode

samtools depth is the most straightforward way to do this

ADD REPLY
42
Entering edit mode
10.6 years ago

Try the genomeCoverageBed tool in the BEDtools package, which takes a BAM or BED file as input and:

computes a histogram of feature coverage (e.g., aligned sequences) for a given genome. Optionally, by using the –d option, it will report the depth of coverage at each base on each chromosome in the genome ﬁle (-g).

The way we run this is to first make a tab-delimited genome.txt file with "contig_name contig_size" information for all contigs in your genome and then run genomeCoverageBed on a sorted BAM file:

$genomeCoverageBed -ibam sortedBamFile.bam -g genome.txt > coverage.txt This will give you a summary histogram of coverage across each contig and for the entire genome, from which you can obtain the modal or mean value (the single fold-coverage value you are looking for). ADD COMMENT 0 Entering edit mode ADD REPLY 0 Entering edit mode With your command, I have different value for the same contig. For instance: jcf7180001864843 2 4 137928 2.90006e-05 jcf7180001864843 3 28 137928 0.000203004 jcf7180001864843 4 10 137928 7.25016e-05  What does it mean ?. How can I have one value for each contig. ADD REPLY 3 Entering edit mode That's because it reports a histogram. From the documentation: The values in the output are: 1. chromosome (or entire genome) [eg jcf7180001864843] 2. depth of coverage from features in input file [eg. 2] 3. number of bases on chromosome (or genome) with depth equal to column 2. [eg. 4] 4. size of chromosome (or entire genome) in base pairs [eg. 137928] 5. fraction of bases on chromosome (or entire genome) with depth equal to column 2. [2.90006e-05] [all values taken from your first row] If you want to get the average coverage: add up the product of bases per coverage [2* 4+3* 28+4* 10+...] and divide by the total number of bases [137928]. ADD REPLY 0 Entering edit mode Hello Ronald, I have a question about what you are saying. So for the average coverage you are stating, that would be per scafffold correct? I attempted to open a topic recently and was referred to this one :) Cheers ADD REPLY 1 Entering edit mode Is your BAM file sorted (as noted on the help doc page linked by @student-t above)? ADD REPLY 0 Entering edit mode what is content of genome.txt ADD REPLY 24 Entering edit mode 10.6 years ago And the mandatory R solution. The Rsamtools package can be used to read in the BAM file. Here is a code example: library(Rsamtools) bamcoverage <- function (bamfile) { # read in the bam file bam <- scanBam(bamfile)[[1]] # the result comes in nested lists # filter reads without match position ind <- ! is.na(bam$pos)
## remove non-matches, they are not relevant to us
bam <- lapply(bam, function(x) x[ind])
ranges <- IRanges(start=bam$pos, width=bam$qwidth, names=make.names(bam$qname, unique=TRUE)) ## names of the bam data frame: ## "qname" "flag" "rname" "strand" "pos" "qwidth" ## "mapq" "cigar" "mrnm" "mpos" "isize" "seq" "qual" ## construc: genomic ranges object containing all reads ranges <- GRanges(seqnames=Rle(bam$rname), ranges=ranges, strand=Rle(bam$strand), flag=bam$flag, readid=bamrname ) ## returns a coverage for each reference sequence (aka. chromosome) in the bam file return (mean(coverage(ranges))) }  And the output: > bamcoverage("~/test.bam") gi|225184640|emb|AL009126.3| 15.35720  The good thing about this is you can do much more with the genomic ranges object than computing the coverage, the downside is, that the code is not that efficient. ADD COMMENT 0 Entering edit mode @Michael : What maximum size of input bam file that can be handled by this solution ? It slurps the whole bamfile in memory ?? ADD REPLY 2 Entering edit mode Sorry for the late reply. Yes it does, and it is not very memory efficient. For a 850MB BAM file with approximately 17 million alignments the R process uses up to 15GB of memory during processing. ADD REPLY 5 Entering edit mode A reply four years after..... ADD REPLY 0 Entering edit mode @Micheal: Will it work for paired end BAM files? ADD REPLY 0 Entering edit mode Hi! I used this R code with Rsamtools with my bam file and the result was: 6503.159 What could have gone wrong? ADD REPLY 0 Entering edit mode You might have incredibly high coverage :) I would try a different tool, now 8 years later. Honestly, this code was only intended to show it can be done in R too. But I am not even sure it works correctly. ADD REPLY 0 Entering edit mode Thanks!!! I'll look for another tools :) genomecoverage from bedtools looks good too. I'll try that one. Thanks for the advice! ADD REPLY 16 Entering edit mode 10.6 years ago did you try the GATK "DepthOfCoverage" ? http://www.broadinstitute.org/gsa/wiki/index.php/Depth_of_Coverage_v3.0 or you can run 'samtools pileup' and calculate the mean value for the coverage column. ADD COMMENT 2 Entering edit mode I have used the samtools method (which excludes bases with 0 coverage) and the BEDTools version (which includes bases with 0 coverage). The former will give a higher mean coverage. This will have an impact on genome where there are large areas that are not sequenced. ADD REPLY 1 Entering edit mode Sorry I am new to bioinformatics. I have used 'samtools mpileup', but how can I get mean value for the depth column. Thanks. ADD REPLY 1 Entering edit mode I tried GATK DepthOfCoverage last July with no luck - complicated java.lang.RuntimeExceptions. I did find success with BEDtools however (see answer below/above) ADD REPLY 1 Entering edit mode I use this method to calculate % of sites with at least a coverage of 20x in a list of sites from a bed. It provides that number directly as output. ADD REPLY 0 Entering edit mode That's the gte_20 value from the sample_cumulative_coverage_proportions file, correct? ADD REPLY 0 Entering edit mode I was interested in this issue too, but I wanted to know the coverage for certain genes. after looking to this tool's wiki I've realized that I can feed it with a list of genes intervals (which of course I'll have to retrieve from UCSC or Ensembl) and I'll surely sort my problem out! look what an unexpected great piece of information I've just found! once again thanks a lot, Pierre. ADD REPLY 0 Entering edit mode Thanks Pierre. I haven't tried Depthof Coverage but I tried Samtools. Samtools would give you a base by base coverage, and I use it successfully like Jorge explained but I 'll look at GATK for sure. ADD REPLY 14 Entering edit mode 10.6 years ago Here's a very simple perl script I wrote that does exactly this that you're free to use: (num,$den)=(0,0); while ($cov=<STDIN>) {
$num=$num+$cov;$den++;
}
$cov=$num/$den; print "Mean Coverage =$cov\n";


Just pipe samtools pileup to it like so:

/path/to/samtools pileup in.bam | awk '{print $4}' | perl mean_coverage.pl  It will also work easily for genomic regions. Just index the BAM file and use samtools view like so: /path/to/samtools view -b in.bam <genomic region> | /path/to/samtools pileup - | awk '{print$4}' | perl mean_coverage.pl


Hope someone finds this useful.

ADD COMMENT
4
Entering edit mode

One more comment on this pipeline... In my data, where I have relatively low coverage there are quite a few genomic positions not covered at all. These will not show up in the pileup command, so I guess that the mean coverage calculated this way will be the mean coverage of the covered positions. Am I right? Is there a way (in the "samtools pileup" command) to output all positions even if they are not covered by any read?

ADD REPLY
0
Entering edit mode

looks promising. Thanks

ADD REPLY
0
Entering edit mode

I am certainly going to try this. Thanks for the answer.

ADD REPLY
0
Entering edit mode

Nice answer! Tried it and it's working! The only thing I'd like to mention is that I think that the BAM file needs to be sorted.

ADD REPLY
0
Entering edit mode

That's correct, panos. This simple solution is only if every base is covered at least once because samtools pileup doesn't report bases with zero coverage. Thanks for pointing that out!

ADD REPLY
0
Entering edit mode

can you give me a detailed description about $cov=[HTML] you used ? Thanks! ADD REPLY 0 Entering edit mode Very useful thanks! Used it on my own dataset (sorry to bump an old thread but still very handy) ADD REPLY 0 Entering edit mode I made a BioStars account to upvote your contribution. I've been trying to find something this simple and straightforward for some time now.. Thanks! ADD REPLY 3 Entering edit mode this is a very old answer. you would find then useful a simpler samtools depth based solution: samtools depth [-b regions.bed] input.bam | awk '{c++;s+=$3}END{print s/c}'

ADD REPLY
0
Entering edit mode

samtools depth bamfile | awk '{sum+=$3} END { print "Average = ",sum/NR}' by @russdurrett also gives same output for my paired-end bam file from RNA-Seq:-) ADD REPLY 1 Entering edit mode that's because NR is an awk variable that stores the number of rows, therefore the c variable on my code is not wrong, but it is superfluous. ADD REPLY 0 Entering edit mode I ran: samtools depth zm_cp.bam | awk '{sum+=$3} END { print "Average = ",sum/NR}'


and the result was:

Average =  6495.19


:O is this okay?

ADD REPLY
0
Entering edit mode

Does this also works with mpileup? I ran:

samtools mpileup zm_cp.bam | awk '{print $4}' | perl mean_coverage.pl  and this was the result: [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 Mean Coverage = 5107.87408859271  :O is this okay? ADD REPLY 0 Entering edit mode Hello, I know this post is quite old, but I am trying to use this as it seems to work wonderfully for everyone else. I am new to bioinformatics and genomics, so please be patient with me. I have run the command 'samtools mpileup A186944_mtgenome.srt.dup.bam | awk '{print$4}' | perl coverage.pl' where my bam file is a sorted bam with duplicates removed and I get the following error:

Illegal division by zero at coverage.pl line 6.

Can someone please help me figure out the error? I think it's maybe because there are 0's where unmapped reads occur? How do I obtain coverage for just the mapped reads, and eliminate the 0 denominator?

ADD REPLY
9
Entering edit mode
3.1 years ago

A modern, very fast, solution for this would be mosdepth.

ADD COMMENT
0
Entering edit mode

how do i install other than bioconda?

ADD REPLY
1
Entering edit mode

You can install the executable from their github release page: https://github.com/brentp/mosdepth/releases

Latest version at time of commenting (0.2.9):

(Download executable) wget https://github.com/brentp/mosdepth/releases/download/v0.2.9/mosdepth

(Update permissions to run it) chmod a+x mosdepth

(Good to go!) <path_to_mosdepth> --help

ADD REPLY
0
Entering edit mode

Why not use bioconda?

More installation options and instructions can be found on the GitHub page.

ADD REPLY
5
Entering edit mode
10.6 years ago
Heikki ▴ 360

@michael.james.clark, thank you. I wrote for myself a script that I can add to when the need arises:

#!/usr/bin/env perl

use strict;
use warnings;

# Usage: samtools pileup file.bam | bam_coverage.pl

my $num; # per residue coverage my$len;        # sequence length counter
my $min = 1000; # minimum coverage my$max = 0;    # maximum coverage

while (<>) {
my @a = split /\t/;
$num +=$a[3];
$min =$a[3] if $min >$a[3];
$max =$a[3] if $max <$a[3];
$len++; } printf "Mean coverage : %.1f\n",$num/$len; printf "Coverage range : %d - %d\n",$min, $max;  ADD COMMENT 1 Entering edit mode Heiki thanks for the codes. It is working for me ADD REPLY 0 Entering edit mode This doesn't work because mpileup does not report the bases with coverage of zero ADD REPLY 5 Entering edit mode 8.1 years ago William ★ 4.9k How about just running qualimap in GUI mode or to create a QC PDF? Your can run it on the whole genome or on 1 or multiple regions using a bed file. http://qualimap.bioinfo.cipf.es/ ADD COMMENT 0 Entering edit mode I tried using QualiMap2 on HPC. It just doesn't work. ADD REPLY 4 Entering edit mode 10.2 years ago Rm 8.1k use samtools pileup accepted_hits.bam | awk '{ count++ ; SUM +=$4 } END { print "Total: " SUM "\t" "Nucleotides: " count "\t" "Average_coverage: " SUM/count }'


(use 8 instead; if pileup with -cf option) ADD COMMENT 0 Entering edit mode I used the above command to extract the average coverage on my aligned bam files and my statistics are: Total: 8011581243 Nucleotides: 510234050 Average_coverage: 15.7018 I want to know that the average_coverage is stating that on a distribution each of my base is read atleast 15 time right? What does the total signifies and the Nucleotides count: 510234050. Is this the total no of bases that are in the exonic region? then how will I calculate the number of reads from here that actually got mapped on in the exome region? Can we do this from the nucleotides count? I know for each reads consist of 100 bases so if I divide this 510234050 bases with 100 it gives me the reads but that will not give me the reads that actually got mapped in the exome region. How t extract that? ADD REPLY 3 Entering edit mode 6.3 years ago 5heikki 9.9k This is for people who have a sam/bam file consisting of reads mapped to multiple contigs (samtools needs to be in PATH):

bbmap/pileup.sh -h

Written by Brian Bushnell
Last modified May 19, 2015

Description:  Calculates per-scaffold coverage information from an unsorted sam file.

Usage:  pileup.sh in=<input> out=<output>

Input may be stdin or a SAM file, compressed or uncompressed.
Output may be stdout or a file.

Input Parameters:
in=<file>           The input sam file; this is the only required parameter.
ref=<file>          Scans a reference fasta for per-scaffold GC counts, not otherwise needed.
fastaorf=<file>     An optional fasta file with ORF header information in PRODIGAL's output format.  Must also specify 'outorf'.
unpigz=t            Decompress with pigz for faster decompression.

..

ADD COMMENT
0
Entering edit mode

does it work with a bam file ?

ADD REPLY
1
Entering edit mode

Yes it does. Use the latest version.

\$ pileup.sh
Written by Brian Bushnell
Last modified September 1, 2016
Description:  Calculates per-scaffold coverage information from an unsorted sam or bam file.

ADD REPLY
0
Entering edit mode

Could you introduce a bed file, say exome, and get the coverage stats?

ADD REPLY
2
Entering edit mode
10.6 years ago
Mdperry ▴ 40

BioPerl has a library named Bio-Samtools which provides a familiar Perl/BioPerl access to samtools. I heard the author, Lincoln Stein, give a talk on on how he wove the two together use XS. As I recall, one of his examples was exactly the question you posed

ADD COMMENT
0
Entering edit mode
17 days ago
MSRS ▴ 480

Now the new release of Samtools for coverage analysis.

samtools coverage input.bam

ADD COMMENT

Login before adding your answer.

Traffic: 928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6