Tutorial: Determine % of reference genome covered by aligned SAM/BAM
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

Thanks to brs1111 for the idea for this.

If you have aligned your reads to a reference genome, here hg38.fasta, and want to know how much of the reference genome has been covered by aligned reads, the following code will tell you:

#Determine number of bases at 0 read depth
zero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1)

#Determine number of bases at >0 read depth, i.e., non-zero bases
nonzero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1)

#Calculate the percent of the reference genome coverd by >0 read depth bases
#Round up to 6 decimal places
percent=$(bc <<< "scale=6; ($nonzero / ($zero + $nonzero))*100")

echo $percent

The genomecov function of BEDTools normally computes coverage in your BAM file for each position in your supplied reference genome. The output looks like this:

    chr1    0   11794364    0
    chr1    11794364    11794488    3
    chr1    11794488    11794490    5
    chr1    11794490    11794514    6
    chr1    11794514    11794639    3
    chr1    11794639    14022492    0
    chr1    14022492    14022535    1
    chr1    14022535    97082336    0
    chr1    97082336    97082442    1
    chr1    97082442    97082486    2
    chr1    97082486    97082593    1
    chr1    97082593    97305303    0
    chr1    97305303    97305377    2

This should take ~1-5 minutes to run with any BAM file on a 'modern' laptop.

ADD COMMENTlink modified 9 months ago by Devon Ryan80k • written 9 months ago by Kevin Blighe21k

It is a good habit to share your experiences and solutions in the community, keep it up. In this case, if you want to use genomecov, there is a faster workaround. You may use the -bg instead of the -bga option for genomecov, as this will only report non-zero regions. Subtracting the sum of the spanned bp from the total chr length will save you one run of genomecov (have also a look at mawk instead of classical awk, it is way faster for these simply count/filter tasks). Also keep in mind that this result is only meaningful if you prefiltered your bam file to exclude low quality alignments (at least remove MAPQ=0, I always require MAPQ to be 20 or higher) as low quality alignments are either placed random (MAPQ=0) or of low reliability. I may also doubt a bit that a 50x WGS bam will be finished in 5min^^

ADD REPLYlink modified 9 months ago • written 9 months ago by ATpoint4.4k

Thanks for the input friend. It does run rapidly for me on my high-spec laptop... but, when I'm on projects in the Tropics, my CPU typically reaches 90 Celsius (194 Fahenheit) and shuts down the computer :/

ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe21k

MAPQ of 20 is way too low for a clinical application - that's a 1-in-100 chance of misalignment. For clinical tests, I push it up to MAPQ of 60. As bioinformatics methods are now entering the clinical realm, we all need to think more about quality control and proper coding.

ADD REPLYlink written 9 months ago by Kevin Blighe21k

That's silly, it's not like you're relying on only a single alignment for variant calling. The MAPQ scores from most aligners aren't as meaningful as you seem to think (most don't even go up to 60, since their bounds are fairly artificial).

ADD REPLYlink written 9 months ago by Devon Ryan80k

Devon, your reply is somewhat naive. There are regions of the genome that exhibit a high level of homology to others, and all reads over these regions will exhibit low MAPQ. Should a variant call over these regions even be trusted if all mapped reads over them have MAPQ <30? If a patient is waiting on a result at the end of the day, would you make that call? I think not.

ADD REPLYlink written 9 months ago by Kevin Blighe21k

You think wrong. If that variant made the most sense in the clinical context and when considering the various other discovered variants then of course that's what would and should be reported. Again, you're placing vastly too much trust into MAPQ values.

ADD REPLYlink written 9 months ago by Devon Ryan80k

I upvoted your comment, by the way!

ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe21k
gravatar for Devon Ryan
9 months ago by
Devon Ryan80k
Freiburg, Germany
Devon Ryan80k wrote:

Alternative methods:

  1. plotFingerprint --outQualityMetrics from deepTools. The X-intercept is the fraction of the genome not covered. This has the benefit of (A) being multithreaded and (B) allowing filtering of the alignments.
  2. plotCoverage from deepTools.
ADD COMMENTlink written 9 months ago by Devon Ryan80k

Cheers for providing alternatives!

ADD REPLYlink written 9 months ago by Kevin Blighe21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1080 users visited in the last hour