Entering edit mode
6.2 years ago
Kevin Blighe
86k
Thanks to brs1111 for the idea for this.
If you have aligned your reads to a reference genome, here hg38.fasta, and want to know how much of the reference genome has been covered by aligned reads, the following code will tell you:
#Determine number of bases at 0 read depth
zero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1)
#Determine number of bases at >0 read depth, i.e., non-zero bases
nonzero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1)
#Calculate the percent of the reference genome coverd by >0 read depth bases
#Round up to 6 decimal places
percent=$(bc <<< "scale=6; ($nonzero / ($zero + $nonzero))*100")
echo $percent
The genomecov
function of BEDTools
normally computes coverage in your BAM file for each position in your supplied reference genome. The output looks like this:
chr1 0 11794364 0
chr1 11794364 11794488 3
chr1 11794488 11794490 5
chr1 11794490 11794514 6
chr1 11794514 11794639 3
chr1 11794639 14022492 0
chr1 14022492 14022535 1
chr1 14022535 97082336 0
chr1 97082336 97082442 1
chr1 97082442 97082486 2
chr1 97082486 97082593 1
chr1 97082593 97305303 0
chr1 97305303 97305377 2
...
This should take ~1-5 minutes to run with any BAM file on a 'modern' laptop.
You may use the -bg instead of the -bga option for genomecov, as this will only report non-zero regions. Subtracting the sum of the spanned bp from the total chr length will save you one run of genomecov (have also a look at mawk instead of classical awk, it is way faster for these simply count/filter tasks). Also keep in mind that this result is only meaningful if you prefiltered your bam file to exclude low quality alignments (at least remove MAPQ=0, I always require MAPQ to be 20 or higher) as low quality alignments are either placed random (MAPQ=0) or of low reliability. I may also doubt a bit that a 50x WGS bam will be finished in 5min^^
Thanks for the input friend. It does run rapidly for me on my high-spec laptop... but, when I'm on projects in the Tropics, my CPU typically reaches 90 Celsius (194 Fahenheit) and shuts down the computer :/
MAPQ of 20 is way too low for a clinical application - that's a 1-in-100 chance of misalignment. For clinical tests, I push it up to MAPQ of 60. As bioinformatics methods are now entering the clinical realm, we all need to think more about quality control and proper coding.
That's silly, it's not like you're relying on only a single alignment for variant calling. The MAPQ scores from most aligners aren't as meaningful as you seem to think (most don't even go up to 60, since their bounds are fairly artificial).
Devon, your reply is somewhat naive. There are regions of the genome that exhibit a high level of homology to others, and all reads over these regions will exhibit low MAPQ. Should a variant call over these regions even be trusted if all mapped reads over them have MAPQ <30? If a patient is waiting on a result at the end of the day, would you make that call? I think not.
You think wrong. If that variant made the most sense in the clinical context and when considering the various other discovered variants then of course that's what would and should be reported. Again, you're placing vastly too much trust into MAPQ values.
I upvoted your comment, by the way!