Question: Calculating Coverage Using CIGAR String
4
QVINTVS_FABIVS_MAXIMVS2.4k wrote:

I'm using the equation

``````Num_Reads * Avg_Read_Length / Genome Size
``````

To calculate coverage. Here are my questions

1. Should I only consider mapped bases in the Query to calculate the read length? Specifically, `M`,`=` tags in the CIGAR string? For example, if a read has `60M10S` for a CIGAR string, would the read length be 60 since 10 were not aligned properly?

2. Given 1. why is it that my calculation of coverage differs from `samtools mpileup`? Wouldn't the average of column 4 in mpileup output equal the average coverage?

Thanks for any help.

bamtools coverage samtools • 1.7k views
modified 3.7 years ago • written 3.7 years ago by QVINTVS_FABIVS_MAXIMVS2.4k
7
Pierre Lindenbaum133k wrote:

1) yes M , = and even X

2) yes , ignore the clip

3) did you consider some options like:

``````  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT 
[UNMAP,SECONDARY,QCFAIL,DUP]
``````

what's the output of `samtools depth`

2

Edit: Fixed it :) Thanks for the response. I'm using Bamtools API for C++ (and learning C++). Here's my code

``````#include "api/BamMultiReader.h"
#include "api/BamWriter.h"
#include <string>
#include <vector>
#include <iostream>
#include <numeric>
using namespace BamTools;
using namespace std;
int main(int argc, char *argv[])
{
vector<string> inBAM;
string ifh = string(argv);
inBAM.push_back(ifh);
cerr << "ERROR: " << ifh << " could not be opened!" << endl;
return 1;
}
BamAlignment al;
const int start=14990508; // This is the first position in my sample given mpileup (converted to 0 base)
const int end=15118428; // This is the last position in my sample given mpileup
double cov=0;
vector<int> lens;
if(al.MapQuality < 10 || al.IsDuplicate()==true || al.IsFailedQC()==true || al.IsMapped()==false || al.IsPrimaryAlignment()==false)
{ continue; }
int rlen=0;
vector<CigarOp> cigar = al.CigarData;
for(vector<CigarOp>::iterator it= cigar.begin(); it != cigar.end(); ++it)
{
if (it->Type == 'M' || it->Type == '=' || it->Type== 'X') {
rlen+=it->Length;
}
}
lens.push_back(rlen);
}
float avg = accumulate (lens.begin(),lens.end(), 0.0)/ lens.size();
cout << "NREADS: " << nreads << " AVG_READLEN: " << avg << " RANGE: " << end-start << endl;
cout << cov << endl;
return 0;
}
``````

Here's the output

Here's my samtools mpileup command

`````` samtools mpileup -A -B -q 10 -Q 0 test.bam | cut -f 4 >tmp
``````

And when I take the average of `tmp` it's 9.52

Initially before I included some of the flags for mpileup that you recommended I got a value of 4.5X. So this is closer to what mpileup reports

Not sure as to why I'm getting different results.

Edit: `samtools depth`

I forgot about samtools depth. It's a strong tool.

``````samtools depth -a -r 1:14990509-15118429 -q 0 -Q 10 test.bam | cut -f 3 >tmp1
``````

The mean of tmp1 is

8.51

Which is the same as the C++ code. So solved! Thanks for helping me learn C++.

2

``````float avg = accumulate (lens.begin(),lens.end(), 0.0)/ lens.size();
``````

why don't you calculate the real coverage at each position of the interval ?

2

I edited the response above.

To answer your question, my goal is to calculate coverage for each chromosome without interrogating each position. As an exercise, I wanted to calculate coverage of an interval (eventually a chromosome) by iterating each read mapped to that region.

Would calculating the coverage for each position be faster than iterating through mapped reads?

3

Why don't you remove `vector<int> lens` and instead have `int total=0;` and `total+=it->Length()` inside the loop and `cov = total*1.0/(end-start+1)`?

2

That's a great idea. I will try that out!