Calculating Coverage Using CIGAR String
1
4
Entering edit mode
5.6 years ago

I'm using the equation

Num_Reads * Avg_Read_Length / Genome Size


To calculate coverage. Here are my questions

1. Should I only consider mapped bases in the Query to calculate the read length? Specifically, M,= tags in the CIGAR string? For example, if a read has 60M10S for a CIGAR string, would the read length be 60 since 10 were not aligned properly?

2. Given 1. why is it that my calculation of coverage differs from samtools mpileup? Wouldn't the average of column 4 in mpileup output equal the average coverage?

Thanks for any help.

samtools bamtools coverage • 2.6k views
7
Entering edit mode
5.6 years ago

1) yes M , = and even X

2) yes , ignore the clip

3) did you consider some options like:

  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]
[UNMAP,SECONDARY,QCFAIL,DUP]


what's the output of samtools depth

2
Entering edit mode

Edit: Fixed it :) Thanks for the response. I'm using Bamtools API for C++ (and learning C++). Here's my code

#include "api/BamMultiReader.h"
#include "api/BamWriter.h"
#include <string>
#include <vector>
#include <iostream>
#include <numeric>
using namespace BamTools;
using namespace std;
int main(int argc, char *argv[])
{
vector<string> inBAM;
string ifh = string(argv[1]);
inBAM.push_back(ifh);
cerr << "ERROR: " << ifh << " could not be opened!" << endl;
return 1;
}
BamAlignment al;
const int start=14990508; // This is the first position in my sample given mpileup (converted to 0 base)
const int end=15118428; // This is the last position in my sample given mpileup
double cov=0;
vector<int> lens;
if(al.MapQuality < 10 || al.IsDuplicate()==true || al.IsFailedQC()==true || al.IsMapped()==false || al.IsPrimaryAlignment()==false)
{ continue; }
int rlen=0;
vector<CigarOp> cigar = al.CigarData;
for(vector<CigarOp>::iterator it= cigar.begin(); it != cigar.end(); ++it)
{
if (it->Type == 'M' || it->Type == '=' || it->Type== 'X') {
rlen+=it->Length;
}
}
lens.push_back(rlen);
}
float avg = accumulate (lens.begin(),lens.end(), 0.0)/ lens.size();
cout << "NREADS: " << nreads << " AVG_READLEN: " << avg << " RANGE: " << end-start << endl;
cout << cov << endl;
return 0;
}


Here's the output

Here's my samtools mpileup command

 samtools mpileup -A -B -q 10 -Q 0 test.bam | cut -f 4 >tmp


And when I take the average of tmp it's 9.52

Initially before I included some of the flags for mpileup that you recommended I got a value of 4.5X. So this is closer to what mpileup reports

Not sure as to why I'm getting different results.

Edit: samtools depth

I forgot about samtools depth. It's a strong tool.

samtools depth -a -r 1:14990509-15118429 -q 0 -Q 10 test.bam | cut -f 3 >tmp1


The mean of tmp1 is

8.51

Which is the same as the C++ code. So solved! Thanks for helping me learn C++.

2
Entering edit mode

float avg = accumulate (lens.begin(),lens.end(), 0.0)/ lens.size();


why don't you calculate the real coverage at each position of the interval ?

2
Entering edit mode

I edited the response above.

To answer your question, my goal is to calculate coverage for each chromosome without interrogating each position. As an exercise, I wanted to calculate coverage of an interval (eventually a chromosome) by iterating each read mapped to that region.

Would calculating the coverage for each position be faster than iterating through mapped reads?

3
Entering edit mode

Why don't you remove vector<int> lens and instead have int total=0; and total+=it->Length() inside the loop and cov = total*1.0/(end-start+1)?

2
Entering edit mode

That's a great idea. I will try that out!