Entering edit mode
8.5 years ago
QVINTVS_FABIVS_MAXIMVS
★
2.6k
I'm using the equation
Num_Reads * Avg_Read_Length / Genome Size
To calculate coverage. Here are my questions
Should I only consider mapped bases in the Query to calculate the read length? Specifically,
M,=tags in the CIGAR string? For example, if a read has60M10Sfor a CIGAR string, would the read length be 60 since 10 were not aligned properly?Given 1. why is it that my calculation of coverage differs from
samtools mpileup? Wouldn't the average of column 4 in mpileup output equal the average coverage?Thanks for any help.
Edit: Fixed it :) Thanks for the response. I'm using Bamtools API for C++ (and learning C++). Here's my code
Here's the output
NREADS: 354 AVG_READLEN: 3073.54 RANGE: 127920 8.50557
Here's my samtools mpileup command
And when I take the average of
tmpit's 9.52Initially before I included some of the flags for mpileup that you recommended I got a value of 4.5X. So this is closer to what mpileup reports
Not sure as to why I'm getting different results.
Edit:
samtools depthI forgot about samtools depth. It's a strong tool.
The mean of tmp1 is
8.51
Which is the same as the C++ code. So solved! Thanks for helping me learn C++.
are you sure about this:
why don't you calculate the real coverage at each position of the interval ?
I edited the response above.
To answer your question, my goal is to calculate coverage for each chromosome without interrogating each position. As an exercise, I wanted to calculate coverage of an interval (eventually a chromosome) by iterating each read mapped to that region.
Would calculating the coverage for each position be faster than iterating through mapped reads?
Why don't you remove
vector<int> lensand instead haveint total=0;andtotal+=it->Length()inside the loop andcov = total*1.0/(end-start+1)?That's a great idea. I will try that out!