Question: Calculating Coverage Using CIGAR String
4
gravatar for QVINTVS_FABIVS_MAXIMVS
3 months ago by
USA SoCal
QVINTVS_FABIVS_MAXIMVS1.9k wrote:

I'm using the equation

Num_Reads * Avg_Read_Length / Genome Size

To calculate coverage. Here are my questions

  1. Should I only consider mapped bases in the Query to calculate the read length? Specifically, M,= tags in the CIGAR string? For example, if a read has 60M10S for a CIGAR string, would the read length be 60 since 10 were not aligned properly?

  2. Given 1. why is it that my calculation of coverage differs from samtools mpileup? Wouldn't the average of column 4 in mpileup output equal the average coverage?

    Thanks for any help.

bamtools coverage samtools • 242 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by QVINTVS_FABIVS_MAXIMVS1.9k
7
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

1) yes M , = and even X

2) yes , ignore the clip

3) did you consider some options like:

  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]
  --ff, --excl-flags STR|INT  filter flags: skip reads with mask bits set
                                            [UNMAP,SECONDARY,QCFAIL,DUP]

what's the output of samtools depth

ADD COMMENTlink modified 3 months ago • written 3 months ago by Pierre Lindenbaum96k
2

Edit: Fixed it :) Thanks for the response. I'm using Bamtools API for C++ (and learning C++). Here's my code

#include "api/BamMultiReader.h"
#include "api/BamWriter.h"
#include <string>
#include <vector>
#include <iostream>
#include <numeric>
using namespace BamTools;
using namespace std;
int main(int argc, char *argv[])
{
    vector<string> inBAM;
    BamMultiReader reader;
    string ifh = string(argv[1]);
    inBAM.push_back(ifh);
    if (!reader.Open(inBAM)){
            cerr << "ERROR: " << ifh << " could not be opened!" << endl;
            return 1;
    }
    BamAlignment al;
    const int start=14990508; // This is the first position in my sample given mpileup (converted to 0 base)
    const int end=15118428; // This is the last position in my sample given mpileup
    double cov=0;
    int nreads=0;
    vector<int> lens;
    while (reader.GetNextAlignmentCore(al)){
            if(al.MapQuality < 10 || al.IsDuplicate()==true || al.IsFailedQC()==true || al.IsMapped()==false || al.IsPrimaryAlignment()==false)
            { continue; }
            int rlen=0;
            vector<CigarOp> cigar = al.CigarData;
            for(vector<CigarOp>::iterator it= cigar.begin(); it != cigar.end(); ++it)
            {
                    if (it->Type == 'M' || it->Type == '=' || it->Type== 'X') {
                            rlen+=it->Length;
                    }
            }
            if(rlen!=0){ ++nreads; }
            lens.push_back(rlen);
    }
    float avg = accumulate (lens.begin(),lens.end(), 0.0)/ lens.size();
    cov = (nreads*avg)/(end-start);
    cout << "NREADS: " << nreads << " AVG_READLEN: " << avg << " RANGE: " << end-start << endl;
    cout << cov << endl;
    return 0;
}

Here's the output

NREADS: 354 AVG_READLEN: 3073.54 RANGE: 127920 8.50557

Here's my samtools mpileup command

 samtools mpileup -A -B -q 10 -Q 0 test.bam | cut -f 4 >tmp

And when I take the average of tmp it's 9.52

Initially before I included some of the flags for mpileup that you recommended I got a value of 4.5X. So this is closer to what mpileup reports

Not sure as to why I'm getting different results.

Edit: samtools depth

I forgot about samtools depth. It's a strong tool.

samtools depth -a -r 1:14990509-15118429 -q 0 -Q 10 test.bam | cut -f 3 >tmp1

The mean of tmp1 is

8.51

Which is the same as the C++ code. So solved! Thanks for helping me learn C++.

ADD REPLYlink modified 3 months ago • written 3 months ago by QVINTVS_FABIVS_MAXIMVS1.9k
2

are you sure about this:

float avg = accumulate (lens.begin(),lens.end(), 0.0)/ lens.size();
cov = (nreads*avg)/(end-start);

why don't you calculate the real coverage at each position of the interval ?

ADD REPLYlink written 3 months ago by Pierre Lindenbaum96k
2

I edited the response above.

To answer your question, my goal is to calculate coverage for each chromosome without interrogating each position. As an exercise, I wanted to calculate coverage of an interval (eventually a chromosome) by iterating each read mapped to that region.

Would calculating the coverage for each position be faster than iterating through mapped reads?

ADD REPLYlink modified 3 months ago • written 3 months ago by QVINTVS_FABIVS_MAXIMVS1.9k
3

Why don't you remove vector<int> lens and instead have int total=0; and total+=it->Length() inside the loop and cov = total*1.0/(end-start+1)?

ADD REPLYlink written 3 months ago by rkostadi60
2

That's a great idea. I will try that out!

ADD REPLYlink written 3 months ago by QVINTVS_FABIVS_MAXIMVS1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 606 users visited in the last hour