Question: Why does shoremap calculate allele frequency this way?
0
2.7 years ago by
emilywynn610
emilywynn610 wrote:

Hello everyone,

I'm learning how to analyze genome sequencing data and questions answered on these forums have been an immense help!

Time to ask my own question:

I've used BWA to align my sequences to a reference genome, then samtools mpileup to call variants, then shoremap to analyze the variants. I noticed that the allele frequencies of variants given by shoremap convert are much lower than the allele frequencies I can calculate by hand just looking at the alignments. Looking at the source code for shoremap convert, I found this:

if(miniInfo[0] == "DP4")
{
// like DP4=0,43,0,38 (first two: ref-coverage, last-two: alt coverage)
std::vector<std::string> DP4Info = split_string(miniInfo[1], ',');
ref_cov  = atol(DP4Info[0].c_str())+atol(DP4Info[1].c_str());
alt_cov  = atol(DP4Info[2].c_str())+atol(DP4Info[3].c_str());
iDP      = alt_cov + ref_cov;
necessaryCheck ++;
givenDP4 = true;
}
}
if(givenDP && givenDP4)
{
/* caution: raw AF with error - 2014-05-05 */
iAF     = (double)alt_cov/(double)atol(aDP.c_str());        // caution: in case no coverage, divided by 0
else
iAF     = (double)alt_cov/(double)iDP;
}

DP4 is information about the number of high quality reads and DP is total read depth. Both of these values are calculated by samtools mpileup and included in the shoremap input. Basically shoremap is calculating the AF by dividing the number of high quality alternative alleles by the total depth coverage at that position. This seems strange to me. Wouldn't it be better to divide the number of high quality alternative alleles by the total number of high quality alleles? By including the low quality reads in the denominator, the allele frequencies are lower than what they should be.

There are places in my alignments where every single read is an alternative base compared to the reference genome (AF should be 1) but because some of the reads are low quality, shoremap tells me the AF is 0.7.

Is there a good reason to calculate allele frequency like this? Or should I just redo the analysis with a better denominator?

modified 2.7 years ago • written 2.7 years ago by emilywynn610

Hi @emilywynn6

I have not the answer to your question but I am also interested because I noticed the same results (AF lower than expected) in my data. Have you somehow solved the issue? Have you tried to contact the author of the software? Thanks