VarScan: How is the DP score in a .vcf calculated?
1
1
Entering edit mode
3.4 years ago
James Reeve ▴ 130

I'm trying to understand the formats of different SNP callers. I have made test bam files with a subset of two stickleback WGS samples. I made a vcf using VarScan's commad mpileup2snp.

Here's an example of a single SNP in the vcf;

chrI    1789    .       T       G       .       PASS    ADP=105;WT=1;HET=1;HOM=0;NC=0   GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/0:66:108:108:87:18:17.14%:1.7211E-6:63:74:64:23:15:3  0/1:68:102:102:73:21:22.34%:1.3568E-7:60:56:48:25:19:2


I can understand most of the scores in the #FORMAT field, however I'm confused how DP is calculated. The reference and alternative allele depths are given (RD & AD), but the sum of the two doesn't equal DP. Using the example above;

Sample 1:

• DP = 108
• RD = 87
• AD + RD = 105

Sample 2 :

• DP = 102
• RD = 73
• AD + RD = 94

Does anyone have any ideas why the reported DP doesn't equal AD + RD?

VarScan VCF • 1.5k views
0
Entering edit mode

I don't think it's so simple, have a look at the code, varscan calls a method getReadCounts



  HashMap<String, String> readCounts = VarScan.getReadCounts(refBase, readBases, readQualities, minAvgQual, mapQualities);
(...)

/**
* Counts the number, quality, and strands of each allele from a pileup
*
* @param   refBase     Reference base at this position
* @param   minAvgQual  Integer of minimum required base quality to count a base.
* @return  results     HashMap<String, String> of results for each allele
*/
{
HashMap<String, Integer> readCounts = new HashMap<String, Integer>();
HashMap<String, Integer> readCountsPlus = new HashMap<String, Integer>();
HashMap<String, Integer> readCountsMinus = new HashMap<String, Integer>();
HashMap<String, Integer> qualitySum = new HashMap<String, Integer>();
HashMap<String, Integer> mapQualitySum = new HashMap<String, Integer>();
HashMap<String, String> strandsSeen = new HashMap<String, String>();


(...) `

1
Entering edit mode
3.4 years ago
igor 12k

Different variant have different rules regarding thresholds for counting reads. Additionally, these criteria may vary for different fields. VarScan offers an explanation on their website:

VarScan requires that bases meet the minimum Phred quality score (default 15 for most commands) to count them for things like read counts (reads1, reads2) and to compute variant allele frequency. However, when VarScan reports the depth (such as in the DP field of VCF output), it reports SAMtools raw depth. To get VarScan read counts to more closely match another tool, set use parameter --min-avg-qual 0. And use caution! Low-quality bases, with the occasional exception of BAQ penalties, should not be trusted.

Also, VarScan reports variants on a biallelic basis. That is, for a given SNP call, the "reads1" column is the number of reference-supporting reads (RD), and the "reads2" column is the number of variant-supporting reads (AD). There may be additional reads at that position showing other bases (SNP or indel variants). If these other variants meet the calling criteria, they will be reported in their own line. If not, it may look like you have "missing" reads.