Entering edit mode
                    11.3 years ago
        Jordan
        
    
        ★
    
    1.3k
    Hi,
I'm having trouble understanding read depth in vcf's called by somatic-sniper.
Here is a sample vcf file:
##fileformat=VCFv4.1
##fileDate=20140428
##phasing=none
##reference=human_hg19.fa
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=IGT,Number=1,Type=String,Description="Genotype when called independently (only filled if called in joint prior mode)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total read depth">
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=BCOUNT,Number=4,Type=Integer,Description="Occurrence count for each base at this site (A,C,G,T)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality">
##FORMAT=<ID=JGQ,Number=1,Type=Integer,Description="Joint genotype quality (only filled if called in join prior mode)">
##FORMAT=<ID=VAQ,Number=1,Type=Integer,Description="Variant allele quality">
##FORMAT=<ID=BQ,Number=.,Type=Integer,Description="Average base quality">
##FORMAT=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality across all reads">
##FORMAT=<ID=AMQ,Number=.,Type=Integer,Description="Average mapping quality for each allele present in the genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal, 0=wildtype,1=germline,2=somatic,3=LOH,4=unknown">
##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Normal    Tumor
chr1    871092    .    G    C    .    .    .    GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC    0/0:0/0:6:5,1,0,0:0,0,6,0:45:.:0:27:38:38:0:.    0/1:0/1:12:8,1,6,0:0,5,9,0:25:.:25:18,23:47:45,46:2:22
chr1    892349    .    G    T    .    .    .    GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC    0/0:0/0:10:9,0,4,0:0,0,9,4:30:.:0:19:36:36:0:.    0/1:0/1:29:24,0,17,0:0,0,24,17:19:.:19:17,10:35:36,33:2:16
The depth of the first variant in tumor is 12. But based on the BCOUNT, it has 15 depth (8+1+6+0). Why this discrepancy? Does anyone know?
And how can I accurately calculate the allele frequency in this case.
For the first variant, the alt allele is C. As per the BCOUNT there is just one 1 C. So, is it 1/15 or 1/12?
It is also a bit concerning that C was called a variant considering that it only has 1 read supporting it, where as A has 6 reads supporting it.
Any help on this is much appreciated. Thanks!
I'm a bit baffled why this might be happening. I do think it likely this is a bug. The tumor depth should match the
DP4counts. I do see what I think is a bug in how theBCOUNTis calculated. My best guess forDP4is that there might be some Ns in your reads? Would it be possible to seesamtools mpileuporbam-readcountoutput for these issues?Yes, my first guess was that the reads would have had
Ns in it as well. But shouldn't theDPbe higher thanBCOUNTin that case? AsBCOUNTgives only ACGTs notNs. Though that does not seem to be the case here.My reads do have
Ns in them btw.My guess would be 3 reads with Ns for
871092? Is that right?One reason the
BCOUNTseems to make no sense is that you're actually looking at theDP4counts. It still won't make sense because there are bugs relating to how the counts are generated. Ns are being double counted in all categories and are thus inflating the non-DPcounts. I'll try to get a fix out before the end of the week.I looked at this position. So, here are the results:
So, I think the depth is working right. And like you said, there are 3
Ns. But theBCOUNTis way off.The
BCOUNTshould be:(0, 2, 6, 1).BCOUNTfor your example is(0,5,9,0). There are two bugs going on here.BCOUNTis being limited to only bases seen in called tumor and normal genotypes.Thus, your
Tisn't being counted (it isn't in the called genotypes) and your 3Ns are being added to both yourGs and yourCs.I see. You said you would fix by the end of the week. Can you let me know when u r done? I will probably re-run the tool then on my samples after the update.
I realize it's been about a year, which is sad. In the event that you are still interested, the latest version (1.0.5.0) should fix the bug(s) you discovered.