I have a tumor sample without matches. This sample is most likely impure without a certain ploidy, copy #. I am wondering for such samples I am only filtering based on allele frequency, mapping quality, read depth. But I was a bit uncertain with some options on BCFtools 1.3 (for the new call function that replaced view) that I thought you guys might know.
This is the option I am using
bcftools call -Amv > out.vcf
-A output all alternate alleles present in the alignments even if they do not appear in any of the genotypes. -v output variant sites only. -m alternative model for multiallelic and rare-variant calling designed to overcome known limitations in -c calling model. Is it recommended to just use -m for purpose of simply calling variants? Between using the old bcftool view -vc and the new -m option, I get a lot less variants.
chr1 150908526 . A G 38.3354 . DP=2;VDB=0.1;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=50 GT:PL 1/1:66,6,0 chr1 150910145 . A G 8.79264 . DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,0,1;MQ=50 GT:PL 0/1:36,3,0 chr1 150916393 . T C 27.3777 . DP=5;VDB=0.74;SGB=-0.453602;RPB=0.5;MQB=1;MQSB=1;BQB=0.5;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=2,0,1,1;MQ=50 GT:PL 0/1:60,0,68 chr1 150923826 . C T 8.79264 . DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=50 GT:PL 0/1:36,3,0 chr1 150924171 . A G 8.79264 . DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,0,1;MQ=50 GT:PL 0/1:36,3,0 chr1 150924181 . A G 8.79264 . DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,0,1;MQ=50 GT:PL 0/1:36,3,0 chr1 150924752 . C T 22.6577 . DP=2;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=50 GT:PL 1/1:50,3,0 chr1 150925702 . taaaaaaaaa tAAaaaaaaaaa,tAaaaaaaaaa 5.83735 . INDEL;IDV=3;IMF=0.230769;DP=5;VDB=0.516498;SGB=-0.511536;MQSB=1;MQ0F=0;AC=2,0;AN=2;DP4=1,0,2,1;MQ=50 GT:PL 1/1:32,0,1,30,0,32
Is there a reason why I am not getting GQ info for each read? What worries me is that MQ for all are 50 (?) and MQ0F = 0.
As for novel mutation rate, is this something I should define to improve my variant calling?
-n, --novel-rate float[,…]
likelihood of novel mutation for constrained -C trio calling. The trio genotype calling maximizes likelihood of a particular combination of genotypes for father, mother and the child P(F=i,M=j,C=k) = P(unconstrained) * Pn + P(constrained) * (1-Pn). By providing three values, the mutation rate Pn is set explictly for SNPs, deletions and insertions, respectively. If two values are given, the first is interpreted as the mutation rate of SNPs and the second is used to calculate the mutation rate of indels according to their length as Pn=float*exp(-a-b*len), where a=22.8689, b=0.2994 for insertions and a=21.9313, b=0.2856 for deletions [pubmed:23975140]. If only one value is given, the same mutation rate Pn is used for SNPs and indels.
Man page for new bcftools.
This is the filtering I am using, provided by GATK and slightly tweaked to include only >2 DP.
--filterExpression "DP > 2.0 || QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \