Entering edit mode
6.5 years ago
CY
▴
750
Most SNP caller, such as HaplotypeCaller apply bayes method to call SNP. HaplotypeCaller use 1KG / dbsnp data set as prior probability.
This got me thinking what prior does 1KG use. It turns out that 1KG set 0.001/base as prior. This seems reasonable given that the data before 1KG shows the average SNP occurrence is 0.001.
Then the question remains. What did projects before 1KG use as prior? Or those early project did not use bayes? Then what methods did they use?
Prior to the HaplotypeCaller, everyone was using UnifiedGenotyper with the GATK, which obviously behaves in a different way than HaplotypeCaller. This was back when the only large population-based dataset that was available was the International HapMap 270. I even ordered this on multiple CDs, still sitting in an office in the UK.
Back then, I don't recall many other variant callers. SAMtools was certainly around.
Your interest in the prior probability matches my own interest in it, but I have not done much work on this particular area. I believe, nevertheless, that the prior probabilities going into each variant call are strongly biased by read-depth and that these probabilities are also responsible for the clear cut variants that are sometimes missed by the GATK. This is why I believe the GATK need to do more work on the influence of downsampling (read depth) and how this affects variant calling. At the moment, as far as I am aware, all variant callers 'randomly' downsample to 1000 or 500 read depth without thinking about how this may affect calling.
I can see a reply in my feed but not here (?) - You can read more on what I mean about read depth here: C: Lack of consensus between NGS & Sanger sequencing on indels/mutations