Hi all,
I think GATK is a great toolbox. There a quite a few steps involved and I was wondering on the impact and importance of joint genotyping - in particular when working with very small sample sizes (around 10 -15 samples). I read that it can lead to a loss on unique snps, which we would be particularly interested in. Has anymore experience with this and can tell me about the effects on the outcome? In particular in quantitative terms, do I gain many more overlapping SNPs - do I loose a lot of unique SNPs? Or is the impact negligible?
Thanks for your input!
I don't believe that it's important and I am unsure how it is any better than just merging multiple VCFs together using BCFtools, provided, of course, that you have performed very well your filtering of reads prior to variant calling and also set specific rules about merging. Using BCFtools, there are never issues with private ('unique') variants.
If you have 100 samples with a heterozygous SNP at 2x coverage, you are unlikely to call it in any of them individually. It's much more likely to be called if you call them together.
Why would you be calling variants at a read depth of 2? You'd be raising your chances of making false-positive calls elsewhere.
That was an extreme example to illustrate a point. On the other end of the spectrum, if you have sufficiently high coverage, all methods will converge towards the truth.
The problem is that you don't always know where you are on that spectrum or you already have the data and you have to work with what you have.
It's a common misconception that higher coverage will improve quality. Even with 1000x, we still miss many true variants and have to resort to running the sample multiple times just to find everything. Having high coverage brings other types of problems than having low coverage but I'd obviously prefer to have higher coverage.
That's certainly true. I get around this through the generation of multiple subsets of each aligned BAM, and calling variants independently on each subset. At the end, I come up with a consensus list.
I worked on multiple clinical panels, so looking at the known variants several times over. Never ran into that problem. Of course, everyone's experience will vary based on many factors. Coverage is not everything.
If you'd like a more official opinion, here is a citation based on the MSKCC panel:
Thanks for that reference!
Thanks for your input. May I ask how you would recommend to filter the reads (I performed adapter and quality trimming before mapping and mark duplicates before variant calling). May I also ask what specific rules about merging you use? I think BCFtools would be a great option for me since I am really mainly interested in the singletons.