Can someone explain to me what in theory and in practice the differences are between somatic and germline variant calling? Or point me to some papers that explain the difference.
I am used to calling variants on multiple individuals from a species using GATK. Apparently you can't just do multisample GATK variant calling for somatic variant calling on multiple samples (not just tumor normal) . Why not?
To rehash/expand on what Dan said, if you're sequencing normal tissue, you generally expect to see single-nucleotide variant sites fall into one of three bins: 0%, 50%, or 100%, depending on whether they're heterozygous or homozygous.
With tumors, you have to deal with a whole host of other factors:
Normal admixture in the tumor sample: lowers variant allele fraction (VAF)
Tumor admixture in the normal - this occurs when adjacent normals are used, or in hematological cancers, when there is some blood in the skin normal sample
Subclonal variants, which may occur in any fraction of the cells, meaning that your het-site VAF might be anywhere from 50% down to sub-1%, depending on the tumor's clonal architecture and the sensitivity of your method
Copy number variants, cn-neutral loss of heterozygosity, or ploidy changes, all of which again shift the expected distribution of variant fractions
These, and other factors, make calling somatic variants difficult and still an area that is being heavily researched. If someone tells you that somatic variant calling is a solved problem, they probably have never tried to call somatic variants.
A germline variant caller generally has a ploidy-based genotyping algorithm built in to part of the algorithm/pipeline. I believe, IIRC, the GATK UnifiedGenotyper for instance does both variant calling and then genotype calling. So to call a genotype for a variant it is expecting a certain number of reads to support the alternative allele. When working with somatic variants all of the assumptions about how many reads you expect with a variant at a position to distinguish between true and false positives are no longer valid. Except for fixed mutations throughout the tumor population only some proportion of cells will hold a somatic variation. You also typically have some contamination from normal non-cancerous cells. Add in complications from significant genomic instability with lots of copy number variations and such and you have a need for a major change in your model for calling variation while minimizing artifactual calls. So you have a host of other programs that have been developed specifically for looking at somatic variation in tumor samples.
Sounds like somatic / tumor variant calling is something that will be solved by improvements at the wet lab side ( single cell selection / amplification / sequencing ) . Rather than at the computational side.
Well, single cell has a role to play (and would have more of one if WGA wasn't so lossy), but realistically, you can't sequence billions of cells from a tumor individually. Bulk sequencing still is going to have a role for quite a while.
Hell germ line calling isn't even a solved problem. Still get lots of false positives (and false negatives). It just tends to work so well that it is hard to improve it much except by making it faster, less memory intensive, etc
Solved was the wrong word. I just meant improved. There is only so much you can do at the computational side. Wet lab also has its part to play.
Sorry, my comment wasn't a knock against what you posted at all. Just reiterating that for all of the vast improvements made for germ line calling it is still a difficult problem with lots of improvement to be made, and somatic variant calling is even tougher. Your post was excellent.
What do you mean by 'three bins: 0%, 50%, or 100%'? Thanks
Either you're going to have 0/2 copies, 1/2 copies, or 2/2 copies of that allele.