I have been working on a comparison of a human endocrine cancer, matched with normal tissue from the same organ. The tissue was dissected by a pathologist, and is considered quite pure. We sequenced to 80x depth in the tumor 40x in the normal. Somatic variant calling was done with Varscan2 (samtools -q 1 to use only mapped reads, and the high confidence set). Indels and SNPs were filtered and annotated; we are looking at coding, non-synonymous mutations and splice site mutations. In our small cohort of 15 patients, we looked for those mutations that are in more than one tumor sample. Using this list, I was pretty surprised; about 1/2 the variants are found in dnSNP 1.3.7 and most in 1.3.1. About 1/4 are in the COSMIC database (although some of those are in dbSNP as well). I pressed on to manual verification in IGV.
The surprises continue; many of the sites called as a somatic mutation are in fact germline; there are more reads in the tumor, but often times in the normal the variant allele frequency is similar or the same. Many of the sites are in dbSNP. Many are in low-coverage regions.
I feel like I did what I could to get quality, somatic mutations out of varscan. I followed the entire GATK pipeline to indel-realign, recal, etc. I used only mapped reads in samtools when piping into varscan. And from varscan, we used the high confidence set. Yet everything is such a soft call... a 'somatic' mutation in one patient has some reads in the normal, and other samples have the variant as a polymorphism as well. I tried to use the preparation and variant calling pipeline to solve this problem, yet here I am at the end, feeling like I'm back to square one. Is there something obvious that I'm missing? (I have also used Mutect, Strelka, GATK, SomaticSniper, but the number of false positives seemed to be the least with Varscan2).
Thank you for the feedback, AOC