I doing germline variant calling on TCGA data, however and I started noticing something strange.
As a test I did the following: I downloaded one tumor/normal genome bam file pair. First I ran variant calling using Strelka (starting from the alignment file and using the various TCGA reference files) and noticed that the distribution of allelic fractions did not look right, it looked skewed (see right figure below, WGS example) and did not reflect the homozygocity or heterozygocity bimodal distribution that I would expect (homozygous would have one big peak at one, heterozygous a distribution centered around 0.5 - see left figure below). I thought I did something wrong and then converted the bam to fastq and ran the analysis from scratch but got the same thing. Below is a figure of the distribution that I would expect and what I have observed in other projects, and what I am seeing on a TCGA tumor/normal pair.
Can you explain? Any advice appreciated.
How did you run variant calling with Strelka (Germline or Somatic mode)? Did you run variant calling for both samples (tumour and normal) together, or each sample separately? If you are performing somatic variant calling, then you would not expect the variant frequencies to conform to your model example.
Hello, thank you very much for your response. The error was actually in how I am calling Strelka, I ran haplotypecaller on the same data and got the correct distribution. The issue is on how Strelka is being called, I am running it through Sarek so will check on that.
By the way, this problem is solved. The issue was that TCGA is WXS data, and I was missing the appropriate parameter (--wes) on Strelka. Once I put that then we saw the expected distribution.