Get both somatic and germline mutations from tumor only sample?
1
0
Entering edit mode
6 months ago
wormball ▴ 10

Hello!

I have some dumb questions again.

I am trying to call mutations from tumor only human samples (enriched with gene panel) from illumina. I am doing this according to this tutorial: https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2 .

Can i distinguish between somatic and germline mutations in this case? I know that i have to have matching normal samples to be perfectly sure, but Mutect2 says it can tell the difference. How confident can i be in Mutect2/FilterMutectCalls's predictions? And what exactly should i look at in the output vcf file? I suspect that germline mutations are those with "germline" or "panel_of_normals" (and no other words) in the "FILTER" column.

Also i do not understand what is the difference between panel of normals and allele frequency (-germline-resource) files (besides one of these contains allele frequencies) and why should we use both. I use somatic-hg38_1000g_pon.hg38.vcf.gz and somatic-hg38_af-only-gnomad.hg38.vcf.gz provided by gatk. The former has 50839 lines and the latter has 11861598 lines, and these 50839 mutations not always have large AF, and not all mutations with large AF are present in the panel of normals. Why can not we use mutations with frequencies larger than some threshold as panel of normals?

germline gatk somatic mutect2 gnomad • 497 views
2
Entering edit mode
6 months ago
Shred ★ 1.0k

Mutect2 outputs two tags in the INFO column of the VCF (GERMQ, TLOD):

##INFO=<ID=GERMQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not germline variants">
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log 10 likelihood ratio score of variant existing versus not existing">


Given the manual, it'll adjust dynamically these threshold, without requiring the user to fix an hard threshold. Follows all the possible tags you could see for a mutation (from the VCF header)

##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=FAIL,Description="Fail the site if all alleles fail but for different reasons.">
##FILTER=<ID=base_qual,Description="alt median base quality">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=contamination,Description="contamination">
##FILTER=<ID=duplicate,Description="evidence for alt allele is overrepresented by apparent duplicates">
##FILTER=<ID=fragment,Description="abs(ref - alt) median fragment length">
##FILTER=<ID=germline,Description="Evidence indicates this site is germline, not somatic">
##FILTER=<ID=haplotype,Description="Variant near filtered variant on same haplotype.">
##FILTER=<ID=low_allele_frac,Description="Allele fraction is below specified threshold">
##FILTER=<ID=map_qual,Description="ref - alt median mapping quality">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
##FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">
##FILTER=<ID=normal_artifact,Description="artifact_in_normal">
##FILTER=<ID=orientation,Description="orientation bias detected by the orientation bias mixture model">
##FILTER=<ID=**panel_of_normals**,Description="Blacklisted site in panel of normals">
##FILTER=<ID=position,Description="median distance of alt variants from end of reads">
##FILTER=<ID=possible_numt,Description="Allele depth is below expected coverage of NuMT in autosome">
##FILTER=<ID=slippage,Description="Site filtered due to contraction of short tandem repeat region">
##FILTER=<ID=strand_bias,Description="Evidence for alt allele comes from one read direction only">
##FILTER=<ID=strict_strand,Description="Evidence for alt allele is not represented in both directions">
##FILTER=<ID=weak_evidence,Description="Mutation does not meet likelihood threshold">


The panel of normal is used to detect possible source of technical bias, resulting in shared false positive calls of somatic mutations across a panel of healthy samples. On the other side, a germline resource file will be used to get the population allele frequency, named POPAF in the INFO column of the VCF, and to model the tumor likelihood, named TLOD. If a mutation doesn't have an associated population allele frequency, Mutect2 will use for it a user defined one. More could be find in Mutect2 manual