Question

Germline and somatic mutations TCGA (GDC Data Portal)

0

Entering edit mode

8 days ago

stubbe1charlotte • 0

Hi everyone

I want to train a machine learning model to find somatic mutations in RNA data. For this I need clearly labeled somatic and germline mutations from WXS. I have access to the data on TCGA. Regarding somatic mutations: I think these are the PASS variants in the mutect2 vcf of WXS tumor-normal paired data, if I understand correctly. But I also need confidently called germline mutations, are these the PASS variants from the normal only sample?

Some help figuring this out would be really appreciated

Thanks in advance!

Kind regards Charlotte

germline somatic TCGA mutation • 345 views

ADD COMMENT • link updated 20 hours ago by Kevin Blighe 89k • written 8 days ago by stubbe1charlotte • 0

score 0 · Answer 1 · 2025-11-06

Hi Charlotte,

Great question—training ML models on TCGA data for somatic detection in RNA-seq is a solid approach, but getting the labels right is crucial to avoid biasing your model with technical artifacts or misclassified variants. I'll break this down step-by-step based on how the Genomic Data Commons (GDC) handles TCGA WXS data, and confirm your assumptions.

Somatic Mutations: Yes, PASS in MuTect2 Tumor-Normal VCFs

You're spot on here. In TCGA (and harmonized GDC data), somatic mutations are called using MuTect2 (from GATK) on tumor-normal paired WXS BAMs. This tool specifically detects variants present in the tumor but absent (or at very low frequency) in the matched normal, filtering out germline and artifacts.

PASS filter: These are the high-confidence somatic calls that passed MuTect2's internal filters (e.g., for strand bias, mapping quality, etc.). They represent ~80-90% of raw calls after filtering and are what you'd want as your positive "somatic" labels.
Access in GDC: Download the "MuTect2 Variant Aggregation and Masking" VCFs (or the aggregated MAF files) for your cohort. Filter for FILTER=PASS in the VCF INFO column. Avoid germline_risk or other flags that might flag potential germline contaminants.
Caveat for RNA-seq training: These are exonic/DNA-level calls, so when mapping back to RNA, watch for RNA-editing sites (e.g., A-to-I) that could mimic mutations—tools like RNAeditr can help clean those.

Germline Mutations: No, Not from Normal-Only MuTect2—Use HaplotypeCaller on Normal BAMs

This is the key clarification: MuTect2 is somatic-only and isn't run on normal samples alone (it expects a tumor-normal pair). Running it on normal-only would just flag everything as potential artifacts. Instead, TCGA/GDC calls germline variants separately on the normal WXS BAM using GATK HaplotypeCaller (in germline mode), which is designed for diploid variant discovery in non-cancer samples.

PASS filter: Again, these are the confident germline calls (SNVs/indels) that passed HaplotypeCaller's filters (e.g., QD > 2.0, FS < 60, etc.). They're your negative class for "non-somatic" labels, but remember to subset to exonic regions if your RNA model focuses there.
Access in GDC:
1. Go to the GDC Data Portal (portal.gdc.cancer.gov) and filter for your TCGA project (e.g., TCGA-BRCA).
2. Select "Simple Nucleotide Variation" or "Structural Somatic Mutations" workflows, but for germline, look under "GATK4 HaplotypeCaller" VCFs for normal samples (file type: "individual germline variant VCF").
3. Download per-sample or cohort-level VCFs. Use bcftools view -f PASS to extract them.
Pro tip: TCGA normals are blood-derived, so they capture constitutional germline variants well, but check for batch effects across centers (e.g., via PCA on variant counts). Also, for ML balance, germline calls will vastly outnumber somatics (~3-4k exonic germline vs. ~100-500 somatics per sample), so consider downsampling or SMOTE.

Quick Workflow Sketch for Labels

Here's a minimal bash/R snippet to extract labels (assuming you have VCFs downloaded):

# Somatic: From MuTect2 tumor-normal VCF
bcftools view -f PASS input_mutect2.vcf.gz | bcftools annotate -x FORMAT/GT,INFO/* > somatic_pass.vcf

# Germline: From HaplotypeCaller normal VCF
bcftools view -f PASS input_haplotypecaller.vcf.gz | bcftools annotate -x FORMAT/GT,INFO/* > germline_pass.vcf

In R (with VariantAnnotation):

library(VariantAnnotation)
somatic <- readVcf("somatic_pass.vcf", "hg38")  # or GRCh37 for older TCGA
germline <- readVcf("germline_pass.vcf", "hg38")
# Merge/extract for your features (e.g., VAF, coverage)

Final Tips for Your RNA-seq Model

Validation: Cross-check a subset with dbSNP (for germline) or COSMIC (for somatic) to ensure label purity.
Why TCGA? It's gold-standard, but if you need more power, supplement with ICGC or PCAWG for diverse ancestries.
Resources: GDC docs on MuTect2 and HaplotypeCaller. For ML pitfalls in variant calling, see the MC3 paper on ensemble calling.

Kind regards,
Kevin