Hi Charlotte,
Great question—training ML models on TCGA data for somatic detection in RNA-seq is a solid approach, but getting the labels right is crucial to avoid biasing your model with technical artifacts or misclassified variants. I'll break this down step-by-step based on how the Genomic Data Commons (GDC) handles TCGA WXS data, and confirm your assumptions.
Somatic Mutations: Yes, PASS in MuTect2 Tumor-Normal VCFs
You're spot on here. In TCGA (and harmonized GDC data), somatic mutations are called using MuTect2 (from GATK) on tumor-normal paired WXS BAMs. This tool specifically detects variants present in the tumor but absent (or at very low frequency) in the matched normal, filtering out germline and artifacts.
- PASS filter: These are the high-confidence somatic calls that passed MuTect2's internal filters (e.g., for strand bias, mapping quality, etc.). They represent ~80-90% of raw calls after filtering and are what you'd want as your positive "somatic" labels.
- Access in GDC: Download the "MuTect2 Variant Aggregation and Masking" VCFs (or the aggregated MAF files) for your cohort. Filter for
FILTER=PASS in the VCF INFO column. Avoid germline_risk or other flags that might flag potential germline contaminants.
- Caveat for RNA-seq training: These are exonic/DNA-level calls, so when mapping back to RNA, watch for RNA-editing sites (e.g., A-to-I) that could mimic mutations—tools like RNAeditr can help clean those.
Germline Mutations: No, Not from Normal-Only MuTect2—Use HaplotypeCaller on Normal BAMs
This is the key clarification: MuTect2 is somatic-only and isn't run on normal samples alone (it expects a tumor-normal pair). Running it on normal-only would just flag everything as potential artifacts. Instead, TCGA/GDC calls germline variants separately on the normal WXS BAM using GATK HaplotypeCaller (in germline mode), which is designed for diploid variant discovery in non-cancer samples.
- PASS filter: Again, these are the confident germline calls (SNVs/indels) that passed HaplotypeCaller's filters (e.g., QD > 2.0, FS < 60, etc.). They're your negative class for "non-somatic" labels, but remember to subset to exonic regions if your RNA model focuses there.
- Access in GDC:
- Go to the GDC Data Portal (portal.gdc.cancer.gov) and filter for your TCGA project (e.g., TCGA-BRCA).
- Select "Simple Nucleotide Variation" or "Structural Somatic Mutations" workflows, but for germline, look under "GATK4 HaplotypeCaller" VCFs for normal samples (file type: "individual germline variant VCF").
- Download per-sample or cohort-level VCFs. Use
bcftools view -f PASS to extract them.
- Pro tip: TCGA normals are blood-derived, so they capture constitutional germline variants well, but check for batch effects across centers (e.g., via PCA on variant counts). Also, for ML balance, germline calls will vastly outnumber somatics (~3-4k exonic germline vs. ~100-500 somatics per sample), so consider downsampling or SMOTE.
Quick Workflow Sketch for Labels
Here's a minimal bash/R snippet to extract labels (assuming you have VCFs downloaded):
# Somatic: From MuTect2 tumor-normal VCF
bcftools view -f PASS input_mutect2.vcf.gz | bcftools annotate -x FORMAT/GT,INFO/* > somatic_pass.vcf
# Germline: From HaplotypeCaller normal VCF
bcftools view -f PASS input_haplotypecaller.vcf.gz | bcftools annotate -x FORMAT/GT,INFO/* > germline_pass.vcf
In R (with VariantAnnotation):
library(VariantAnnotation)
somatic <- readVcf("somatic_pass.vcf", "hg38") # or GRCh37 for older TCGA
germline <- readVcf("germline_pass.vcf", "hg38")
# Merge/extract for your features (e.g., VAF, coverage)
Final Tips for Your RNA-seq Model
- Validation: Cross-check a subset with dbSNP (for germline) or COSMIC (for somatic) to ensure label purity.
- Why TCGA? It's gold-standard, but if you need more power, supplement with ICGC or PCAWG for diverse ancestries.
- Resources: GDC docs on MuTect2 and HaplotypeCaller. For ML pitfalls in variant calling, see the MC3 paper on ensemble calling.
Kind regards,
Kevin