Hi Charlotte,
Great question—training ML models on TCGA data for somatic detection in RNA-seq is a solid approach, but getting the labels right is crucial to avoid biasing your model with technical artifacts or misclassified variants. I'll break this down step-by-step based on how the Genomic Data Commons (GDC) handles TCGA WXS data, and confirm your assumptions.
Somatic Mutations: Yes, PASS in MuTect2 Tumor-Normal VCFs
You're spot on here. In TCGA (and harmonized GDC data), somatic mutations are called using MuTect2 (from GATK) on tumor-normal paired WXS BAMs. This tool specifically detects variants present in the tumor but absent (or at very low frequency) in the matched normal, filtering out germline and artifacts.
- PASS filter: These are the high-confidence somatic calls that passed MuTect2's internal filters (e.g., for strand bias, mapping quality, etc.). They represent ~80-90% of raw calls after filtering and are what you'd want as your positive "somatic" labels.
- Access in GDC: Download the "MuTect2 Variant Aggregation and Masking" VCFs (or the aggregated MAF files) for your cohort. Filter for
FILTER=PASS in the VCF INFO column. Avoid germline_risk or other flags that might flag potential germline contaminants.
- Caveat for RNA-seq training: These are exonic/DNA-level calls, so when mapping back to RNA, watch for RNA-editing sites (e.g., A-to-I) that could mimic mutations—tools like RNAeditr can help clean those.
Germline Mutations: No, Not from Normal-Only MuTect2—Use HaplotypeCaller on Normal BAMs
This is the key clarification: MuTect2 is somatic-only and isn't run on normal samples alone (it expects a tumor-normal pair). Running it on normal-only would just flag everything as potential artifacts. Instead, TCGA/GDC calls germline variants separately on the normal WXS BAM using GATK HaplotypeCaller (in germline mode), which is designed for diploid variant discovery in non-cancer samples.
- PASS filter: Again, these are the confident germline calls (SNVs/indels) that passed HaplotypeCaller's filters (e.g., QD > 2.0, FS < 60, etc.). They're your negative class for "non-somatic" labels, but remember to subset to exonic regions if your RNA model focuses there.
- Access in GDC:
- Go to the GDC Data Portal (portal.gdc.cancer.gov) and filter for your TCGA project (e.g., TCGA-BRCA).
- Select "Simple Nucleotide Variation" or "Structural Somatic Mutations" workflows, but for germline, look under "GATK4 HaplotypeCaller" VCFs for normal samples (file type: "individual germline variant VCF").
- Download per-sample or cohort-level VCFs. Use
bcftools view -f PASS to extract them.
- Pro tip: TCGA normals are blood-derived, so they capture constitutional germline variants well, but check for batch effects across centers (e.g., via PCA on variant counts). Also, for ML balance, germline calls will vastly outnumber somatics (~3-4k exonic germline vs. ~100-500 somatics per sample), so consider downsampling or SMOTE.
Quick Workflow Sketch for Labels
Here's a minimal bash/R snippet to extract labels (assuming you have VCFs downloaded):
# Somatic: From MuTect2 tumor-normal VCF
bcftools view -f PASS input_mutect2.vcf.gz | bcftools annotate -x FORMAT/GT,INFO/* > somatic_pass.vcf
# Germline: From HaplotypeCaller normal VCF
bcftools view -f PASS input_haplotypecaller.vcf.gz | bcftools annotate -x FORMAT/GT,INFO/* > germline_pass.vcf
In R (with VariantAnnotation):
library(VariantAnnotation)
somatic <- readVcf("somatic_pass.vcf", "hg38") # or GRCh37 for older TCGA
germline <- readVcf("germline_pass.vcf", "hg38")
# Merge/extract for your features (e.g., VAF, coverage)
Final Tips for Your RNA-seq Model
- Validation: Cross-check a subset with dbSNP (for germline) or COSMIC (for somatic) to ensure label purity.
- Why TCGA? It's gold-standard, but if you need more power, supplement with ICGC or PCAWG for diverse ancestries.
- Resources: GDC docs on MuTect2 and HaplotypeCaller. For ML pitfalls in variant calling, see the MC3 paper on ensemble calling.
Kind regards,
Kevin
Dear Kevin
Thank you for the clear and extensive reply!
Somatic Regarding the PASS Mutect2 variants, if there are germline risk variants left in PASS, we need to filter them out? I also found that there is an "Aliquot Ensemble" file, which should represent the consensus of variant calls from all used variant callers. Are these variant calls more confidently called (more reliable) than the Mutect2 PASS variants?
Germline We are working with the TCGA-DLBC dataset. I cannot find any vcf or maf file that uses the GATK Haplotypecaller (in this project). Do you suggest doing the variant calling ourselves? Are you aware of any pipeline that is used by TCGA that we can use too (germline variant calling on WXS normal bam file)?
Thanks again!
Kind regards Charlotte
You are very welcome, Charlotte
For anything related to TCGA, I recommend to just use the processed data that is already available - it is already a very old project. Also, I would use the most updated version of this processed data, as they are always re-processing it using updated methods.
For somatic mutation data, only MAF data is openly available; for Germline, I'm not sure that you can retrieve that without going through dbGaP for access approval, where you would then be able to find the VCFs for normal samples also.
Dear Kevin
We have access through dbGaP, so that is no problem.
somatic With processed data, do you suggest then using PASS from Mutect2 or the ensemble aliquot?
germline Even with the access, no vcf for normal samples is available. (no haplotypecaller data) Can I do the germline calling myself?