For many of the TCGA tumor types, targeted capture validation was performed on a subset of variants/genes/samples detected in WGS/WXS. You can find capture validation BAMs here on NCI's GDC Legacy Archive, and then query/fetch the corresponding WXS/WGS BAM per sample. Use the latter as the discovery set, and the former as a validation set. This gets you a measure of variant caller specificity. To measure sensitivity, you'll need some kind of comprehensive gold set, and I wouldn't recommend using the TCGA MAFs. They have varying degrees of sensitivity/specificity.
One way to generate a gold set, would be to throw the WGS/WXS BAMs against multiple leading somatic variant callers (Try Strelka, MuTect, VarScan, Bassovac), and check them against the validation BAMs. The union of calls that passes, could make a decent gold set. If your caller tries to reposition reads to make better calls, then you could also play with different BAM aligners (bwa-aln, bwa-mem, bowtie) or local re-aligners/assemblers (Pindel, ABRA or scalpel).
This and this are papers worth reading. I'm sure there are more.
In TCGA UCEC, we made new DNA libraries and re-sequenced ~250 genes in their entirety for validation on 222 cases. And it's one of the heavily mutated tumors, so there should be enough to generate good sensitivity/specificity stats. With frequent microsatellite instability, UCEC is also a good challenge for indel callers. If you have access to TCGA's Jamboree, and don't want to deal with BAM files, then here is a file that lists all 57,087 mutations targeted for re-sequencing (capture validation) in MAF format, and their post-validation status broken down as follows in column 60.
Failed validation:
- 6609 -
variant_not_found
- 1636 -
germline_or_loh
Passed validation:
- 32666 -
somatic_high_tum_vaf
- 4101 -
somatic_low_tum_vaf
- 9492 -
somatic_med_tum_vaf
Unable to validate:
- 1193 -
insufficient_reads
- These variants have fewer than 10 reads in Tumor or Normal
- 549 -
skip_larger_indels
- Read-counts were not generated for larger indels
- 459 -
low_nrm_vaf_low_tum_vaf
- likely a recurrent artifact or low-level contamination from another sample
- 222 -
not_on_autosome
- These variants are mapped to sequences other than 1..22 and X
- 158 -
no_custom_capture
- Custom capture data was unavailable (need additional material)
Thanks for the thoughts Cyriac! I've been building a spreadsheet of BAMs and MAFs from the TCGA papers that claim to have performed validation manually and was hoping someone else had already done this work, because I am lazy. I'm unhappy with forming a gold set from the union of calls from other callers--I'd prefer to see some other form of validation, whether it's targeted resequencing with another platform or very high-depth sequencing with the same platform.
As for the references you provide: were you able to track down the data behind the ICGC papers? They mention that the data is available but don't give an ID to search on.
Also, how do you go from validation BAMs to validated variant calls? I have seen those validation BAMs (> 4k!) but didn't see associated VCFs/MAFs.
Hi Cyriac
Sorry for the basic question but i'm interested to understand better your comment below:
You can find the validation BAMs here on CGHub, and then query/fetch the corresponding WXS/WGS BAMs. Use the latter as the discovery set, and the former as a validation set. This gets you a measure of variant caller specificity.
what exactly is the difference between the validation BAMs and the WGS/WXS BAMs? how do you use the combination of the two to measure a variant caller specificity?
Thanks
Severine
I guess you can find the answer using this FAQ.