We're working on a new somatic variant caller and could use some validated somatic variant calls to ensure our software is working well. We're working to compile a list of validated somatic variant calls together with the raw or aligned tumor and normal reads that were used to generate the variant calls. If you know of any data sets that might belong on our list, please note them below!
For many of the TCGA tumor types, targeted capture validation was performed on a subset of variants/genes/samples detected in WGS/WXS. You can find capture validation BAMs here on NCI's GDC Legacy Archive, and then query/fetch the corresponding WXS/WGS BAM per sample. Use the latter as the discovery set, and the former as a validation set. This gets you a measure of variant caller specificity. To measure sensitivity, you'll need some kind of comprehensive gold set, and I wouldn't recommend using the TCGA MAFs. They have varying degrees of sensitivity/specificity.
One way to generate a gold set, would be to throw the WGS/WXS BAMs against multiple leading somatic variant callers (Try Strelka, MuTect, VarScan, Bassovac), and check them against the validation BAMs. The union of calls that passes, could make a decent gold set. If your caller tries to reposition reads to make better calls, then you could also play with different BAM aligners (bwa-aln, bwa-mem, bowtie) or local re-aligners/assemblers (Pindel, ABRA or scalpel).
In TCGA UCEC, we made new DNA libraries and re-sequenced ~250 genes in their entirety for validation on 222 cases. And it's one of the heavily mutated tumors, so there should be enough to generate good sensitivity/specificity stats. With frequent microsatellite instability, UCEC is also a good challenge for indel callers. If you have access to TCGA's Jamboree, and don't want to deal with BAM files, then here is a file that lists all 57,087 mutations targeted for re-sequencing (capture validation) in MAF format, and their post-validation status broken down as follows in column 60.
- 6609 -
- 1636 -
- 32666 -
- 4101 -
- 9492 -
Unable to validate:
- 1193 -
insufficient_reads- These variants have fewer than 10 reads in Tumor or Normal
- 549 -
skip_larger_indels- Read-counts were not generated for larger indels
- 459 -
low_nrm_vaf_low_tum_vaf- likely a recurrent artifact or low-level contamination from another sample
- 222 -
not_on_autosome- These variants are mapped to sequences other than 1..22 and X
- 158 -
no_custom_capture- Custom capture data was unavailable (need additional material)