Question

Compendium of validated somatic variant calls with associated reads and reference?

2

Entering edit mode

9.2 years ago

jeff.hammerbacher ▴ 110

We're working on a new somatic variant caller and could use some validated somatic variant calls to ensure our software is working well. We're working to compile a list of validated somatic variant calls together with the raw or aligned tumor and normal reads that were used to generate the variant calls. If you know of any data sets that might belong on our list, please note them below!

cancer tcga variant-calling • 2.7k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by jeff.hammerbacher ▴ 110

Ram · Accepted Answer · 2015-02-18

For many of the TCGA tumor types, targeted capture validation was performed on a subset of variants/genes/samples detected in WGS/WXS. You can find capture validation BAMs here on NCI's GDC Legacy Archive, and then query/fetch the corresponding WXS/WGS BAM per sample. Use the latter as the discovery set, and the former as a validation set. This gets you a measure of variant caller specificity. To measure sensitivity, you'll need some kind of comprehensive gold set, and I wouldn't recommend using the TCGA MAFs. They have varying degrees of sensitivity/specificity.

One way to generate a gold set, would be to throw the WGS/WXS BAMs against multiple leading somatic variant callers (Try Strelka, MuTect, VarScan, Bassovac), and check them against the validation BAMs. The union of calls that passes, could make a decent gold set. If your caller tries to reposition reads to make better calls, then you could also play with different BAM aligners (bwa-aln, bwa-mem, bowtie) or local re-aligners/assemblers (Pindel, ABRA or scalpel).

This and this are papers worth reading. I'm sure there are more.

In TCGA UCEC, we made new DNA libraries and re-sequenced ~250 genes in their entirety for validation on 222 cases. And it's one of the heavily mutated tumors, so there should be enough to generate good sensitivity/specificity stats. With frequent microsatellite instability, UCEC is also a good challenge for indel callers. If you have access to TCGA's Jamboree, and don't want to deal with BAM files, then here is a file that lists all 57,087 mutations targeted for re-sequencing (capture validation) in MAF format, and their post-validation status broken down as follows in column 60.

Failed validation:

6609 - variant_not_found
1636 - germline_or_loh

Passed validation:

32666 - somatic_high_tum_vaf
4101 - somatic_low_tum_vaf
9492 - somatic_med_tum_vaf

Unable to validate:

1193 - insufficient_reads - These variants have fewer than 10 reads in Tumor or Normal
549 - skip_larger_indels - Read-counts were not generated for larger indels
459 - low_nrm_vaf_low_tum_vaf - likely a recurrent artifact or low-level contamination from another sample
222 - not_on_autosome - These variants are mapped to sequences other than 1..22 and X
158 - no_custom_capture - Custom capture data was unavailable (need additional material)