We've developed a method for estimating the 95% confidence intervals for MAF (Blomquist 2015) using amplicon NGS libraries. We'd like to extend this into bait capture and whole genome libraries. The approach uses synthetic exomes to regions of interest (e.g., actionable mutations) and consist of competitive template spike-ins ( the entire exome and 150 bp flanking introns that differ from the native sequence by a series of 2 adjacent unique bp changes every 50 bp) added to the sample prior to library processing. For amplicon libraries the spike-ins maintain the ratio of spike-in : native template during the library construction and mimic the type and frequency of NGS library and sequencing introduced base errors. The spike-in enable the accurate estimate of input templates and NGS errors to generate a 95% confidence estimate on every mutation in every sample for the area of interest. Creating a pipeline for amplicon libraries was straight forward but a pipeline for randomly fragmented libraries is much more complicated.
My naive guess as to a solution is if it's possible to add in a virtual reference chromosome consisting of spike-in sequences one could drive the separate alignment of Native Template and spike-in and from that generate counts and MAF estimates.
This pipeline is being developed to perform the analysis for the FDA SEQC2 project. I'm also planning on submitting an SBIR for this method, so there could be funding available. I'd love to collaborate with someone on this project!