I would like to model the performance of a structural variant (SV) caller that is being developed in my lab for NGS technologies. One of its strengths is the ability to integrate all available alignment "signals" (e.g., discordant alignments, split-reads, etc.) when calling SVs, and as such, we observe big gains in sensitivity.
To demonstrate the utility of the software in the context of cancer genomics, we would like to demonstrate the ability to detect SVs at different frequencies (e.g. found in 0.1%, 1%, 2%, 5%, 20%, etc.) among the tumor cells and at different overall sequencing depths (e.g., 5X, 10X, 20X). For example, we want to assess our power to detect SVs at 5% frequency in the tumor with 20X coverage and compare this sensitivity to other tools.
Now, we have the ability to simulate a FASTA file with chromosomes having variants at different frequencies. For example, to simulate a variant present in 10% of the cells, one could create a FASTA file with 9 copies of the wild-type chromosome and 1 copy of the mutant chrom containing the SV.
My question is what read simulation tools can be used to guarantee that, if we ask for say 20X coverage, the 10 versions of the chromosome above are sampled from randomly. In essence we want to ensure that we don't have false negatives owing to a lack of data for the mutant chromosomes - we want to measure algorithmic false negatives, not data false negatives.
Is there a read simulation tool that can handle this? Will
Also, if you can think of a better way to do the simulation or a tool that is specifically designed to do this, I would be grateful to know of it.