Simulating whole gene deletions for tuning a WES based CNV calling approach?
Entering edit mode
3.4 years ago
curious ▴ 730

I have data from thousands of samples that have undergone WES. I am interested in reliably detecting whole gene deletions of one single gene of interest, which according to very good literature should occur 3% of the time. My impression is that CNV calling from WES data has some issues due to uneven coverage, so I want to have some quality control.

If I had a couple of true positives and negatives, I could evaluate quality of a calling approach. I don't have these.

My idea today was to simulate true positives (double all reads over the gene) and true negatives (randomly delete 95% of reads over the gene. Then use these for QC.

t seems almost too simple, is there any reason in the technology or biology that would make this a terrible idea?

WES • 876 views
Entering edit mode

Hi curious,

In my experience, biology and real data are usually muuuuuch more complex than most simulations and there will always be sources of variation you won't take into account that will impact you results. If possible, I'd suggest you trying to find at least a few positive and negative examples in public datasets. If this is not possible, you could try running the one or more CNV calling algorithms in your dataset and selecting a few good looking samples for positives and negatives, checking them in IGV and validating them with MLPA or similar. And then working from there. The problem is that this approach will likely give you only "easy" true positives and "easy" true negatives.

We had a problem with the number of False Positives in our datasets and developed an R package, CNVfilteR, leveraging point mutations to identify them.

Hope this helps!


Entering edit mode

Interestingly enough, it looks like someone actually already made the tool "bamgineer" I was discussing a similar concept behind:

"read pairs sampled from existing reads and modified to contain SNPs of the haplotype of interest. This approach retains biases of the original data such as local coverage, strand bias, and insert size."


Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6