Question

Simulating whole gene deletions for tuning a WES based CNV calling approach?

0

Entering edit mode

3.8 years ago

curious ▴ 750

I have data from thousands of samples that have undergone WES. I am interested in reliably detecting whole gene deletions of one single gene of interest, which according to very good literature should occur 3% of the time. My impression is that CNV calling from WES data has some issues due to uneven coverage, so I want to have some quality control.

If I had a couple of true positives and negatives, I could evaluate quality of a calling approach. I don't have these.

My idea today was to simulate true positives (double all reads over the gene) and true negatives (randomly delete 95% of reads over the gene. Then use these for QC.

t seems almost too simple, is there any reason in the technology or biology that would make this a terrible idea?

WES • 957 views

ADD COMMENT • link 3.8 years ago by curious ▴ 750

0

Entering edit mode

Hi curious,

In my experience, biology and real data are usually muuuuuch more complex than most simulations and there will always be sources of variation you won't take into account that will impact you results. If possible, I'd suggest you trying to find at least a few positive and negative examples in public datasets. If this is not possible, you could try running the one or more CNV calling algorithms in your dataset and selecting a few good looking samples for positives and negatives, checking them in IGV and validating them with MLPA or similar. And then working from there. The problem is that this approach will likely give you only "easy" true positives and "easy" true negatives.

We had a problem with the number of False Positives in our datasets and developed an R package, CNVfilteR, leveraging point mutations to identify them.

Hope this helps!

Bernat

ADD REPLY • link 3.8 years ago by bernatgel ★ 3.4k

0

Entering edit mode

Interestingly enough, it looks like someone actually already made the tool "bamgineer" I was discussing a similar concept behind:

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006080

"read pairs sampled from existing reads and modified to contain SNPs of the haplotype of interest. This approach retains biases of the original data such as local coverage, strand bias, and insert size."

ADD REPLY • link 3.8 years ago by curious ▴ 750