Correcting odd GC bias in whole-exome CNV calling
1
2
Entering edit mode
7.4 years ago

This image shows the log2 ratio of tumor coverage over control coverage for each exon capture region. As you can see there is a clear bias with GC content, making segmentation useless. Both samples are of good quality, sequenced using the same exon capture kit (but different sequencing batches) and there is unlikely to be contamination (as determined by mapping rate, which is near 100% to GRCm38). The control tissue is from liver.

Any suggestions on how I could try and correct the bias? I would expect log2 fold changes to have equal variance across GC content range.

I should also add that the CNV algorithms I've tried do correct for GC bias, but they were not sufficient.

GC bias

sequencing • 2.5k views
ADD COMMENT
0
Entering edit mode

There are generally more segments with intermediate GC content, so wouldn't you expect an increase in variance in the middle (granted, there are a lot of points ~0.6 and there's low variance there, so perhaps I'm over blowing this)? What seems odd to me is that the variance is asymmetric. Presuming you have a BED file with the capture probe coordinates, you could try making a black list out of its complement and see what computeGCBias from deepTools outputs. The performance won't be great, since I didn't write the blacklisting stuff with this in mind, but it'll at least give you a better idea about whether you really have an issue.

ADD REPLY
0
Entering edit mode

Thanks for the reply. I' am primarily using CNVkit for my analysis, which uses regions outside the capture regions to aid in CNV analysis. I attached a plot of log2 ratios to GC bias for those regions outside the capture regions (i.e. the black list regions as you suggested). The image in my original post is only the capture regions.

enter image description here

Also, here is an image from another tumor that works well with liver, demonstrating how I don't think that increased variance at difference GC levels should be there.

enter image description here

ADD REPLY
2
Entering edit mode
7.4 years ago

I did come up with an ad hoc solution. But of course the best solution is to understand why the liver and tumor samples have different GC content profiles, but in the event I cannot I have a couple workarounds.

What I did in windows of 1% GC I calculated the variance in log2 ratios (variance_i, for window i), then computed the median variance (variance_median). I then performed the following calculations:

variance_correction_i = variance_median / variance_i, for all i

Then multiplied the log2 ratios in each window i by variance_correction_i.

This led to much better segmentation results. See the image below, which is the same data as the first image but corrected.

enter image description here

ADD COMMENT

Login before adding your answer.

Traffic: 3211 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6