Question

Recurrent Segmentation Errors

0

Entering edit mode

6.9 years ago

AndyW • 0

I'm currently using cnvkit on around 100 paired exome samples. I ran a GISTIC analysis and found some segment locations amplified or deleted across a significant number of my cohort. These are almost certainly errors and I've found two possible reasons.

The first is that I have vertical lines on the VAF/BAF plot in some regions. I narrowed down one to a region of MUC genes on chr 7 q22.1 (MUC3A, MUC12, MUC17 etc.). I can see that the corresponding log2 values at this position also have a high variability and the segmentation algorithm regularly calls an amplification at this location.

The second is that some locations have a high number of target regions in close proximity and generally uneven coverage. One example I have found in my data is FLG2. The cnvkit algorithm seems to exclude many of these targets for having spread greater than 1, however a few fall below the threshold and are kept. The coverage distribution here is not even and so this is almost always called in error.

I have tried suggestions for filtering false positives, such as segmetrics, but this doesn't filter out these error regions. I could just manually find the locations and exclude them in the access command (as I have done for the HLA region), however I'd prefer a more automatic approach to identifying low confidence positions (in case I miss some which are less clear). Does an access file exist specifically for whole exome data with further low confidence regions (including problematic genes) already identified?

Best, Andy

cnvkit • 1.4k views

ADD COMMENT • link updated 6.8 years ago by Eric T. ★ 2.8k • written 6.9 years ago by AndyW • 0

score 0 · Answer 1 · 2017-07-13

Did you use a single pooled reference, or a separate paired reference for each T/N pair? If your exome samples were all prepared with the same kits, you will probably get better results with a pooled reference, rather than paired references. These highly variable regions won't be properly recognized as such in a paired reference, but with pooling, the reference should record that the variance in coverages there is high, and mask out or downweight the dodgy bins accordingly. This affect both of the issues you saw -- it's CNVkit's main automated way of masking out low-confidence regions.

If you did use a pooled reference with all 100 normal samples, and segmetrics --ci + call --filter ci didn't do the trick, then it's a good idea to just mask out the known problematic regions manually. There are BED files available in the cnvkit-examples repo that list low-mappability regions (wgEncode*.bed, and I usually exclude these with access -x. But if you know a priori that parts of the genome are problematic and not caught by any other indicators, don't be shy about squashing them at the beginning of the pipeline.