tumor/normal WGS with CNVkit has many small segments with small copy number changes
4.3 years ago
swheelan ▴ 10

Hi! We've used CNVkit fairly extensively and are quite happy with it, thanks. We have rat tumor/normal WGS samples now, and CNVkit generates many, many small (~20kb) segments that oscillate between copy number 0 and copy number -0.5. The confidence intervals do not overlap each other and for the segments around -0.5 the ci do not overlap zero. The coverage is good, alignments were good, and nothing else seems odd about these samples, but this output is pretty strange. Any suggestions? Thanks!

You can post the figure generated by CNVKit.

I haven't tried posting pictures before so apologies if this didn't work! but here's a shot of the scatter plot.

1) Are the tumor and normal samples from the same rat? Or just from the same strain? There can be more heterogeneity than you expect sometimes.
I'm still not sure whether that would explain that result, though...

They're from the same rat.

Interestingly, we just ran Control-FREEC on the same data and got a small handful of discrete called CNVs, everything else normal ploidy.

4.3 years ago
Eric T. ★ 2.7k

How did you run CNVkit? It looks as if it was run with two separate coverage profiles (genic and intergenic?) that were not normalized to each other when combining, so half the bins have log2 values shifted downward by 0.5. This could have happened if you ran batch with WGS data but did not use -m wgs, for example.

Hmm, that would be interesting indeed. Here's the command (paths and names shortened): cnvkit.py batch tumor.sorted_RG_noDup.bam --normal normal.sorted_RG_noDup.bam -m wgs --fasta /path/rat/rn6/rn6.fa --annotate /path/rat/rn6/rn6_UCSC_gene_merged.bed --output-reference tn.reference.cnn --output-dir ./tvsn

Have you used normal.sorted_RG_noDup.bam as a reference elsewhere? If there is something odd about the coverage in that normal sample in particular, that could shift the normalized log2 ratios for tumor.sorted_RG_noDup.bam -- so contamination with an enriched exome in either sample would yield similarly weird results. I recommend using a pooled reference of multiple normal samples to reduce the risk of this happening and generally reduce noise in the results.

You can also try using flasso as the segmentation method (with the segment command) instead of the default cbs, as fused lasso seems to work better on large datasets like WGS.

Thanks- we'll try this & I'll check out that reference sample.