Question: Noisy germline CNV data using CNVKit
gravatar for omg what am I doing
4 months ago by
Penn State College of Medicine
omg what am I doing60 wrote:

Context: I'm a new CNVKit user (using 0.9.6). I have 6 exome-seq samples (from saliva DNA) that I want to do germline copy number calling on. 2 of the samples are from saliva of healthy individuals and the other 4 are from saliva of individuals with breast cancer, all in the same extended family pedigree.

What I did so far: I ran the normal CNVKit pipeline with a flat reference and made sure to use the --access command to remove poorly mappable regions in the access-5kb-mappable.hg19.bed file mentioned in the docs. I did segmetrics with the 'ci' option, then made a scatter plot(scatter plot output seen here). I also ran the pipeline again but this time using my 2 healthy samples to generate a reference genome and compare to the individuals with breast cancer in the same family(scatter plot output seen here). The data looks very noisy in both cases. (see photos linked above)


  1. Can anybody help me understand why my data is so noisy? And what the grey and orange bars/lines represent? I am thinking it has to do with the reference (or rather lack of a good reference...). I want to retry running this with some exome-seq samples that are unrelated to breast cancer and build a reference from those individuals, but I am not sure how much that would help, or if the issue is the reference I am using, to begin with.

  2. What are the grey and orange bars on the scatter plot? On I only see red bars which are supposed to be "segmentation line", but I am not entirely sure what this means and why I have 2 colors of the bars, and they are not red. I am using v 0.9.6 of CNVKit.

Any help is greatly appreciated. Thank you

ADD COMMENTlink modified 4 days ago by brunobsouzaa20 • written 4 months ago by omg what am I doing60
gravatar for brunobsouzaa
4 days ago by
brunobsouzaa20 wrote:

I'm using cnvkit for a few months now and, for my little experience, I don't recommend using the flat reference. Using a pooled reference of at least 15 samples is the best! I'm now using a pooled reference with 161 samples and everything is working fine for me.

ADD COMMENTlink written 4 days ago by brunobsouzaa20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1979 users visited in the last hour