Question: Help with understanding CNVkit output
gravatar for alons
3.2 years ago by
alons270 wrote:

Hi all,

I need some help with understanding the output of CNVkit, specifically Segmented log2 ratios (.cns) and the exported CNVs in VCF format.

I'm looking to get the copy number of every region found by CNVkit.

For Segmented log2 ratios (.cns) file:

Please correct me if i'm wrong: To get the actual estimated copy number I should simply anti-log the log2 column, right?

In that case,

  • What's the correlation/connection, if there is any, between the log2 value in the .cns file and the inferred SVLEN value in the .vcf file?

  • Is there any connection to the "CN" (copy number genotype...) value in the .vcf? Also, why does "CN" only appears in duplication events? I tried calculating the copy number from the log2 value in the .cns file but the values are different than what I expected.

An example if it helps:

For a certain region in the .cns file, the log2 value is 0.382191.
In the .vcf file, SVLEN is 6812878 and the CN value is 3. What is the copy number then?


cancer cnv ngs cnvkit • 3.6k views
ADD COMMENTlink modified 7 months ago by RamRS21k • written 3.2 years ago by alons270
gravatar for Eric T.
3.2 years ago by
Eric T.2.4k
San Francisco, CA
Eric T.2.4k wrote:

For segmented log2 ratios -- nearly. See the documentation for CNVkit's call command. But the easiest way to get the integer copy number values is through the "export" command, either VCF or BED.

  • I recommend "export bed" for custom analysis. If you are analyzing tumor samples with some known amount of normal-cell contamination, use "call" or "rescale" first to adjust the log2 ratios for that contamination.
  • The ["export vcf"] output conforms to the VCF spec, which is a little unintuitive for describing CNVs. The CN field means absolute copy number, but it's only specified for copy number gains, while hemizygous or homozygous deletions are specified differently, as you're seeing.

In your example, a log2 value of .38 corresponds to 2^(.38) = 1.3 times the reference ploidy. For a diploid genome, the absolute copy number would be 2 * 2^(.38) = 2.6, which you can probably round up to 3 assuming some reasonable level of normal-cell contamination. (See the documentation on tumor heterogeneity for more guidance on this topic.)

ADD COMMENTlink modified 7 months ago by RamRS21k • written 3.2 years ago by Eric T.2.4k

Thank you very much for the detailed answer!

I assume that for copy number losses the calculation is the same?

Also, how would I go about finding the "original" (reference) copy number so I could find out exactly what was the copy number before said mutation / aberration?

ADD REPLYlink modified 7 months ago by RamRS21k • written 3.1 years ago by alons270

Yes, the calculation is the same for copy number losses.

To find the copy number status of the normal sample, just run the same pipeline on it. If the normal sample was included in the CNV reference (cnv_reference.cnn), you can alternatively run the pipeline on the normal sample using a "flat" reference instead.

ADD REPLYlink written 3.1 years ago by Eric T.2.4k

Great. If I don't have a normal sample should I create a "flat" reference from the reference.fa file and then run batch on it to get the normal copy number so I could calculate the exact copy number loss/gain?

To be more specific, I want to get the original copy number so I could know the number of copies in the ref as opposed to the tumor. For example, I now see that I have 3 times the copy number of the ref, but what was the original copy number and consequently, what's the tumor copy number?

ADD REPLYlink modified 7 months ago by RamRS21k • written 3.1 years ago by alons270

Yes. The original copy number is the ploidy of your organism, e.g. humans are diploid, 2 copies of each autosome, and the sex chromosomes are XX or XY normally. If you use a flat reference for both tumor and normal, then you can interpret the log2 values as they are. If you used a single normal reference, then you should first check that the normal sample is copy-number-neutral at the location of interest (it probably is) before interpreting the tumor log2 ratio.

Regarding SVLEN -- this is just the length of the altered genomic region, in basepairs. It's not related to the log2 value or copy number.

ADD REPLYlink written 3.1 years ago by Eric T.2.4k

So is it safe to assume that if the log2 < 0 it's a deletion, otherwise a dup?

ADD REPLYlink modified 7 months ago by RamRS21k • written 2.7 years ago by brentp23k

Sort of. If the log2 value close to 0 it could instead just be noise or imperfect centering. But it is true that the neutral value, i.e. cutoff between loss and gain, is zero. In array CGH analysis (which CNVkit mimics) it's common to treat log2 values between +/- 0.2 as effectively neutral copy number, and focus on greater deviations from zero.

ADD REPLYlink written 2.7 years ago by Eric T.2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1373 users visited in the last hour