Help with understanding CNVkit output
2
3
Entering edit mode
5.8 years ago
alons ▴ 270

Hi all,

I need some help with understanding the output of CNVkit, specifically Segmented log2 ratios (.cns) and the exported CNVs in VCF format.

I'm looking to get the copy number of every region found by CNVkit.

For Segmented log2 ratios (.cns) file:

Please correct me if i'm wrong: To get the actual estimated copy number I should simply anti-log the log2 column, right?

In that case,

• What's the correlation/connection, if there is any, between the log2 value in the .cns file and the inferred SVLEN value in the .vcf file?

• Is there any connection to the "CN" (copy number genotype...) value in the .vcf? Also, why does "CN" only appears in duplication events? I tried calculating the copy number from the log2 value in the .cns file but the values are different than what I expected.

An example if it helps:

For a certain region in the .cns file, the log2 value is 0.382191.
In the .vcf file, SVLEN is 6812878 and the CN value is 3. What is the copy number then?

Thanks,
Alon

cnvkit cnv cancer ngs • 6.9k views
6
Entering edit mode
5.8 years ago
Eric T. ★ 2.7k

For segmented log2 ratios -- nearly. See the documentation for CNVkit's call command. But the easiest way to get the integer copy number values is through the "export" command, either VCF or BED.

• I recommend "export bed" for custom analysis. If you are analyzing tumor samples with some known amount of normal-cell contamination, use "call" or "rescale" first to adjust the log2 ratios for that contamination.
• The ["export vcf"] output conforms to the VCF spec, which is a little unintuitive for describing CNVs. The CN field means absolute copy number, but it's only specified for copy number gains, while hemizygous or homozygous deletions are specified differently, as you're seeing.

In your example, a log2 value of .38 corresponds to 2^(.38) = 1.3 times the reference ploidy. For a diploid genome, the absolute copy number would be 2 * 2^(.38) = 2.6, which you can probably round up to 3 assuming some reasonable level of normal-cell contamination. (See the documentation on tumor heterogeneity for more guidance on this topic.)

0
Entering edit mode

Thank you very much for the detailed answer!

I assume that for copy number losses the calculation is the same?

Also, how would I go about finding the "original" (reference) copy number so I could find out exactly what was the copy number before said mutation / aberration?

2
Entering edit mode

Yes, the calculation is the same for copy number losses.

To find the copy number status of the normal sample, just run the same pipeline on it. If the normal sample was included in the CNV reference (cnv_reference.cnn), you can alternatively run the pipeline on the normal sample using a "flat" reference instead.

0
Entering edit mode

Great. If I don't have a normal sample should I create a "flat" reference from the reference.fa file and then run batch on it to get the normal copy number so I could calculate the exact copy number loss/gain?

To be more specific, I want to get the original copy number so I could know the number of copies in the ref as opposed to the tumor. For example, I now see that I have 3 times the copy number of the ref, but what was the original copy number and consequently, what's the tumor copy number?

1
Entering edit mode

Yes. The original copy number is the ploidy of your organism, e.g. humans are diploid, 2 copies of each autosome, and the sex chromosomes are XX or XY normally. If you use a flat reference for both tumor and normal, then you can interpret the log2 values as they are. If you used a single normal reference, then you should first check that the normal sample is copy-number-neutral at the location of interest (it probably is) before interpreting the tumor log2 ratio.

Regarding SVLEN -- this is just the length of the altered genomic region, in basepairs. It's not related to the log2 value or copy number.

0
Entering edit mode

So is it safe to assume that if the log2 < 0 it's a deletion, otherwise a dup?

3
Entering edit mode

Sort of. If the log2 value close to 0 it could instead just be noise or imperfect centering. But it is true that the neutral value, i.e. cutoff between loss and gain, is zero. In array CGH analysis (which CNVkit mimics) it's common to treat log2 values between +/- 0.2 as effectively neutral copy number, and focus on greater deviations from zero.

0
Entering edit mode
2.1 years ago
linouhao • 0

@Eric Talevich when using cnvkit export vcf, a site has no CN, but when using export bed, it report the last column cn number is 1, so which one is right?

another question is whether cnvkit.py batch take tumor purity into consideration? thanks a lot

0
Entering edit mode

Please don't bump a 3 year old thread like this. I suggest you start a new thread to ask your questions.