Question

The exonix length calculated from the CNVkit bed files is several fold larger than the real exonic length

0

Entering edit mode

19 months ago

ruhollah ▴ 10

I ran CNVkit as usual in the batch mode for >100 whole exome mice samples. Then I generated bed files (one per sample) to get the integer values for the aberrant copy number per segment in each sample as follows:

cnvkit.py export bed  -x male WholeExomeMouseSample_1.cns -o WholeExomeMouseSample_1.cns.bed

The output bed file for a given sample is something like this:

2   87071181    90429432    WholeExomeMouseSample_1 3
2   90429932    111291758   WholeExomeMouseSample_1 3
2   111292258   111646005   WholeExomeMouseSample_1 4
3   29357078    91552512    WholeExomeMouseSample_1 3
3   92014572    114061589   WholeExomeMouseSample_1 3
3   114206302   159934364   WholeExomeMouseSample_1 3
5   3344361 14678781    WholeExomeMouseSample_1 3
5   145365571   146184973   WholeExomeMouseSample_1 3
6   15324588    18681705    WholeExomeMouseSample_1 13
7   34218228    34911854    WholeExomeMouseSample_1 3
...

Now, I want to find the total genomic length (in base-pair) of all segments having aberrant copy number in a given sample (let's call it L_alter_CNA). In other words, I need the total length of the altered portion of the genome (based on copy number alteration). We can simply calculate this (I think!) by summing over end - start for all lines in the above bed file.

However, for most samples, L_alter_CNA is several fold larger than the real exonic length of the sample.

Why is this? What do I miss here? Or maybe I misunderstand the bed files generated by CNVkit?

Thank you!

bed CNVkit • 918 views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 19 months ago by ruhollah ▴ 10

score 2 · Answer 1 · 2022-09-02

CNVkit's calls are not limited to exons. I'd interpret your BED file above to mean that chromosome 3 has a copy number of 3, i.e. a single-copy gain of the whole chromosome. The two breakpoints might be the centromere (91,552,512 to 92,014,572 bp) and maybe another masked or unresolved genomic region in the middle of the q arm (114,061,589 to 114,206,302 bp).

I see an unusual portion of this genome is reported with 3 copies, which could be a false positive. You could check for noise, and run cnvkit.py call --center to see if that shifts the segment means so that most of the genome has neutral copy number instead of a low-amplitude gain.

score 1 · Answer 2 · 2022-08-31

1

Entering edit mode

19 months ago

Istvan Albert 100k

Investigate your files, do you see how

3   29357078    91552512    WholeExomeMouseSample_1 3

is already a 62 million bp long continuous interval; it most certainly longer than a single exon, hence your files are computing not the exon but regions around exons. So you can't expect it to be the length of exons.

ADD COMMENT • link 19 months ago by Istvan Albert 100k

0

Entering edit mode

You're right! I have no idea how CNVkit came up with this large contig. I exactly followed their procedure. Hope @Eric Talevich (developer of CNVkit) comments on this.

ADD REPLY • link 19 months ago by ruhollah ▴ 10