The exonix length calculated from the CNVkit bed files is several fold larger than the real exonic length
2
0
Entering edit mode
19 months ago
ruhollah ▴ 10

I ran CNVkit as usual in the batch mode for >100 whole exome mice samples. Then I generated bed files (one per sample) to get the integer values for the aberrant copy number per segment in each sample as follows:

cnvkit.py export bed  -x male WholeExomeMouseSample_1.cns -o WholeExomeMouseSample_1.cns.bed

The output bed file for a given sample is something like this:

2   87071181    90429432    WholeExomeMouseSample_1 3
2   90429932    111291758   WholeExomeMouseSample_1 3
2   111292258   111646005   WholeExomeMouseSample_1 4
3   29357078    91552512    WholeExomeMouseSample_1 3
3   92014572    114061589   WholeExomeMouseSample_1 3
3   114206302   159934364   WholeExomeMouseSample_1 3
5   3344361 14678781    WholeExomeMouseSample_1 3
5   145365571   146184973   WholeExomeMouseSample_1 3
6   15324588    18681705    WholeExomeMouseSample_1 13
7   34218228    34911854    WholeExomeMouseSample_1 3
...

Now, I want to find the total genomic length (in base-pair) of all segments having aberrant copy number in a given sample (let's call it L_alter_CNA). In other words, I need the total length of the altered portion of the genome (based on copy number alteration). We can simply calculate this (I think!) by summing over end - start for all lines in the above bed file.

However, for most samples, L_alter_CNA is several fold larger than the real exonic length of the sample.

Why is this? What do I miss here? Or maybe I misunderstand the bed files generated by CNVkit?

Thank you!

bed CNVkit • 918 views
ADD COMMENT
2
Entering edit mode
19 months ago
Eric T. ★ 2.8k

CNVkit's calls are not limited to exons. I'd interpret your BED file above to mean that chromosome 3 has a copy number of 3, i.e. a single-copy gain of the whole chromosome. The two breakpoints might be the centromere (91,552,512 to 92,014,572 bp) and maybe another masked or unresolved genomic region in the middle of the q arm (114,061,589 to 114,206,302 bp).

I see an unusual portion of this genome is reported with 3 copies, which could be a false positive. You could check for noise, and run cnvkit.py call --center to see if that shifts the segment means so that most of the genome has neutral copy number instead of a low-amplitude gain.

ADD COMMENT
1
Entering edit mode
19 months ago

Investigate your files, do you see how

3   29357078    91552512    WholeExomeMouseSample_1 3

is already a 62 million bp long continuous interval; it most certainly longer than a single exon, hence your files are computing not the exon but regions around exons. So you can't expect it to be the length of exons.

ADD COMMENT
0
Entering edit mode

You're right! I have no idea how CNVkit came up with this large contig. I exactly followed their procedure. Hope @Eric Talevich (developer of CNVkit) comments on this.

ADD REPLY

Login before adding your answer.

Traffic: 1503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6