CNVkit - diagram problem
2
0
Entering edit mode
21 months ago

I have a little problem with CNVkit.

I normally use cnvkit to calculate CNV in a whole exome panel and I have no problems and I have a good results. But now I'm trying to calculate them in a smaller panel (50 genes). The commands Ihave used are the same but with a different bed file.

cnvkit.py batch sample_sorted_ND.bam --normal *sorted_ND.bam -t my-targets.bed --fasta hg19_ref_genome.fa --access data/access-5kb-mappable.hg19.bed --output-reference my_Mreference.cnn --output-dir example1


The problem is that, when I look at the log2 in the cnr file, the only regions I have out of range are BRCA1_14 and BRCA_13, which is the deletion that there really is.

chr17 41228504 41228631 427_15619_672(BRCA1)_14 142535 -0.540179 0.629573
chr17 41231350 41231416 427_15618_672(BRCA1)_13 119045 -0.672668 0.552469


But in the diagram comes out many genes (most of the genes in the panel but not BRCA1), Why? Should not only go uot these regions? I am doing something wrong? I must add something else on the command line?

Other question is that the deph column is not correct since it does not always recognize decimals.

Some one can help me?

Thank you,

Kira

cnvkit • 443 views
0
Entering edit mode
20 months ago
Eric T. ★ 2.6k

Are the parentheses really there in the gene name? Are there spaces, too? Maybe the formatting gets strange when it goes through PDF rendering by reportlab.

Are you using the .cnr, the .cns, or both with the diagram command? Which version of CNVkit are you using?

When pandas writes out the .cnn file, it might display a floating-point number that is equal to a round integer (e.g. 123.0) as an integer with no decimal place (e.g. 123). Is that what you're seeing, or is the decimal issue something else?

0
Entering edit mode
20 months ago

Hello, In the gene name there is the parentheses but there is no spaces. I try to eliminate the parentheses in the gene name but the results is the same.

That I use to obtain the diagram is: cnvkit.py batch Sample_sorted_ND.bam -r my_reference.cnn -p 0 --scatter --diagram -d example-ok

Regarding the decimal, when I the program doesn't recognize decimals is because some times is writte 234.8769 and some times 2348769 as enter number:

chr17 41245559 41245822 427_15615_672-BRCA1_10 108631 -0.530366 0.809323

chr17 41246613 41246877 427_15615_672-BRCA1_10 112011 -0.53761 0.815537

chr7 152373125 152373164 477_192314_7516-XRCC2_1 73.5897 -0.565382 0.545698

chr17 41245295 41245559 427_15615_672-BRCA1_10 109129 -0.595931 0.792966

chr17 41246349 41246613 427_15615_672-BRCA1_10 111095 -0.608238 0.795026

chr17 41243978 41244241 427_15615_672-BRCA1_10 88.5551 -0.654744 0.843337

chr17 41228504 41228631 427_15619_672-BRCA1_14 87063 -0.661954 0.642077

In this table you can see the difference in the depth column. And in this case I have nothing in de diagram and these exons are really deleted.

Thank you

0
Entering edit mode

The decimal issue you're showing here is very strange. I'm thinking:

• Why is 477_192314_7516-XRCC2_1 from chr7 listed in the middle of these BRCA1 targets? The other targets here don't appear to be in genomic order, either. If the .cnn or .cnr files are scrambled, then that could lead to other issues.
• It looks like the depth numbers 111095 and 87063 should be 111.095 and 87.063 if these are part of the same contiguous genomic region.
• The pattern there is that there are 3 trailing decimal places, which looks like a Euro-style thousands separator (vs. commas in the US locale), whereas 88.5551 has 4 decimal places and the decimal would not be mistaken for a thousands separator.

For the decimal issue, could you look at your system's locale settings and Python and pandas versions to see if your shell environment or pandas installation is mixing up . versus , thousands separators? My guess is that when the intermediate .cnn and .cnr files are being written by CNVkit via pandas, the decimal disappears because it looks like a thousands separator. But since the depth column isn't used for much after constructing the reference (check your reference.cnn file to see if the log2 values are wild there), it might not be the source of your main issue, the undetected BRCA1 deletion.

In your diagram or .cnr, do the log2 ratios look well-centered (mostly near 0), or is there a lot of noise and potentially off-center log2 ratios? If centering is the problem (maybe due to lots of outlier values), you could try re-centering with call -m median, or look further upstream to find the source of the outliers.

If it ultimately looks like a bug in CNVkit, could you try the latest from GitHub and/or tell me which version you're using?