Hi, I am optimizing pipeline for CNV analysis of WES data. I was getting quite strange output out of CNVkit, so I tried to run the same data as both, tumor (sample) and control (from which the reference is build). To my surprise the output contained quite a variable bins and segments!
Reading from CNVkit docs, I supposed that the Fix step should account for the differences between log2 ratios between the sample and the control (hence produce zeros everywhere)
[The corresponding “expected” normalized log2 read-depth values from the reference are then subtracted for each set of bins.][2]
Head -n 1 for sample.targetcoverage.cnn:
chromosome start end gene depth log2
chr1 65509 65629 ensembl_gene_id=ENSG00000186092;gene_symbol=OR4F5 32.175 5.00787
Head -n 1 for reference.cnn (build from same bam as sample):
chromosome start end gene log2 depth gc rmask spread
chr1 65509 65629 ensembl_gene_id=ENSG00000186092;gene_symbol=OR4F5 -0.144045 32.175 0.333333 0.213561
Head -n 1 for resulting sample.cnr
chromosome start end gene log2 depth weight
chr1 65509 65629 ensembl_gene_id=ENSG00000186092;gene_symbol=OR4F5 -0.243448 32.175 0.953424
I noticed that the reference.cnn is one column short - the number of colnames does not match the actual colvalues..
Any insights would be welcomed!