I am having some trouble running CNVkit 0.9.7, installed via conda in a dedicated environment, on a set of 10 control and 16 experimental (tumor) libraries which were prepared using a targeted amplicon panel. The batch command I am using is:
cnvkit.py batch -m amplicon $tumor_dir*.bam --normal $normal_dir*.bam \ -p 0 \ --targets $bed --fasta $hg19 \ --annotate $refFlat \ --scatter --diagram \ --output-reference reference.cnn --output-dir $out_dir
Because mode is 'amplicon' I am omitting the --access flag, but have also tried with --access using the targets bed file and the result is the same.
The first issue I encounter is that 116 bins failed filters, but looking at the log2 and spread they are within limits. Below are the first few examples - the first clearly exceeds the spread threshold, but the rest are within limits.
Targets: 116 (11.5538%) bins failed filters (log2 < -5.0, log2 > 5.0, spread > 1.0) - chr1:36167979-36168099 log2=-4.026 spread=1.154 rmask=nan " chr1:53307856-53307973 log2=-0.132 spread=0.078 rmask=nan PTGER3 chr1:71477262-71477341 log2=-0.703 spread=0.113 rmask=nan FUBP1 chr1:78432623-78432745 log2=-0.588 spread=0.146 rmask=nan " chr1:78432750-78432869 log2=0.467 spread=0.105 rmask=nan " chr1:78433305-78433419 log2=0.069 spread=0.095 rmask=nan " chr1:78433832-78433916 log2=-0.269 spread=0.159 rmask=nan
The second issue I encounter is that executing the pipeline does not generate the .cns files or scatter and diagram plots. For simplicity, processing a single sample with the reference generated using the command above, the output in the CLI is:
$ ./run_cnvkit_noref.sh reference.cnn ../bam/tumor_bam/test_case test CNVkit 0.9.7 Created directory test Wrote test/reference.target-tmp.bed with 1004 regions Wrote test/reference.antitarget-tmp.bed with 0 regions Running 1 samples in 12 processes (that's 12 processes per bam) Running the CNVkit pipeline on ../bam/tumor_bam/test_case.bam ... Processing reads in test_case.bam Time: 38.627 seconds (76320 reads/sec, 26 bins/sec) Summary: #bins=1004, #reads=2948012, mean=2936.2671, min=26.099290780141843, max=39176.7304964539 Percent reads in regions: 78.873 (of 3737691 mapped) Wrote test/test_case.targetcoverage.cnn with 1004 regions Skip processing test_case.bam with empty regions file test/reference.antitarget-tmp.bed Wrote test/test_case.antitargetcoverage.cnn with 0 regions Processing target: test_case Keeping 888 of 1004 bins Correcting for GC bias... Processing antitarget: test_case /home/erik/anaconda3/envs/cnvkit/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1111: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) /home/erik/anaconda3/envs/cnvkit/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1111: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) Wrote test/test_case.cnr with 888 regions Segmenting test/test_case.cnr ... Segmenting with method 'cbs', significance threshold 0.0001, in 12 processes
Of course, because this is an amplicon based targeted panel, there are no antitargets, so the test_case.antitargetcoverage.cnn file consists of only a header. The output directory contains only the following:
test_case.antitargetcoverage.cnn test_case.cnr test_case.targetcoverage.cnn reference.antitarget-tmp.bed reference.target-tmp.bed
No .cns file or PDFs of plots.
I tried running version 0.9.6 in its own environment to see if that resolves either of these issues, but when I run the cnvkit.py command I get an immediate error regarding biopython:
$ cnvkit.py -h /home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/skgenome/intersect.py:15: FutureWarning: pandas.core.index is deprecated and will be removed in a future version. The public classes are available in the top-level namespace. from pandas.core.index import Int64Index Traceback (most recent call last): File "/home/erik/anaconda3/envs/cnvkit_0.96/bin/cnvkit.py", line 8, in <module> from cnvlib import commands File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/__init__.py", line 5, in <module> from .commands import * File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/commands.py", line 32, in <module> from . import (access, antitarget, autobin, batch, call, core, coverage, File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/autobin.py", line 11, in <module> from . import coverage, samutil File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/coverage.py", line 15, in <module> from Bio._py3k import StringIO ModuleNotFoundError: No module named 'Bio._py3k'
Any assistance appreciated!
The filtering of bins appears to be related to the use of the --fasta flag, which encodes the GC content of the bins in the .cnn files. Extracting the failed bins from the reference.cnn file and sorting by GC values there is a gap between 0.378788 and 0.619835 suggesting a cutoff around < 0.4 and > 0.6.