Question: CNVkit 0.9.7: problems with incorrectly filtered bins, batch command outputs
0
gravatar for erikt
3 months ago by
erikt0
erikt0 wrote:

Hello!

I am having some trouble running CNVkit 0.9.7, installed via conda in a dedicated environment, on a set of 10 control and 16 experimental (tumor) libraries which were prepared using a targeted amplicon panel. The batch command I am using is:

cnvkit.py batch -m amplicon $tumor_dir*.bam --normal $normal_dir*.bam \
-p 0 \
--targets $bed --fasta $hg19 \
--annotate $refFlat \
--scatter --diagram \
--output-reference reference.cnn --output-dir $out_dir

Because mode is 'amplicon' I am omitting the --access flag, but have also tried with --access using the targets bed file and the result is the same.

The first issue I encounter is that 116 bins failed filters, but looking at the log2 and spread they are within limits. Below are the first few examples - the first clearly exceeds the spread threshold, but the rest are within limits.

Targets: 116 (11.5538%) bins failed filters (log2 < -5.0, log2 > 5.0, spread > 1.0)
-                   chr1:36167979-36168099    log2=-4.026  spread=1.154  rmask=nan
"                 chr1:53307856-53307973    log2=-0.132  spread=0.078  rmask=nan
PTGER3              chr1:71477262-71477341    log2=-0.703  spread=0.113  rmask=nan
FUBP1               chr1:78432623-78432745    log2=-0.588  spread=0.146  rmask=nan
"                 chr1:78432750-78432869    log2=0.467  spread=0.105  rmask=nan
"                 chr1:78433305-78433419    log2=0.069  spread=0.095  rmask=nan
"                 chr1:78433832-78433916    log2=-0.269  spread=0.159  rmask=nan

The second issue I encounter is that executing the pipeline does not generate the .cns files or scatter and diagram plots. For simplicity, processing a single sample with the reference generated using the command above, the output in the CLI is:

$ ./run_cnvkit_noref.sh reference.cnn ../bam/tumor_bam/test_case test
CNVkit 0.9.7
Created directory test
Wrote test/reference.target-tmp.bed with 1004 regions
Wrote test/reference.antitarget-tmp.bed with 0 regions
Running 1 samples in 12 processes (that's 12 processes per bam)
Running the CNVkit pipeline on ../bam/tumor_bam/test_case.bam ...
Processing reads in test_case.bam
Time: 38.627 seconds (76320 reads/sec, 26 bins/sec)
Summary: #bins=1004, #reads=2948012, mean=2936.2671, min=26.099290780141843, max=39176.7304964539 
Percent reads in regions: 78.873 (of 3737691 mapped)
Wrote test/test_case.targetcoverage.cnn with 1004 regions
Skip processing test_case.bam with empty regions file test/reference.antitarget-tmp.bed
Wrote test/test_case.antitargetcoverage.cnn with 0 regions
Processing target: test_case
Keeping 888 of 1004 bins
Correcting for GC bias...
Processing antitarget: test_case
/home/erik/anaconda3/envs/cnvkit/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1111: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
/home/erik/anaconda3/envs/cnvkit/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1111: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Wrote test/test_case.cnr with 888 regions
Segmenting test/test_case.cnr ...
Segmenting with method 'cbs', significance threshold 0.0001, in 12 processes

Of course, because this is an amplicon based targeted panel, there are no antitargets, so the test_case.antitargetcoverage.cnn file consists of only a header. The output directory contains only the following:

test_case.antitargetcoverage.cnn
test_case.cnr
test_case.targetcoverage.cnn
reference.antitarget-tmp.bed
reference.target-tmp.bed

No .cns file or PDFs of plots.

I tried running version 0.9.6 in its own environment to see if that resolves either of these issues, but when I run the cnvkit.py command I get an immediate error regarding biopython:

$ cnvkit.py -h
/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/skgenome/intersect.py:15: FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
  from pandas.core.index import Int64Index
Traceback (most recent call last):
  File "/home/erik/anaconda3/envs/cnvkit_0.96/bin/cnvkit.py", line 8, in <module>
    from cnvlib import commands
  File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/__init__.py", line 5, in <module>
    from .commands import *
  File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/commands.py", line 32, in <module>
    from . import (access, antitarget, autobin, batch, call, core, coverage,
  File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/autobin.py", line 11, in <module>
    from . import coverage, samutil
  File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/coverage.py", line 15, in <module>
    from Bio._py3k import StringIO
ModuleNotFoundError: No module named 'Bio._py3k'

Any assistance appreciated!

-Erik

EDIT

The filtering of bins appears to be related to the use of the --fasta flag, which encodes the GC content of the bins in the .cnn files. Extracting the failed bins from the reference.cnn file and sorting by GC values there is a gap between 0.378788 and 0.619835 suggesting a cutoff around < 0.4 and > 0.6.

ADD COMMENTlink modified 3 months ago • written 3 months ago by erikt0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1856 users visited in the last hour
_