Question

Understanding coverage command

0

Entering edit mode

7.2 years ago

deber1980 • 0

Hello, While I'm aware its not the best tool for detecting relatively small CNVs, I'm trying to use CNVKit to detect a single exon deletions. I have a very small panel of genes (~10) with very high coverage (>1000). However when I run

cnvkit.py coverage sample123.bam one_row.bed -o sample123.cnn

where one_row.bed has a single regions spanning few bases (I tried 200 bases and 5 bases). I'm looking at the bam file and each base in this region has >1000 depth, but still, the depth in the cnn output file is very low (~10).

What am I missing? Shouldn't the depth hold the average depth of all the bases in the region?

Thank you for the help.

CNVkit • 2.2k views

ADD COMMENT • link updated 7.2 years ago by Eric T. ★ 2.8k • written 7.2 years ago by deber1980 • 0

0

Entering edit mode

Probably not the actual reason, but a tempting question here: Can the value of 10 you are reporting possibly be in log2?

ADD REPLY • link 7.2 years ago by Noushin N ▴ 600

0

Entering edit mode

Thanks Noushin. No, it looks like this:

chromosome  start   end gene    log2    depth
chr13   32890471    32890475    BRCA2_ex02_01   3.16993 9

ADD REPLY • link 7.2 years ago by deber1980 • 0

score 1 · Answer 1 · 2017-01-27

1

Entering edit mode

7.2 years ago

Eric T. ★ 2.8k

Are most of the reads marked as duplicates? Those wouldn't be counted. You could also try the -c option with the coverage command to get roughly the same answer calculated through a different method.

In CNVkit 0.8.3 you can use the bundled script coverage_bin_size.py to estimate ideal bin sizes based on the coverage depth in a given sample. This most recent version also includes some internal improvements to handle targeted amplicon sequencing better.

For single-exon detection: As a wild guess, I'd try an average target bin size of 40 or 50, and run segmentation with the HaarSeg method (segment -m haar). Also, be sure to run the fix and reference commands with the option --no-edge, and perhaps also --no-gc --no-rmask if your target footprint is very small.

ADD COMMENT • link 7.2 years ago by Eric T. ★ 2.8k

0

Entering edit mode

Thanks! After examining it, I see that indeed most reads are duplicates, which makes sense as this is a high coverage panel for small number of genes. Is there a way to count duplicate reads as well? I'll also take a look at the most recent version.

ADD REPLY • link 7.2 years ago by deber1980 • 0

0

Entering edit mode

If it's targeted amplicon sequencing and not hybrid capture, which I'm assuming is the case, then you shouldn't mark duplicates at all. MarkDuplicates is designed to flag and remove PCR duplicates (originally for WGS datasets); with targeted amplicon sequencing, all of your amplified are in fact PCR duplicates, but intentionally. So just skip that step.