Question: Understanding coverage command
0
gravatar for deber1980
2.6 years ago by
deber19800
deber19800 wrote:

Hello, While I'm aware its not the best tool for detecting relatively small CNVs, I'm trying to use CNVKit to detect a single exon deletions. I have a very small panel of genes (~10) with very high coverage (>1000). However when I run

cnvkit.py coverage sample123.bam one_row.bed -o sample123.cnn

where one_row.bed has a single regions spanning few bases (I tried 200 bases and 5 bases). I'm looking at the bam file and each base in this region has >1000 depth, but still, the depth in the cnn output file is very low (~10).

What am I missing? Shouldn't the depth hold the average depth of all the bases in the region?

Thank you for the help.

cnvkit • 1.2k views
ADD COMMENTlink modified 2.6 years ago by Eric T.2.5k • written 2.6 years ago by deber19800

Probably not the actual reason, but a tempting question here: Can the value of 10 you are reporting possibly be in log2?

ADD REPLYlink written 2.6 years ago by Noushin N550

Thanks Noushin. No, it looks like this:

chromosome  start   end gene    log2    depth
chr13   32890471    32890475    BRCA2_ex02_01   3.16993 9
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by deber19800
1
gravatar for Eric T.
2.6 years ago by
Eric T.2.5k
San Francisco, CA
Eric T.2.5k wrote:

Are most of the reads marked as duplicates? Those wouldn't be counted. You could also try the -c option with the coverage command to get roughly the same answer calculated through a different method.

In CNVkit 0.8.3 you can use the bundled script coverage_bin_size.py to estimate ideal bin sizes based on the coverage depth in a given sample. This most recent version also includes some internal improvements to handle targeted amplicon sequencing better.

For single-exon detection: As a wild guess, I'd try an average target bin size of 40 or 50, and run segmentation with the HaarSeg method (segment -m haar). Also, be sure to run the fix and reference commands with the option --no-edge, and perhaps also --no-gc --no-rmask if your target footprint is very small.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Eric T.2.5k

Thanks! After examining it, I see that indeed most reads are duplicates, which makes sense as this is a high coverage panel for small number of genes. Is there a way to count duplicate reads as well? I'll also take a look at the most recent version.

ADD REPLYlink written 2.6 years ago by deber19800

If it's targeted amplicon sequencing and not hybrid capture, which I'm assuming is the case, then you shouldn't mark duplicates at all. MarkDuplicates is designed to flag and remove PCR duplicates (originally for WGS datasets); with targeted amplicon sequencing, all of your amplified are in fact PCR duplicates, but intentionally. So just skip that step.

ADD REPLYlink written 2.6 years ago by Eric T.2.5k

Right, make sense. Thanks

ADD REPLYlink written 2.6 years ago by deber19800
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1447 users visited in the last hour