Noisy germline CNV data using CNVKit
2
1
Entering edit mode
4.9 years ago
Getting there ▴ 120

Context: I'm a new CNVKit user (using 0.9.6). I have 6 exome-seq samples (from saliva DNA) that I want to do germline copy number calling on. 2 of the samples are from saliva of healthy individuals and the other 4 are from saliva of individuals with breast cancer, all in the same extended family pedigree.

What I did so far: I ran the normal CNVKit pipeline with a flat reference and made sure to use the --access command to remove poorly mappable regions in the access-5kb-mappable.hg19.bed file mentioned in the docs. I did segmetrics with the 'ci' option, then made a scatter plot(scatter plot output seen here). I also ran the pipeline again but this time using my 2 healthy samples to generate a reference genome and compare to the individuals with breast cancer in the same family(scatter plot output seen here). The data looks very noisy in both cases. (see photos linked above)

Questions:

  1. Can anybody help me understand why my data is so noisy? And what the grey and orange bars/lines represent? I am thinking it has to do with the reference (or rather lack of a good reference...). I want to retry running this with some exome-seq samples that are unrelated to breast cancer and build a reference from those individuals, but I am not sure how much that would help, or if the issue is the reference I am using, to begin with.

  2. What are the grey and orange bars on the scatter plot? On https://cnvkit.readthedocs.io/en/stable/plots.html I only see red bars which are supposed to be "segmentation line", but I am not entirely sure what this means and why I have 2 colors of the bars, and they are not red. I am using v 0.9.6 of CNVKit.

Any help is greatly appreciated. Thank you

cnvkit copy number variation germline cnv gcnv • 2.5k views
ADD COMMENT
2
Entering edit mode
4.6 years ago
brunobsouzaa ▴ 830

I'm using cnvkit for a few months now and, for my little experience, I don't recommend using the flat reference. Using a pooled reference of at least 15 samples is the best! I'm now using a pooled reference with 161 samples and everything is working fine for me.

ADD COMMENT
0
Entering edit mode

if using pool reference, does the paired control sample no longer used for analysing. can you share your command? I ran my command like this, but I do not find any criteria to find the noisy sample

firstly, I used the batch command to get all the control samples target.cnn and antitarget.cnn

# command 1
cnvkit.py batch Tumor.bam --normal Normal.bam \
    --targets my_baits.bed --annotate refFlat.txt \
    --fasta hg19.fasta --access data/access-5kb-mappable.hg19.bed \
    --output-reference my_reference.cnn --output-dir results/ \
    --diagram --scatter

Secondly, I gather all the control samples target.cnn and antitarget.cnn to a empty directory,

# command 2
cnvkit.py reference *coverage.cnn -f ucsc.hg19.fa -o Reference.cnn

Thirdly, I can not find that the cnvkit support control sample and pool rference just like the gatk Mutect2. so I can just give the pool reference and ignoring this normal sample

# command 3
cnvkit.py batch Tumor.bam --normal  Reference.cnn \
    --targets my_baits.bed --annotate refFlat.txt \
    --fasta hg19.fasta --access data/access-5kb-mappable.hg19.bed \
    --output-reference my_reference.cnn --output-dir results/ \
    --diagram --scatter

thanks a lot, and looking forward to hear more experience with cnvkit about you

ADD REPLY
0
Entering edit mode

So, I've made a little modification on the way I work with cnvkit... For the baseline, use all samples in the same run. This will work fine! For my command, first, build your reference:

cnvkit.py batch --normal $NORMAL --targets $TARGETS --annotate $REFFLAT --fasta $GEN_REF --access $ACCESS --output-reference $CNV_REF --output-dir $OUT_DIR

Then, run analysis using the created reference:

cnvkit.py batch $TESTES -r $CNV_REF -d $OUT_DIR

Last, call cn's

cnvkit.py call ${i}.Tumor.cnr -y -m clonal -o ${i}.call.cnr
ADD REPLY
2
Entering edit mode
4.2 years ago
sutturka ▴ 190

This link will answer you question regarding the grey and orange bars on the scatter plot. Search through Github issues and you may get more answers.

ADD COMMENT
0
Entering edit mode

thank you so much for the help

ADD REPLY

Login before adding your answer.

Traffic: 2487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6