I have just get my CNV files by CNVkit. I am wondering if the column "log2" in the output of CNVkit (after call) is the same as "Seg_mean". If not, how can I get the "Seg_mean" with "log2"? Please, give me some advice,thanks!
Please read: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002202 and add more relevant details to your question. What have you tried? Have you read the CNVkit paper?
Here are two lines of what I get.
chromosome start end log2 probes
chr1 826717 2410579 -0.00659771 487
chr1 2410780 2787772 -0.372291 70
Yes, I have read the CNVkit paper, here is the link.
I get an answer like this: Segment_Mean is the arithmetic mean of those probes' log2 copy ratio values.
But I am still confused how can I get "Segment_Mean"? I need it as an input to ABSOLUTE.
And I have got CNV file by Varscan too ,but the "Segment_mean" is quite too large.
I've moved this to a comment - please do not add an answer unless you're answering the top-level question. Plus, edit your question and add this information in there. Please read posts under /t/how-to for more information.
In the .cns files, yes, log2 is the segment mean in log2 scale. Details here:
Thanks for your help!
And I am now facing other problem using CNVkit, could you please give me some advice? Details are as follows:
I am running CNVkit for CNV files of my whole-exon sequencing data. I use command like cnvkit.py batch -m amplicon -t targets.bed *.bam , but I can not provide the targets.bed file. And I also check Astra-Zeneca’s reference data repository but cannot find as well.
cnvkit.py batch -m amplicon -t targets.bed *.bam
Astra-Zeneca’s reference data repository
My questions are:
1) Is that right I use -m amplicon ?
2) Is there any file containing total exons of human I can use for script guess_baits.py ? I am really confused where I can get the total bed file I can use for guess!
I will appreciate it if you could give me some advice!
For exome, -m hybrid is better than -m amplicon. You can verify that there are off-target reads by loading the BAM file in a viewer like IGV.
For guess_baits.py, try UCSC's RefSeq exons (refFlat.txt here), or another BED file of known genes from UCSC Genome Browser. Make sure the reference genome matches.
Thanks a lot! I got it, but I do also want to make sure I am doing the right thing. Here what I did.
skg_convert.py refFlat.txt -t bed -o refFlat.bed
guess_baits.py bam1 bam2 -t refFlat.bed -o guess_baits.bed
But I get error like this:
Loaded 80816 candidate regions from refFlat.bed
Evaluating targets in bam1
Processing reads in bam1
Time: 1281.040 seconds (205575 reads/sec, 61 bins/sec)
Summary: #bins=78477, #reads=263349347, mean=3355.7520, min=0.0, max=197074.45
Percent reads in regions: 279.509 (of 94218509 mapped)
Traceback (most recent call last):
File "miniconda2/bin/guess_baits.py", line 246, in <module>
baits = filter_targets(args.targets, args.sample_bams, args.processes)
File "miniconda2/bin/guess_baits.py", line 54, in filter_targets
"%d != %d" % (len(sample), len(baits))
AssertionError: 78477 != 80816
What does it mean?
Hmm, not sure, I'll take a look to see if there's a bug in guess_baits.py.
If you're building a pooled reference (multiple control samples), you can also just use the refflat.bed file as-is and CNVkit will drop most of the uncaptured exons automatically.