Dear Eric, dear all, from the cnvkit documentation :
Typically you would use a properly formatted VCF from joint tumor-normal SNV calling, e.g. the output of MuTect, VarDict, or FreeBayes, having already flagged somatic mutations so they can be skipped in this analysis. If you have no matched normal sample for a given tumor, you can use 1000 Genomes common SNP sites to extract the likely germline SNVs from a tumor-only VCF, and use just those sites with THetA2 (or another tool like PyClone or BubbleTree).
I am currently trying to do the same as above mentioned. I have made a .cnn reference from unrelated but aged matched WES files with obatiained from the same hybridziation based method. I have filtered my VCFs for common dbsnp SNPs with AF of more that 10% (very very common).
my cns file looks like this
chromosome start end gene log2 depth probes weight chr1 12403 2990008 DDX11L1,WASH7P,FAM138F,MIR4251 -2.49508 14.0935 1681 555.806 chr1 2992142 6337653 PRDM16,MIR4251,ACOT7 -4.62846 7.05581 1631 483.703
my vcf file without header looks like this
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Tumorsample1 chr1 762273 . G A . REJECT DP=367;AF=1;DP4=0,0,119,248;SB=0;ANNOVAR_DATE=2018-04-16;ExAC_ALL=0.8060;ExAC_AFR=0.4415;ExAC_AMR=0.8116;ExAC_EAS=0.9174;ExAC_FIN=0.9;ExAC_NFE=0.8384;ExAC_OTH=0.8896;ExAC_SAS=0.8184;Func.refGene=ncRNA_exonic;Gene.refGene=LINC00115;GeneDetail.refGene=.;ExonicFunc.refGene=.;AAChange.refGene=.;cosmic87_coding=.;ALLELE_END;rs_ids=rs3115849 GT:DP:AF:SB:DP4 .:367:1.0:0:0,0,119,248
when I use
export theta tumorsample1.cns -r ref.cnn -v sample1.vcf
Wrote sample1theta Selected test sample sample1 Loaded 44443 records; skipped: 0 somatic, 1648 depth Kept 44443 heterozygous of 44443 VCF records /anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:868: FutureWarning: Passing list-likes to .loc or  with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike return self._getitem_lowerdim(tup) Wrote sample1.tumor.snp_formatted.txt Wrote sample1.normal.snp_formatted.txt
Unfortunately I get an empty normal SNP file(should it be like this or it is an error?)
#Chrm Pos Ref_Allele Mut_Allele
the tumor snp file has zero as Mut_Allele count
#Chrm Pos Ref_Allele Mut_Allele chr1 762272 367 0 chr1 808921 153 0
#ID chrm start end tumorCount normalCount start_1_12403:end_1_2990008 1 12403 2990008 298177 3286017 start_1_2992142:end_1_6337653 1 2992142 6337653 65940 1608940
Theta stops prematurely as the
normalMutCount[i] + normalRefCount[i] is less than
the common SNPs were extracted but how would you get the BAF of the normal sample?
Best regards and thanks in advance