Question

BAF without normal control, to do or not! or how

0

Entering edit mode

5.0 years ago

sm.hashemin ▴ 90

Dear Eric, dear all, from the cnvkit documentation :

Typically you would use a properly formatted VCF from joint tumor-normal SNV calling, e.g. the output of MuTect, VarDict, or FreeBayes, having already flagged somatic mutations so they can be skipped in this analysis. If you have no matched normal sample for a given tumor, you can use 1000 Genomes common SNP sites to extract the likely germline SNVs from a tumor-only VCF, and use just those sites with THetA2 (or another tool like PyClone or BubbleTree).

I am currently trying to do the same as above mentioned. I have made a .cnn reference from unrelated but aged matched WES files with obatiained from the same hybridziation based method. I have filtered my VCFs for common dbsnp SNPs with AF of more that 10% (very very common).

my cns file looks like this

chromosome  start   end gene    log2    depth   probes  weight
chr1    12403   2990008 DDX11L1,WASH7P,FAM138F,MIR4251  -2.49508    14.0935 1681    555.806
chr1    2992142 6337653 PRDM16,MIR4251,ACOT7    -4.62846    7.05581 1631    483.703

my vcf file without header looks like this

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Tumorsample1
chr1    762273  .   G   A   .   REJECT  DP=367;AF=1;DP4=0,0,119,248;SB=0;ANNOVAR_DATE=2018-04-16;ExAC_ALL=0.8060;ExAC_AFR=0.4415;ExAC_AMR=0.8116;ExAC_EAS=0.9174;ExAC_FIN=0.9;ExAC_NFE=0.8384;ExAC_OTH=0.8896;ExAC_SAS=0.8184;Func.refGene=ncRNA_exonic;Gene.refGene=LINC00115;GeneDetail.refGene=.;ExonicFunc.refGene=.;AAChange.refGene=.;cosmic87_coding=.;ALLELE_END;rs_ids=rs3115849   GT:DP:AF:SB:DP4 .:367:1.0:0:0,0,119,248

when I use export theta tumorsample1.cns -r ref.cnn -v sample1.vcf

  Wrote sample1theta Selected test sample sample1 Loaded 44443
records; skipped: 0 somatic, 1648 depth Kept 44443 heterozygous of
44443 VCF records
/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:868:
FutureWarning:  Passing list-likes to .loc or [] with any missing
label will raise KeyError in the future, you can use .reindex() as an
alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self._getitem_lowerdim(tup) 
Wrote sample1.tumor.snp_formatted.txt 
Wrote sample1.normal.snp_formatted.txt

Unfortunately I get an empty normal SNP file(should it be like this or it is an error?)

#Chrm   Pos Ref_Allele  Mut_Allele

the tumor snp file has zero as Mut_Allele count

#Chrm   Pos Ref_Allele  Mut_Allele
chr1    762272  367 0
chr1    808921  153 0

interval file

#ID chrm    start   end tumorCount  normalCount start_1_12403:end_1_2990008 1   12403   2990008 298177  3286017
start_1_2992142:end_1_6337653   1   2992142 6337653 65940   1608940

Theta stops prematurely as the normalMutCount[i] + normalRefCount[i] is less than min

the common SNPs were extracted but how would you get the BAF of the normal sample?

Best regards and thanks in advance

Eric T.

cnvkit BAF no controls TheTA2 TheTA • 1.5k views

ADD COMMENT • link updated 4.9 years ago by zhouyangyu • 0 • written 5.0 years ago by sm.hashemin ▴ 90

score 0 · Answer 1 · 2019-09-16

I'm using 1000 genomes to pileup snps, and used cnvkit.py export theta to generate the three files (Tumor snps, normal snps, interval count). To simulate the matched normal, I just used the tumor snp file and simulate the normal snp file by setting every snps as heterozygous (ref_allele = mut_allele). This allows theta to run smoothly, and the result look good.