Question: Using THETA2 from CNVkit: what king of vcf do you need?
0
gravatar for gil.hornung
2.3 years ago by
gil.hornung70
European Union
gil.hornung70 wrote:

Hi,

I want to use the "export theta" functionality of CNVKit estimate the tumor purity based on THETA2 program.

Part of the input for THETA2 are files with SNP counts for the Tumor and Normal samples formatted like:

#Chrm   Pos     Ref_Allele      Mut_Allele
10      104427  74      1
10      111955  54      0
10      135656  0       94

To my best understanding these should be the germline variants in the Tumor and Normal samples, because they are used to estimate the biallelic fraction (BAF).

Based on the CNVkit manual cnvkit export theta accepts a vcf file:

cnvkit.py export theta Sample_T.cns reference.cnn -v Sample_Paired.vcf

However it is unclear what kind of VCF is it. If the germline mutations are important then the VCF output of programs such as MuTect2 are not appropriate, because they are geared towards somatic mutations and discard of the germline mutations. Should I use the output of HaplotypeCaller? But then how is the Sample_Paired.vcf organised? And furthermore, should I filter the VCF to include only PASS mutations?

Am I missing out on something?

Thank you,

Gil

theta2 tumor purity cnvkit • 1.4k views
ADD COMMENTlink modified 2.3 years ago by Eric T.2.5k • written 2.3 years ago by gil.hornung70

Have you seen the details mentioned here:

http://cnvkit.readthedocs.io/en/latest/fileformats.html

And here,

https://github.com/samtools/hts-specs

Let us know what issues you faced if you have seen these pages and if they did not work.

ADD REPLYlink written 2.3 years ago by sridhar56100
0
gravatar for Eric T.
2.3 years ago by
Eric T.2.5k
San Francisco, CA
Eric T.2.5k wrote:

CNVkit's VCF processing works best with GATK HaplotypeCaller or FreeBayes on the tumor-normal pair, with both samples shown and somatic variant records marked with SOMATIC in the INFO column.

ADD COMMENTlink written 2.3 years ago by Eric T.2.5k

Thank you Eric and sridhar56, If possible, can you provide a small VCF that follows the proper specs as an example? It would things much clearer.

Gil

ADD REPLYlink written 2.2 years ago by gil.hornung70

Yes, here's an example VCF included in CNVkit's test suite: https://raw.githubusercontent.com/etal/cnvkit/master/test/formats/na12878_na12882_mix.vcf

ADD REPLYlink written 2.2 years ago by Eric T.2.5k

Just as a reference, here is the GATK command I used to extract high-quality heterozygous SNP from HaplotypeCaller output:

java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-T SelectVariants \
-V haplotype_caller.vcf \
-o het.vcf \
--excludeFiltered \
--selectTypeToInclude SNP \
--restrictAllelesTo BIALLELIC \
-select '(vc.getGenotype("normal_name").isHet())&&(vc.getGenotype("normal_name").getAD().1>={params.min_normal_ALT})&&(vc.getGenotype("tumor_name").getDP()>{params.min_tumor_DP})'
ADD REPLYlink written 2.2 years ago by gil.hornung70

Yes, this is similar to what CNVkit does internally when you give it a VCF.

ADD REPLYlink written 2.2 years ago by Eric T.2.5k

Does this command line:

cnvkit.py export theta Sample_T.cns reference.cnn -v Sample_Paired.vcf

currently support the VCF outputs from Mutect2 now? A unfiltered output VCF file from Mutect2 should also contain germline variants (as indicated by "germlink_risk" in the FILTER column.

May be I'm wrong?

ADD REPLYlink written 18 months ago by ibphuangchen10

You'd think so, but Mutect also tends to filter the germline variants even when you tell it not to, leaving relatively few SNPs that CNVkit can use for the BAF calculation. It's better to use HaplotypeCaller to get comprehensive germline SNP calls.

ADD REPLYlink written 17 months ago by Eric T.2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1718 users visited in the last hour