Question: annotating TCGA VCF file
1
gravatar for jan
2.6 years ago by
jan90
Malaysia
jan90 wrote:

HI,

I recently obtained TCGA VCF files to search for germline variants. The variants were called by Washington University using several callers i.e Samtools, Sniper, Varscan, and strelka , which were separately lumped into one VCF file. Upon checking the files, most of the variants called by all callers except Varscan are uninformative . So I can only annotate variants that were called by Varscan.

This is how the variant header looks like :

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCGA-PG-A914-01A-11D-A37N-09 TCGA-PG-A914-01A-11D-A37N-09-[Samtools] TCGA-PG-A914-10A-01D-A37N-09 TCGA-PG-A914-10A-01D-A37N-09-[Sniper] TCGA-PG-A914-01A-11D-A37N-09-[Sniper] TCGA-PG-A914-10A-01D-A37N-09-[VarscanSomatic] TCGA-PG-A914-01A-11D-A37N-09-[VarscanSomatic] TCGA-PG-A914-10A-01D-A37N-09-[Strelka] TCGA-PG-A914-01A-11D-A37N-09-[Strelka]

The problem comes when the format column is not consistent. These are all the formats in the VCF files.

GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:FA:AD:FDP:SDP:SUBDP:AU:CU:GU:TU GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:IGT:DP4:BCOUNT:JGQ:AMQ:SSC:FDP:SDP:SUBDP:AU:CU:GU:TU GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:GQ:MQ:IGT:BCOUNT:JGQ:AMQ:SSC GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:DP4:FDP:SDP:SUBDP:AU:CU:GU:TU GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:FA GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:GQ:MQ GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:IGT:DP4:BCOUNT:JGQ:AMQ:SSC GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:FDP:SDP:SUBDP:AU:CU:GU:TU:DP4 GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ:GQ:MQ:IGT:BCOUNT:JGQ:AMQ:SSC GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:AD:FA:FDP:SDP:SUBDP:AU:CU:GU:TU GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:FA GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ GT:DP:DP4:BQ:FA:VAQ:SS:FT:GQ:MQ:AD:IGT:BCOUNT:JGQ:AMQ:SSC GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:AD:FA GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ:GQ:MQ GT:DP:DP4:BQ:FA:VAQ:SS:FT:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC GT:DP:DP4:BQ:FA:VAQ:SS:FT:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC:AD GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:DP4 GT:DP:DP4:BQ:FA:VAQ:SS:FT GT:DP:DP4:BQ:FA:VAQ:SS:FT:GQ:MQ:AD

I'm not too sure what are the strategies to annotate these kind of VCF files and would really appreciate any help if you have encountered this kind of VCF formatting.

Disclaimers: 1) I have the right authorization to use the data 2) I have emailed TCGA regarding this issue and no solution was given 3) I have emailed Washington University a few weeks ago and haven't received any reply

ADD COMMENTlink modified 2.6 years ago by Chris Miller20k • written 2.6 years ago by jan90

How are you interested in annotating the VCF -- with another program like snpEff or vep, or with custom scripts? The former should not be a problem if the VCF is valid; for the latter, try a VCF parsing library like pyvcf which will keep track of the format tags for you.

ADD REPLYlink written 2.6 years ago by Eric T.2.3k
1
gravatar for Chris Miller
2.6 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

I'm not sure who you emailed at WashU, but I can offer a partial response.

1) I'd recommend starting from the MAF files, rather than the VCFs, unless you're looking at WGS data. They are better curated lists of variants.

2) you can convert those to VCF using one of several available tools like https://github.com/mskcc/vcf2maf/blob/master/maf2vcf.pl

3) It's probably not an awful idea to reannotate using something like VEP or vcfanno.

ADD COMMENTlink written 2.6 years ago by Chris Miller20k

Thank you for your reply.

I just submitted a contact form through The McDonnell Genome Institute Websitey website .

It's quite difficult to navigate the new TCGA portal and there is no option to get MAF files for whole exome sequencing data. There are only BAM and VCF files. I found a website in this forum that explained where to get new MAF files.

https://wiki.nci.nih.gov/display/TCGA/TCGA+MAF+Files .

However I could see that all the MAF files are for somatic variants, which is not useful for me if they don't contain germline variants. Also, the number of MAF files doesn't match the total number of cases (only ~half of total number of UCEC cases).

The VCF files will be annotated by the bioinformatics team at my institute using own pipeline which incorporate snpEff and other tools.

ADD REPLYlink written 2.6 years ago by jan90

Use vcf2maf in the repo that Chris pointed to. It was tested on this complex VCF similar to what you're dealing with. You can use it with the lumped VCF, if you specify --tumor-id and --normal-id as the names of the genotype columns for VarScan.

ADD REPLYlink written 2.5 years ago by Cyriac Kandoth5.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 839 users visited in the last hour