Question: Keep Format and Individual fields when annotating VCF with VEP
gravatar for jsneaththompson
2.7 years ago by
jsneaththompson90 wrote:

I'm currently updating my Variant Calling Pipeline by switching the VCF annotating software from Annovar to VEP for a variety of regions, not least how easy it is to annotate with HGVS notation and keep datasets up to date in VEP.

For the most part everything is running smoothly, with the exception that some of the data in the VCFs is lost during annotation (and conversion to tsv). The VCFs are created with GATK's UnifiedGenotyper and include a 'Format' column where each value is 'GT:AD:DP:GQ:PL' and a column named after the Individual, which contains semicolon-separated data that corresponds to the Format column (i.e. Genotype;Allele Depth;Depth;Genotype Quality;Phred-likelihood). When I annotate with VEP none of this data is carried over to the output file as it would be in Annovar, leaving me with an annotated file that has no information on read depth, genotype or any of the other data in the two lost columns.

I've included the command I'm currently using for annotation:

./vep -i RM0108.vcf --cache --force_overwrite --tab --merged --variant_class --sift b --polyphen b --hgvs --symbol --canonical --check_existing --af_1kg --af_gnomad --humdiv --pick -o RM0108.tsv

I can't find information on this in the VEP documentation or elsewhere online. I could write something to take the relevant information from the VCF and add it to the tsv after VEP has finished running, but it seems like there may be an easier solution that I'm missing, so any help would be appreciated.

I've also posted this question on the Bioinformatics StackOverflow Link Here

ADD COMMENTlink modified 2.4 years ago by MasMarius10 • written 2.7 years ago by jsneaththompson90

It would definitely be easier to extract and add the required information using bcftools. If you'd like to preserve all VCF information, your output format should be VCF not tab. VCF is 3D (one mXn matrix per sample per variant) information, tsv is 2d (one data point per sample per variant).

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by _r_am32k

You might also want to look at GATK's VariantsToTable tool:

You will need to use the -GF flag for each genotype field you want output

ADD REPLYlink written 2.7 years ago by steve2.7k

What's mXn represents?

ADD REPLYlink written 19 months ago by Shicheng Guo8.5k

Read it as "m -by- n", which refers to a 2D matrix with m rows and n columns.

ADD REPLYlink written 19 months ago by _r_am32k
gravatar for Emily_Ensembl
2.7 years ago by
Emily_Ensembl21k wrote:

The VEP TSV format can only keep in its own specified columns. If you want to maintain the data from your original input, get your output in VCF. It will add the VEP annotation to the INFO column, and keep all the stuff you already have there.

ADD COMMENTlink written 2.7 years ago by Emily_Ensembl21k

Is there a way to convert the VEP annotated vcf to tab?

ADD REPLYlink written 21 months ago by Matthias20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2505 users visited in the last hour