Dear Biostars,
I am trying to add annotations to a .vcf file. I have created a .tab file and a .vcf file using code included in this post. However, the .vcf file stores the extra information requested in the INFO column, while the .tab file is more neatly organised. I would rather stick the a .vcf file format because this seems to be the more popular annotations format. Any help to clear this misunderstanding up would be helpful.
tab file creation code and first line of output:
${vep_path}/vep --cache --dir $dir \
--dir_cache $dir_cache --offline --species homo_sapiens --assembly GRCh38 --fasta ${fasta} \
--input_file ${input.vcf} --output_file output.vcf --warning_file warn.txt --stats_file stat.html \
--hgvs --symbol --force_overwrite --format vcf --tab --no_check_variants_order
--check_existing --polyphen p --sift p --af_gnomad --total_length --max_af --variant_class \
--keep_csq --plugin CADD --plugin dbNSFP --plugin ExACpL --plugin LoFtool \
--plugin DisGeNET --plugin REVEL --plugin Mastermind \
--fields "Uploaded_variation,Location,Allele,Gene,Feature,SYMBOL,EXON,Existing_variation,VARIANT_CLASS,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,HGVSc,HGVSp,BIOTYPE,IMPACT,CLIN_SIG,PolyPhen,SIFT,CADD_PHRED,CADD_RAW,MutationTaster_pred,REVEL,gnomAD_AF,MAX_AF,ExACpLI,LoFtool,DisGeNET_PMID,DisGeNET_SCORE,DisGeNET_disease,Mastermind_URL" \
--pick --pick_order rank,canonical,tsl --fork 4 --buffer_size 20000
And I will just give the first several columns for neatness.
Uploaded_variation|Location|Allele|Gene|Feature|SYMBOL|EXON|Existing_variation
chr1_14653_C/T|chr1:14653|T|ENSG00000227232|ENST00000488147|WASH7P|-|rs62635297
However, when I try to create a vcf file with almost the same code, the requested information is all crammed into the INFO column (highlighted in bold to make it easier to view).
The only two flags changed are:
--vcf
instead of--tab
--fields
"Allele,Gene,Feature,SYMBOL,EXON,Existing_variation,VARIANT_CLASS,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,HGVSc,HGVSp,BIOTYPE,IMPACT,CLIN_SIG,PolyPhen,SIFT,CADD_PHRED,CADD_RAW,MutationTaster_pred,REVEL,gnomAD_AF,MAX_AF,ExACpLI,LoFtool,DisGeNET_PMID,DisGeNET_SCORE,DisGeNET_disease,Mastermind_URL"
The output files columns are as follows:
CHROM|POS|ID|REF|ALT|QUAL|FILTER|INFO|FORMAT|subject1 chr1|14653|.|C||T|359.64|MQ40;SOR3;VQSRTrancheSNP99.90to100.00| AC=1;AF=0.500;AN=2;AS_FilterStatus=VQSRTrancheSNP99.90to100.00;AS_VQSLOD=-10.7082;AS_culprit=MQ;BaseQRankSum=-8.870e-01;DP=24;ExcessHet=3.0103;FS=8.016;MLEAC=1;MLEAF=0.500;MQ=23.02;MQRankSum=0.883;NEGATIVE_TRAIN_SITE;QD=18.93;ReadPosRankSum=2.26;SOR=4.863;CSQ=chr1_14653_C/T|chr1:14653|T|ENSG00000227232|ENST00000488147|WASH7P||rs62635297|SNV|intron_variant&non_coding_transcript_variant|||||ENST00000488147.1:n.1254-152G>A||unprocessed_pseudogene|MODIFIER||||0.148|-0.373269|||||||||| |GT:AD:DP:GQ:PL|0/1:3,16:19:25:367,0,25
Is there a way in VEP to extract this information from the INFO column and neatly organise this as a .vcf file?
Cheers, will have a look.