How can I make VEP recognize header lines in my VCF?
Entering edit mode
4.5 years ago

Here's a massively simplified VCF file with one line:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  M_CJ-R878H_AML1-R878H_AML1
1       3329234 .       G       T       36.8    .       TIER=1  GT     1/1

If I run VEP on it like so, it returns an warning, as it tries to parse the FORMAT header line:

$ perl /path/to/ensembl-tools-release-86/scripts/variant_effect_predictor/ -i test.vcf -o out.vcf --offline --cache_version 67 --species mus_musculus --vcf --symbol --format vcf --dir_cache /path/to/.vep --dir_plugins /path/to/VEP_plugins-release-86
2017-05-04 12:46:31 - Read existing cache info
2017-05-04 12:46:31 - Starting...

WARNING: Invalid input formatting on line 2
2017-05-04 12:46:31 - Read 1 variants into buffer
2017-05-04 12:46:31 - Reading transcript data from cache and/or database
[========================================================================================================================]  [ 100% ]
2017-05-04 12:46:31 - Retrieved 8 transcripts (0 mem, 8 cached, 0 DB, 0 duplicates)
2017-05-04 12:46:31 - Analyzing chromosome 1
2017-05-04 12:46:31 - Analyzing variants
[========================================================================================================================]  [ 100% ]
2017-05-04 12:46:31 - Calculating consequences
2017-05-04 12:46:31 - Processed 1 total variants (1 vars/sec, 1 vars/sec total)
2017-05-04 12:46:31 - Wrote stats summary to out.vcf_summary.html
2017-05-04 12:46:31 - See out.vcf_warnings.txt for details of 1 warnings
2017-05-04 12:46:31 - Finished!

To support my idea that it's not handling the header correctly, if I run this VCF omitting the --format vcf flag, it is unable to detect that it is a VCF.

It does return the annotated VCF lines correctly when told that it's a VCF, but doesn't pass through the existing header lines and also doesn't add the CSQ header line that contains the key for parsing the information the VEP adds.

Has anyone encountered this before? Any suggestions on how to make VEP do the right thing here?

Edit to add output, which is sane, but lacking the expected headers:

1   3329234 .   G   T   36.8    .   GT;CSQ=T|intron_variant|MODIFIER||ENSMUSG00000051951|Transcript|ENSMUST00000070533|protein_coding||2/2||||||||||-1|||   1/1
vcf vep annotation • 1.8k views
Entering edit mode
4.5 years ago

Update - this doesn't seem to happen on my laptop's more recent install of VEP (version 87 vs version 86). I guess it's either a version issue or a somehow screwy install. I'm going to go ahead and mark this as the best answer for now, as an upgrade seems like it will solve the problem.

If anyone has additional information or has encountered this, would still love to hear what might be wrong.

Entering edit mode
4.5 years ago

use awk to insert a dummy format for each token in the FORMAT column

something like

awk '/^#CHROM/ {prinftf("##FORMAT=<ID=x1,Number=1,Type=String,Description=\"\">\n##FORMAT=<ID=x2,Number=1,Type=String,Description=\"\">\n");} {print;}' in.vcf
Entering edit mode

Sadly, that's not the issue. I'm editing the post above to make the VCF even simpler and make the FORMAT lines match up 100% with the fields - the same warning and header recognition issue persists.


Login before adding your answer.

Traffic: 2444 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6