VEP is very slow. Fork doesnt seem to work
1
2
Entering edit mode
16 months ago
nhaus ▴ 200

I am using VEP (v103) to annotate a small vcf file (~1000 variants). Nonetheless, it takes very long (>20 minutes) , which doesnt quite match their description of:

Set up correctly, VEP is capable of processing around 3 million variants in 30 minutes

Furthermore, it seems like the --fork does not really work, because the whole time just one cores is used.

This is the command that i used:

vep  --cache --dir_cache vep-cache --offline --fasta ref-genome.fa --pick --fork 4 --sift b --variant_class -i somatic.filtered.snp.vcf -o snp_vep_out.txt


Id be very thankful if someone could point out what I am doing wrong.

vep annotation • 1.4k views
0
Entering edit mode

I'm not a VEP user, but if you can't figure it out then you can always use another variant annotator like OpenCRAVAT. My experience is that it should only take several seconds to annotate 1000 variants (docs here: https://open-cravat.readthedocs.io/en/latest/ ).

0
Entering edit mode

Also, as it looks like you are trying to annotate somatic mutations (likely in cancer), OpenCRAVAT has more options for predicting oncogenic mutations in cancer beyond sift. Most recent benchmarks suggests there are many other better methods for cancer (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01954-z ).

2
Entering edit mode
16 months ago
Emily 23k

I assume you've already seen this documentation page based on the quote at the top. The 3M in 30 min is really the absolute fastest under ideal conditions, which means no additional flags (--pick and --sift in your command will make it slower).

With regard to forking, the VEP automatically reads 5000 variants to memory in each fork, so there will be no forking if you have <5000 variants. You can change this with --buffer_size but I doubt this would increase speed much.

0
Entering edit mode

Thank you for you answer and sorry for just getting back now. Your explanation regarding forking makes a lot of sense!

I am writing again, because I am using VEP to annotate germline mutations, but this time more than 4 million and it takes more than a day already with 4 forks..

My input VCF is sorted, but I am not sure if I have tabix-indexed my cache. I downloaded the cache using the installer script (homo_sapiens_vep_104_GRCh37.tar.gz). I saw on this site that there exists an already indexed cache.

curl -O http://ftp.ensembl.org/pub/release-104/variation/indexed_vep_cache/homo_sapiens_vep_104_GRCh38.tar.gz

tar xzf homo_sapiens_vep_104_GRCh37.tar.gz

However, this didnt work and I got this error.

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now


Cheers!

EDIT:

I just tried out the convert_cache.pl script:

perl convert_cache.pl --dir . --species all --version all


which finished right away and said that no No unprocessed types remaining, so I guess my cache is already indexed, which really makes me wonder what I am doing wrong that VEP takes so long.

0
Entering edit mode

Yes, our caches are already indexed.