Question

memory usage and run time for VEP whole genome variant annotation?

0

Entering edit mode

9.1 years ago

vlad1 • 0

Hi,

What memory usage and run time for VEP whole genome variant annotation? I tried annotate a 5 sample Illumina 30x coverage whole genome VCF:

perl /ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --force_overwrite -i G85829.vcf --cache --assembly GRCh37 --offline --individual all \
    --symbol \
    --numbers \
    --biotype \
    --total_length \
    -o output \
    --vcf \
   --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,Protein_position,BIOTYPE

This command runs out of memory after ~11 hours. there is about 20Gb free memory on ubuntu server:

[==============================================================================]  [ 100% ]
2015-03-05 22:14:40 - Processed 20675000 total variants (238 vars/sec, 547 vars/sec total)
2015-03-05 22:14:41 - Read 5000 variants into buffer
2015-03-05 22:14:41 - Reading transcript data from cache and/or database
[=====================================>                                        ]   [ 50% ]ERROR: Cannot allocate memory at /ensembl-tools-release-78/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4735, <GEN0> line 4136138.

The processed total variants number is close to total variants from GATK calling (,each genome has about 3.5M total SNPs and .5 total indels, 20,917 total). So I wonder if something happen after all variants were processed. For comparison it takes few minutes for ANNOVAR to annotate one genome.

Vlad

SNP vep ensembl genome • 5.3k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by vlad1 • 0

Ram · Answer 1 · 2015-03-06

4

Entering edit mode

9.1 years ago

EnsemblWill ▴ 570

Try using --fork (see http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#forking)

Not only should this eradicate any memory leak issues, but you should see the script run much, much faster.

In addition, if you can wait a week or so, the next release of VEP (79) is even faster still than 78 and comes with a handy guide for making sure your VEP analyses are running at optimal speed.

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by EnsemblWill ▴ 570

0

Entering edit mode

Thanks! it worked. greatly reduced memory footprint and took about 10 hours to process 42M variants on 8 cpus. I used --fork 6:

--force_overwrite \
-i G85829.vcf \
--cache \
--assembly GRCh37 \
--offline \
--individual all \
--fork 6 \
--sift b \
--polyphen b \
--symbol \
--numbers \
--biotype \
--total_length \
-o output \
--vcf \
--fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE

Probably can be further optimized via the batch size option

Vlad

ADD REPLY • link updated 23 months ago by Ram 43k • written 9.1 years ago by vlad1 • 0