memory usage and run time for VEP whole genome variant annotation?
1
0
Entering edit mode
9.1 years ago
vlad1 • 0

Hi,

What memory usage and run time for VEP whole genome variant annotation? I tried annotate a 5 sample Illumina 30x coverage whole genome VCF:

perl /ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --force_overwrite -i G85829.vcf --cache --assembly GRCh37 --offline --individual all \
    --symbol \
    --numbers \
    --biotype \
    --total_length \
    -o output \
    --vcf \
   --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,Protein_position,BIOTYPE

This command runs out of memory after ~11 hours. there is about 20Gb free memory on ubuntu server:

[==============================================================================]  [ 100% ]
2015-03-05 22:14:40 - Processed 20675000 total variants (238 vars/sec, 547 vars/sec total)
2015-03-05 22:14:41 - Read 5000 variants into buffer
2015-03-05 22:14:41 - Reading transcript data from cache and/or database
[=====================================>                                        ]   [ 50% ]ERROR: Cannot allocate memory at /ensembl-tools-release-78/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4735, <GEN0> line 4136138.

The processed total variants number is close to total variants from GATK calling (,each genome has about 3.5M total SNPs and .5 total indels, 20,917 total). So I wonder if something happen after all variants were processed. For comparison it takes few minutes for ANNOVAR to annotate one genome.

Vlad

SNP vep ensembl genome • 5.3k views
ADD COMMENT
4
Entering edit mode
9.1 years ago
EnsemblWill ▴ 570

Try using --fork (see http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#forking)

Not only should this eradicate any memory leak issues, but you should see the script run much, much faster.

In addition, if you can wait a week or so, the next release of VEP (79) is even faster still than 78 and comes with a handy guide for making sure your VEP analyses are running at optimal speed.

ADD COMMENT
0
Entering edit mode

Thanks! it worked. greatly reduced memory footprint and took about 10 hours to process 42M variants on 8 cpus. I used --fork 6:

--force_overwrite \
-i G85829.vcf \
--cache \
--assembly GRCh37 \
--offline \
--individual all \
--fork 6 \
--sift b \
--polyphen b \
--symbol \
--numbers \
--biotype \
--total_length \
-o output \
--vcf \
--fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE

Probably can be further optimized via the batch size option

Vlad

ADD REPLY

Login before adding your answer.

Traffic: 2000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6