Question

Memory error during imputation of full genome data for multiple samples

0

Entering edit mode

4.9 years ago

waqaskhokhar999 ▴ 160

I am trying to impute missing values of full genome data (3955671 rows) for more the 700 samples. The script works fine for a smaller dataset (10000 rows) but gives memory error for full genome.

Trail dataset:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  108 139 159 265 350
1   73  0   C   A   40  PASS    0   GT:DP:GQ    0|0:5:40    0|0:9:40    0|0:6:38    ./.:.:. ./.:.:.
1   83  0   T   C,A 40  PASS    0   GT:DP:GQ    1|1:5:40    1|1:9:40    0|0:8:38    ./.:.:. ./.:.:.
1   92  0   A   C   40  PASS    0   GT:DP:GQ    1|1:8:40    1|1:11:40   0|0:9:40    ./.:.:. ./.:.:.

After imputation:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  108 139 159 265 350
1   73  0   C   A   0   PASS    0   GT  0|0 0|0 0|0 0|0 0|0
1   83  0   T   C,A 0   PASS    0   GT  1|1 1|1 0|0 0|0 0|0
1   92  0   A   C   0   PASS    0   GT  1|1 1|1 0|0 0|0 0|0

For full genome dataset command and error:

java -Xmx50g -jar  beagle.16May19.351.jar gt=genotype_9.vcf.recode.vcf nthreads=96 out=results
beagle.16May19.351.jar (version 5.0)
Copyright (C) 2014-2018 Brian L. Browning
Enter "java -jar beagle.16May19.351.jar" to list command line argument
Start time: 03:02 PM BST on 25 May 2019

Command line: java -Xmx45511m -jar beagle.16May19.351.jar
  gt=genotype_9.vcf.recode.vcf
  nthreads=96
  out=results

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:           0
Study samples:             666

Window 1 (1:73-30427620)
Study markers:         960,417
java.lang.OutOfMemoryError: Java heap space
at phase.PhaseBaum1.<init>(PhaseBaum1.java:107)
at phase.PhaseLS.run(PhaseLS.java:66)
at main.MainHelper.lsPhaseSingles(MainHelper.java:95)
at main.MainHelper.phase(MainHelper.java:72)
at main.Main.phaseData(Main.java:166)
at main.Main.main(Main.java:116)
java.lang.OutOfMemoryError: Java heap space
ERROR
terminating program.

I can use upto 102 cores and here is free memory information for my server:

              total        used        free
Mem:         257823         786       53784

How much memory size should I have to keep in order to perform this task, or do I need to subset my dataset to perform this task on individual datasets?

SNP imputation beagle • 1.7k views

ADD COMMENT • link updated 4.9 years ago by ociramoi • 0 • written 4.9 years ago by waqaskhokhar999 ▴ 160

score 0 · Answer 1 · 2019-05-27

Hello, If there are 700 individuals, it should work, I did try with up to 15000 individuals but i did the imputation for each chromosome separately. The memory error might be because of the many markers you have, you can try to divide your data (ped files) by chromosomes and then convert them to vcf and continue running Beagle by Chromosome. You will then combine the results of the 22 chromosomes later after performing the imputation.