Question

Joint variant calling with platypus killed for exceeding memory.

0

Entering edit mode

5.5 years ago

vncnt.anna • 0

Hi,

I am running into some problems with variant calling for one of my projects. I need to call variants for 601 individuals for which I have exome capture data for chromosome X, Y, MT, and some other autosomal genes.

I used bwa-mem to map them to the grch38 reference which worked fine, even though I do not have the bed file corresponding to the target regions.

Where I run into problems though it's with variant calling. I am trying to do it with platypus and the first time I ran my job on all 600 files at once, it got killed for exceeding the job's memory limit.

Platypus callVariants --maxReads 800000000 \
                      --nCPU {cores} \
                      --maxVariants 20 \
                      --bamFiles={infiles} \
                      --refFile={ref_fasta} \
                      --output={outvcf} \
                      --filterDuplicates=0 \
                      source={variants_vcf}.gz \
                      --minPosterior=0 \
                      --getVariantsFromBAMs=0

I increased the memory to 256 GB, on 12 cores /node which is the maximum i can ask from the cluster I'm using (slurm backend) but it still got killed. As an alternative I split my bam files by chromosomes to run separate jobs and then merge the VCFs using VCFtools afterwards but I still get the same error after a few rounds of calling:

2018-11-11 11:41:46,549 - INFO - Processing region X:900000-1000000. (Only printing this message every 10 regions of size 100000)
2018-11-11 11:41:46,657 - INFO - Processing region X:1000000-1100000. (Only printing this message every 10 regions of size 100000)
2018-11-11 11:41:46,765 - INFO - Processing region X:1100000-1200000. (Only printing this message every 10 regions of size 100000)
slurmstepd: error: Job 17151372 exceeded memory limit (269778968 > 268435456), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error:  JOB 17151372 ON s04n82 CANCELLED AT 2018-11-11T12:31:55

In the beginning I did not specify the --regions=chr_X with platypus, so it still ran over all regions even though my input bam files were per chromosome (I triple checked) but even by doing so, it can't finish the jobs and I can't allocate more memory.

So I'm thinking that I might have a memory leak somewhere (I am definitely no expert and it is my first time working with so many files ) because it seems crazy to me that it can't process those files (full bam were about 2-3GB each, split bams are about 1.5GB per X chromosome file and 1.7K per MT but only 47K for the Y chromosome.)

I am also using a temporary directory to store the platypus temp files before merging into the final VCF but it does not seem to help.

My other thought is that maybe I should used another variant calling software maybe GATK, but it doesn't look extremely new user friendly.

I also considered splitting my bam files into two 300N groups and merging afterwards, but I think it's much better to call on all individuals at once for the same region.

If any of you have run into similar problems or would have any suggestions on what to do next, I'm listening.

Thanks a lot !

Anna

software error next-gen SNP • 1.4k views

ADD COMMENT • link updated 5.3 years ago by Ram 43k • written 5.5 years ago by vncnt.anna • 0