Reducing block size used in Spark versions of GATK tools
Entering edit mode
3.7 years ago
aalith ▴ 10

Hi all,

I've been attempting to run GATK's MarkDuplicatesSpark on a bam file that's about 160G, however, I keep getting errors about running out of space on my device. I've allotted Docker 850G of space, which should be enough in my mind. The following command takes around 2 days to reach an error.

Command Line

gatk MarkDuplicatesSpark -I "mydata/sample.bam" -O sample.markeddups.bam --spark-master local[10] --verbosity ERROR --tmp-dir path/josh --conf 'spark.local.dir=./tmp'

Is there a way to reduce the block size of each little storage block that this Spark tool creates? I can't find a simple way of doing so in Docker or from the MarkDuplicatesSpark command line. Each chunk is currently around 50MB and there are about 12,000 "tasks." I am new to this work, so I'm not fully comfortable with interpreting what that means.

gatk spark • 599 views

Login before adding your answer.

Traffic: 1990 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6