Question: GATK MarkDuplicatesSpark Space Issues
0
gravatar for aalith
3 months ago by
aalith10
aalith10 wrote:

I'm using GATK's function to mark PCR duplicates in my bam files before running through base quality score recalibration then MuTect. My bam file is 166G. I keep getting errors about space, but I am running nothing else on Docker concurrently. I have given Docker 14 cores, 850G of storage, and 55G of memory. Before my most recent attempt, I cleared my cache by using "docker container prune"

The error is as follows (with several normal lines above it):

19/09/01 05:32:21 INFO ShuffleBlockFetcherIterator: Getting 15508 non-empty blocks out of 16278 blocks
19/09/01 05:32:21 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/09/01 05:32:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:27 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:27 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:27 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:27 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:29 ERROR Utils: Aborting task
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device

My command looks like this:

gatk MarkDuplicatesSpark -I "mydata/files/merged.bam" -O merged.markeddups.bam --spark-master local[10] --tmp-dir path/josh

I have tried running MarkDuplicatesSpark with the optional flag to create a statistics file (-M merged.txt). I have also tried controlling the number of cores used with the --conf flag instead of the --spark-master flag (--conf 'spark.executor.cores=10').

Any suggestions on why I'm running out of memory? I think my machine has more than enough resources to handle this task. This command also takes 3 days to reach this error.

markduplicates gatk • 145 views
ADD COMMENTlink written 3 months ago by aalith10

You are running out of disk space, not memory. Hard disk/SSD, not RAM. Spark is probably creating a whole lot of temporary files - that is not uncommon with these distributed data processing applications.

ADD REPLYlink modified 3 months ago • written 3 months ago by RamRS25k

That's what I thought, but does this make sense? I need more than 850 gigs allocated to Docker?

ADD REPLYlink written 3 months ago by aalith10

Wild suggestion: Maybe Docker is assigning too large a block size to each storage block, giving each file chunk more room than it needs? The error about 16278 blocks makes me think this. It's almost like each chunk is 50MB in size where they could be 4MB.

Also, check out this possibly related post: https://serverfault.com/questions/357367/xfs-no-space-left-on-device-but-i-have-850gb-available

ADD REPLYlink modified 3 months ago • written 3 months ago by RamRS25k

Thanks! That post is helpful, but I'm new to docker... how would I implement this in docker?

I may fall back to the regular MarkDuplicates! I'd just need to sort by queryname first

ADD REPLYlink written 3 months ago by aalith10

Sorry, I don't know Docker. Maybe someone familiar with it can help you out.

ADD REPLYlink written 3 months ago by RamRS25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1333 users visited in the last hour