GATK MarkDuplicatesSpark Space Issues
0
0
Entering edit mode
4.6 years ago
aalith ▴ 10

I'm using GATK's function to mark PCR duplicates in my bam files before running through base quality score recalibration then MuTect. My bam file is 166G. I keep getting errors about space, but I am running nothing else on Docker concurrently. I have given Docker 14 cores, 850G of storage, and 55G of memory. Before my most recent attempt, I cleared my cache by using "docker container prune"

The error is as follows (with several normal lines above it):

19/09/01 05:32:21 INFO ShuffleBlockFetcherIterator: Getting 15508 non-empty blocks out of 16278 blocks
19/09/01 05:32:21 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/09/01 05:32:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:27 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:27 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:27 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:27 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:29 ERROR Utils: Aborting task
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device

My command looks like this:

gatk MarkDuplicatesSpark -I "mydata/files/merged.bam" -O merged.markeddups.bam --spark-master local[10] --tmp-dir path/josh

I have tried running MarkDuplicatesSpark with the optional flag to create a statistics file (-M merged.txt). I have also tried controlling the number of cores used with the --conf flag instead of the --spark-master flag (--conf 'spark.executor.cores=10').

Any suggestions on why I'm running out of memory? I think my machine has more than enough resources to handle this task. This command also takes 3 days to reach this error.

GATK MarkDuplicates • 2.4k views
ADD COMMENT
0
Entering edit mode

You are running out of disk space, not memory. Hard disk/SSD, not RAM. Spark is probably creating a whole lot of temporary files - that is not uncommon with these distributed data processing applications.

ADD REPLY
0
Entering edit mode

That's what I thought, but does this make sense? I need more than 850 gigs allocated to Docker?

ADD REPLY
0
Entering edit mode

Wild suggestion: Maybe Docker is assigning too large a block size to each storage block, giving each file chunk more room than it needs? The error about 16278 blocks makes me think this. It's almost like each chunk is 50MB in size where they could be 4MB.

Also, check out this possibly related post: https://serverfault.com/questions/357367/xfs-no-space-left-on-device-but-i-have-850gb-available

ADD REPLY
0
Entering edit mode

Thanks! That post is helpful, but I'm new to docker... how would I implement this in docker?

I may fall back to the regular MarkDuplicates! I'd just need to sort by queryname first

ADD REPLY
0
Entering edit mode

Sorry, I don't know Docker. Maybe someone familiar with it can help you out.

ADD REPLY

Login before adding your answer.

Traffic: 2370 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6