Question

How to best utilize Supercomputer resources when running BWA on WGS data?

1

Entering edit mode

20 months ago

Rebeka ▴ 10

Dear Fellows,

I would like to run alignment on whole-genome sequenced human samples with BWA-MEM2 (version bwa-mem2/2.2.1) using a Supercomputer. The data is paired-end, short-read data. There are 200 samples (400 fastQ files) that make up 18 Tb in total (45 Gb / file on average).

On the Supercomputer I have access to:

692 thin nodes (40 cores, 192 GB RAM)
55 fat nodes (40 cores, 1.5 TB RAM)
40 GPU node (40 cores, 192 GB RAM, NVIDIA V100)

How could the job be parallelized in the best way to achieve better performance and lower runtime? (accounting happens on node/hour basis).

For example: A) Running a single big job on more nodes or B) submitting the samples in batches and running more individual jobs (each on a single node) in parallel?

Is BWA-MEM2 suitable for running on GPU or multiple nodes?

The first thing that I did was to write a for loop that took the two fastQ files of a sample at a time and performed alignment on them, and then took the next sample and so on. I ran 2 samples (194 Gb) on a single GPU node, 40 cores, and 180 Gb RAM. It took 37 hours.

Then I tried to elevate the number of nodes to 4 (incl 1 GPU node, 40 cores each, 180 Gb). But there was no huge difference in runtime. I would have assumed a halved runtime at least. I might be wrong with this. Is the number of nodes supposed to affect (optimally decrease) the runtime of a job?

Now, I am thinking to submit the samples in batches to a full node with a maximum of 40 cores at a time.

And yes, I am not that experienced with using resources on Supercomputers, therefore any help and advice are appreciated. :)) Also, if I missed some information, then let me know.

Many thanks and have a nice day! Rebeka

nodes bwa bwa-mem2 wgs hpc • 1.2k views

ADD COMMENT • link 20 months ago by Rebeka ▴ 10

score 0 · Answer 1 · 2022-08-05

Is BWA-MEM2 suitable for running on GPU or multiple nodes?

No for both. bwa is not GPU compatible. You don't want to spread multi-threaded jobs across physical nodes since that increases latency and may slow the job execution down.

Your "supercomputer" must use a job scheduling system so make use of that. Submit multi-threaded jobs (within the resources your account is allotted/can use) as independent tasks. You can find threads on how to construct command lines to create these independent jobs on biostars. For alignment tasks, thin of fat nodes should make no difference since you are only going to use a certain number of cores (use between 8-12 per job if you want to get multiple going).

I ran 2 samples (194 Gb) on a single GPU node, 40 cores, and 180 Gb RAM. It took 37 hours

You have samples that deeply sequenced? 194G of data for 2 samples? If you do then 37 h is to be expected even with a lot of cores (assuming you actually used all 40).

Note: You are going to get suggestions to use a "workflow manager" from others. If you are planning to do this regularly then by all means invest the time learning one. Otherwise some bashfu should get this job done.

score 0 · Answer 2 · 2022-08-05

0

Entering edit mode

20 months ago

colindaven 6.4k

And heres the first suggestion on workflow managers:

This has all been done before and is actively maintained. This should help with your cluster problems too.

I think this is the more current pipelein for your use-case: https://nf-co.re/sarek

It takes some time to get into nf-core pipelines, but on balance it will be 100x quicker than reinventing it all yourself. Once you get used to nf-core, you can use the other high-quality pipelines too for future work.

ADD COMMENT • link 20 months ago by colindaven 6.4k

0

Entering edit mode

Thanks for the suggestion!

ADD REPLY • link 20 months ago by Rebeka ▴ 10

score 0 · Answer 3 · 2022-08-05

First off, no GPU support in bwa-mem2, nor multiple nodes per sample, so you can drop any thoughts to that. Each sample has to run on one node, but multiple samples per node are possible.

Data at that scale benefit from a workflow manager such as Nextflow which will also automate submission of jobs to a cluster, e.g. via SLURM or whatever scheduler your HPC uses and also provide caching to resume a job if something goes wrong with some samples. In general, lots of cores for alignment at some point have little additional benefit at I/O bottlenecks kick in. I would probably do 20-core jobs so this allows two samples at a time per node. It is now to think about whether you want to align first in a separate process and then later in a second round sort the samples, or to pipe the SAM from bwa directly into a sort tool such as samtools sort. I usually do the latter, as it saves you from large intermediate files. A configuration could be 16 jobs for bwa and 4 for samtools sort, resulting in 20, so two jobs per node, with memory for the sorting as available. Check the -m option in the sort command.

Definitely don't do loops or batches by hand. Let Nextflow (or similar tools like SnakeMake) or at least the HPC scheduler do this. Just submit everything, and then let the HPC rules (maximum number of jobs per user and number of idle nodes) let take care of the rest.

Depending on what the analysis goal is you could use existing workflows such as https://nf-co.re/sarek. Google for end-to-end workflows, they exist.