I would like to run alignment on whole-genome sequenced human samples with BWA-MEM2 (version bwa-mem2/2.2.1) using a Supercomputer. The data is paired-end, short-read data. There are 200 samples (400 fastQ files) that make up 18 Tb in total (45 Gb / file on average).
On the Supercomputer I have access to:
- 692 thin nodes (40 cores, 192 GB RAM)
- 55 fat nodes (40 cores, 1.5 TB RAM)
- 40 GPU node (40 cores, 192 GB RAM, NVIDIA V100)
- How could the job be parallelized in the best way to achieve better performance and lower runtime? (accounting happens on node/hour basis).
For example: A) Running a single big job on more nodes or B) submitting the samples in batches and running more individual jobs (each on a single node) in parallel?
- Is BWA-MEM2 suitable for running on GPU or multiple nodes?
The first thing that I did was to write a for loop that took the two fastQ files of a sample at a time and performed alignment on them, and then took the next sample and so on. I ran 2 samples (194 Gb) on a single GPU node, 40 cores, and 180 Gb RAM. It took 37 hours.
Then I tried to elevate the number of nodes to 4 (incl 1 GPU node, 40 cores each, 180 Gb). But there was no huge difference in runtime. I would have assumed a halved runtime at least. I might be wrong with this. Is the number of nodes supposed to affect (optimally decrease) the runtime of a job?
Now, I am thinking to submit the samples in batches to a full node with a maximum of 40 cores at a time.
And yes, I am not that experienced with using resources on Supercomputers, therefore any help and advice are appreciated. :)) Also, if I missed some information, then let me know.
Many thanks and have a nice day! Rebeka