Optimize bwa mem in parallel on a SLURM cluster
2
2
Entering edit mode
4.2 years ago
CaffeSospeso ▴ 50

Hello,

I would like to understand what is the best way of using bwa in parallel in a SLURM cluster. Obviously, this will depend on the computational limits that I have as user.

bwa software has an argument "-t" specifying the number of threads. Let's imagine that I use bwa mem -t 3 ref.fa sampleA.fq.gz, this will mean that bwa split the job on three tasks/threads. In other words, it will align three reads at a time in parallel (I guess).

Now, if I want to run this command on several samples and in a SLURM cluster, Shall I specify the number of tasks as for bwa mem, and specify the number of CPUs per task(for instance 2)? Which would be:

sbatch -c 2 -n 3 bwa.sh


where bwa.sh containes:

cat data.info | while read indv; do
bwa mem -t 3 ref.fa sample{indv}.fq.gz done  Do you have any suggestion? Or can you improve/correct my reasoning? alignment genome • 5.0k views ADD COMMENT 9 Entering edit mode 4.2 years ago One good idea is to use a job array that takes a samplesheet in input. Each line specifies one sample. Then using SLURM_ARRAY_TASK_ID variable in the slurm bash script you can acces the respective line in your sample sheet and execute a bwa mem instance with the number of CPU specified in the slurm parameters. something like for 5 CPU - 20G RAM and 10 samples. You may want to tune the size of the RAM required depending on the reference genome. #!/bin/bash #SBATCH --nodes=1 --ntasks-per-node=5 --mem-per-cpu=20G #SBATCH --array=1-10 #SBATCH -J jobName samplesheet="sample_sheet.txt" threads=SLURM_JOB_CPUS_PER_NODE

r1=sed -n "$SLURM_ARRAY_TASK_ID"p$samplesheet |  awk '{print $1}' # fastq.gz file bwa mem -t$threads ref.fa $r1  very simple sample_sheet.txt : sample_1_R1.fastq.gz sample_2_R1.fastq.gz sample_3_R1.fastq.gz sample_4_R1.fastq.gz  You may change it something more detailed. Here for example it's paired-end data. And column 1 contains the name of the sample. sample_1 sample_1_R1.fastq.gz sample_1_R2.fastq.gz sample_2 sample_2_R1.fastq.gz sample_2_R2.fastq.gz sample_3 sample_3_R1.fastq.gz sample_3_R2.fastq.gz sample_4 sample_4_R1.fastq.gz sample_4_R2.fastq.gz  Then within the script you can change the section : r1=sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet | awk '{print$1}' # fastq.gz file


to

name=sed -n "$SLURM_ARRAY_TASK_ID"p$samplesheet |  awk '{print $1}' r1=sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet | awk '{print$2}'
r2=sed -n "$SLURM_ARRAY_TASK_ID"p$samplesheet |  awk '{print $3}'  ADD COMMENT 0 Entering edit mode Nicolas Rosewick : It would be useful to show a small example of what sample_sheet.txt file should look like (for PE data). Also for human sized data 10G RAM is not sufficient so it would be best to make a note of that. ADD REPLY 0 Entering edit mode indeed 10GB is maybe too small. It was for the example ;) I added an example of sample sheet also. ADD REPLY 0 Entering edit mode Ok, this is a good solution. But imagine that the script has also a command line to index the genome. Could I run the array from inside the script? And not from outside as you are suggesting. I hope it is clear my question. ADD REPLY 0 Entering edit mode You need to index the genome only once so it can be a single one-time job. ADD REPLY 0 Entering edit mode Thank you, however, it is still unclear to me where do you use the information on the "sample_sheet.txt". In addition,$SLURM_ARRAY_TASK_ID, files are all variable that you define externally of the script?

0
Entering edit mode

\$SLURM_ARRAY_TASK_ID is a variable that report the job array ID. check here for information concerning job arrays : https://slurm.schedmd.com/job_array.html .

0
Entering edit mode

Ok, this is clear. But in your script, you do not seem to use the "samplesheet" variable. Or may be I did not understand.

0
Entering edit mode

Indeed my bad. copy/paste error. I solved it now