Optimize bwa mem in parallel on a SLURM cluster
2
2
Entering edit mode
6.0 years ago
CaffeSospeso ▴ 50

Hello,

I would like to understand what is the best way of using bwa in parallel in a SLURM cluster. Obviously, this will depend on the computational limits that I have as user.

bwa software has an argument "-t" specifying the number of threads. Let's imagine that I use bwa mem -t 3 ref.fa sampleA.fq.gz, this will mean that bwa split the job on three tasks/threads. In other words, it will align three reads at a time in parallel (I guess).

Now, if I want to run this command on several samples and in a SLURM cluster, Shall I specify the number of tasks as for bwa mem, and specify the number of CPUs per task(for instance 2)? Which would be:

sbatch -c 2 -n 3 bwa.sh

where bwa.sh containes:

cat data.info | while read indv; do
bwa mem -t 3 ref.fa sample${indv}.fq.gz
done

Do you have any suggestion? Or can you improve/correct my reasoning?

alignment genome • 7.0k views
ADD COMMENT
9
Entering edit mode
6.0 years ago

One good idea is to use a job array that takes a samplesheet in input. Each line specifies one sample. Then using SLURM_ARRAY_TASK_ID variable in the slurm bash script you can acces the respective line in your sample sheet and execute a bwa mem instance with the number of CPU specified in the slurm parameters.

something like for 5 CPU - 20G RAM and 10 samples. You may want to tune the size of the RAM required depending on the reference genome.

#!/bin/bash
#SBATCH --nodes=1 --ntasks-per-node=5 --mem-per-cpu=20G
#SBATCH --array=1-10
#SBATCH -J jobName


samplesheet="sample_sheet.txt"

threads=$SLURM_JOB_CPUS_PER_NODE

r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` # fastq.gz file

bwa mem -t $threads ref.fa $r1

very simple sample_sheet.txt :

sample_1_R1.fastq.gz
sample_2_R1.fastq.gz
sample_3_R1.fastq.gz
sample_4_R1.fastq.gz

You may change it something more detailed. Here for example it's paired-end data. And column 1 contains the name of the sample.

sample_1    sample_1_R1.fastq.gz    sample_1_R2.fastq.gz
sample_2    sample_2_R1.fastq.gz    sample_2_R2.fastq.gz
sample_3    sample_3_R1.fastq.gz    sample_3_R2.fastq.gz
sample_4    sample_4_R1.fastq.gz    sample_4_R2.fastq.gz

Then within the script you can change the section :

r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` # fastq.gz file

to

name=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` 
r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $2}'` 
r2=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $3}'`
ADD COMMENT
0
Entering edit mode

Nicolas Rosewick : It would be useful to show a small example of what sample_sheet.txt file should look like (for PE data). Also for human sized data 10G RAM is not sufficient so it would be best to make a note of that.

ADD REPLY
0
Entering edit mode

indeed 10GB is maybe too small. It was for the example ;) I added an example of sample sheet also.

ADD REPLY
0
Entering edit mode

Ok, this is a good solution. But imagine that the script has also a command line to index the genome. Could I run the array from inside the script? And not from outside as you are suggesting. I hope it is clear my question.

ADD REPLY
0
Entering edit mode

You need to index the genome only once so it can be a single one-time job.

ADD REPLY
0
Entering edit mode

Thank you, however, it is still unclear to me where do you use the information on the "sample_sheet.txt". In addition, $SLURM_ARRAY_TASK_ID, files are all variable that you define externally of the script?

ADD REPLY
0
Entering edit mode

$SLURM_ARRAY_TASK_ID is a variable that report the job array ID. check here for information concerning job arrays : https://slurm.schedmd.com/job_array.html .

ADD REPLY
0
Entering edit mode

Ok, this is clear. But in your script, you do not seem to use the "samplesheet" variable. Or may be I did not understand.

ADD REPLY
0
Entering edit mode

Indeed my bad. copy/paste error. I solved it now

ADD REPLY

Login before adding your answer.

Traffic: 2332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6