Question

Optimize bwa mem in parallel on a SLURM cluster

2

Entering edit mode

6.8 years ago

CaffeSospeso ▴ 50

Hello,

I would like to understand what is the best way of using bwa in parallel in a SLURM cluster. Obviously, this will depend on the computational limits that I have as user.

bwa software has an argument "-t" specifying the number of threads. Let's imagine that I use bwa mem -t 3 ref.fa sampleA.fq.gz, this will mean that bwa split the job on three tasks/threads. In other words, it will align three reads at a time in parallel (I guess).

Now, if I want to run this command on several samples and in a SLURM cluster, Shall I specify the number of tasks as for bwa mem, and specify the number of CPUs per task(for instance 2)? Which would be:

sbatch -c 2 -n 3 bwa.sh

where bwa.sh containes:

cat data.info | while read indv; do
bwa mem -t 3 ref.fa sample${indv}.fq.gz
done

Do you have any suggestion? Or can you improve/correct my reasoning?

alignment genome • 7.9k views

ADD COMMENT • link updated 6.8 years ago by ATpoint 88k • written 6.8 years ago by CaffeSospeso ▴ 50

score 9 · Accepted Answer · 2018-09-16

9

Entering edit mode

6.8 years ago

Nicolas Rosewick 11k

One good idea is to use a job array that takes a samplesheet in input. Each line specifies one sample. Then using SLURM_ARRAY_TASK_ID variable in the slurm bash script you can acces the respective line in your sample sheet and execute a bwa mem instance with the number of CPU specified in the slurm parameters.

something like for 5 CPU - 20G RAM and 10 samples. You may want to tune the size of the RAM required depending on the reference genome.

#!/bin/bash
#SBATCH --nodes=1 --ntasks-per-node=5 --mem-per-cpu=20G
#SBATCH --array=1-10
#SBATCH -J jobName


samplesheet="sample_sheet.txt"

threads=$SLURM_JOB_CPUS_PER_NODE

r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` # fastq.gz file

bwa mem -t $threads ref.fa $r1

very simple sample_sheet.txt :

sample_1_R1.fastq.gz
sample_2_R1.fastq.gz
sample_3_R1.fastq.gz
sample_4_R1.fastq.gz

You may change it something more detailed. Here for example it's paired-end data. And column 1 contains the name of the sample.

sample_1    sample_1_R1.fastq.gz    sample_1_R2.fastq.gz
sample_2    sample_2_R1.fastq.gz    sample_2_R2.fastq.gz
sample_3    sample_3_R1.fastq.gz    sample_3_R2.fastq.gz
sample_4    sample_4_R1.fastq.gz    sample_4_R2.fastq.gz

Then within the script you can change the section :

r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` # fastq.gz file

to

name=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` 
r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $2}'` 
r2=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $3}'`

ADD COMMENT • link 6.8 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Nicolas Rosewick : It would be useful to show a small example of what sample_sheet.txt file should look like (for PE data). Also for human sized data 10G RAM is not sufficient so it would be best to make a note of that.

ADD REPLY • link 6.8 years ago by GenoMax 152k

0

Entering edit mode

indeed 10GB is maybe too small. It was for the example ;) I added an example of sample sheet also.

ADD REPLY • link 6.8 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Ok, this is a good solution. But imagine that the script has also a command line to index the genome. Could I run the array from inside the script? And not from outside as you are suggesting. I hope it is clear my question.

ADD REPLY • link 6.8 years ago by CaffeSospeso ▴ 50

0

Entering edit mode

You need to index the genome only once so it can be a single one-time job.

ADD REPLY • link 6.8 years ago by GenoMax 152k

0

Entering edit mode

Thank you, however, it is still unclear to me where do you use the information on the "sample_sheet.txt". In addition, $SLURM_ARRAY_TASK_ID, files are all variable that you define externally of the script?

ADD REPLY • link 6.8 years ago by CaffeSospeso ▴ 50

0

Entering edit mode

$SLURM_ARRAY_TASK_ID is a variable that report the job array ID. check here for information concerning job arrays : https://slurm.schedmd.com/job_array.html .

ADD REPLY • link 6.8 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Ok, this is clear. But in your script, you do not seem to use the "samplesheet" variable. Or may be I did not understand.