Question: Optimize bwa mem in parallel on a SLURM cluster
0
gravatar for CaffeSospeso
7 months ago by
CaffeSospeso10
CaffeSospeso10 wrote:

Hello,

I would like to understand what is the best way of using bwa in parallel in a SLURM cluster. Obviously, this will depend on the computational limits that I have as user.

bwa software has an argument "-t" specifying the number of threads. Let's imagine that I use bwa mem -t 3 ref.fa sampleA.fq.gz, this will mean that bwa split the job on three tasks/threads. In other words, it will align three reads at a time in parallel (I guess).

Now, if I want to run this command on several samples and in a SLURM cluster, Shall I specify the number of tasks as for bwa mem, and specify the number of CPUs per task(for instance 2)? Which would be:

sbatch -c 2 -n 3 bwa.sh

where bwa.sh containes:

cat data.info | while read indv; do
bwa mem -t 3 ref.fa sample${indv}.fq.gz
done

Do you have any suggestion? Or can you improve/correct my reasoning?

alignment genome • 740 views
ADD COMMENTlink modified 7 months ago by ATpoint15k • written 7 months ago by CaffeSospeso10

Don't forget to follow up on your threads.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink written 7 months ago by genomax66k

Yes, thank you for the suggestion. I'm testing the reply of Nicolas Rosewick and then I will vote.

ADD REPLYlink written 7 months ago by CaffeSospeso10
3
gravatar for Nicolas Rosewick
7 months ago by
Belgium, Brussels
Nicolas Rosewick7.5k wrote:

One good idea is to use a job array that takes a samplesheet in input. Each line specifies one sample. Then using SLURM_ARRAY_TASK_ID variable in the slurm bash script you can acces the respective line in your sample sheet and execute a bwa mem instance with the number of CPU specified in the slurm parameters.

something like for 5 CPU - 20G RAM and 10 samples. You may want to tune the size of the RAM required depending on the reference genome.

#!/bin/bash
#SBATCH --nodes=1 --ntasks-per-node=5 --mem-per-cpu=20G
#SBATCH --array=1-10
#SBATCH -J jobName


samplesheet="sample_sheet.txt"

threads=$SLURM_JOB_CPUS_PER_NODE

r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` # fastq.gz file

bwa mem -t $threads ref.fa $r1

very simple sample_sheet.txt :

sample_1_R1.fastq.gz
sample_2_R1.fastq.gz
sample_3_R1.fastq.gz
sample_4_R1.fastq.gz

You may change it something more detailed. Here for example it's paired-end data. And column 1 contains the name of the sample.

sample_1    sample_1_R1.fastq.gz    sample_1_R2.fastq.gz
sample_2    sample_2_R1.fastq.gz    sample_2_R2.fastq.gz
sample_3    sample_3_R1.fastq.gz    sample_3_R2.fastq.gz
sample_4    sample_4_R1.fastq.gz    sample_4_R2.fastq.gz

Then within the script you can change the section :

r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` # fastq.gz file

to

name=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` 
r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $2}'` 
r2=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $3}'`
ADD COMMENTlink modified 7 months ago • written 7 months ago by Nicolas Rosewick7.5k

Nicolas Rosewick : It would be useful to show a small example of what sample_sheet.txt file should look like (for PE data). Also for human sized data 10G RAM is not sufficient so it would be best to make a note of that.

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax66k

indeed 10GB is maybe too small. It was for the example ;) I added an example of sample sheet also.

ADD REPLYlink modified 7 months ago • written 7 months ago by Nicolas Rosewick7.5k

Ok, this is a good solution. But imagine that the script has also a command line to index the genome. Could I run the array from inside the script? And not from outside as you are suggesting. I hope it is clear my question.

ADD REPLYlink written 7 months ago by CaffeSospeso10

You need to index the genome only once so it can be a single one-time job.

ADD REPLYlink written 7 months ago by genomax66k

Thank you, however, it is still unclear to me where do you use the information on the "sample_sheet.txt". In addition, $SLURM_ARRAY_TASK_ID, files are all variable that you define externally of the script?

ADD REPLYlink written 7 months ago by CaffeSospeso10

$SLURM_ARRAY_TASK_ID is a variable that report the job array ID. check here for information concerning job arrays : https://slurm.schedmd.com/job_array.html .

ADD REPLYlink written 7 months ago by Nicolas Rosewick7.5k

Ok, this is clear. But in your script, you do not seem to use the "samplesheet" variable. Or may be I did not understand.

ADD REPLYlink written 7 months ago by CaffeSospeso10

Indeed my bad. copy/paste error. I solved it now

ADD REPLYlink written 7 months ago by Nicolas Rosewick7.5k
0
gravatar for ATpoint
7 months ago by
ATpoint15k
Germany
ATpoint15k wrote:

Here is script for BWA on a SLURM cluster, assuming you use a single node and aim to run parallel jobs on it. Note that depending on the rules on your cluster, parallel operations may not be allowed, so that Nicolas solution might be preferrable. Our cluster rules are pretty lenient, so this one works fine for me.

The BWA mem part itself is wrapped inside the function Fq2Bam, where BWA is set to use 16 threads. This speeds-up the alignment itself for one fastq file/pair. The function then can be parallelized over multiple input files by passing Fq2Bam to GNU parallel, which runs it on a defined number of files, controlled by the -j parameter in parallel. Make sure that the number of parallel jobs fit to the total number f cores available on the node.

#/usr/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=<int>
#SBATCH --partition=<partition>
#SBATCH --time=48:00:00 
#SBATCH --mail-type=ALL
#SBATCH --mail-user=foo@bar.com
#SBATCH --job-name=Align
#########################################################################################

function Fq2Bam {

  BASENAME=$1

  BWA_IDX=/path/to/index

  bwa mem -t 16 ${BWA_IDX} ${BASENAME}_1.fastq.gz ${BASENAME}_2.fastq.gz | samtools sort -o ${BASENAME}_sorted.bam

}
#########################################################################################

export -f Fq2Bam

ls *_1.fastq.gz | awk -F "_1.fastq.gz" '{print $1}' | parallel -j <NumberOfParallelJobs> "Fq2Bam {}"
ADD COMMENTlink modified 7 months ago • written 7 months ago by ATpoint15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1613 users visited in the last hour