Question: Optimize bwa mem in parallel on a SLURM cluster
1
gravatar for CaffeSospeso
23 months ago by
CaffeSospeso40
CaffeSospeso40 wrote:

Hello,

I would like to understand what is the best way of using bwa in parallel in a SLURM cluster. Obviously, this will depend on the computational limits that I have as user.

bwa software has an argument "-t" specifying the number of threads. Let's imagine that I use bwa mem -t 3 ref.fa sampleA.fq.gz, this will mean that bwa split the job on three tasks/threads. In other words, it will align three reads at a time in parallel (I guess).

Now, if I want to run this command on several samples and in a SLURM cluster, Shall I specify the number of tasks as for bwa mem, and specify the number of CPUs per task(for instance 2)? Which would be:

sbatch -c 2 -n 3 bwa.sh

where bwa.sh containes:

cat data.info | while read indv; do
bwa mem -t 3 ref.fa sample${indv}.fq.gz
done

Do you have any suggestion? Or can you improve/correct my reasoning?

alignment genome • 2.1k views
ADD COMMENTlink modified 23 months ago by ATpoint36k • written 23 months ago by CaffeSospeso40
4
gravatar for Nicolas Rosewick
23 months ago by
Belgium, Brussels
Nicolas Rosewick9.0k wrote:

One good idea is to use a job array that takes a samplesheet in input. Each line specifies one sample. Then using SLURM_ARRAY_TASK_ID variable in the slurm bash script you can acces the respective line in your sample sheet and execute a bwa mem instance with the number of CPU specified in the slurm parameters.

something like for 5 CPU - 20G RAM and 10 samples. You may want to tune the size of the RAM required depending on the reference genome.

#!/bin/bash
#SBATCH --nodes=1 --ntasks-per-node=5 --mem-per-cpu=20G
#SBATCH --array=1-10
#SBATCH -J jobName


samplesheet="sample_sheet.txt"

threads=$SLURM_JOB_CPUS_PER_NODE

r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` # fastq.gz file

bwa mem -t $threads ref.fa $r1

very simple sample_sheet.txt :

sample_1_R1.fastq.gz
sample_2_R1.fastq.gz
sample_3_R1.fastq.gz
sample_4_R1.fastq.gz

You may change it something more detailed. Here for example it's paired-end data. And column 1 contains the name of the sample.

sample_1    sample_1_R1.fastq.gz    sample_1_R2.fastq.gz
sample_2    sample_2_R1.fastq.gz    sample_2_R2.fastq.gz
sample_3    sample_3_R1.fastq.gz    sample_3_R2.fastq.gz
sample_4    sample_4_R1.fastq.gz    sample_4_R2.fastq.gz

Then within the script you can change the section :

r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` # fastq.gz file

to

name=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $1}'` 
r1=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $2}'` 
r2=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet |  awk '{print $3}'`
ADD COMMENTlink modified 23 months ago • written 23 months ago by Nicolas Rosewick9.0k

Nicolas Rosewick : It would be useful to show a small example of what sample_sheet.txt file should look like (for PE data). Also for human sized data 10G RAM is not sufficient so it would be best to make a note of that.

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax87k

indeed 10GB is maybe too small. It was for the example ;) I added an example of sample sheet also.

ADD REPLYlink modified 23 months ago • written 23 months ago by Nicolas Rosewick9.0k

Ok, this is a good solution. But imagine that the script has also a command line to index the genome. Could I run the array from inside the script? And not from outside as you are suggesting. I hope it is clear my question.

ADD REPLYlink written 23 months ago by CaffeSospeso40

You need to index the genome only once so it can be a single one-time job.

ADD REPLYlink written 23 months ago by genomax87k

Thank you, however, it is still unclear to me where do you use the information on the "sample_sheet.txt". In addition, $SLURM_ARRAY_TASK_ID, files are all variable that you define externally of the script?

ADD REPLYlink written 23 months ago by CaffeSospeso40

$SLURM_ARRAY_TASK_ID is a variable that report the job array ID. check here for information concerning job arrays : https://slurm.schedmd.com/job_array.html .

ADD REPLYlink written 23 months ago by Nicolas Rosewick9.0k

Ok, this is clear. But in your script, you do not seem to use the "samplesheet" variable. Or may be I did not understand.

ADD REPLYlink written 23 months ago by CaffeSospeso40

Indeed my bad. copy/paste error. I solved it now

ADD REPLYlink written 23 months ago by Nicolas Rosewick9.0k
0
gravatar for ATpoint
23 months ago by
ATpoint36k
Germany
ATpoint36k wrote:

Here is script for BWA on a SLURM cluster, assuming you use a single node and aim to run parallel jobs on it. Note that depending on the rules on your cluster, parallel operations may not be allowed, so that Nicolas solution might be preferrable. Our cluster rules are pretty lenient, so this one works fine for me.

The BWA mem part itself is wrapped inside the function Fq2Bam, where BWA is set to use 16 threads. This speeds-up the alignment itself for one fastq file/pair. The function then can be parallelized over multiple input files by passing Fq2Bam to GNU parallel, which runs it on a defined number of files, controlled by the -j parameter in parallel. Make sure that the number of parallel jobs fit to the total number f cores available on the node.

#/usr/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=<int>
#SBATCH --partition=<partition>
#SBATCH --time=48:00:00 
#SBATCH --mail-type=ALL
#SBATCH --mail-user=foo@bar.com
#SBATCH --job-name=Align
#########################################################################################

function Fq2Bam {

  BASENAME=$1

  BWA_IDX=/path/to/index

  bwa mem -t 16 ${BWA_IDX} ${BASENAME}_1.fastq.gz ${BASENAME}_2.fastq.gz | samtools sort -o ${BASENAME}_sorted.bam

}
#########################################################################################

export -f Fq2Bam

ls *_1.fastq.gz | awk -F "_1.fastq.gz" '{print $1}' | parallel -j <NumberOfParallelJobs> "Fq2Bam {}"
ADD COMMENTlink modified 23 months ago • written 23 months ago by ATpoint36k

How would you modify this if you have many reference files to process for many fastq files?

If my reference.fasta files are saved in a large list in reflist.txt, and I have another directory with my fastq files to process how does one instead use parallel to process many reference.fasta files simultaneously for their fastq files (I only have 40 single-end in that case).

ADD REPLYlink written 9 months ago by DNAngel60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1515 users visited in the last hour