Question

How to parallelise a for loop in bash?

0

Entering edit mode

22 months ago

Christy ▴ 20

Hi,

I am trying to parallelise a loop in my bash script so I can index multiple genomes at once rather than sequentially, but I'm really struggling to get it working. I also want to learn how to do this as there are many other parts of the pipeline I would like to parallelise.

I am submitting the job to an HPC with a Linux environment that uses the SLURM workload manager, my script is as follows:

#!/bin/bash
#SBATCH --job-name=parallel_indexing
#SBATCH --output=parallel_indexing%j.log

# send email when job begins ends or aborts
#SBATCH --mail-user=<email> --mail-type=BEGIN,END,ABORT

#request resources
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=500G
#SBATCH --time=12:00:00
#SBATCH --partition=k2-medpri

# set working directory to scratch space project folder
#SBATCH --chdir <dir>

#load modules 
module load <bowtie>

#initiate arrays
GENOMES=(2015_genome.fa 2021_genome.fa liv_genome.fa wash_genome.fa human_genome.fa)
NAMES=(2015_genome 2021_genome liv_genome wash_genome human_genome)

for index in ${!GENOMES[*]}; do
bowtie-build ${GENOMES[$index]} ${NAMES[$index]} &
done

The job runs for ~4 seconds then exits without error and the log file is completely empty. If anyone has any advice it would be greatly appreciated!

processing parallel slurm loops bash hpc • 3.9k views

ADD COMMENT • link updated 22 months ago by ATpoint 82k • written 22 months ago by Christy ▴ 20

3

Entering edit mode

use a worflow manager like snakemake or nextflow.

ADD REPLY • link 22 months ago by Pierre Lindenbaum 161k

1

Entering edit mode

An easy option would be to submit several jobs. Create several scripts, one per genome, and submit them to the cluster.

ADD REPLY • link 22 months ago by iraun 6.2k

1

Entering edit mode

Hi, thank you for the reply - this would work of course but this is just one part of a larger pipeline and for the rest this wouldn't be practical (many more files than just the 5 here) so I'm really keen to understand how to parallelise if possible!

ADD REPLY • link 22 months ago by Christy ▴ 20

1

Entering edit mode

The simple answer is:

Don't use a loop - use GNU parallel.

ADD REPLY • link 22 months ago by Joe 21k

score 8 · Answer 1 · 2022-06-06

There are multiple ways:

1) Create one submission script per genome, and then submit each. That is the easiest. Either write a skeleton submission script and then a bash loop to fill it out with each genome automatically, or just by hand. For five genomes that is easy to do. It does not scale though, so if you had hundreds of values/genomes/elements then automate this.

2) This is probably the preferred way with SLURM alone without additional software. Create a SLURM array. The variable SLURM_ARRAY_TASK_ID takes the values of --array. This below will automatically submit one job per value of --array. Could be something like:

#!/bin/bash

#!/bin/bash
#SBATCH --job-name=arraytest
#SBATCH (fill in all other submission parameters)
#SBATCH --array=1-5

echo "$SLURM_ARRAY_TASK_ID" # this will be 1,2,3,4,5

GENOMES=(2015_genome.fa 2021_genome.fa liv_genome.fa wash_genome.fa human_genome.fa)
NAMES=(2015_genome 2021_genome liv_genome wash_genome human_genome)

bowtie-build ${GENOMES[$SLURM_ARRAY_TASK_ID]} ${NAMES[$SLURM_ARRAY_TASK_ID]}

I have not used arrays in years, you have to check how to set the --nodes and --ntasks correctly, I do not remember. Many clusters have limits of how many array jobs can run in parallel, so if that number is lower than five you might switch to option 1 given you are allowed to have 5 independent jobs running at the same time (which I guess is the case).

3) Wrap this in something like Snakemake or Nextflow which are workflow managers that can parallelize these things, but this takes time to learn and is rather something (recommended!) for the longterm than for a simple job like this.

By the way, you do not strictly need NAMES. You could simply use basename on the GENOMES array values to get the exact same but automatically, without the error-prone way of writing down the names manually. Not necessary for something that simple as this job, but imagine you have hundreds of values in the array, you do not want to write that by hand.

score 3 · Answer 2 · 2022-06-06

use a Makefile:

GENOMES=$(addsuffix /path/to/dir/containing/,2015_genome.fa 2021_genome.fa liv_genome.fa wash_genome.fa human_genome.fa)

%.1.ebwt: %.fa
    module load bowtie && bowtie build $< $(basename $<)

all: $(addsuffix .1.ebwt,$(basename $(GENOMES)))

and in your script:

#!/bin/bash
(...)
make -j 4 -f /path/to/Makefile

score 2 · Answer 3 · 2022-06-06

2

Entering edit mode

22 months ago

steve ★ 3.5k

Use Nextflow.

If you are trying to write a script that runs in parallel using the HPC then you are beyond the scope of what you should reasonably be doing in bash alone and its time to move up to a real workflow framework.

ADD COMMENT • link 22 months ago by steve ★ 3.5k

1

Entering edit mode

The advantage of using Nextflow here over a SLURM array is that the array just submits one job per script, so the script itself is still serially being processed. Nextflow can parallelize each module definition, so if you have one module for indexing, one for mapping and say one for filtering then this can be parallelized and this again can be parallelized over all samples, natively supporting SLURM plus other benefits such as SIngularity and Docker integration. It is preferred, yet has quite a learning curve. More efficient, yes, but the same could be done with a proper bash script in an arrayish fashion, it depends how much effort for the short- and longterm you want or currently can invest. In the end, getting results is more important than maxing out efficiency, even though for the longterm learning a workflow manager definitely makes sense.

ADD REPLY • link 22 months ago by ATpoint 82k

1

Entering edit mode

I totally agree with all of this - I 100% am aiming to try and get the whole pipeline sorted in nextflow but as you say it's a steep learning curve so just balancing getting what I need now and putting the time in to build it with a workload manager

ADD REPLY • link 22 months ago by Christy ▴ 20

0

Entering edit mode

Agreed, for me it took weeks to get my head around Nextflow. I feel like producing reliable results comes first, fancy pipelines is second.

ADD REPLY • link 22 months ago by ATpoint 82k

0

Entering edit mode

if it isn't the King of Nextflow :)

ADD REPLY • link 22 months ago by pbioinf ▴ 70

score 1 · Answer 4 · 2022-06-07

...and here's an example in snakemake:

GENOMES= {'2015_genome': '2015_genome.fa', 
          '2021_genome': '2021_genome.fa'}

wildcard_constraints:
    name = '|'.join([re.escape(x) for x in GENOMES.keys()])

rule all:
    input:
        expand('{name}.1.ebwt', name= GENOMES.keys()),


rule bowtie_build:
    input:
        genome= lambda wildcards: GENOMES[wildcards.name], 
    output:
        '{name}.1.ebwt',
    shell:
        r"""
        bowtie-build {input.genome} {wildcards.name}
        """

save it as Snakefile and run it as e.g.:

snakemake --cluster "qsub --ntasks=5 --cpus-per-task=1 etc..." -p -n --jobs 10

I would do things slightly differently (for example I wouldn't hard-code filenames in the Snakefile) but this should give an idea. It's not trivial but it's worth it.

Another advantage of snakemake (and I guess nextflow) is that it submits jobs to the cluster in the right order and at the right time. Presumably after indexing there will be an alignment. Snakemake will submit the alignment jobs as soon as the indexing is done but without exceeding the resources you allocate.