How to parallelize bwa mem for single-end data within a larger bash script?
Entering edit mode
4.7 years ago
DNAngel ▴ 250

I've seen some posts here that have tried to parallelize bwa-mem for their sample data but all examples were paired-end data and the code was standalone (i.e. they only needed it for bwa mem and maybe converting SAM to BAM). But I also have to run my script for 1000s of different reference files and I have many steps within my script. My large bash script that I run on SLURM, processes single-end fastq files (40 fastq files) sequentially with various conversions and other scripts being called so I want to implement gnu-parallel. Would I need to parallelize each command block? Here is my attempt at putting my first few commands into a function and trying to run parallel on it but I'm getting confused by how to actually write out parallel:

#SBATCH --node=2
#SBATCH --ntasks=96
#SBATCH --time=05:00:00


module load gcc
module load samtools
module load bwa
module load gnu-parallel # new module I want to use to speed up my jobs

REF=$1 fasta reference file, I have 1000s for various tests so the IDs are saved in a reflist.txt file 
GENE=$(basename $REF) # just the variable name without extensions   

function BWAMEM {
# First, index reference file
if [ -f $REF.bwt ]; then
echo "$REF already indexed, skip"; else
echo "$Creating index for $REF"
bwa index $REF; fi

# BWA MEM for each fastq file in CLEANED directory
for files in ../cleaned/*_cleaned.fastq
    if [ -f $(basename ${files%%_cleaned.fastq}_${GENE}_aln.sam ]
        echo "BWA MEM already done for $files, skipping"
        echo "Running BWA MEM for $files"
        bwa mem -B -t 20 $REF ${files} > ${files%%_cleaned.fastq}_${GENE}_aln.sam && \ 
        mv ${files%%_cleaned.fastq}_${GENE}_aln.sam $ path
        sleep 10
}; export -f BWAMEM
parallel -j 20 BWAMEM ### can I leave it as this does it know to access cleaned fastq files?

next set of tasks
final set of tasks

So far I have stopped here because I am unsure if I even need to keep the for loop in my BWAMEM function as I need to supply ::: something to parallel. Would I instead get rid of the for loop and instead write:

parallel -j 20 BWAMEM :::: ../cleaned/*_cleaned.fastq

If I have 40 fastq files, then this parallel command should execute for 20 of them first while BWA MEM also uses 20 cores (for a total of 40 cores which is what the server allows) and then run the rest of my script which is just a lot of conversions and cleaning up the BAM files that get generated. And then I was wondering if all those conversion steps (some do take long too) also need to be parallelized? I supposed to supply the list of REFERENCE files here because I do want to run this for 1000s of reference fasta files for various tests/tasks.

So maybe I should leave the for loop that accesses all cleaned fastq files and the final parallel command is:

parallel -j 20 BWAMEM ::: sed -e 's/.fa//g' reflist.txt
bash bwa-mem gnu-parallel • 1.9k views
Entering edit mode

Your script requires improvement as the idea of parallel should be to replace the for loop entirely. List all relevant fastq files forst and then pipe this into a parallel command then runs the bwa operations on them in parallel. Right now you are calling the files both from the parallel command and via the for loop which might be conflicting. See the linked thread for inspiration. By the way, if you are building from scratch, consider to learn how to do these kinds of workflows with a manager such as snakemake.

A: Optimize bwa mem in parallel on a SLURM cluster

Entering edit mode

I know but because I have multiple fastq ffiles to process for 1000s of ref genes (which the link you provided and others I've found here do not have this problem) I am unsure which part of my script to change. Do I use parallel to call the individual REF files of which I have thousands and in that case I leave the for loop to loop through the 40 fastq files...OR do I use parallel to run 20 fastq files at a time but somehow I have to execute it 1000s of times for each REF file? To me the former makes more sense solely because there are 1000s of ref files for testing but I've only seen parallel bwa mem examples where they feed the fastq files. Hence my confusion :(

Entering edit mode
4.7 years ago

you should use a workflow manager like nextflow or snakemake.

nevertheless, here is a solution using make (parallelize with option -j )

REFS=$(shell find /home/lindenb/src/jvarkit-git/src/test/resources/ -type f -name "*.fa" -o -name "*.fasta" )
FASTQS=$(shell find /home/lindenb/src/jvarkit-git/src/test/resources/ -type f -name "*.fastq.gz" -o -name "*.fq.gz")

define run1

$(1).bwt: $(1)
    bwa index $$<


define run2

$(lastword $(subst /, ,$(dir $(2) )))_$(notdir $(1))_$(notdir $(2)).sam: $(1).bwt $(2)
    bwa mem $(1) $(2) > $$@

all: $(foreach R,$(REFS),$(foreach Q,$(FASTQS), $(lastword $(subst /, ,$(dir $Q )))_$(notdir $R)_$(notdir $Q).sam ))

$(eval $(foreach R,$(REFS),$(call run1,$R)))
$(eval $(foreach R,$(REFS),$(foreach Q,$(FASTQS),$(call run2,$R,$Q))))

Login before adding your answer.

Traffic: 2837 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6