Question

How common is it to split fastq files prior to bwa mem to increase parallelization?

0

Entering edit mode

5 weeks ago

curious ▴ 890

I’m working with 150bp paired end human whole-genome sequencing (WGS) reads, with each sample sequenced across 8 lanes. I get two fastqs for each lane, which are ~4GB a piece

I’m currently aligning the FASTQ files in parallel using bwa mem, one process per lane. Even after increasing the -t (threads) parameter, the alignment step remains slow and typically taking around two days per sample.

I'm considering further parallelizing by splitting each lane-based FASTQ file using a tool like seqkit split2 -p 8, and then aligning the resulting chunks in parallel. This seems like it should provide a near-linear speedup, but I’m a bit cautious since splitting introduces an extra step where things could potentially go wrong.

Is splitting like this a common strategy for accelerating WGS alignments? Are there any caveats or best practices I should be aware of when using this approach?

Thanks

mem bwa alignment • 953 views

ADD COMMENT • link updated 4 weeks ago by Darked89 4.7k • written 5 weeks ago by curious ▴ 890

0

Entering edit mode

What is your hardware and command line? Two days sounds excessive, unless you are limited by RAM and cores, and that then would make it questionable what you would gain by putting more processes -- as you're probably bittlenecked already.

ADD REPLY • link 5 weeks ago by ATpoint 88k

2

Entering edit mode

This run is running on a node with 60 cores but I only requested 4GB of RAM. Seem like it is only using 2 CPUs to near 100% if I open htop. The fastq are gzipped. Did I not ask for enough RAM? I can ask for much more.

 bwa mem \
    -K 100000000 \
    -Y \
    -t 60 \
    -R "@RG\\tID:{params.flowcell_id}.{params.lane}\\tSM:{params.SM}\\tPL:ILLUMINA\\tPU:{params.flowcell_id}.{params.lane}.{params.sample_barcode}" \
    {input.ref} \
    {input.read1} \
    {input.read2} | \
    samtools view -b -o {output.bam}

EDIT: is my problem the sort command creating bottleneck? I see this stack exchange:

bwa mem -t 8 genome.fa reads.fastq | samtools sort -@8 -o output.bam -

if so, ultimatley I want a sorted cram file, would the command be this overall?:

 bwa mem \
    -K 100000000 \
    -Y \
    -t 60 \
    -R "@RG\\tID:{params.flowcell_id}.{params.lane}\\tSM:{params.SM}\\tPL:ILLUMINA\\tPU:{params.flowcell_id}.{params.lane}.{params.sample_barcode}" \
    {input.ref} \
    {input.read1} \
    {input.read2} | \
    samtools sort -@60 -O CRAM -T {genome} -o {output.cram} -

ADD REPLY • link 5 weeks ago by curious ▴ 890

2

Entering edit mode

4GB is in no way enough. The sort alone uses 768MB RAM per thread, so 8*768 already exceeds the memory. bwa with that many threads will use probably use a large two digit amount of RAM as well. Strange job did not get OOM-killed. I would run -t 32 or so, more is probably not doing much due to I/O bottleneck. Consider bwa-mem2 which is faster at identical results, see GitHub. Request 100GB RAM or so.

ADD REPLY • link 5 weeks ago by ATpoint 88k

0

Entering edit mode

The issue was my snakemake pipeline and the threads not being passed correctly :(, your response helped me though!

ADD REPLY • link 5 weeks ago by curious ▴ 890

0

Entering edit mode

Sorry the another question, but is bwa-mem2 the industry standard replacement for bwa-mem? If it produces identical results I would think so, but just wondering if there is any reason to not use it

ADD REPLY • link 5 weeks ago by curious ▴ 890

1

Entering edit mode

I cannot tell about "industry-standard", but since it was developed in coop. with the bwa developer, and claims that it produces identical results I see no reason to not use it. Uses more RAM IIRC, but don't quote me on that.

ADD REPLY • link 5 weeks ago by ATpoint 88k

0

Entering edit mode

You may switch from bwa to bwa-mem2. There should be no difference in the output, but since bwa-mem2 is faster than bwa you will get a speed-up for free. Try to use stable, latest versions of the tools. And since samtools is doing compression, increase the number of threads for it.

ADD REPLY • link 4 weeks ago by Darked89 4.7k

score 3 · Answer 1 · 2025-05-19

3

Entering edit mode

5 weeks ago

dsull ★ 7.6k

Splitting is a good strategy for speeding up processing. I use it all the time in my Snakemake workflows. Processing multiple files in parallel (using a couple of threads per file) is faster than processing one huge file using a large number of threads.

I don’t think there are really any caveats aside from testing to make sure that your “split file” pipeline produces the same results as your “non-split-file” pipeline.

ADD COMMENT • link 5 weeks ago by dsull ★ 7.6k

1

Entering edit mode

I don’t think there are really any caveats aside from testing to make sure that your “split file” pipeline produces the same results as your “non-split-file” pipeline.

Since bwa processes reads in chunks, and the chunk content is used e.g. to estimate insert sizes, the trick is probably to use the default (or any fixed) chunk size and split files in multiples of that chunk size. Then I would assume it produces the same output. Duplicate marking should then be done on the merged alignment I guess.

ADD REPLY • link 5 weeks ago by ATpoint 88k

0

Entering edit mode

Ah yes good point, agreed.

ADD REPLY • link 5 weeks ago by dsull ★ 7.6k

0

Entering edit mode

Btw, for splitting, in my hands, piping unpigz into the unix split command seemed to work faster than seqkit.

ADD REPLY • link 5 weeks ago by dsull ★ 7.6k