parallel bwa mem with compressed fastq.bz2 files
1
0
Entering edit mode
4.1 years ago
Amirosein ▴ 70

Hi all,

I have a set of 2 compressed paired files like:

L1_R1.fastq.bz2
L1_R2.fastq.bz2


and I want to run bwa mem with multiple threads on it. As the files are compress in .bzip2 I am using a shell script to pipe two commands for compressing both into bwa mem at once. The code is as follows:

bwa mem -t20 /projects/ref.fa \
<(pbzip2 -kdc -m5000 -p12 /projects/L1_R1.fastq.bz2) \
<(pbzip2 -kdc -m5000 -p12 /projects/L1_R2.fastq.bz2) \
| samtools view -u -F4 | samtools sort -@8 -o
/projects/linearL1.bam


I am using pbzip2 which do the decompression using multiple cores (-p12). Additionally, I set the -t20 parameter for bwa mem. But when I run the script, I see that only two threads are processing the data! But I want to use multiple threads to do it faster..!

So the question is What am I missing in my script to use more threads? or what is the problem?

Additionally, I have multiple files, I am wondering if I can input all of them into a single bwa mem run. Example of my files organisation:

L1_R1.fastq.bzip2
L1_R2.fastq.bzip2
L2_R1.fastq.bzip2
L2_R2.fastq.bzip2
L3_R1.fastq.bzip2
L3_R2.fastq.bzip2

bzip2 bwa bwa-mem linux bash • 3.5k views
0
Entering edit mode

At some point IO become limiting. Anyway, if bwa mem is using all 20 cores then decompressing the files faster won't make any difference. Further, bwa, like pretty much every program in existence, has various steps with various levels of parallelization, so if you're looking at its CPU usage you might just be seeing a step where its worker threads are dumping to disk (that will be single threaded by nature).

0
Entering edit mode

But it is not using all the 20 cores! yeah I am looking at CPU usage but it is totally different from when you pass an uncompressed file into it...

0
Entering edit mode

Then something else is the bottleneck. Programs don't scale linearly forever, there are various limitations throughout both your system and program architectures.

0
Entering edit mode

I think you might be better off trying something like this:

pbzip2 -kdc -m5000 -p10 /projects/L1_R1.fastq.bz2 > /projects/L1_R1.fastq &
pbzip2 -kdc -m5000 -p10 /projects/L1_R2.fastq.bz2 > /projects/L1_R2.fastq
bwa mem -t20 /projects/ref.fa /projects/L1_R1.fastq /projects/L1_R2.fastq \
| samtools view -u -F4 | samtools sort -@8 -o  /projects/linearL1.bam
rm -f /projects/L1_R1.fastq
rm -f /projects/L1_R2.fastq


unzip the two files each with half of your machines CPU power and only afterwards send those two unzipped files to bwa with all cores. That way you use the full capacity of your machine and avoid (what I think is) the rate limiting unzip step in your stream approach.

Yes, you'll have to (temporarily) give in on storage space efficiency.

on the second part of your question: And why would you want to put all those files into a single bwa run ? if you split them up it will process faster.

0
Entering edit mode

Thank, but I am wondering what is wrong with my code! If there is no problem then the code you provided will be slower than mine, as you are doing both the things separately. Theoretically, if you use piping you are not going to be slower at least...! And my problem is that I want to be faster and memory efficient using piping. And thanks for your answer to the 2nd part. You're right.

0
Entering edit mode

I don't think something is wrong with your code as it does seem to work, right? As Devon Ryan also mentioned you're likely facing a bottleneck somewhere, but hard to say where or what I'm afraid.

Theoretically, if you use piping you are not going to be slower at least...!

This holds true for single core processes but not sure how this will go in your set-up. I can think of a scenario where for instance the data is not being unzipped rapidly enough for bwa to read it at full capacity, hence the whole process will go slower. In your specific case you are also over-asking the #cores to use. Your server has 20 cores and you ask for 20+12+12 = 44 in total so you might have issues with parts of your pipeline competing for resources.

And my problem is that I want to be faster and memory efficient using piping

I understand, but that's not really the way to go then, piped cmdlines are always more memory intensive then reading from disk as the whole cmdline needs to be handled in memory, so you will give in on memory efficiency. Yes it might go faster as you are indeed eliminating the time-consuming reading / writing to disk steps.

If there is no problem then the code you provided will be slower than mine, as you are doing both the things separately.

Might look so, but I beg to differ. The approach I provide is using the full capacity of your server for the whole duration of the process(es). And I'm not doing all steps separately only the unzipping is two steps. bwa is running at full CPU capacity so there is no loss of efficiency there (even gain as it can read from disk at full capacity)

0
Entering edit mode

The buffer size of a pipe is relatively small, at least on Linux, so there shouldn't be too much of a memory hit from piping.

0
Entering edit mode

Thanks for your COMPLETE reply, yeah I was answering fast and I missed some of the points. Thanks for pointing out. and I have to add that the machine has more than 44 cores. Thanks again.

0
Entering edit mode
2.7 years ago
hsiaoyi0504 ▴ 60

I am not pretty sure if the original author of this thread solves this problem, but for me, I use bunzip -c to do the decompression without creating the file in the input file directory.

0
Entering edit mode