I am trying to align a bunch of paired sample fastq files using bwa mem.
My original command was:
bwa mem -t 8 hg38.fa sample_read1.fq.gz sample_read2.fq.gz > sample_paired.sam
I am running this on a HPC cluster.
These files have approx. 25 million reads, so I initially anticipated that they might take a little bit of time; however I noticed that the programs seemed to hang - the sam file that I am redirecting the output to never became populated with any entries other than the header comments, even after hours of waiting.
Noticing this, I decided to first test whether the most bare bones bwa mem functionality was working, and so I took just the first 1000 lines off one of the fastq files and tested whether it was working with:
bwa mem hg38.fa test_1000lines.fq.gz > test_output.sam
So this works. This creates a proper sam file almost instantaneously. However, I kept testing this behavior by increasing the number of lines I took off the fastq file from 1000 to 2000,3000,etc., and eventually realized that any fastq files I input with more than approximately 8400-8500 lines will make the program hang.
If I run the following with a fastq file containing 8400 lines, the program will run and the sam file will be created in a few seconds:
bwa mem hg38.fa test_8400lines.fq.gz > test_output8400.sam
and this is the stdout output:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2100 sequences (313733 bp)...
[M::mem_process_seqs] Processed 2100 reads in 1.051 CPU sec, 1.054 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem supporting_files/hg38.fa test_8-4k.fq.gz
[main] Real time: 4.231 sec; CPU: 4.178 sec
However, if I run the following with just 100 more lines in the fastq file:
bwa mem hg38.fa test_8500lines.fq.gz > test_output8500.sam
The command will hang (at least, that is what it appears like its doing) and all I will see on the terminal is:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2125 sequences (317468 bp)...
followed by a blinking cursor.
I've monitored the CPU and MEM usage with the "top" command while I was conducting these tests, and while the memory never goes above something like 15%, the CPU usage always goes to 100% when this hanging behavior happens. When the .sam files were created properly and the program didn't hang, the CPU usage was (for the very short amount of time it took for the program to run) like 60-70%.
Whether the CPU usage going to 100% is causing the hanging, or the hanging it causing the CPU usage to 100%, I'm not sure.
I'm completely out of ideas what the issue is here. Any ideas?
Use option
-v
with value of 4 or above to help diagnose the issue:-v INT
Control the verbose level of the output. This option has not been fully supported throughout BWA. Ideally, a value 0 for disabling all the output to stderr; 1 for outputting errors only; 2 for warnings and errors; 3 for all normal messages; 4 or higher for debugging. When this option takes value 4, the output is not SAM.See: bwa manual
Here is the output with the "-v 4" flag (not showing all the @SQ comments) :
The cursor just hangs after the last line.
Here is that specific read in question:
As I updated in the bioinformatics stack exchange post, this stalling behavior disappears if I use the hs38DH set of reference indices, instead of the hg38 set.
This is a guess but if you are actually running this on a cluster then it is possible that you need to use a job scheduler to submit this job. It appears that you may be running this on head-node directly and perhaps that is why the job likely is throttled by sys admins.
Since you are able to get a small dataset to work we know that the program is properly installed and is working as expected.
You may also want to directly pipe the output to
samtools
and write out a sorted BAM file. There is no need to create a SAM file. Assuming you are using a recent version ofsamtools
something like following will workActually, I originally did both of those things that you suggested. My original command at the top of the post was submitted to our cluster through a job scheduler and I had originally also piped the output to samtools as you wrote.
I did these testing on my local login node however to turn the problem into a smaller one, so I could try and isolate the issue. Given that the failure occurs with an input fastq file of less than 10k lines (which should normally be processed basically instantly with even a basic home laptop), I think the problem here is something more fundamental than just a matter of resources.
Then you probably are best served by talking with your local sys admins to see where the issue may be.
Cross-posted here: sequence alignment - bwa mem hangs after a few thousand reads - Bioinformatics Stack Exchange