Weird error from BWA and BOWTIE2
1
0
Entering edit mode
29 days ago
SkyL ▴ 10

Hi community,

Recently I have used BWA and Bowtie2 to align simulated DNA sequencing data to test our sequencing simulator. I got some errors from both aligners:

BWA: submit.sh: line 48: 6881 Segmentation fault (core dumped)

BOWTIE2: terminate called after throwing an instance of 'std::bad_alloc'

what(): std::bad_alloc

Aborted (core dumped)

(ERR): bowtie2-align exited with value 134 or (ERR): bowtie2-align died with signal 6 (ABRT)

I searched online and found most posts said this kind of error is related to the memory shortage, so I monitor the memory usage during the alignment. I found BWA consistently took ~9GB and BOWTIE2 consistently took ~5GB in total. I also ran a script to check the memory every 30 seconds and found both of the aligners occupied no more than 10% of the memory and there is always ~100GB memory available. I then tried using fewer threads (5 threads for example) and assign each thread 9GB memory but still got the same error. So I feel it is unlikely the memory issue.

The data I am aligning that throws out such error is having 100x coverage for human genome so a single fasts file would be 300-400GB. I also tried lower depth (e.g.15x coverage) data using the same simulator and the alignment can be done without issue. I am not sure if this is due to the simulated data is too deep but I feel it is just the number of total read and the aligner would take a longer time to finish rather than throw out an error.

Does anyone encounter a similar issue or know what might be an issue or can give some hint on how to fix it? Many thanks!

alignment simulation error sequencing • 720 views
1
Entering edit mode
29 days ago
Mensur Dlakic ★ 14k

Insufficient memory is most likely the problem, as you surmised. It doesn't mean much that you can't detect a memory spike when monitoring in 30 sec intervals. The spike itself can happen on a time scale that is smaller than 30 sec and you may not be able to catch it. Also, are you certain that all the memory is available to you? I would not normally worry about 100x coverage, but it depends on the genome size and the total number of reads - 100x of the wheat genome is not the same as 100x of E. coli.

0
Entering edit mode

Thanks for the reply and suggestion

1. We are simulating human WGS data
2. I believe that all memory are available to me since I ever used over 100GB memory on the same machine

I will monitor the memory every second to see if there is a spike that exceed the total amount of memory, thanks!

0
Entering edit mode

Do you use a job scheduler such as SLURM, if so please add the submission parameters.

0
Entering edit mode

I used SLURM for a larger server, the command is some like: sbatch -p xxx -w xxx -t 71:00:00 -c 16 --mem 46G script.sh or sbatch -p xxx -w xxx -t 71:00:00 -c 5 --mem-per-cpu 9G script.sh

-p xxx and -w xxx point to the pool and compute node the server defined, and script.sh contain the command to run bwa or bowtie2

For the memory test, I used a smaller cluster that is a normal ubuntu system so I just open a screen and run script.sh

0
Entering edit mode

I monitored the memory usage every second, and plot it as follows. Since it consistently took 2.7% until existing, I only crop the short time frame close to the end point There is spike but that only took around 25% of the usage.

0
Entering edit mode

I tried 50x coverage and got the same error... now I am confused .. any suggestions for debugging? Should it be something wrong with the simulator? thanks!

1
Entering edit mode

I think this is almost certainly a memory issue. You can (dis)prove that: try your command with a single CPU, use more memory than 46G, and take out the part that says --mem-per-cpu 9G. I realize doing it that way will be slow, but if it goes without a problem it means that your combination of total memory, # of CPUs and memory allocation per CPU are not giving the program enough memory to work with.

Alternatively, your total memory should be at least 10% higher than 5*9G, because scripts take up memory for other reasons than just what BWA or BOWTIE2 need.

0
Entering edit mode

thanks for your suggestion. I did more tests on different coverage of simulated data using different aligners. Bowtie2 finished on 25x and 10x coverage without error but failed on 5x data with std::bad_alloc error. I also test minimap2, bwa mem, bowtie2 on the same 15x coverage data on the nodes that have the same configuration (8cores, 23G memory), only bwa mem finished without error, bowtie2 existed with std::bad_alloc error, and minimap2 existed with Segmentation fault (core dumped) and SEQ and QUAL are of different length message. Is this possible? bwe mem also failed on the previous 100x data with Segmentation fault (core dumped) or SEQ and QUAL are of different length error. Do I miss some key options for those aligners? Currently, I just define the input file, output file, number of threads and reference genome for all these aligners. Thanks!

1
Entering edit mode

SEQ and QUAL are of different length message points to a different problem, and that could be an issue with a simulator rather than memory. It means that in one or more of your reads the length of sequence line is not the same as the length of the line with quality values. You can remove those reads with reformat.sh which is part of the BBtools package:

reformat.sh in=reads.fq out=fixed.fq tossbrokenreads

0
Entering edit mode

thanks for the quick response, but what confused me is that for the same data, bwa mem finished without error but minimap2 got such error, and bowtie2 still existed with the std::bad_alloc error, is this possible? Or I missed some configuration for the tools?

1
Entering edit mode

I have seen it before that some aligners quit when faced with unequal read and quality lengths, and others can just power through it. If you know there is a problem, I think it is always a good idea to fix it.

0
Entering edit mode

thanks, we run fastQValidator on the simulator data and found that there are a lot repeated sequenced identifier:

ERROR on Line 1967301: Repeated Sequence Identifier: ohlsaiv at Lines 1277469 and 1967301
ERROR on Line 1972921: Repeated Sequence Identifier: qjgbvmp at Lines 1467313 and 1972921
ERROR on Line 2035541: Repeated Sequence Identifier: xfvhukh at Lines 498093 and 2035541
ERROR on Line 2039105: Repeated Sequence Identifier: vpxmhxr at Lines 506781 and 2039105
ERROR on Line 2074373: Repeated Sequence Identifier: ejirkhj at Lines 797529 and 2074373
Finished processing refright.fasta with 501692704 lines containing 125423176 sequences.
There were a total of 974030 errors.
Returning: 1 : FASTQ_INVALID


but did not find any error related to unequal read and quality lengths, do you think the repeated sequenced identifier error is the reason to blow the memory? if so, how? Thanks!