Question

BWA-MEM get different results with different threads

5

Entering edit mode

7.0 years ago

lghust2011 ▴ 110

When I use BWA(version 0.7.15-r1140), I select the BWA-MEM algorithm and here is my command:

bwa mem -t 2 reference_file fastq1.fq.gz fastq2.fq.gz > result1.sam

This works well and then I use this command to improve the threads:

bwa mem -t 10 reference_file fastq1.fq.gz fastq2.fq.gz > result2.sam

This also works well but when I compare the result1.sam with result2.sam, they are different! I also test with -t 6, -t 16, and all the results are different. However, when I run with the same threads twice, the results are identical. So I found that BWA-MEM will get different results with different threads.

Then I read the source code and found this:

kt_for(opt->n_threads, worker1, &w, (opt->flag&MEM_F_PE)? n>>1 : n); // find mapping positions
    for (i = 0; i < opt->n_threads; ++i) smem_aux_destroy(w.aux[i]);
free(w.aux);
if (opt->flag&MEM_F_PE) { // infer insert sizes if not provided
    if (pes0) memcpy(pes, pes0, 4 * sizeof(mem_pestat_t));
    else mem_pestat(opt, bns->l_pac, n, w.regs, pes);
}
kt_for(opt->n_threads, worker2, &w, (opt->flag&MEM_F_PE)? n>>1 : n);

That is, BWA-MEM use n_threads(such as -t 6, n = 6) to find mapping positions, but use only 1 thread to execute the function mem_pestat to calculate avg(average of the insert size) and std(standard deviation of the insert size), which are important to find pair information. According to BWA, every thread will process around 10000000bp, so:

If I use -t 2, BWA will calculate avg and std with 2 x 10000000bp

If I use -t 10, BWA will calculate avg and std with 10 x 10000000bp

If I use -t 16, BWA will calculate avg and std with 16 x 10000000bp

So I know why the results are different with different threads.

I wonder to know if there is anything wrong with my opinion? If it's correct, I want to know how to evaluate the difference? The difference will change which filed of the SAM record( such as RNAME or POS)? If it's wrong, I want to know the real reason to make the difference.

Any reply will be much appreciated!

alignment sequence • 4.9k views

ADD COMMENT • link updated 7.0 years ago by Brian Bushnell 20k • written 7.0 years ago by lghust2011 ▴ 110

2

Entering edit mode

Could you explain how the files are different? How did you compare them? Mapping rate, insert size, order of reads, md5sum? This question as asked before, see here, here and here.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

Thanks for your comment. I only use diff command to compare. Because I read the source code directly so I found that the code above maybe the reason contributed to the difference. According to my search, samtools flagset is a good tool to compare and I will use it and tell you the result later. If there is any better compare tool, please let me know.

ADD REPLY • link 7.0 years ago by lghust2011 ▴ 110

1

Entering edit mode

diff is not a good tool for this, unless you sort both files. If reads are in different order, but have exactly same mapping, diff will show them as different. Use Devon Ryan tool, or use mapping statistics. Have in mind that even if the files mapped with different thread number are not identical, differences should be minor.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

I found that when the MAPQ is big, the main difference is flag. When the MAPQ is small, the difference gets more complicated...

ADD REPLY • link 7.0 years ago by lghust2011 ▴ 110

score 5 · Accepted Answer · 2017-04-24

I wonder to know if there is anything wrong with my opinion?

Well, no, not really. Everyone wants correct and reproducible results...

But in this case, you can't have correct and reproducible results. The problem is, you can't have correct results. The insert size estimate depends on read order, and read order is random. There's no way accurately claim "My estimate using the first 10000 reads is better than using 5 threads with 2000 reads each", or whatever.

For N reads, there are N! orderings. Even if it took a nanosecond to evaluate the best insert size for a given read ordering, it's still impossible for even 100 reads, to determine the optimal order or optimal insert size, in a system where the insert size affects mapping. I think bwa's approach of using multiple threads to refine the estimate is pretty good.

BBMap uses continuous refinement of insert-size estimation, so for an infinitely-long file with uniform insert-size distribution, it will converge on the correct answer. This approach is more robust to scenarios in which the first 1m reads happen to be junk, which is a chronic occurrence on some platforms. But, that makes it even more nondeterministic than bwa.

I think, if you want the best results from mapping modern sequence output in a usable timeframe, you will have to accept nondeterministic software.

score 3 · Accepted Answer · 2017-04-24

3

Entering edit mode

7.0 years ago

Devon Ryan 104k

It's almost guaranteed that the differences are in alignments with low (0 or so) MAPQs. I have a little C program that uses htslib to compare two name sorted BAM files (or at the very least they need to be in the same order). It'll print out differences due to chromosome/position/cigar. If you then looked at a couple of those my guess would be that the differences are only occurring at low confidence alignments, which is wholly unsurprising.

ADD COMMENT • link 7.0 years ago by Devon Ryan 104k

1

Entering edit mode

Yes, I found the MAPQ of most different records are very small, although a few of them are big

ADD REPLY • link 7.0 years ago by lghust2011 ▴ 110