Hi, I used bwa-mem to align paired-end reads, I trimmed one fastq file which is 80 bps long, the mate fastq file is 79 bps long. When I used 8 threads, it mapped more reads than the default 1 thread. Now I am confused, changing the number of threads are supposed to only change the speed, why does it affect the mapping result? Does any one know the reason for that? Any comments are appreciated. Thank you.
What flags are you using? Does the behaviour persist with the -M flag? Can you identify where the differences are: i.e. reads mapped multiple times or a greater % of raw reads mapped?
I only set -t, for others I used default settings. Do I have to use -M all the time? I use samtools flagstat to test the result. All items are different. The biggest different different is the number of reads properly paired.
That could be a bug in the software, especially taking in account that race conditions are usually more difficult to debug and test - so they have more probability to get through undetected. What version of bwa do you use, and did you have a chance to see what was aligned differently between runs? I rely on bwa-mem in current project and am really interested in knowing more about this issue.
I've just run a piece of my pipeline and data using different number of threads, the number is greater with more threads. Diff shows that some lines changed, some got added:
$ bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.5a-r405
Contact: Heng Li <lh3@sanger.ac.uk>
Out of curiosity, do you get identical results if you rerun one of these? I'm wondering if there's a random number generator being used somewhere for seeding, that might cause slightly different results with each run (in fact, just greping through the bwa source code, it makes use of random seeds and such in a number of places).
For those who may concern, I got answer from the developer of bwa mem. There is a bug in bwa-mem running with different threads in the version 0.7.5a. The other master branch versions may work fine.
ADD COMMENT
• link
updated 3.7 years ago by
Ram
39k
•
written 9.4 years ago by
shl198
▴
440
0
Entering edit mode
thanks, it's good to know, that the problem is acknowledged
Only the released version in source forge has the bug, in github I think he fixed it. This is what he said: "There is a bug in 0.7.5a which affects the randomness. The master branch in git should not have this issue if I am right".
Sounds like it. That's a relief as I have been using the 0.7.5a version from GitHub in my production runs for awhile now. I am testing with the current github version to double check overnight/tomorrow
I've been banging my head for about a week and i've just come across this thread - glad Pavel Senin has confirmed this. I'm using bwa/0.7.15 so i doubt its fixed. P
I have to say that after several years, this issue still persists, which I think is a quite big reproducibility issue.
Especially when the pipeline consists of many steps, the inconsistencies accumulate on top of each other like a snowball.
I use version 0.7.17-r1188
bwa mem -t 10 SC_masked 4_SC-NA_2N_col1_28279_TCCTGAtr_1P.fastq.gz 4_SC-NA_2N_col1_28279_TCCTGAtr_2P.fastq.gz | samtools sort -@10 -o SC1.bam -
61544262 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
330 + 0 supplementary
0 + 0 duplicates
47333309 + 0 mapped (76.91% : N/A)
61543932 + 0 paired in sequencing
30771966 + 0 read1
30771966 + 0 read2
46942664 + 0 properly paired (76.28% : N/A)
47180978 + 0 with itself and mate mapped
152001 + 0 singletons (0.25% : N/A)
87678 + 0 with mate mapped to a different chr
45985 + 0 with mate mapped to a different chr (mapQ>=5)
bwa mem -t 4 SC_masked 4_SC-NA_2N_col1_28279_TCCTGAtr_1P.fastq.gz 4_SC-NA_2N_col1_28279_TCCTGAtr_2P.fastq.gz | samtools sort -@4 -o SC1.bam -
61544262 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
330 + 0 supplementary
0 + 0 duplicates
47333217 + 0 mapped (76.91% : N/A)
61543932 + 0 paired in sequencing
30771966 + 0 read1
30771966 + 0 read2
46943592 + 0 properly paired (76.28% : N/A)
47180798 + 0 with itself and mate mapped
152089 + 0 singletons (0.25% : N/A)
87692 + 0 with mate mapped to a different chr
46000 + 0 with mate mapped to a different chr (mapQ>=5)
I think (but not sure) Bowtie2 is deterministic, as it doesn't estimate on the fly insert size. If you consider exactly reproducibility very important, then maybe is an aligner to consider / test.
If you run the same version of Bowtie 2 on two reads with identical
names, nucleotide strings, and quality strings, and if --seed is set
the same for both runs, Bowtie 2 will produce the same output; i.e.,
it will align the read to the same place, even if there are multiple
equally good alignments. This is intuitive and desirable in most
cases.
It does not appear to do this by default.
Bowtie 2’s search for alignments for a given read is “randomized.”
That is, when Bowtie 2 encounters a set of equally-good choices, it
uses a pseudo-random number to choose. For example, if Bowtie 2
discovers a set of 3 equally-good alignments and wants to decide which
to report, it picks a pseudo-random integer 0, 1 or 2 and reports the
corresponding alignment. Arbitrary choices can crop up at various
points during alignment.
The question is if a difference in mapping rate of < 0.001% is indeed a quite big reproducibility issue. I've never checked but one would need to see if that indeed and notably affects genome coverage and therefore downstream applications, I would assume it doesn't and these differences are probably (but this is guessing) concentrated at difficult genomic regions (low complexity) which for most downstream applications would be filtered out anyway.
Well, something is either reproducible or not, and in this case BWA is one of the most used software in bioinformatics, that's why I say quite big. Imagine how much time in total the community would spend to figure out what is wrong in a large pipeline, which produces slightly different results. And the parameter of cores is intuitively not the first thing to check (in my case at least) when a large pipeline behaves oddly.
Regarding how much this affects the results, so in my case the downstream MACS2 and DiffBind analysis showed around 10 more open chromatin regions (~0.5% of total regions). Even if the number is not huge, I was about to redo all the analysis and plots, which might take a couple of days.
That depends on what these new 10 regions are. If these are problematic regions prone to show false-positives then this is probably not too worrying. Can you upload the coordinates of the regions and the genome version. Would like to know if they are in my personal blacklists, just out of curiosity.
I think this is a common misconception about what reproducibility should really mean. The goal shouldn't be to get the exact same result, since that's not even possible by using deterministic software in a docker container on different systems. The goal instead should be to get within some reasonable epsilon of the same results while allowing the same conclusions to be drawn.
I would say that this epsilon should really depend on the area of research, while for experimental biology of course nothing can be 100% reproducible, for bioinofrmatics part I guess it should be really close to 0.
Also depends on what is the conclusion, if it is a number of open-chromatin peaks, so it should be exactly the same :)
I guess it is how NGS aligners work (they make assumptions/programmatic compromises and you make some additional choices when you run them).
Brian Bushnell (developer of BBMap suite) had said in a past conversation :
BBMap is always nondeterministic when run in paired-end mode with
multiple threads, because the insert-size average is calculated on a
per-thread basis, which affects mapping; and which reads are assigned
to which thread is nondeterministic. The only way to avoid that would
be to restrict it to a single thread
Something similar is likely happening here.
BBMap does provide a way to run deterministically:
deterministic=f Run in deterministic mode. In this case it is good
to set averagepairdist. BBMap is deterministic
without this flag if using single-ended reads,
or run singlethreaded.
What flags are you using? Does the behaviour persist with the -M flag? Can you identify where the differences are: i.e. reads mapped multiple times or a greater % of raw reads mapped?
I only set -t, for others I used default settings. Do I have to use -M all the time? I use samtools flagstat to test the result. All items are different. The biggest different different is the number of reads properly paired.
That could be a bug in the software, especially taking in account that race conditions are usually more difficult to debug and test - so they have more probability to get through undetected. What version of bwa do you use, and did you have a chance to see what was aligned differently between runs? I rely on bwa-mem in current project and am really interested in knowing more about this issue.
how many more reads did it map?
It mapped about 50,000 more reads.
Please provide a reproducible example.
I am using the 0.7.12 version now, and it has the same problem still!