Question: BWA MEM mapping issue
gravatar for marina.v.yurieva
3.5 years ago by
Farmington, CT
marina.v.yurieva430 wrote:

I'm mapping two samples independently (PE reads) using bwa-mem:

bwa mem -t 3 -M ref.fasta sample1_R1.fastq sample1_R2.fastq > sample1.sam (2121026 mapped reads)
bwa mem -t 3 -M ref.fasta sample2_R1.fastq sample2_R2.fastq > sample2.sam (2485112 mapped reads)

But when I pool reads from two samples together and map them the number of mapped reads is higher: 5315859 reads vs 4606138. Why is this happening? Don't the reads map independently? 

bwa next-gen alignment • 1.8k views
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by marina.v.yurieva430

Usually things like this end up being due to the use of a random number generator at some point during the seeding step.

ADD REPLYlink written 3.5 years ago by Devon Ryan81k

"Don't the reads map independently?" I would think so myself... @Devon Isn't the difference 53M vs 46M a bit too large to be due to randomness?

ADD REPLYlink written 3.4 years ago by dariober9.3k

"Don't the reads map independently?" well at least, there are some calculations about the median size of the inserts in a pool of reads.


 -a INT     Maximum insert size for a read pair to be considered being mapped properly. Since 0.4.5, this option is only used when there are not enough good alignment to infer the distribution of insert sizes. [500]
ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum110k

Pierre, but this is sampe option, not mem...

ADD REPLYlink written 3.4 years ago by marina.v.yurieva430

I think there is the same algorithm in mem:

[M::main_mem] read 200 sequences (20000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (42, 0, 0, 37)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (15, 23, 30)
[M::mem_pestat] low and high boundaries for computing mean and (1, 60)
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Pierre Lindenbaum110k

Yeah, I'm pretty sure it infers the insert size (and library type) from the data. So different data could yield different results. However, it shouldn't be that different unless one of the datasets is just not being handled well and the other is helping bwa infer things (or, inversely, one dataset is just borking the inference step and then throwing everything off). Doesn't picard have an command that will give things like insert size distribution? It'd be interesting to run that on the two SAM files produced individually to see if they have very different insert size distributions.

ADD REPLYlink written 3.4 years ago by Devon Ryan81k

I think there is something wrong with my fastq files. I subsampled them and did the same analysis and everything adds up and the distribution of insert sizes looks good. But for the original files the distribution plot shows very long insert size (~8000) and shows it as "Tandem" instead of "FR". And the distribution plot on two original samples pooled together looks good. Looks like bwa has a problem with my original fastq files...

ADD REPLYlink written 3.4 years ago by marina.v.yurieva430

check if you have some empty reads in your FASTQ.

check the number of reads in each FASTQ R1/R2: numbers should be the same.

ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum110k

Yeah, that does seem a bit over the top to me too. Perhaps marina.v.yurieva can post the file somewhere just so this can be confirmed.

ADD REPLYlink written 3.4 years ago by Devon Ryan81k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1383 users visited in the last hour