BWA mem output inconsistent on same but re-ordered FASTQ input
3
1
Entering edit mode
6.1 years ago
ramy ▴ 10

Hello, I've noticed that bwa mem outputs results to be inconsistent under following setting. I'd have expected this to be not the case, and would appreciate if someone can comment on why this is happening. I have tested this with both bwa mem v0.7.5 and 0.7.15.

Steps to reproduce:

1. Create multiple FASTQ paired-end files from a reference FASTQ set of files. So, if there are 10000 read in our reference FASTQ files (s1.fastq, and s2.fastq), I generate another set of fastq files which also have the same 10000 reads but in a different order. I used the following script to do so.
#!/usr/bin/env python
import Bio
import Bio.SeqIO
import sys
import random

file1 = sys.argv[1]
file2 = sys.argv[2]

record1 = []
record2 = []

for s in Bio.SeqIO.parse(file1,"fastq"):
record1.append(s)

for s in Bio.SeqIO.parse(file2,"fastq"):
record2.append(s)

# Lets make sure both files have same number records
assert(len(record1) == len(record2))

rand_idx = random.sample(xrange(len(record1)), len(record1))

r1 = open("reorder1.fastq", "w")
r2 = open("reorder2.fastq", "w")

for idx in rand_idx:
Bio.SeqIO.write(record1[idx], r1, "fastq")
Bio.SeqIO.write(record2[idx], r2, "fastq")

r1.close()
r2.close()

print("Wrote %d records" % len(record1))


This will say output the following files:

$ls reorder1.fastq reorder2.fastq s1.fastq s2.fastq shuffle_fastq.py  The original files were s1.fastq, s2.fastq. The newly generated shuffled files are reorder1.fastq, reorder2.fastq. I made sure reorder1.fastq has all the entries in s1.fastq (and the same for reorder2.fastq/s2.fastq) using this: #!/usr/bin/env python import Bio import Bio.SeqIO import sys file1 = sys.argv[1] file2 = sys.argv[2] results = [] records = {} for s in Bio.SeqIO.parse(file1,"fastq"): records[s.id] = s for s in Bio.SeqIO.parse(file2,"fastq"): if not records[s.id].seq == s.seq: results.append(s) print "Unmatched records", len(results)  1. Run BWA mem on the output FASTQ file pairs r1.fastq/r2.fastq and s1.fastq / s2.fastq -- $ bwa mem -R "@RG\tID:foo\tLB:bar\tPL:illumina\tPU:illumina\tSM:ERR000589" /share/BWA_Index/hg19.all_chr.fa s1.fastq s2.fastq > s.sam

$bwa mem -R "@RG\tID:foo\tLB:bar\tPL:illumina\tPU:illumina\tSM:ERR000589" /share/BWA_Index/hg19.all_chr.fa reorder1.fastq reorder2.fastq > reorder.sam # Sort sam files so we can compare them with diff$ cat reorder.sam | samtools view -Sb - | samtools sort - reorder_sort && samtools view reorder_sort.bam > reorder_sort.sam
\$ cat s.sam | samtools view -Sb - | samtools sort - s_sort && samtools view s_sort.bam > s_sort.sam

1. Compare the output sam files s.sam and r.sam, and I see different alignments!!

diff reorder_sort.sam s_sort.sam

Is anyone aware of why this could happen? I've tried this experiement several times with the following results:

2500 135
12500 660
125000 6830
bwa mem alignment bwa Assembly tool • 5.8k views
1
Entering edit mode

Take a look at: bwa mem multiple alignments doesn't work Output of NGS aligners may be non-deterministic.

7
Entering edit mode
6.1 years ago
lh3 33k

There are two sources of randomness. The first is caused by the insert size estimate. Different batches of reads have slightly different insert size distribution, which affects paired-end alignment. You can in principle feed the insert size distribution to bwa-mem, but the bwa-mem estimate is usually better than a naive estimate and is at least more convenient.

The second source is the "random" placement of repetitive hits. Bwa-mem is not really using a random number generator. It instead hashes deterministic information, such that the i-th read in the input is always associated with the same hash and has the same mapping. Reordering obviously breaks the deterministic behavior. I later learned that Bowtie2 hashes read names instead of the index of read. That is usually a better strategy except when you change read names.

1
Entering edit mode
6.1 years ago

As far as I remember fast aligners have to be stochastic, to overcome this we use local realignments and store multiple best hits. Please correct me if I am wrong, but even the genome indexing process that is before the alignment has fluctuations between different computers.

Also from bioinformatics standpoint could you please explain why 130 reads out of 10k bother you, unless there is a location bios for their distribution (for example you have highly repetitive region where your reads can match to multiple locations with almost the same score. To avoid this it is a good practice to remove/mask such regions before the alignment. There was GRCh37.light version of hg19 assembly that was quite popular and was designed with such short reads alignment issues in mind. It was a long time ago I used it, but if you will not be able to find it, please let me know and I will look in my archives.

2
Entering edit mode

Please correct me if I am wrong, but even the genome indexing process that is before the alignment has fluctuations between different computers.

I certainly hope that isn't the case! The only things I am aware of in bwa-mem that are nondeterministic/order-specific are the random choice of multimapping read primary alignment (which doesn't really matter) and the insert size estimation. Insert-size estimation, which affects placement of paired reads, apparently is calculated from some of the initial reads in the file. Or, that's what I've been told. So, reordering the reads would give a slightly different average insert size and thus potentially lead to different best site assignments. It should not affect single-ended data.

BBMap is also nondeterministic/order-specific when dealing with paired reads due to insert-size calculation, though it maintains a running average so even if the first, say, 1000 reads have a weird insert size, it will eventually stabilize. This is not an easy problem to solve - the insert size distribution affects the mapping, and the mapping determines the insert-size distribution. As a result, you can make things deterministic by mapping with some fixed initial insert size estimate, then mapping again with a revised estimate, and iterate until the estimate stops changing (if it does stop changing). But who's going to do that? It's not even guaranteed to converge. I postulate that there's no provably optimal, deterministic solution that runs in non-exponential time (considering there are an infinite number of insert size distributions, and even the average is a real number rather than an integer). Therefore, if you want the best possible accuracy, which you get from paired-end aligners using insert-size estimates, you have to put up with nondeterminism.

0
Entering edit mode

I mean even if 130 reads out of 10k reads difference is a big deal for you and there is no bios, you can run this analysis multiple times and see if distribution is random and than you can correct for it considering it as noise.

0
Entering edit mode

And by the way bwa 0.7.5a had a bug in it confirmed by developers that caused different results even on different threads of the same computer. Do not know if it was corrected in 0.7.15

0
Entering edit mode
6.1 years ago
ramy ▴ 10

I had to correct my post above. I am seeing inconsistency rates of over 5% instead of 1% as originally claimed. I counted each line of FASTQ as a record (i.e. 10000), whereas its really only 2500 records. I've seen similar rates with other datasets:

2500 135
12500 660
125000 6830
0
Entering edit mode

Right... again, that's not exactly unexpected, so why are you surprised? 10000 is too small a number of reads to use for statistics, particularly if you did not sample them randomly from the dataset (which you did not state).

Also, please post your responses as a response, rather than an answer, to keep posts logically organized.

0
Entering edit mode

Its not so much the number of reads which is surprising, but the fact that this is happening at all. The above comments on the inner workings of bwa explains the behavior. Thank you everyone!