Question

BWA with multiple references

2

Entering edit mode

8.9 years ago

Aurélien Guy-Duché ▴ 60

Hello everyone,

As I was playing with BWA, I came across something strange.

I made a ref.fasta file with two copies of the same genome sequence (I changed the second copy's name).

I was expecting to see the reads mapped on one of the two references and the alignment on the other one displayed in the XA tag.

What's strange is that, for some reads, the first in pair was mapped on one reference and its mate on the other, breaking down the pair.

Example:

ERR656485.14    81    gi|9629357|ref|NC_001802.2|    31    0    151
S129M1D20M    gi|9629357|ref|NC_001802.1|    26    0    TTGTATGTATT
ACATGATGACGGTGATCGTGTTTGACAGTGTCTTCATCATTGTTTGTTTTTTTTTTTTTTTCAAGCAGAAGACGG
CATACGAGATCCTCTATCGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAGCCCGGGAGCTCTC
TGGCTATCTAGGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTTAAGTAGTGTGTGCCCGTCT
GTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCAGTTTGGTAGTGTGGAAAATCTCTAGCN    ),.
)))))))))))-)((((((((.-,(((.)))).))((:.)).)444)(.(,,.(,-(-4FFB>FFFFFFFFFFFF
FFFFFFF7FDDFGCC4*FGGGGGGGGGFECGGGGECGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCB8!    NM:i:7    AS:i:116    XS:i:116
ERR656485.14    161    gi|9629357|ref|NC_001802.1|    26    0    134
M1D20M146S    gi|9629357|ref|NC_001802.2|    31    0    NGCCCGGGAGC
TCTCTGGCTATCTAGGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTTAAGTAGTGTGTGCCC
GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCAGTTTGGTAGTGTGGAAAATCTCTAGCAAGATCGG
AAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAAAACACTAACA
CAGCACCCAGCAGAGAGTAACTACATGGCACAGACTAATACAGTAGCAGTCACACACAACACAT    !8A
CCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGCFGGGFDGGGGGGFGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGDDGGFGGFGGGFFFGFFGFFFFDFFDFFFFFFFF5@ACAEFFFFFFFFB>@(,(,(,
,((),(().4(((((-((((((((())).)-)))),,()(((()))))))))))--.)(-)-((-((4((,)    NM:i:8    AS:i:117    XS:i:117

Pierre Lindenbaum and I did some tests and found that if we shuffled the fastqs, some paired reads were back to what was expected, except for the flags (0x2 unset) and mapq (0).

Example:

ERR656485.14    81    gi|9629357|ref|NC_001802.2|    31    0    151
S129M1D20M    =    26    -155    TTGTATGTATTACATGATGACGGTGATCGTGTTTG
ACAGTGTCTTCATCATTGTTTGTTTTTTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATCCTCTATCGAGATC
GGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAGCCCGGGAGCTCTCTGGCTATCTAGGGAACCCACTGCT
TAAGCCTCAATAAAGCTTGCCTTGAGTGCTTTAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAG
ATCCCTCAGACCAGTTTGGTAGTGTGGAAAATCTCTAGCN    ),.)))))))))))-)((((((((.-,
(((.)))).))((:.)).)444)(.(,,.(,-(-4FFB>FFFFFFFFFFFFFFFFFFF7FDDFGCC4*FGGGGGG
GGGFECGGGGECGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCB8!    NM:i:7    MD:Z:16A48C
55C0T3A2^C19A0    AS:i:116    XS:i:116    XA:Z:gi|9629357|ref|NC_0018
02.1|,-31,151S129M1D20M,7;
ERR656485.14    161    gi|9629357|ref|NC_001802.2|    26    0    134
M1D20M146S    =    31    155    NGCCCGGGAGCTCTCTGGCTATCTAGGGAACCCAC
TGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTTAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACT
AGAGATCCCTCAGACCAGTTTGGTAGTGTGGAAAATCTCTAGCAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGT
GTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAAAACACTAACACAGCACCCAGCAGAGAGTAACTAC
ATGGCACAGACTAATACAGTAGCAGTCACACACAACACAT    !8ACCGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGFGGGCFGGGFDGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDDGGFG
GFGGGFFFGFFGFFFFDFFDFFFFFFFF5@ACAEFFFFFFFFB>@(,(,(,,((),(().4(((((-((((((((
())).)-)))),,()(((()))))))))))--.)(-)-((-((4((,)    NM:i:8    MD:Z:0A3T16
A48C55C0T3A2^C20    AS:i:117    XS:i:117    XA:Z:gi|9629357|ref
|NC_001802.1|,+26,134M1D20M146S,8;

Does someone have an explanation for that?

Did I do something wrong?

Code of the tests.

Aurélien

bwa • 5.5k views

ADD COMMENT • link updated 8.9 years ago by h.mon 35k • written 8.9 years ago by Aurélien Guy-Duché ▴ 60

0

Entering edit mode

Note: Aurelien is my internship student. He is working on a blast2sam software ( https://github.com/guyduche/Blast2Bam ). When he compared the output of Blast , he was surprised to see some "properly paired-reads" with blast but not with bwa ( read & mate mapped on different contigs) . The reference used was the sequence of HIV duplicated with two different names.

ADD REPLY • link 8.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

How much is some? Is it 50%:50% for paired:broken? Or something like 95%:5%?

ADD REPLY • link 8.9 years ago by h.mon 35k

0

Entering edit mode

I did a samtools flagstat on the tests results.

1 ref:

822480 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
554230 + 0 mapped (67.39%:-nan%)
822480 + 0 paired in sequencing
411240 + 0 read1
411240 + 0 read2
551692 + 0 properly paired (67.08%:-nan%)
553110 + 0 with itself and mate mapped
1120 + 0 singletons (0.14%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

2 refs:

822480 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
523806 + 0 mapped (63.69%:-nan%)
822480 + 0 paired in sequencing
411240 + 0 read1
411240 + 0 read2
0 + 0 properly paired (0.00%:-nan%)
518112 + 0 with itself and mate mapped
5694 + 0 singletons (0.69%:-nan%)
180592 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

2 refs with fastq shuffled:

822480 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
523806 + 0 mapped (63.69%:-nan%)
822480 + 0 paired in sequencing
411240 + 0 read1
411240 + 0 read2
0 + 0 properly paired (0.00%:-nan%)
518112 + 0 with itself and mate mapped
5694 + 0 singletons (0.69%:-nan%)
180282 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

ADD REPLY • link 8.9 years ago by Aurélien Guy-Duché ▴ 60

Ram · Answer 1 · 2015-07-12

0

Entering edit mode

8.9 years ago

h.mon 35k

This is just a guess, as I don't know the inner workings of BWA in detail. But when BWA finds multiple mapping locations it assigns the read randomly to one position. With two references, there are no "properly paired" reads as all reads map to multiple locations, and BWA assigns randomly the primary alignment. The "rescue" you saw was probably random variation.

I am more intrigued by the rise in singletons, as I can't find any explanation for this.

edit: what is the expected ratio of "mapping to same chromosome" and "mapping to different chromosome" under a random assignment scenario? Does it fit your read mappings?

ADD COMMENT • link 8.9 years ago by h.mon 35k

1

Entering edit mode

On that note, I think this paper might be of some help

http://www.biomedcentral.com/1471-2105/15/S16/S15

someone did a detail analysis of these kind of situation where multiple mapping happens

ADD REPLY • link 8.9 years ago by Sam ★ 4.7k

1

Entering edit mode

Indeed. Quoting from the paper:

"By further examining the implementation of BWA we found that in the case of non-uniquely mapped reads, the algorithm is set to report one alignment randomly chosen for each read. This selection is not entirely random, as we found out, since for the same FASTQ file a repeated execution would always produce the same results. The seed for the random number generator is fixed so the random numbers chosen are always the same. This means that in the case where the reads in the input sequence files are at a different position in the file [...] the reported alignment for each non-uniquely mapped read will be different."

ADD REPLY • link 8.9 years ago by h.mon 35k

0

Entering edit mode

@Sam: Thank you for the paper, it will help me test my program.

@h.mon: Ok, I understand that with BWA, when we shuffle the input, we get different results and that if the read is mapped with the same mapq at different locations, the reference displayed in the alignment is chosen randomly between the multiple locations. What I don't understand is why it is happening before the pairing of the two reads, therefore potentially breaking pairs.

You're right, I didn't see that before but the rise in singletons is strange too.

I didn't understand your question. What do you mean by random assignment scenario, fastQ reshuffling?

ADD REPLY • link 8.9 years ago by Aurélien Guy-Duché ▴ 60

0

Entering edit mode

The mapping quality of reads mapping to multiple locations is 0, and BWA therefore decides it can't reliably map it as a pair, too - look at the 0 + 0 properly paired (0.00%:-nan%), even though a lot of your reads mapped to the same reference. My guess it is a design decision (probably a wise one, but in your aberrant case leading to strange results), better would be ask its author.

My hypothesis is: in your case with two references, all reads have two "correct" mapping positions, but are being assigned one of them randomly (and independently of its pair position). Under this scenario, we have an expectation about the proportion of reads mapping to the same reference, and the proportion mapping to a different reference, in relation to the total of reads mapping. If my very basic and poor probability skills are correct, you should have 2/3 mapping to the same reference, and 1/3 mapping to a different reference. My question was, does the mapping you observed conforms to this expectation?

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.9 years ago by h.mon 35k