Finding All Sub-Optimal Alignments With Bwa
1
0
Entering edit mode
9.7 years ago
pan.alexiou ▴ 10

I am trying to align some small sequence reads to a genome using bwa. For this specific project I need to have all mapping positions returned were the read maps with less than a given number of mismatches (this could be absolute number or % of read length).

So for example if a read maps perfectly in 10 locations, with 1 mismatch in 100 more and 2 mismatches in 200 more I would want all 310 locations returned. I want to do that for as many mismatches allowed by my computational power (if I find the right switch I can try with different numbers).

Reading the bwa manual and some initial testing didn't help much:

I tried changing the -n flag to different sizes (I have 30nt reads on average) and the results show that the file size is not increasing as expected (foo.$n.sai): 1.5K foo.1.sai 3.2K foo.2.sai 5.3K foo.3.sai 21K foo.5.sai 12K foo.7.sai 11K foo.10.sai 11K foo.15.sai 11K foo.30.sai 11K foo.40.sai  what I find especially confusing is that it peaks at$n=5 and then drops again.

looking inside the file (I convert it to sam) I see that the n=5 file has alignments up to a distance of 5, but then=7 file has alignments (for the same read) up to distance of 2. How does this work?

I am running this:

bwa aln -t 8 -n $i -N -l 100000 genome.fa foo.fa > foo/foo.$i.sai


for different lengths. If I take out the -N flag then it just returns alignments 1 step worse than the perfect one (or so I believe from looking at the results). the -l flag is supposed to override the "seed" functionality.

Any ideas? How could I get alignments with 7 mismatches let's say?

bwa alignment • 4.3k views
0
Entering edit mode
9.7 years ago

the -n flag corresponds to the the maximum edit distance and I believe that it includes gaps as well. Also you should generate and investigate the actual alignments rather than looking at file sizes. It is overly simplistic to extrapolate from that to actual alignments.

I would recommend that you use bowtie (version 1) for your purpose. That aligner will not match indels to begin with so it is better suited to what you need.

0
Entering edit mode

looking inside the file (I convert it to sam) I see that the n=5 file has alignments up to a distance of 5, but then=7 file has alignments (for the same read) up to distance of 2. So my question is how does this work ok up to n=5 and then gives even less alignments for the same read for n=7.

0
Entering edit mode

you need to understand that many of these tools (high throughput mappers) were designed to find the best exact or near exact matches at very high speeds and with great efficiency. Attempting to use them for other purposes is not recommended. There are many internal optimization and decisions applied that will not make sense in your case. For example when you select a very high edit distance I is possible that so many of your reads could map to various locations that the tool will discard most reads as being non-representative because in the normal and intended usage such a choice would be greatly desired.

If you need to align the way you seem to want to you will need to use a different tool, one that that at least has some characteristics that fit the problem. For example like I mentioned bowtie 1 does not deal with gaps therefore you may have better luck (and of course fewer results to deal with). Or you may need to choose a different type of aligner like blast, lastz or something completely different.

0
Entering edit mode

"that so many of your reads could map to various locations that the tool will discard most reads as being non-representative because in the normal and intended usage such a choice would be greatly desired."

I didn't see this talked about anywhere in the bwa manual so I assumed that all alignments were returned. When I convert to .sam I am asking for 10,000 alignments to be returned and none of the sequences reaches this number. What confuses me most is that in any case I would expect n=7 to at least return everything found for n=5, which is not the case. Also, in fact I want gaps and mismatches, so I guess I will check out blast or something --- open to suggestions.

0
Entering edit mode

You can't push a tool outside its operating characteristics - but I do agree with you that usually that information is not readily available. It is because it is very difficult to ascertain/describe at what extremes will make a tool not work properly. As a rule don't add extreme values to parameters because it won't work and even worse sometimes you may not even notice that it does not work.

Here is a simple example create a read that contains a single base say A. Now which aligner can find all the alignments of A against the human genome. We'll most likely none will work because the problem is radically different from what a short read aligner is supposed to do, yet one could just as well argue that it is so easy to align a single base against a genome that any aligner should support that by default. But they don't because the problem is not what people use these tools for.

0
Entering edit mode

Yeah, I guess I didn't consider 7 being an extreme value for bwa. Oh well. I guess I will have to use BLAST to do it or something. Do you have any idea of other programs? I was reading bowtie 2 and it has a flag that supposedly reports all alignments but I guess it falls in the same category as bwa and it will be a similar disappointment.