Question

Finding All Sub-Optimal Alignments With Bwa

0

Entering edit mode

11.1 years ago

pan.alexiou ▴ 10

I am trying to align some small sequence reads to a genome using bwa. For this specific project I need to have all mapping positions returned were the read maps with less than a given number of mismatches (this could be absolute number or % of read length).

So for example if a read maps perfectly in 10 locations, with 1 mismatch in 100 more and 2 mismatches in 200 more I would want all 310 locations returned. I want to do that for as many mismatches allowed by my computational power (if I find the right switch I can try with different numbers).

Reading the bwa manual and some initial testing didn't help much:

I tried changing the -n flag to different sizes (I have 30nt reads on average) and the results show that the file size is not increasing as expected (foo.$n.sai):

1.5K  foo.1.sai
3.2K  foo.2.sai
5.3K  foo.3.sai
21K   foo.5.sai
12K   foo.7.sai
11K   foo.10.sai
11K   foo.15.sai
11K   foo.30.sai
11K   foo.40.sai

what I find especially confusing is that it peaks at $n=5 and then drops again.

looking inside the file (I convert it to sam) I see that the $n=5 file has alignments up to a distance of 5, but the $n=7 file has alignments (for the same read) up to distance of 2. How does this work?

I am running this:

bwa aln -t 8 -n $i -N -l 100000 genome.fa foo.fa > foo/foo.$i.sai

for different lengths. If I take out the -N flag then it just returns alignments 1 step worse than the perfect one (or so I believe from looking at the results). the -l flag is supposed to override the "seed" functionality.

Any ideas? How could I get alignments with 7 mismatches let's say?

bwa alignment • 4.7k views

ADD COMMENT • link updated 9.9 years ago by Biostar 20 • written 11.1 years ago by pan.alexiou ▴ 10

score 0 · Answer 1 · 2013-03-26

0

Entering edit mode

11.1 years ago

Istvan Albert 100k

the -n flag corresponds to the the maximum edit distance and I believe that it includes gaps as well. Also you should generate and investigate the actual alignments rather than looking at file sizes. It is overly simplistic to extrapolate from that to actual alignments.

I would recommend that you use bowtie (version 1) for your purpose. That aligner will not match indels to begin with so it is better suited to what you need.

ADD COMMENT • link 11.1 years ago by Istvan Albert 100k

0

Entering edit mode

looking inside the file (I convert it to sam) I see that the $n=5 file has alignments up to a distance of 5, but the $n=7 file has alignments (for the same read) up to distance of 2. So my question is how does this work ok up to n=5 and then gives even less alignments for the same read for n=7.

ADD REPLY • link 11.1 years ago by pan.alexiou ▴ 10

0

Entering edit mode

you need to understand that many of these tools (high throughput mappers) were designed to find the best exact or near exact matches at very high speeds and with great efficiency. Attempting to use them for other purposes is not recommended. There are many internal optimization and decisions applied that will not make sense in your case. For example when you select a very high edit distance I is possible that so many of your reads could map to various locations that the tool will discard most reads as being non-representative because in the normal and intended usage such a choice would be greatly desired.

If you need to align the way you seem to want to you will need to use a different tool, one that that at least has some characteristics that fit the problem. For example like I mentioned bowtie 1 does not deal with gaps therefore you may have better luck (and of course fewer results to deal with). Or you may need to choose a different type of aligner like blast, lastz or something completely different.

ADD REPLY • link 11.1 years ago by Istvan Albert 100k

0

Entering edit mode

"that so many of your reads could map to various locations that the tool will discard most reads as being non-representative because in the normal and intended usage such a choice would be greatly desired."

I didn't see this talked about anywhere in the bwa manual so I assumed that all alignments were returned. When I convert to .sam I am asking for 10,000 alignments to be returned and none of the sequences reaches this number. What confuses me most is that in any case I would expect n=7 to at least return everything found for n=5, which is not the case. Also, in fact I want gaps and mismatches, so I guess I will check out blast or something --- open to suggestions.

ADD REPLY • link 11.1 years ago by pan.alexiou ▴ 10

0

Entering edit mode

You can't push a tool outside its operating characteristics - but I do agree with you that usually that information is not readily available. It is because it is very difficult to ascertain/describe at what extremes will make a tool not work properly. As a rule don't add extreme values to parameters because it won't work and even worse sometimes you may not even notice that it does not work.

Here is a simple example create a read that contains a single base say A. Now which aligner can find all the alignments of A against the human genome. We'll most likely none will work because the problem is radically different from what a short read aligner is supposed to do, yet one could just as well argue that it is so easy to align a single base against a genome that any aligner should support that by default. But they don't because the problem is not what people use these tools for.

ADD REPLY • link 11.1 years ago by Istvan Albert 100k

0

Entering edit mode

Yeah, I guess I didn't consider 7 being an extreme value for bwa. Oh well. I guess I will have to use BLAST to do it or something. Do you have any idea of other programs? I was reading bowtie 2 and it has a flag that supposedly reports all alignments but I guess it falls in the same category as bwa and it will be a similar disappointment.

ADD REPLY • link 11.1 years ago by pan.alexiou ▴ 10