I am trying to align some small sequence reads to a genome using bwa. For this specific project I need to have all mapping positions returned were the read maps with less than a given number of mismatches (this could be absolute number or % of read length).
So for example if a read maps perfectly in 10 locations, with 1 mismatch in 100 more and 2 mismatches in 200 more I would want all 310 locations returned. I want to do that for as many mismatches allowed by my computational power (if I find the right switch I can try with different numbers).
Reading the bwa manual and some initial testing didn't help much:
I tried changing the -n flag to different sizes (I have 30nt reads on average) and the results show that the file size is not increasing as expected (foo.$n.sai):
1.5K foo.1.sai 3.2K foo.2.sai 5.3K foo.3.sai 21K foo.5.sai 12K foo.7.sai 11K foo.10.sai 11K foo.15.sai 11K foo.30.sai 11K foo.40.sai
what I find especially confusing is that it peaks at $n=5 and then drops again.
looking inside the file (I convert it to sam) I see that the $n=5 file has alignments up to a distance of 5, but the $n=7 file has alignments (for the same read) up to distance of 2. How does this work?
I am running this:
bwa aln -t 8 -n $i -N -l 100000 genome.fa foo.fa > foo/foo.$i.sai
for different lengths. If I take out the -N flag then it just returns alignments 1 step worse than the perfect one (or so I believe from looking at the results). the -l flag is supposed to override the "seed" functionality.
Any ideas? How could I get alignments with 7 mismatches let's say?