I have short reads that I would like to align on a bacterial genome. I need to know the exact identity between the reads and the reference genome, even if there are mismatches or gaps near the 5' or 3' ends.
To be sure to make myself clear: on this 20nt sequence (MM in lowercase) I would expect an identity of 16/20= 80%
I tried with BWA but for a 36bp read I can only have up to max of 5 mismatches, it doesnt seem to be able to give the identity for more MM/gaps (is there anyway to do that ?)
I am thinking of using a global aligner (free license) that should also work correctly on relatively short reads (36bp). If this aligner could output an easy-to-parse output file that would be X-mas for me.
Also, I tried Exonerate but I cannot figured out how to make it work the way I want (the options --exhaustive --model affine:global do not seem to work ...).
Any help is appreciated.
You will need a more sensitive tool than bwa, but this comes for the price of increased run time. How many reads do you have? You can nowadays run exact algorithms on up to a few thousand reads, but if it is more than that a heuristic approach is required.
I have 13 million reads....