I am parallelizing a mutation search program. The program calls BWA for alignment. Later, the program uses the starting position of every aligned read given by BWA in SAM file and retrieve a subsequence starting at that position(s), and continue working on that subsequence to search for mutations.
My problem is that the program was written to take care of any number of chromosomes in the input referemce. However, upon testing we found it works correctly only for single chromosomes references, where when the reference has several chromosomes the program crashes.
Upon inspection I figured out the problem. My first easy fix was to concatenate the chromosomes ( there will be loss of precision, but it is negligible). However, if the size of the concatenated chromosome exceeds ~150MB BWA fails, or I think BWT indexing fails since the index files comes out empty.
----My first question is: is there a way to allow a larger chromosomes size? Is it a problem in BWT or BWA?
My other option was to retrieve the reference subsequences by the positions given in SAM file. However, if there are more than one chromosome, the subsequence retrieved by the position given by the BWA is different from the aligned refrence subsequence by BWA . Again, we get the correct subsequence if we have one chromosome reference only.
I read about BWT indexing, but not sure I can implement the indexing in a way similar to the actual algorithm. I think there should be several decisions I need to make like what if we have N's in the reference, and how we handle indexing several chromosomes, and so.
----My second question is: is there a way to retrieve the correct subsequence from the reference using the exact position given in sam file?
Any suggestions or links that are possibly helpful will be appreciated.