Entering edit mode
6.5 years ago
tianjunz
•
0
Hi everyone, I am recently using BWA-MEM for aligning human genome. I found that when a hit (hit A) with a higher score falls in a the extension of another hit (hit B) which computed previously, BWA-MEM doesn't do the extension for hit A. However, on the final output, BWA-MEM uses the alignment for hit A and score for hit B. I am confused on this problem.
Any inputs would be appreciated. Thanks!
I have no idea what you're trying to get at. Could you show the specific reads and scores you're looking at? In particular, what do you mean with "hit B which computed previously"?
Thanks for replying. This is the specific read I am looking at:
However, I found that they didn't really extend the hit position 13588261. Instead, the output is:
They simply omitting the hit position 13588260 since it is contained in an existing alignment. However, the AS score for alignment of hit position 13588260 is not 58. I disabled the function of merging hits, and here is the new output:
So I am totally confused.
Sorry, I'm still not fully on board, and I may never be. I still recommend you add that information to your original question including the commands you're using
It is a somewhat confusing situation!
I know that BWA breaks the reads up into segments and then tries to align them piece by piece. If one segment aligns, it will then extend this 'alignment seed' to see if the next segments in the read also align sequentially.
In your example above, the initial segment is 'GGAAGGAA', but BWA then gets confused because this is then found multiple times in the final aligned read (I believe). We're dealing with the very intricate details of BWA here, something which may be better answered on Heng Li's Github page (or by contacting him directly).
In your situation above, I think that BWA is getting confused because you're dealing with a very long and difficult repeat. No aligner will be capable of faithfully aligning this based on short (~150bp) reads. You'd require longer reads up to 1000bp.
Looking at the region in the UCSC Genome Browser, it's also dense in SINE and LINE elements.