Question

Score doesn't match alignment problem BWA-MEM

0

Entering edit mode

6.5 years ago

tianjunz • 0

Hi everyone, I am recently using BWA-MEM for aligning human genome. I found that when a hit (hit A) with a higher score falls in a the extension of another hit (hit B) which computed previously, BWA-MEM doesn't do the extension for hit A. However, on the final output, BWA-MEM uses the alignment for hit A and score for hit B. I am confused on this problem.

Any inputs would be appreciated. Thanks!

bwa alignment • 1.7k views

ADD COMMENT • link updated 6.5 years ago by Friederike 8.9k • written 6.5 years ago by tianjunz • 0

0

Entering edit mode

I have no idea what you're trying to get at. Could you show the specific reads and scores you're looking at? In particular, what do you mean with "hit B which computed previously"?

ADD REPLY • link 6.5 years ago by Friederike 8.9k

0

Entering edit mode

Thanks for replying. This is the specific read I am looking at:

chr20   0       chr20   13588261        1       8S75M2I15M      *       0       0       AAGGAAGGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGCAGGAAGGAAGAGAAAGAAAAGGAAGGAAGGAAGGAAATGGAAGGAAGGAATGT    2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222    NM:i:4  MD:Z:38A48G2    AS:i:58 XS:i:57

However, I found that they didn't really extend the hit position 13588261. Instead, the output is:

* ---> Processing chain(165) <---
** ---> Extending from seed(0) [38;8,13588252] @ chr20 <---
*** Left ref:   AAAGAAGGAAA
*** Left query: GGAAGGAA
*** Left extension: prev_score=-1; score=38; bandwidth=100; max_off_diagonal_dist=0; global_score=33
*** Right ref:   AAGGAAGGAAGGAAGGAAGAGAAAGAAAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGTAGGTAGGTAGGTCTAGATTGGAAAAGAAGAAGTAAAATTGTCT
*** Right query: CAGGAAGGAAGAGAAAGAAAAGGAAGGAAGGAAGGAAATGGAAGGAAGGAATGT
*** Right extension: prev_score=38; score=60; bandwidth=100; max_off_diagonal_dist=8; global_score=58
*** Added alignment region: [8,100) <=> [13588252,13588350); score=60; {left,right}_bandwidth={100,100}
* ---> Processing chain(166) <---
** Seed(0) [38;8,13588256] is almost contained in an existing alignment [8,100) <=> [13588252,13588350)
* ---> Processing chain(167) <---
** Seed(0) [38;8,13588260] is almost contained in an existing alignment [0,100) <=> [13588258,13588350)
* ---> Processing chain(168) <---
** Seed(0) [38;8,13588264] is almost contained in an existing alignment [0,100) <=> [13588258,13588350)

They simply omitting the hit position 13588260 since it is contained in an existing alignment. However, the AS score for alignment of hit position 13588260 is not 58. I disabled the function of merging hits, and here is the new output:

chr20   0   chr20   13588251    35  8M2D75M2I15M    *   0   0   AAGGAAGGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGCAGGAAGGAAGAGAAAGAAAAGGAAGGAAGGAAGGAAATGGAAGGAAGGAATGT    2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222    NM:i:6MD:Z:8^AA38A48G2  AS:i:72 XS:i:57

So I am totally confused.

ADD REPLY • link updated 6.5 years ago by GenoMax 141k • written 6.5 years ago by tianjunz • 0

0

Entering edit mode

Sorry, I'm still not fully on board, and I may never be. I still recommend you add that information to your original question including the commands you're using

ADD REPLY • link 6.5 years ago by Friederike 8.9k

0

Entering edit mode

It is a somewhat confusing situation!

I know that BWA breaks the reads up into segments and then tries to align them piece by piece. If one segment aligns, it will then extend this 'alignment seed' to see if the next segments in the read also align sequentially.

In your example above, the initial segment is 'GGAAGGAA', but BWA then gets confused because this is then found multiple times in the final aligned read (I believe). We're dealing with the very intricate details of BWA here, something which may be better answered on Heng Li's Github page (or by contacting him directly).

In your situation above, I think that BWA is getting confused because you're dealing with a very long and difficult repeat. No aligner will be capable of faithfully aligning this based on short (~150bp) reads. You'd require longer reads up to 1000bp.

Looking at the region in the UCSC Genome Browser, it's also dense in SINE and LINE elements.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k