Question: Score doesn't match alignment problem BWA-MEM
0
gravatar for tianjunz
2.9 years ago by
tianjunz0
tianjunz0 wrote:

Hi everyone, I am recently using BWA-MEM for aligning human genome. I found that when a hit (hit A) with a higher score falls in a the extension of another hit (hit B) which computed previously, BWA-MEM doesn't do the extension for hit A. However, on the final output, BWA-MEM uses the alignment for hit A and score for hit B. I am confused on this problem.

Any inputs would be appreciated. Thanks!

bwa alignment • 880 views
ADD COMMENTlink modified 2.9 years ago by Friederike6.1k • written 2.9 years ago by tianjunz0

I have no idea what you're trying to get at. Could you show the specific reads and scores you're looking at? In particular, what do you mean with "hit B which computed previously"?

ADD REPLYlink written 2.9 years ago by Friederike6.1k

Thanks for replying. This is the specific read I am looking at:

chr20   0       chr20   13588261        1       8S75M2I15M      *       0       0       AAGGAAGGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGCAGGAAGGAAGAGAAAGAAAAGGAAGGAAGGAAGGAAATGGAAGGAAGGAATGT    2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222    NM:i:4  MD:Z:38A48G2    AS:i:58 XS:i:57

However, I found that they didn't really extend the hit position 13588261. Instead, the output is:

* ---> Processing chain(165) <---
** ---> Extending from seed(0) [38;8,13588252] @ chr20 <---
*** Left ref:   AAAGAAGGAAA
*** Left query: GGAAGGAA
*** Left extension: prev_score=-1; score=38; bandwidth=100; max_off_diagonal_dist=0; global_score=33
*** Right ref:   AAGGAAGGAAGGAAGGAAGAGAAAGAAAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGTAGGTAGGTAGGTCTAGATTGGAAAAGAAGAAGTAAAATTGTCT
*** Right query: CAGGAAGGAAGAGAAAGAAAAGGAAGGAAGGAAGGAAATGGAAGGAAGGAATGT
*** Right extension: prev_score=38; score=60; bandwidth=100; max_off_diagonal_dist=8; global_score=58
*** Added alignment region: [8,100) <=> [13588252,13588350); score=60; {left,right}_bandwidth={100,100}
* ---> Processing chain(166) <---
** Seed(0) [38;8,13588256] is almost contained in an existing alignment [8,100) <=> [13588252,13588350)
* ---> Processing chain(167) <---
** Seed(0) [38;8,13588260] is almost contained in an existing alignment [0,100) <=> [13588258,13588350)
* ---> Processing chain(168) <---
** Seed(0) [38;8,13588264] is almost contained in an existing alignment [0,100) <=> [13588258,13588350)

They simply omitting the hit position 13588260 since it is contained in an existing alignment. However, the AS score for alignment of hit position 13588260 is not 58. I disabled the function of merging hits, and here is the new output:

chr20   0   chr20   13588251    35  8M2D75M2I15M    *   0   0   AAGGAAGGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGCAGGAAGGAAGAGAAAGAAAAGGAAGGAAGGAAGGAAATGGAAGGAAGGAATGT    2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222    NM:i:6MD:Z:8^AA38A48G2  AS:i:72 XS:i:57

So I am totally confused.

ADD REPLYlink modified 2.9 years ago by genomax89k • written 2.9 years ago by tianjunz0

Sorry, I'm still not fully on board, and I may never be. I still recommend you add that information to your original question including the commands you're using

ADD REPLYlink written 2.9 years ago by Friederike6.1k

It is a somewhat confusing situation!

I know that BWA breaks the reads up into segments and then tries to align them piece by piece. If one segment aligns, it will then extend this 'alignment seed' to see if the next segments in the read also align sequentially.

In your example above, the initial segment is 'GGAAGGAA', but BWA then gets confused because this is then found multiple times in the final aligned read (I believe). We're dealing with the very intricate details of BWA here, something which may be better answered on Heng Li's Github page (or by contacting him directly).

In your situation above, I think that BWA is getting confused because you're dealing with a very long and difficult repeat. No aligner will be capable of faithfully aligning this based on short (~150bp) reads. You'd require longer reads up to 1000bp.

Looking at the region in the UCSC Genome Browser, it's also dense in SINE and LINE elements.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Kevin Blighe65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1367 users visited in the last hour