Question

Cigar And Md String Do Not Match In Bowtie 0.12.8

1

Entering edit mode

11.6 years ago

jeremy ▴ 80

I used RSEM to process RNAseq data, which uses bowtie by default. I found many alignment records with inconsistent CIGAR and MD string. For example:

FCC121WACXX:4:1301:11541:99400#TCTTATAT   83   NM_004055   4307   100   87M3I   =   4202   -195   TAGGCTTCCCTCTTCTCAGGATCCACCACAGGGTTAGGGGACAGGAAGCCTGTTCTATTCTCAATAAATCTTACAAAATTCCAAAAAGAC   BBBBBBB_]``^HZVHZcb\VG``Vc_V_UWWQ_c^^^OX_ddcccc_c`c^ed^Id_daSbecec`_d`bdd[QQJQJR^caca^c\^^   XA:i:2   MD:Z:87A1A0   NM:i:2   ZW:f:1

The MD string says: 87A1A0, which should correspond to a CIGAR string with "90M". But bowtie gives: 87M3I. It says there is a 3 bps insert in the reference, which is wrong. Anyone encounter this problem? How can you generate a correct CIGAR string? Thanks.

bowtie cigar • 3.3k views

ADD COMMENT • link updated 6.3 years ago by Rubus Pi • 0 • written 11.6 years ago by jeremy ▴ 80

Obi Griffith · Answer 1 · 2012-09-17

1

Entering edit mode

11.6 years ago

Istvan Albert 100k

To be honest each of your CIGAR scores seems a bit strange.

The version of bowtie that you are using does not have the capabilty to align with insertions/deletions, only mismatches are supported. So it seems somewhat surprising that it lists insertions in the CIGAR string, especially at the end of the read where listing mismatches would be more appropriate.

I believe that the main CIGAR string contains the initial fast alignment performed to choose this location as a good hit, whereas the CIGAR listed in the MD string is generated via an optimal Smith–Waterman alignment. So that is the reason for the discrepancy.

The SAM spec says that the two CIGAR strings ought to match I don't know what that really means. Seems a softer requirement than must match.

ADD COMMENT • link updated 11.6 years ago by Obi Griffith 20k • written 11.6 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks. Is there any tool to generate a correct CIGAR string?

ADD REPLY • link 11.6 years ago by jeremy ▴ 80

1

Entering edit mode

You already have two correct CIGAR strings ;-) , why do you need a third? The first tells you why bowtie picked this position rather than other possible positions. The second tells you what actually was found there once it looked more closely. The only question is which one do you want to use.

Also remember that the process is a heuristic, the vast majority of times it works very well, with occasional misses. But that neither of your CIGAR strings is guaranteed to be the correct in the terms of being the best possible alignment of the read. The only issue to decide whether this problem is common in your data or rare. If it is common you will need to use a different aligner.

ADD REPLY • link 11.6 years ago by Istvan Albert 100k

score 0 · Answer 2 · 2018-01-04

0

Entering edit mode

6.3 years ago

Rubus Pi • 0

"Note that insertions, since they don't represent a loss of information about the reference, are not stored in MD flag. This has some interested consequences."

https://github.com/vsbuffalo/devnotes/wiki/The-MD-Tag-in-BAM-Files

ADD COMMENT • link 6.3 years ago by Rubus Pi • 0