I have a pair end sample that mapped to the human genome (hg19) with bwa, and used GATK IndelRealigner.
The point is that I find that the CIGAR expression is not consistent for some reads.
NS500644:35:H5WFMAFXX:1:11106:23458:8358 99 9 33797899 40 3S148M = 33797902 154 GGTCAAGATCATCCGCCACCCCAAATACAACAGCTGGACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCACACCTGCCATCATCAATGCCCATGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGTGCCTC ??>ABAB@@AA@@@;A@AA@@@AAA@?AAAAABABCBAAAAACA@BBA@CAABABABBB?C?ABBBCBBB?BABB-BABCBABABBABBBACBAABAA@C@BABBA-?BBBBABCBAA-BA@BA@=?B@B?BBCABCBABBBB/B@@@>@9 XA:Z:7,+142470615,3S138M10S,9;7,+142481199,3S148M,12; MC:Z:27M2I2M2D120M BD:Z:KKLMNKKKLMMLMKKKLLKLKKLKEKJLJKLJMMMLLLLLKKLLLLJKKMLLJLMKLLMMLLLMMKKMMKKKKLKJKLLLMLLLMMLMMKKMMLKLLMKIKMKLKLLLMKKKKKLMLKLKLLKLKKLKKLMMMLMMLLKMKLLLKMIMLLK MD:Z:18T11G0G0A0C18A10A6T8G12G0C54 PG:Z:MarkDuplicates RG:Z:1 BI:Z:NNNNONNNOOOOOMNNOOLONNONIONONNONOOPPNNNONNPNMNNNONONNOOMOPOPPOOOONNOPNNNMOLNLOOPOOOOOOOOONONOONOONNNNNMOLOOOONNNNNPOONOLOOLONNONMOOOPPOPPNPPLOPONONOOON NM:i:11 MQ:i:43 AS:i:95 XS:i:93 NS500644:35:H5WFMAFXX:1:11106:23458:8358 147 9 33797902 43 27M2I2M2D120M = 33797899 -154 GATCATCCGCCACCCCAAATACAACAGCTGGACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCACACCTGCCATCATCAATGCCCATGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGTGCCTCATCTCT ?=>A@@A<B1C@B1@B@BA@?B?@CBBCBAB@CACB@B@CBABB@CAABCBBCBBAAC@BBC.CABA@C@BC@BBCAACAACB?BABBBABAAB.BC@BC@@BAC@BBBBBB?AB?ABABAABAABAABAABB?BAAAAAAAB?A@AA=@> XA:Z:7,-142470618,151M,12;7,-142481202,151M,13; MC:Z:3S148M BD:Z:MLLMLLLKKLILJJLKEKKLKKKKLMMMLKMLKKMLKMKKKLMMKMLLMMMMMMMLLKKMMKKKLLIKILMMMKLMLLMLLKKLMKJLMLJKJLLLILLMLKKKKKMMKJLILLILJJMKLLLMMMMMMLLMILMMKLKMKMKLMMLKKKK MD:Z:15T13^AC18A10A6T8G12G0C59C0 PG:Z:MarkDuplicates RG:Z:1 BI:Z:OOONONNNPNNNMMNPJNNONPNNPPOOOMNONNOOMNNPNOONNNONNOPOOOOOOPOPONNNNNNNNNNOPPNNOONOOPNOPPMNNONLNNNNNNNNONNNNNOPPMNNNNNNMMNNNNPPOOPOOOOONOOONOLPPNNONONNNNN NM:i:12 MQ:i:40 AS:i:97 XS:i:91
I can see that both sequences are overlapping (and are a pair) with the second one starting 6 bases after fist one (3 are soft clipped). I also notice that both have exactly the same sequence.
Both have the following sequence:
The first CIGAR expression is 3S148M, so I have 148 matches after the first 3 soft clipped bases.
The second CIGAR expression is 27M2I2M2D120M, so I have 2 bases inserted and 2 deleted after 27 matches.
The reference in this region (hg19) is
If I map to the reference I get:
REF: CAAGATCAATCCGCCACCCCCAATACAACAG GGACACTCTGGACAATGACATCATGCTGATCAAACTCTCCTC... 1st: GGTCAAGATCATCCGCCACCCCAAATACAACAGCTGG ACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCAC... 2nd: GATCATCCGCCACCCCAAATACAACAGCTGG ACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCAC...
This way the second CIGAR code is correct (CT inserted, GA deleted, assuming that there is no A delecion but ATCAA > GATCA in the 5 first 5 bases), but not the first one...
How could I explain this? Do I get this because GATK IndelRealigner is recalculating sequence and CIGAR in a wrong way? Does it has something to do with the fact that I am in a multimapped region?
In the other hand, they are multimapped sequences and the alternative mapped regions are (XA tag):
7,142481199,3S148M for first read and
7,142481202,151M for the 2nd. In both cases the 2nd one starts 3 bases after the 1st one.
If I have the sequence fully matched if mapping to chr7...
Why bwa assigned them to chr9 as primary?
Thank you very much
EDIT: I mapped again this sample without doing IndelRealigner and I still get the same thing, so the problem is with bwa