I'm aligning 75-mer paired end reads from a Illumina HiSeq RNA-seq run using BWA 0.6.0, mapped against a custom transcriptome. I have a read which spans two exons. The mystery: if I build the reference in two different ways, both of which have a perfect match for the read, I get MAPQ scores of 7 and 29. The low score comes from an alignment against exon junctions of length 150; the high score comes when I align against consensus transcripts that include all exons. In the low score read the read's pair does not align to the same junction fasta (it has a perfect match elsewhere) while in the high score the read pair matches elsewhere in the same consensus sequence.
Why would the same read have two different scores when there's a perfect match both times?
I think there's a hint in the "template-independent mapping quality" flag; the low score alignment has SM:i:0 AM:i:0 while the high-score has SM:i:29 AM:i:29. If that's the explanation, can someone explain what these flags mean?
The low-scoring and high-scoring reads follow, broken into three lines for easier reading:
HS20_6661:1:1308:2103:61534#6 163 my_junction 70 7 75M my_pair_name 53 0 CAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC BCCFFFFFHGHHGIHIIJJJIJJJJIJIFHGGIIJIJJJJIJJGGBBDCEECCCDEDCBBD?AA55<8;@@>BD? X0:i:2 X1:i:0 XA:Z:my_junction,+70,75M,0;my_pair_name,+70,75M,0; MD:Z:75 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R HS20_6661:1:1308:2103:61534#6 163 my_consensus 1698 29 75M = 1807 195 CAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC BCCFFFFFHGHHGIHIIJJJIJJJJIJIFHGGIIJIJJJJIJJGGBBDCEECCCDEDCBBD?AA55<8;@@>BD? XT:A:U NM:i:0 SM:i:29 AM:i:29 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:75 XA:Z:my_consensus,+1698,75M,0;
The low-scoring junction sequence:
>my_junction CATGTGTAACAGCAGCTGCGTCGGAGGAATGAACAGACGTCCAATTTTAATCATCGTTACTCTGGAAA CCAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC AGATGA