Question

Differing Bwa Mapq Scores For A Read Depending On How I Build The Transcriptome

2

Entering edit mode

12.4 years ago

David Quigley 11k

I'm aligning 75-mer paired end reads from a Illumina HiSeq RNA-seq run using BWA 0.6.0, mapped against a custom transcriptome. I have a read which spans two exons. The mystery: if I build the reference in two different ways, both of which have a perfect match for the read, I get MAPQ scores of 7 and 29. The low score comes from an alignment against exon junctions of length 150; the high score comes when I align against consensus transcripts that include all exons. In the low score read the read's pair does not align to the same junction fasta (it has a perfect match elsewhere) while in the high score the read pair matches elsewhere in the same consensus sequence.

Why would the same read have two different scores when there's a perfect match both times?

I think there's a hint in the "template-independent mapping quality" flag; the low score alignment has SM:i:0 AM:i:0 while the high-score has SM:i:29 AM:i:29. If that's the explanation, can someone explain what these flags mean?

The low-scoring and high-scoring reads follow, broken into three lines for easier reading:

HS20_6661:1:1308:2103:61534#6    163    my_junction    70    7    75M    
my_pair_name    53    0    CAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC    BCCFFFFFHGHHGIHIIJJJIJJJJIJIFHGGIIJIJJJJIJJGGBBDCEECCCDEDCBBD?AA55<8;@@>BD?
X0:i:2    X1:i:0    XA:Z:my_junction,+70,75M,0;my_pair_name,+70,75M,0;    MD:Z:75    XG:i:0    AM:i:0    NM:i:0    SM:i:0    XM:i:0    XO:i:0    XT:A:R

HS20_6661:1:1308:2103:61534#6    163    my_consensus    1698    29    75M    =    1807    195
CAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC
BCCFFFFFHGHHGIHIIJJJIJJJJIJIFHGGIIJIJJJJIJJGGBBDCEECCCDEDCBBD?AA55<8;@@>BD?    XT:A:U
NM:i:0    SM:i:29    AM:i:29    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:75    XA:Z:my_consensus,+1698,75M,0;

The low-scoring junction sequence:

>my_junction
CATGTGTAACAGCAGCTGCGTCGGAGGAATGAACAGACGTCCAATTTTAATCATCGTTACTCTGGAAA
CCAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC
AGATGA

bwa next-gen sequencing rna • 5.1k views

ADD COMMENT • link updated 12.4 years ago by Damian Kao 16k • written 12.4 years ago by David Quigley 11k

0

Entering edit mode

I think maybe it has to do with how BWA calculates pair end mapping quality. What would be the insert size difference between mapping to junction vs mapping to whole transcript?

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

The name of the paired match is different for the poorer score - maybe it thinks it's on a different chromosome?

ADD REPLY • link 12.4 years ago by Chris Penkett ▴ 490

0

Entering edit mode

That's the issue to which I refer in my comment to DK's answer; my current formulation of "map reads to all exon junctions" puts each junction in a separate FASTA and then builds an index. BWA doesn't know the distance between these separate FASTAs, so pairs which map to separate junctions or where the mate is not in a junction get poor MAPQ scores.

ADD REPLY • link 12.4 years ago by David Quigley 11k

0

Entering edit mode

What is the insert size distribution bwa was giving? You can find from the bwa error output (stderr)?

ADD REPLY • link 12.4 years ago by lh3 33k

0

Entering edit mode

For the "junction probe" reference: (25, 50, 75) percentile: (712, 1293, 2422) low and high boundaries: 75 and 5842 for estimating avg and std inferred external isize from 14157 pairs: 1479.721 +/- 1111.644 skewness: 1.566; kurtosis: 2.139; ap_prior: 7.24e-04 inferred maximum insert size: 8416 (6.24 sigma)

Compared to the "consensus exons" reference where: (25, 50, 75) percentile: (187, 214, 238) That looks better; clearly the insert sizes are off for the junction reference. I don't blame BWA for this; it would need to know the absolute location of the junctions.

ADD REPLY • link 12.4 years ago by David Quigley 11k

score 0 · Answer 1 · 2011-11-28

0

Entering edit mode

12.4 years ago

Damian Kao 16k

There isn't much information on how pair-end qualities are derived. The BWA paper does say:

BWA assigns a Phred-like quality score to each read (MapQ) based on match uniqueness, sequence identity, end-pairing, and insert size that is intended to indicate confidence of read placement accuracy in the genome

So I guess your pairing distance and quality will affect the mapping quality of the read in a pair.

ADD COMMENT • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

I think the key is the "end-pairing, and insert size" part of the quality score. Each junction is a separate FASTA in my junction build, so the pair maps to separate FASTAs. BWA may be penalizing me for that, since it has no way to calculate the insert size correctly.

ADD REPLY • link 12.4 years ago by David Quigley 11k