Question: Differing Bwa Mapq Scores For A Read Depending On How I Build The Transcriptome
2
gravatar for David Quigley
7.2 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

I'm aligning 75-mer paired end reads from a Illumina HiSeq RNA-seq run using BWA 0.6.0, mapped against a custom transcriptome. I have a read which spans two exons. The mystery: if I build the reference in two different ways, both of which have a perfect match for the read, I get MAPQ scores of 7 and 29. The low score comes from an alignment against exon junctions of length 150; the high score comes when I align against consensus transcripts that include all exons. In the low score read the read's pair does not align to the same junction fasta (it has a perfect match elsewhere) while in the high score the read pair matches elsewhere in the same consensus sequence.

Why would the same read have two different scores when there's a perfect match both times?

I think there's a hint in the "template-independent mapping quality" flag; the low score alignment has SM:i:0 AM:i:0 while the high-score has SM:i:29 AM:i:29. If that's the explanation, can someone explain what these flags mean?

The low-scoring and high-scoring reads follow, broken into three lines for easier reading:

HS20_6661:1:1308:2103:61534#6    163    my_junction    70    7    75M    
my_pair_name    53    0    CAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC    BCCFFFFFHGHHGIHIIJJJIJJJJIJIFHGGIIJIJJJJIJJGGBBDCEECCCDEDCBBD?AA55<8;@@>BD?
X0:i:2    X1:i:0    XA:Z:my_junction,+70,75M,0;my_pair_name,+70,75M,0;    MD:Z:75    XG:i:0    AM:i:0    NM:i:0    SM:i:0    XM:i:0    XO:i:0    XT:A:R

HS20_6661:1:1308:2103:61534#6    163    my_consensus    1698    29    75M    =    1807    195
CAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC
BCCFFFFFHGHHGIHIIJJJIJJJJIJIFHGGIIJIJJJJIJJGGBBDCEECCCDEDCBBD?AA55<8;@@>BD?    XT:A:U
NM:i:0    SM:i:29    AM:i:29    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:75    XA:Z:my_consensus,+1698,75M,0;

The low-scoring junction sequence:

>my_junction
CATGTGTAACAGCAGCTGCGTCGGAGGAATGAACAGACGTCCAATTTTAATCATCGTTACTCTGGAAA
CCAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC
AGATGA
next-gen rna bwa sequencing • 3.7k views
ADD COMMENTlink written 7.2 years ago by David Quigley11k

I think maybe it has to do with how BWA calculates pair end mapping quality. What would be the insert size difference between mapping to junction vs mapping to whole transcript?

ADD REPLYlink written 7.2 years ago by Damian Kao15k

The name of the paired match is different for the poorer score - maybe it thinks it's on a different chromosome?

ADD REPLYlink written 7.2 years ago by Chris Penkett470

That's the issue to which I refer in my comment to DK's answer; my current formulation of "map reads to all exon junctions" puts each junction in a separate FASTA and then builds an index. BWA doesn't know the distance between these separate FASTAs, so pairs which map to separate junctions or where the mate is not in a junction get poor MAPQ scores.

ADD REPLYlink written 7.2 years ago by David Quigley11k

What is the insert size distribution bwa was giving? You can find from the bwa error output (stderr)?

ADD REPLYlink written 7.2 years ago by lh331k

For the "junction probe" reference: (25, 50, 75) percentile: (712, 1293, 2422) low and high boundaries: 75 and 5842 for estimating avg and std inferred external isize from 14157 pairs: 1479.721 +/- 1111.644 skewness: 1.566; kurtosis: 2.139; ap_prior: 7.24e-04 inferred maximum insert size: 8416 (6.24 sigma)

Compared to the "consensus exons" reference where: (25, 50, 75) percentile: (187, 214, 238) That looks better; clearly the insert sizes are off for the junction reference. I don't blame BWA for this; it would need to know the absolute location of the junctions.

ADD REPLYlink written 7.2 years ago by David Quigley11k
0
gravatar for Damian Kao
7.2 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

There isn't much information on how pair-end qualities are derived. The BWA paper does say:

BWA assigns a Phred-like quality score to each read (MapQ) based on match uniqueness, sequence identity, end-pairing, and insert size that is intended to indicate confidence of read placement accuracy in the genome

So I guess your pairing distance and quality will affect the mapping quality of the read in a pair.

ADD COMMENTlink written 7.2 years ago by Damian Kao15k

I think the key is the "end-pairing, and insert size" part of the quality score. Each junction is a separate FASTA in my junction build, so the pair maps to separate FASTAs. BWA may be penalizing me for that, since it has no way to calculate the insert size correctly.

ADD REPLYlink written 7.2 years ago by David Quigley11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1011 users visited in the last hour