Question: Combining quality scores when aligning/assembling
2
gravatar for earonesty
5.3 years ago by
earonesty230
United States
earonesty230 wrote:

Suppose I have 2 reads, and I align them as a part of a joining program or an assembly:

AATGCGTTTGA
   GCGTTAGATGCTGA

Then I output one combined read:

AATGCGTTTGATGCTGA

How do  I combine the associated (phred scaled) quality scores in the overlapped region?

If the two bases match, and say one is q20 and the other q20 is then the likelihood of correctness is probably higher than q20.

It's clear that if the two bases don't match, and say one is q20 and the other q20 is then the likelihood of correctness is 50% (phred quality score of 3).

If the two bases don't match, and one is q40 and the other q3, and I choose the q40 base, then the correctness likelihood is probably still really near around q40.   That q3 base shouldn't pull down the q40 that much.

Is there a "correct" formula for:

    - combining quality scores when bases match in an overlapped region
    - combining quality scores when don't match in an overlapped region

 

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by earonesty230

Note:  When the bases match, it's a much easier problem.   With an assumption of independence, you can simple add the q scores.   But that this assumption is fundamentally flawed.    For Illumina I use max(a,b)+min(4,min(a,b)) ... when they match.  It's the mismatches that are harder.

ADD REPLYlink written 5.3 years ago by earonesty230

For mismatches.... min(a,b,max(abs(a-b),3))?   That seems a close approximation.   At least it seems to make sense to me when they are the same (3), and when they are far apart.

ADD REPLYlink written 5.3 years ago by earonesty230
2
gravatar for earonesty
5.3 years ago by
earonesty230
United States
earonesty230 wrote:

I spoke to a statistician here. I think a correct'ish answer for ambiguous bases is (after taking the better quality base):

E = max(3,((1-e2/2) * e1) / ((1-e1) * e2/2 + (1-e2/2) * e1))

Where e2 is the lower score.

And that this algorithm can be used to approximate it without needing to convert to/from phredspace

min(max(q1,q2),max(abs(q1-q2),3))

 

ADD COMMENTlink written 5.3 years ago by earonesty230

I am not sure about the solution but +1 for taking an effort to ask others and posting the answer.

ADD REPLYlink written 5.3 years ago by Ashutosh Pandey11k
0
gravatar for mikhail.shugay
5.3 years ago by
mikhail.shugay3.4k
Czech Republic, Brno, CEITEC
mikhail.shugay3.4k wrote:

It would be very rare case when your bases don't match and have the same quality. Also note that Phred score is in log scale, so the difference between 20 and 25 is quite high. Therefore I usually do the following (quite extreme) thing:

1. if there are two ambiguous bases, take the one with highest quality

2. when bases match output the highest quality 

I've implemented it in demultiplexing/overlapping util called Checkout here (https://github.com/mikessh/migec)

Anyway I believe the answer to your question could have only some heuristic behind it.

ADD COMMENTlink written 5.3 years ago by mikhail.shugay3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1818 users visited in the last hour