Question

PHRED scores and sequence errors

1

Entering edit mode

2.8 years ago

Dunois ★ 2.5k

Wikipedia has a nice article about PHRED scores where it explains that a score of 20 implies a 1% chance of that base having been called wrong.

So if I have a sequence of 100 nucleotides, and all the bases have a quality score of 20, does this mean that the sequence overall has a chance of (0.01)^100 % chance of being wrong?

Or is the calculation not so naive?

sequencing-errors probability PHRED • 2.2k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 2.8 years ago by Dunois ★ 2.5k

3

Entering edit mode

No, let's assume that the sequencer has a 1% chance of error. So if 100 nucleotides are sequenced, it is expected that one of them will have an error.

You'd have to do 1 - (0.99^100) to calculate your desired probability, not (0.01)^100

ADD REPLY • link 2.8 years ago by dsull ★ 5.8k

0

Entering edit mode

Going by this, it would mean a 250 bp read would have 1-((0.99)^250) ~ 92% chance of being wrong if all bases in the sequence had a PHRED score of 20? Or did you mean to say this is the probability of the sequence being not wrong?

I suppose I am misunderstanding something here.

ADD REPLY • link 2.8 years ago by Dunois ★ 2.5k

2

Entering edit mode

What do you consider "wrong"? My definition of wrong is if a sequence differs from the ground truth even by one base.

So yes, your 250 bp read would have a 92% chance of being wrong (i.e. differing from the true sequence by at least one base).

A 1% error rate is pretty high (it's 1 in 100 bases basically; and now you're giving me a read with 250 bases, so of course it will most likely contain an error).

ADD REPLY • link 2.8 years ago by dsull ★ 5.8k

0

Entering edit mode

I was actually tripping over precisely what @h.mon explained in their answer: "Note there is only one way of a sequence is correct (all bases must be correct), but there are many ways a sequence can be wrong".

Now your answer makes sense to me!!

ADD REPLY • link 2.8 years ago by Dunois ★ 2.5k

2

Entering edit mode

Phred score is for each individual basecall. I don't think it can be extended to a stretch of sequence (just intuition not a statistician).

ADD REPLY • link 2.8 years ago by GenoMax 141k

0

Entering edit mode

^^^ yes, these are usually quoted at the level of an individual base.

Phred 20 == 1 in 100 chance of being incorrect
Phred 30 == 1 in 1000

Phred scores are also calculated for the alignment of a read to a reference though, i.e., MAPQ score.

ADD REPLY • link 2.8 years ago by Kevin Blighe 87k

score 3 · Accepted Answer · 2021-06-17

Using my very basic probability skills, I will try to explain dsull comments with a little more detail.

You can calculate pretty easily the probability of the sequence being correct, i.e., no base is wrong, as (probability the base is correct) ^ #bases. So, for example, for a 100bp sequence with all bases Q20, the probability of the sequence being correct is (0.99)^(100) = 0.366, or 36.6% chance having no errors.

The probability of a sequence "being wrong" is one minus the probability of the sequence being correct. So, for the same example above of a 100bp sequence with all bases Q20, the probability of the sequence being wrong (i.e., containing one or more errors) is 1-0.366 = 0.634, or 63.4% chance of containing at least one error.

Note there is only one way of a sequence is correct (all bases must be correct), but there are many ways a sequence can be wrong - one base can be wrong, two bases, and so on. The estimation from your question - (0.01)^100 - is actually the probability of all bases of the sequence being wrong.