PHRED scores and sequence errors
1
1
Entering edit mode
3.2 years ago
Dunois ★ 2.6k

Wikipedia has a nice article about PHRED scores where it explains that a score of 20 implies a 1% chance of that base having been called wrong.

So if I have a sequence of 100 nucleotides, and all the bases have a quality score of 20, does this mean that the sequence overall has a chance of (0.01)^100 % chance of being wrong?

Or is the calculation not so naive?

sequencing-errors probability PHRED • 2.5k views
3
Entering edit mode

No, let's assume that the sequencer has a 1% chance of error. So if 100 nucleotides are sequenced, it is expected that one of them will have an error.

You'd have to do 1 - (0.99^100) to calculate your desired probability, not (0.01)^100

0
Entering edit mode

Going by this, it would mean a 250 bp read would have 1-((0.99)^250) ~ 92% chance of being wrong if all bases in the sequence had a PHRED score of 20? Or did you mean to say this is the probability of the sequence being not wrong?

I suppose I am misunderstanding something here.

2
Entering edit mode

What do you consider "wrong"? My definition of wrong is if a sequence differs from the ground truth even by one base.

So yes, your 250 bp read would have a 92% chance of being wrong (i.e. differing from the true sequence by at least one base).

A 1% error rate is pretty high (it's 1 in 100 bases basically; and now you're giving me a read with 250 bases, so of course it will most likely contain an error).

0
Entering edit mode

I was actually tripping over precisely what @h.mon explained in their answer: "Note there is only one way of a sequence is correct (all bases must be correct), but there are many ways a sequence can be wrong".

2
Entering edit mode

Phred score is for each individual basecall. I don't think it can be extended to a stretch of sequence (just intuition not a statistician).

0
Entering edit mode

^^^ yes, these are usually quoted at the level of an individual base.

• Phred 20 == 1 in 100 chance of being incorrect
• Phred 30 == 1 in 1000

Phred scores are also calculated for the alignment of a read to a reference though, i.e., MAPQ score.

3
Entering edit mode
3.2 years ago
h.mon 35k

Using my very basic probability skills, I will try to explain dsull comments with a little more detail.

You can calculate pretty easily the probability of the sequence being correct, i.e., no base is wrong, as (probability the base is correct) ^ #bases. So, for example, for a 100bp sequence with all bases Q20, the probability of the sequence being correct is (0.99)^(100) = 0.366, or 36.6% chance having no errors.

The probability of a sequence "being wrong" is one minus the probability of the sequence being correct. So, for the same example above of a 100bp sequence with all bases Q20, the probability of the sequence being wrong (i.e., containing one or more errors) is 1-0.366 = 0.634, or 63.4% chance of containing at least one error.

Note there is only one way of a sequence is correct (all bases must be correct), but there are many ways a sequence can be wrong - one base can be wrong, two bases, and so on. The estimation from your question - (0.01)^100 - is actually the probability of all bases of the sequence being wrong.

0
Entering edit mode

Thank you for taking the time to explain this.

Now @dsull 's answer makes a lot of sense. I kept thing the probability of the sequence being wrong is very high, without considering the fact that (as you pointed out), there are many more ways here for the sequence to be "wrong" (one base call is incorrect, ..., n base calls are incorrect) than there are for the sequence being "right".

I really need to work on basic probability and statistics and improve my intuition for these subjects.