Question: Why does the BLAST use E-value instead of p-value?
1
mangfu100740 wrote:

Hi all.

I think that p-value is one of the most greatest way of measuring degree of observed data.

However, BLAST doesn't use p-value but E-value.

Why the BLAST use e-value for interpreting sequence data instead of p-value?

Is there any logical reason to use E-value for BLAST? If so, could you tell me the detail reason?

sequencing alignment • 9.4k views
modified 5.4 years ago by Csaba Kerepesi330 • written 5.4 years ago by mangfu100740
11
Csaba Kerepesi330 wrote:

Quote from the BLAST help (http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head4 ):

"The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995. However, when E < 0.01, P-values and E-value are nearly identical."

Important to note that P value of the BLAST is not the same thing than a P-value of a t-test.

1

Could you elaborate further on your last sentence?

1

any p-value is the result of a hypothesis test. since a blast search is not a hypothesis test, a p would be an inappropriate result.

3

Yes, BLAST is doing a hypothesis test: is the sequence a homolog of your query, or not? The null hypothesis is that it is not a homolog, and instead is a "random" sequence. The P-value is the probability that you would've gotten a score this high if it's not a homolog. BLAST scores follow a known distribution (an extreme value distribution) under the null hypothesis. Conceptually, it's the same as any other p-value based significance test.

I think most users aren't aware of the hypothesis test as you've stated it. Implicitly, BLAST is testing a query sequence against thousands or thousands of millions of candidate sequences. If we interpret p-value as the false positive rate (or incorrect null-h acceptance), then we should apply a multiple-testing correction to the result, and the copious results are decimated. The chance of artificial alignment is highly dependent upon the genome being searched and the complexity of the query sequence. We can guarantee that a 2-mer is a homolog of a million locations, but it's useless as a result. the E-value distribution accounts for these things and is more directly related to the complexity and uniqueness of a blast 'hit'. It's determined by the genome index being queried.  We use BLAST to find things, and want to know how certain it is. I think most users aren't specifying any hypotheses or accounting for the multiplicity thereof.

2

seanrobertseddy (Sean Eddy? Hello!) is right here. BLAST is doing a standard hypothesis test. It has an explicit null model and the E-value is estimated based on this model. You may argue whether the null model is appropriate, but math is math. As I remember, BLAST precomputes the two key parameters. FASTA/swat learns the parameters from data. They are less affected by the redundancy in the database.

Exactly: BLAST P-value: "The probability of a chance alignment occurring with a particular score or a better score in a database search." Quoted form BLAST Glossary

More exactly: If you have an n length query sequence and an m length database and running BLAST you get a hit with S score, than the P value is the probability of you get at least one hit with a score greater (or equal) than S if you BLAST a random n length query against a random m length database.

The last state are concluded mostly from: http://www.basiclocalalignmentsearchtool.com/

However P-value is not calculated by BLAST but E-value. P value is not equal with E-value. BLAST E-value is the expectation value of the hits with score greater (or equal) than S if you BLAST a random n length query against a random m length database.