Question

Why does the BLAST use E-value instead of p-value?

1

Entering edit mode

9.0 years ago

mangfu100 ▴ 800

Hi all.

I think that p-value is one of the most greatest way of measuring degree of observed data.

However, BLAST doesn't use p-value but E-value.

Why the BLAST use e-value for interpreting sequence data instead of p-value?

Is there any logical reason to use E-value for BLAST? If so, could you tell me the detail reason?

sequencing alignment • 16k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.0 years ago by mangfu100 ▴ 800

Ram · Accepted Answer · 2015-04-17

11

Entering edit mode

9.0 years ago

Csaba Kerepesi ▴ 350

Quote from the BLAST help (http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head4 ):

"The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995. However, when E < 0.01, P-values and E-value are nearly identical."

Important to note that P value of the BLAST is not the same thing than a P-value of a t-test.

ADD COMMENT • link 9.0 years ago by Csaba Kerepesi ▴ 350

1

Entering edit mode

Could you elaborate further on your last sentence?

ADD REPLY • link 9.0 years ago by lelle ▴ 830

1

Entering edit mode

any p-value is the result of a hypothesis test. since a blast search is not a hypothesis test, a p would be an inappropriate result.

ADD REPLY • link 9.0 years ago by karl.stamm 4.1k

3

Entering edit mode

Yes, BLAST is doing a hypothesis test: is the sequence a homolog of your query, or not? The null hypothesis is that it is not a homolog, and instead is a "random" sequence. The P-value is the probability that you would've gotten a score this high if it's not a homolog. BLAST scores follow a known distribution (an extreme value distribution) under the null hypothesis. Conceptually, it's the same as any other p-value based significance test.

ADD REPLY • link 9.0 years ago by seanrobertseddy ▴ 50

0

Entering edit mode

I think most users aren't aware of the hypothesis test as you've stated it. Implicitly, BLAST is testing a query sequence against thousands or thousands of millions of candidate sequences. If we interpret p-value as the false positive rate (or incorrect null-h acceptance), then we should apply a multiple-testing correction to the result, and the copious results are decimated. The chance of artificial alignment is highly dependent upon the genome being searched and the complexity of the query sequence. We can guarantee that a 2-mer is a homolog of a million locations, but it's useless as a result. the E-value distribution accounts for these things and is more directly related to the complexity and uniqueness of a blast 'hit'. It's determined by the genome index being queried. We use BLAST to find things, and want to know how certain it is. I think most users aren't specifying any hypotheses or accounting for the multiplicity thereof.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.0 years ago by karl.stamm 4.1k

2

Entering edit mode

seanrobertseddy (Sean Eddy? Hello!) is right here. BLAST is doing a standard hypothesis test. It has an explicit null model and the E-value is estimated based on this model. You may argue whether the null model is appropriate, but math is math. As I remember, BLAST precomputes the two key parameters. FASTA/swat learns the parameters from data. They are less affected by the redundancy in the database.

ADD REPLY • link 9.0 years ago by lh3 33k

0

Entering edit mode

Exactly: BLAST P-value: "The probability of a chance alignment occurring with a particular score or a better score in a database search." Quoted form BLAST Glossary

More exactly: If you have an n length query sequence and an m length database and running BLAST you get a hit with S score, than the P value is the probability of you get at least one hit with a score greater (or equal) than S if you BLAST a random n length query against a random m length database.

The last state are concluded mostly from: http://www.basiclocalalignmentsearchtool.com/

However P-value is not calculated by BLAST but E-value. P value is not equal with E-value. BLAST E-value is the expectation value of the hits with score greater (or equal) than S if you BLAST a random n length query against a random m length database.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.0 years ago by Csaba Kerepesi ▴ 350

0

Entering edit mode

I agree. The last statement requires further elaboration otherwise it might be misleading. Did you meant to say that the underlying distribution is different?

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.0 years ago by mxs ▴ 530