Question

BLAST e-value between two sequences vs BLAST e-value between sequence and data base

0

Entering edit mode

2.1 years ago

nyanovsky • 0

I'm having a little trouble understanding the difference between the e-value given when I use BLAST to align two sequences vs the e-value's given when I run BLAST for the same query sequence against a database.

For example, when I run BLAST between human myoglobin and human neuroglobin, the e-value given is 3e-4. But then, when I run BLAST for human myoglobin against a refseq protein database, human neuroglobin doesnt appear since it didn't reach the e-value cut threshhold (which is indeed bigger than 3e-4). So, I know the e-value is proportional to the query length(this doesnt change since i'm running human myoglobin in both cases), the database length, and that it decreases exponentially with the score. I'm assuming that the score shouldn't change in both cases (i'm using the same parameters), so the only thing that's left that can explain why human neuroglobin doesn't appear in the BLAST search is that the database length is different in the pairwise alignment, even though i'm selecting the same refseq protein database in both cases. This prompts me to think that when I run BLAST for a pairwise alignment, the database length used to calculate the e-value is just the length of the target sequence or the sum of both the query and target lengths. Is this so?

BLAST e-value • 622 views

ADD COMMENT • link updated 2.1 years ago by Mensur Dlakic ★ 27k • written 2.1 years ago by nyanovsky • 0

score 0 · Answer 1 · 2022-04-27

This is very easy to find by googling, because it is one of the cornerstones of sequence comparisons.

In simplest terms: E-value of a sequence comparison is the P-value of the same comparison multiplied by the number of sequences in the database.

Let's say that you compare A to B and you get E-value of 1e-50. That's also the P-value, because your database size is 1. Now you put sequence B into a larger database that has 1 million sequences, and compare A vs that database. The match to B in the context of a larger database will have an E-value of 1e-44 (1e-50 * 1e+6).

When you multiply 3e-4 by refseq size (I don't know what it is exactly, but an educated guess is at least 100 million) you will get an E-value of >30,000 which will not be displayed among results.