Blast E-Value To Database Size
1
2
Entering edit mode
12.2 years ago
Hranjeev ★ 1.5k

Hi,

I'm thinking of splitting the database to smaller chunks. And, blast my sequences against them each on a separate process. My only concern is the results (which I will merge later).

Would the resulting e-value be affected by database content when smaller subsets are used? I have a hunch that it would not matter when all the subset results later becomes concatenated. Please correct me if I'm wrong.

blast statistics • 15k views
ADD COMMENT
12
Entering edit mode
12.2 years ago
Neilfws 49k

The statistics of BLAST scores are described in this article. It's quite mathematics-heavy, but also quite readable; just take your time and re-read several times.

The short answer is that yes, e-values are dependent on database size. If you think about it intuitively, there's a higher probability of finding a match in a large database than in a smaller database.

That said, it is possible to re-calculate e-values by combining the results when the database is split. This is implemented in, for example, mpiBLAST. It would be a good idea to study their website, code and publication to see how they handle the problem.

See also the discussion of recalculating e-value in this paper or do a quick web search for "BLAST split database e-value calculate" - it's quite a widely-discussed issue.

ADD COMMENT
4
Entering edit mode

Totally agree with the answer but you can set manually the database size using the parameter "-z". On that way, you can split the db file into smaller pieces, make your queries and then merge results.

ADD REPLY
1
Entering edit mode

I think you also need to set the number of sequences in the database (N) to calculate the edge adjustment parameter (l or "ell"). The adjustment is done for you if you use NOBLAST

ADD REPLY
1
Entering edit mode

try grep -v '^>' something.fasta | grep -o [ACTGNactg] | wc -l for fasta files before building database

ADD REPLY
0
Entering edit mode

thanks much appreciated

ADD REPLY
0
Entering edit mode

Oops my hunch was wrong. Anyways any easy way to count the number of letters, N (total letters) of a database?

ADD REPLY
0
Entering edit mode

Please validate if the nr database atm is 5784003470 letters in size

ADD REPLY
0
Entering edit mode

When I said "database size" I refered to the total number of sequences in your database, I didn't refer to the total number of residues on it.

ADD REPLY

Login before adding your answer.

Traffic: 2470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6