Question: Distributed (parallel) BLAST. Calculating E-values and other statistics params.
0
4.0 years ago by
Belarus

Hi guys!

Imagine, we are searching over distributed blast databases, slices of the one big non-redundant database. We run separate blast task over each slice (piece) with the same query and search parameters. All the same for each task except database name. Then we obtain results - list of hits with alignment, E-value, bit-score (score) for each task and database slice. We need to display common results. We join all hits into one list and sort it by bit-score (score). Question is - HOW TO CALCULATE SUMMARY E-value for every hit?

Formula for calculating E-value is:

E-value = Eff-space / 2^bit-score

bit-score is independent of database size and stay the same for a given hit, no matter if we are searching over whole database or just small piece of it.

I guess that E-value summary can be calculated so:

E-value summary = (Eff-space piece 1 + Eff-space piece 2 + .. + Eff-space piece N) / 2^bit-score

where N - number of slices (pieces)

Please, let me know if I am totally wrong and give advise.

PS: Another question is: can we somehow advise to searcher what E-value he should use knowing database size and query length to get at least one hit? This question appeared from my own practice when I have used small E-value cutoff searching over very big database using small query sequence.

modified 4.0 years ago by Damian Kao15k • written 4.0 years ago by vladimir.khramkov10

Unfortunately Blast+ from NCBI does not have an input option for number of sequences in the original database.

I have modified the source codes of Blast + 2.3.0 to add this option. And it is possible now to use the fragments of database.

2
4.0 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can define an effective search space size in your blasts, so it will use that number for calculating e-values. I can't remember what flag it was on the top of my head. Look through the manual or help.

Can you advise me how to choose correct effective search space for given database?

effective search space is just length of query * length of database (not exactly because there is a length adjustment step, but this is close enough). Take your sum bases in query sequences multiply that by sum bases in target sequences.

In my case (query * length of database) = 5 times (effective search space calculated by blast)

Sorry, but 5 times it is not close enough

I think when you are dealing with values like 1e-50, 5 times probably is close enough. Do you really need to resolve between 1e-50 and 1e-51? If you are just comparing the e-values among each other, then it doesn't really matter what the absolute e-values are as long as they are calculated consistently.

If you really care about the absolute score, might as well use the bitscore.

E-values like 1e-50 attainable only for very long queries (1000 bp and more). I usually search by short queries (20-300 bp) and receive hits with E-values from 0.5 to 500

1
4.0 years ago by
Göttingen, Germany
Manuel Landesfeind1.2k wrote:

I am not sure if the "summary-calculation" is that easy because you already stated that the e-Value depends on database specific parameters (see http://www.ncbi.nlm.nih.gov/BLAST/tutorial). I guess, the easiest thing would be to estimate the database specific parameters (i.e., lambda and K) for the full database (probably BLAST can report them) and just recalculate the e-Value for every match by formula (2) given in the link above.

A probably easier approach would be to distribute the workload by splitting the input sequences and match them against the full database...

PS: You should be able to determine the "significance" threshold in terms of bitscore by the same formula.

1) In most cases input query is not dividable into separate sequences because it is already represent only one sequence

2) I also thought in this direction as estimating the database specific parameters (lambda and K), but problem is in what BLAST over one big database is not the same as BLAST by all pieces of this database. BLAST by all pieces returns better results (more hits). For example - there is such parameter as "length adjustment". For big database, length adjustment is larger and for small database is smaller. This means that blast search with smaller length adjustment parameter will return more hits than blast search over big database with larger length adjustment.

Why is the "BLAST over one big database not the same as BLAST by all pieces" when referring to RAW alignment scores? If you consider only one match and sum up the substitution scores, gap penalties, etc., it does not matter if the remaining database is large or small. Given such a raw alignment score as well as database parameters lambda and K you can calculate the e-Value.

For sure, to do so, you need to recalculate the raw alignment score from a given alignment (see "PS" below), but I think you are targeting something in that direction, didn't you? The efforts that you describe (e.g., splitting up the BLAST database, recalculating scores, etc, for a single query) would only make sense if you are planning to develop something like an "accelerated BLAST" in a highly distributed environment. At least from my point of view...

PS: Referring to your initial question stating that the "bit-score is independent of database size", and given the information in the link above, I would say this is incorrect. The bit-score is already normalized dependent on lambda and K and thus database specific. (incorrect, see below)

If you look at BLAST source code, you will find function which calculates lambda, K, H

As I am looking gapped alignments I will show you function for l,K,H calculation for gapped case.

This is Blast_KarlinBlkGappedFill:

Blast_KarlinBlkGappedFill(Blast_KarlinBlk* kbp,
Int4 gap_open,
Int4 gap_extend,
Int4 decline_align,
const char* matrix_name)

this function returns Blast_KarlinBlk structure with filled l,K, logK, H,C parameters

As you can see this function does not use database size information, result of this function fully depends on search parameters and scoring matrix.

So, If lambda and K are independent from database size then bit_score also is independent from database size.

You are right. I think I misunderstood that - shame on me! :-/

But then let me ask: Why is "BLAST over one big database not the same as BLAST by all pieces" when referring to bit-scores, given that the search parameter are equal? And if you already dug into the source code of BLAST, calculating the E-Value for a database of size N (see section "Database searches") is even more easy.

BLAST over one big database will find less hits than independent BLASTs over small pieces of this database.

I can explain this effect by two things:
1) -evalue parameter which tells BLAST do not return hits with E-value greater than this parameter

As you known E-value of the hit depends on Eff-space of database where it was found. Than bigger database size and eff-space than greater E-values returns blast for the same hits with the same bit_score. And could happens situation when blast will not return even exact match because its E-value will be greater than -evalue parameter. In my case short query has hit with E-value = 2.2 over small piece and E-value = 635, when I do search over whole big database.

2) Another parameter which effects on search results (hits) returning by blast is - length adjustment.

Length Adjustment tells BLAST to do not look for hits in the right end of sequences - Edge effect.

For big database BLAST uses greater length adjustment than for small. This means that it could not find some hits (depends on value of length adjustment) when searching over big database.