Inferring homology from BLAST scores/statistics
18 months ago

Hello,

I have some proteins with blast homologs, and I am trying to get a quantitative measure of the representativeness of each match. As I understand, everyone normally compares blast alignments using bit scores, as these are database-size independent. However (please correct me if I'm wrong) bit-scores only describe the quality of the HSP itself, not how representative that HSP/bit score is of it's parent protein.

Would I be barking up the wrong tree if I DIYed a score for comparison? One of the main reasons I'm asking is I'm not particularly hot on BLAST statistics (so this may all be unnecessary) and I know making up your own stats can be a bad idea.

My score would be something like:

(bitscore * perc_coverage)/log (evalue)


This would hopefully approximate to:

(quality of HSP * HSP representativeness of protein) / reliability of HSP quality


NB - I would take the log to stop differences in E-value massively biasing the final score.

18 months ago
Mensur Dlakic ★ 23k

As I understand, everyone normally compares blast alignments using bit scores, as these are database-size independent.

Bit-scores are independent of database size, but they are sequence length-dependent. Longer query sequences are expected to have higher bit-scores that still may not be statistically significant.

Whether you like BLAST E-value statistics or not, they take into account as many things as possible regarding the length and number of HSPs, their bit-scores, and also a database size. Hope you don't take this the wrong way, but I doubt you can come up with a measure that is going to be more informative and still reflect true relationships than E-value. You measure may seem better on a limited number of examples, but neither bit-score nor coverage are globally accurate when it comes to assessing homologous relationships on a large scale.

Ahh perfect thanks for the reply - I thought I might be getting ahead of myself, so glad I sanity checked this!

Longer query sequences are expected to have higher bit-scores that still may not be statistically significant.

Out of interest/for my own learning, do you mean (i) longer queries, or (ii) longer query subsequences contained in a given HSP here? I'm guessing the second based on this , which says:

An alignment that is twice as long, e.g. 200 residues instead of 100 residues at the same evolutionary distance, will have a bit score that is twice as high.

(assuming by 'alignment' they mean HSPs)

My point was that bit-scores are biased by query length (your first choice), but that also extends to the expected alignment length (your second choice).