Inferring homology from BLAST scores/statistics
1
0
Entering edit mode
3.0 years ago

Hello,

I have some proteins with blast homologs, and I am trying to get a quantitative measure of the representativeness of each match. As I understand, everyone normally compares blast alignments using bit scores, as these are database-size independent. However (please correct me if I'm wrong) bit-scores only describe the quality of the HSP itself, not how representative that HSP/bit score is of it's parent protein.

Would I be barking up the wrong tree if I DIYed a score for comparison? One of the main reasons I'm asking is I'm not particularly hot on BLAST statistics (so this may all be unnecessary) and I know making up your own stats can be a bad idea.

My score would be something like:

(bitscore * perc_coverage)/log (evalue)

This would hopefully approximate to:

(quality of HSP * HSP representativeness of protein) / reliability of HSP quality

NB - I would take the log to stop differences in E-value massively biasing the final score.

Thanks for reading!

statistics blast • 1.3k views
ADD COMMENT
3
Entering edit mode
3.0 years ago
Mensur Dlakic ★ 28k

As I understand, everyone normally compares blast alignments using bit scores, as these are database-size independent.

Bit-scores are independent of database size, but they are sequence length-dependent. Longer query sequences are expected to have higher bit-scores that still may not be statistically significant.

Whether you like BLAST E-value statistics or not, they take into account as many things as possible regarding the length and number of HSPs, their bit-scores, and also a database size. Hope you don't take this the wrong way, but I doubt you can come up with a measure that is going to be more informative and still reflect true relationships than E-value. You measure may seem better on a limited number of examples, but neither bit-score nor coverage are globally accurate when it comes to assessing homologous relationships on a large scale.

ADD COMMENT
0
Entering edit mode

Ahh perfect thanks for the reply - I thought I might be getting ahead of myself, so glad I sanity checked this!

Longer query sequences are expected to have higher bit-scores that still may not be statistically significant.

Out of interest/for my own learning, do you mean (i) longer queries, or (ii) longer query subsequences contained in a given HSP here? I'm guessing the second based on this , which says:

An alignment that is twice as long, e.g. 200 residues instead of 100 residues at the same evolutionary distance, will have a bit score that is twice as high.

(assuming by 'alignment' they mean HSPs)

ADD REPLY
1
Entering edit mode

My point was that bit-scores are biased by query length (your first choice), but that also extends to the expected alignment length (your second choice).

ADD REPLY

Login before adding your answer.

Traffic: 1764 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6