Question

Help interpreting BLAST results? (Max score/Percent. Identity/E-values)

2

Entering edit mode

3.6 years ago

cia ▴ 20

Hi everyone! It's my first approach to BLAST and to comparative genomics, and I would appreciate some help. I understood the theoric part of what these values rapresents (max score/%identities etc), but then, doing the pratical part, I have some problems. How can I actually understand if these values are ok or not? To be more clear, I will report some values so that you will understand better.

The query lenght is 370 aa The max score and total score is 340 The query cover is 97% The e-value is 1e-103 And the percentage identity is 47.45%

Now, I understand that is a good allignment, but I would appreciate so much if someone could explain me in a simple way for example HOW I can understand that the score I got is an High one, or just simply how you would interpret these datas. And more, how is the percent. identity related to matrice (BLOSUM 62) I used? WHY is this a good value?

I know I probably made some silly questions, but as I said, it's my first approach and I'm trying to understand the basis. Thank you so much to everyone that will help.

alignment gene genome • 28k views

ADD COMMENT • link updated 2.2 years ago by DavidStreid ▴ 90 • written 3.6 years ago by cia ▴ 20

3

Entering edit mode

NCBI has several resources available on this page that should be useful. Statistics of sequence similarity scores is covered here.

ADD REPLY • link 3.6 years ago by GenoMax 142k

1

Entering edit mode

Genomax's linked resources are all you should need to know, but the TL;DR is that these statistics tell you different things about how accurate/meaningful your alignment is. Coverage for example, tells you whether you have a short or long alignment, and combined with identity can tell you whether you have a long, low identity match (e.g. perhaps an orthologous genes), or a short, high identity match (similar protein domains/active sites). The E-value is a description of how likely it is that the match could have arisen effectively by chance, so you want this number to be as low as possible. A lot of people/tools use a default of 1E-6, but this is pretty arbitrary.

ADD REPLY • link 3.6 years ago by Joe 21k

0

Entering edit mode

Thank you so much to both of you.. Joe could you apply what you said to my example? In my specific case how are these two parameters related? Again, sorry if it's a banal question, but this is all new and math is not my strong point and material example help me in understanding. Thank you so much for your time!!!

ADD REPLY • link 3.6 years ago by cia ▴ 20

score 4 · Answer 1 · 2020-09-25

4

Entering edit mode

3.6 years ago

Mensur Dlakic ★ 27k

All of these quantities are telling you something about the relationship between query and match sequences. The least informative among them is score, even though high score generally means more likely to be related. However, score is length-dependent, so a sequence that is 10000 residues may have a score of 1000 with an unrelated sequence, while a sequence that is 300 residue will have a score of 1000 with another and that will actually be a true relationship.

BLAST does something called high-scoring segment pairs, which boils down do making lots of local alignments and scoring each separately. From a combination of those scores a total score is derived. When max and total scores are the same, that means that there is one global alignment between the two sequences, which is usually good because it means that they can be aligned well without long insertions or deletions.

An E-value is not length-dependent and is usually more indicative of a true relationship than a raw score. As already explained, it represents a likelihood that the observed alignment could have been made by chance. In your case the E-value is zero for all practical purposes.

Percent identity tells you how related the two sequences are in terms of evolutionary distance. Yours are fairly divergent, which likely means that they have been separated by long evolutionary history.

A coverage of 97% tells you that the two sequences have the same overall organization and length. Along with identical total and max scores, that indicates that the two proteins are likely to perform a very similar function, and possibly an identical function. If the coverage was 30%, it could mean that the two sequences share only a single protein domain between each other, in which case they would be more likely to perform somewhat different functions.

All combined together: 1) percent identity tells us that your two sequences are very distant in terms of evolution (say, one was from yeast and another from human); 2) E-value tells us they are clearly related to each other; 3) scores and coverage tell us that they likely perform the same function.

ADD COMMENT • link 3.6 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you so much, your explanation is really detailed and is helping me a lot! If I'm not asking too much, could you elaborate on the percent identity? I mean, I get what it represents, but how do I interpret the value? To be more clear, why a value of 47,45% tells me that they are fairly divergent? And am I correct when I say that percent identity is related to the Matrice that we use? (in this case blosum62).

Again, thank you so much!!!!

ADD REPLY • link 3.6 years ago by cia ▴ 20

1

Entering edit mode

The interpretation of percent identity is empirical - there are no prescribed values beyond very high identity that may indicate the same (sub)species. For reference, human and chimp proteins will typically have identity in mid- to high-90s; human and pig probably in low 90s; with chicken probably in mid- to high-80s. The point is that anything below 50% identity will be a distant relationship. I made an educated guess when saying yeast and human, but that could really be human and any invertebrates as well.

Percent identity is literally what it sounds like - a percentage of identical residues in the alignment. As such it is not related to the substitution matrix, though the creation of the alignment itself will be affected by what substitution matrix is used. BLAST also outputs a similarity value between the two sequences, and that is related to the substitution matrix. If you perform BLAST searches using BLOSUM62 and BLOSUM80, you may get two identical alignments (meaning two identical numbers for percentage identity) that will have different similarity values. This is more likely to happen for relative low-identity cases such as yours.

ADD REPLY • link 3.6 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

again,thank you very much!!

ADD REPLY • link 3.6 years ago by cia ▴ 20

0

Entering edit mode

And am I correct when I say that percent identity is related to the Matrice that we use?

Percent identity is telling you how many residues in your query are an identical match to the hit. Closely related sequences will have a much higher % identity.

ADD REPLY • link 3.6 years ago by GenoMax 142k

0

Entering edit mode

Would the coverage remain identical if domains get reshuffled?

ADD REPLY • link 3.6 years ago by Dunois ★ 2.5k

1

Entering edit mode

Unlikely, because the flanking sequences would contribute to the alignment and change the score somewhat.

ADD REPLY • link 3.6 years ago by Joe 21k

0

Entering edit mode

Would the max-score ever provide more information than the E-value? It seems that an expect-value that indicates statistical significance would mean the alignment is great. Plus, the lowest E-value always seems to have the highest max-score (please let me know if this isn't the case).

And thank you for your helpful explanation!

ADD REPLY • link 2.2 years ago by DavidStreid ▴ 90