Question

Idenity, e value or bitscore ?

7

Entering edit mode

9.3 years ago

Leo ▴ 70

Hi everyone, i have a blast p result with the average identity percentage 35%, is it an acceptable percentage?

If not what is the minimal acceptable identity percentage?

Between Identity percentage, e-value and bitscore, which one should we focus on in order to find the best match? Thanks.

blast annotation • 68k views

ADD COMMENT • link updated 6.1 years ago by alslonik ▴ 320 • written 9.3 years ago by Leo ▴ 70

18

Entering edit mode

9.3 years ago

Being Bioinformatician ▴ 250

The E-value provides information about the likelihood that a given sequence match is purely by chance. The lower the E-value, the less likely the database match is a result of random chance and therefore the more significant the match is. Empirical interpretation of the E-value is as follows. If E < 1e - 50 (or 1 × 10-50), there should be an extremely high confidence that the database match is a result of homologous relationships. If E is between 0.01 and 1e - 50, the match can be considered a result of homology. If E is between 0.01 and 10, the match is considered not significant, but may hint at a tentative remote homology relationship. Additional evidence is needed to confirm the tentative relationship. If E > 10, the sequences under consideration are either unrelated or related by extremely distant relationships that fall below the limit of detection with the current method. Because the E-value is proportionally affected by the database size, an obvious problem is that as the database grows, the E-value for a given sequence match also increases. Because the genuine evolutionary relationship between the two sequences remains constant, the decrease in credibility of the sequence match as the database grows means that one may “lose” previously detected homologs as the database enlarges. Thus, an alternative to E-value calculations is needed

A bit score is another prominent statistical indicator used in addition to the Evalue in a BLAST output. The bit score measures sequence similarity independent of query sequence length and database size and is normalized based on the rawpairwise alignment score. The bit score (S) is determined by the following formula: S = (λ × S − lnK)/ ln2 where λ is the Gumble distribution constant, S is the raw alignment score, and K is a constant associated with the scoring matrix used. Clearly, the bit score (S) is linearly related to the rawalignment score (S). Thus, the higher the bit score, the more highly significant the match is. The bit score provides a constant statistical indicator for searching different databases of different sizes or for searching the same database at different times as the database enlarges.

ADD COMMENT • link 9.3 years ago by Being Bioinformatician ▴ 250

5

Entering edit mode

6.1 years ago

alslonik ▴ 320

Just because I always bump into this one when I google for the article to cite for the cutoff for inferring homology between the two proteins (important condition). Here is the link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/ And here is the citation:

The bit-score provides a better rule-of-thumb for inferring homology. For average length proteins, a bit score of 50 is almost always significant. A bit score of 40 is only significant (E() < 0.001) in searches of protein databases with fewer than 7000 entries. Increasing the score by 10 bits increases the significance 210=1000-fold, so 50 bits would be significant in a database with less than 7 million entries (10 times SwissProt, and within a factor of 3 of the largest protein databases). Thus, the NCBI Blast web site uses a color code of blue for alignment with scores between 40–50 bits; and green for scores between 50–80 bits. In the yeast vs human example, the alignments with less than 20% identity had scores ranging from 55 – 170 bits. Except for very long proteins and very large databases, 50 bits of similarity score will always be statistically significant and is a much better rule-of-thumb for inferring homology in protein alignments.

ADD COMMENT • link 6.1 years ago by alslonik ▴ 320

2

Entering edit mode

6.9 years ago

utsafar ▴ 80

First of all, mind that blast hits are HSPs which may be just some part of query and subject sequences not all of them. So here I just talk about these matched parts of your Q and S sequences.

Identity: "the average identity of percentage 35%" is meaningless because blast hits are independent. For example a BlastP with two hits: protein 1 against protein 2 with pident 55% and protein 1 against 3 with pident 15% say that protein 1 is to a high confidence homolog of protein 1, but about the homology between protein 1 and protein 3 you must be more cautious. mind that proteins are made of 20 different AAs and if you align two irrelevant protein sequences (or any other random AA sequences) with any length you will have a 5% random identity (for DNA and RNA sequences random identity is 25% since those are made of for different bases A,T,C,G). there is another parameter ppos in BlastP which is based on similarity. ppos is pident+(the percentage of similar but not identical AA matches). At all, I think, two AA sequences with pident higher than 20% and ppos higher than 30% are close enough to be called homolog. in NA sequences I think pident 40% and above is OK.

P-Value: depends on query and DB lengths but I think p-value lower than 10^-5 shows a relation.

BitScore: Very depends on query length. Compare bitscore with your qlen, I think if bitscore of a hit is 0.7 of qlen or greater, query and subject are close enough.

ADD COMMENT • link 6.9 years ago by utsafar ▴ 80

score 6 · Accepted Answer · 2016-04-18

6

Entering edit mode

9.3 years ago

agata88 ▴ 870

Identity 35% means that 35% of aa in your sequence match to other sequences in database. There isn't something like "acceptable percentage". It always depends on what you are looking for: --- if you have unknown protein sequence and you would like to know the homology sequences, information about identity (even 35%) is valuable, --- if you have known protein and you need to confirm the sequence, the identity 35% is small and may suggest that something went wrong during your analysis.

The E-value is very important, the lower the better.

Best,

Agata

ADD COMMENT • link 9.3 years ago by agata88 ▴ 870

1

Entering edit mode

@Leo has not provided enough information to draw meaningful conclusions. But it may be worth noting that the 35% identity could be over a critical part of the protein e.g. an active site/binding site etc so it may still be an important observation.
While the e-value is important, it is dependent on the size of the database being searched against. So that also should be kept in perspective.

ADD REPLY • link 9.3 years ago by GenoMax 152k