Question: Idenity, e value or bitscore ?
2
4.6 years ago by
Leo20
Leo20 wrote:

Hi everyone, i have a blast p result with the average identity percentage 35%, is it an acceptable percentage?

If not what is the minimal acceptable identity percentage?

Between Identity percentage, e-value and bitscore, which one should we focus on in order to find the best match? Thanks.

blast annotation • 36k views
modified 18 months ago by alslonik150 • written 4.6 years ago by Leo20
5
4.6 years ago by
agata88800
Poland
agata88800 wrote:

Identity 35% means that 35% of aa in your sequence match to other sequences in database. There isn't something like "acceptable percentage". It always depends on what you are looking for: --- if you have unknown protein sequence and you would like to know the homology sequences, information about identity (even 35%) is valuable, --- if you have known protein and you need to confirm the sequence, the identity 35% is small and may suggest that something went wrong during your analysis.

The E-value is very important, the lower the better.

Best,

Agata

1

@Leo has not provided enough information to draw meaningful conclusions. But it may be worth noting that the 35% identity could be over a critical part of the protein e.g. an active site/binding site etc so it may still be an important observation.
While the e-value is important, it is dependent on the size of the database being searched against. So that also should be kept in perspective.

11
4.6 years ago by
India
Being Bioinformatician180 wrote:

The E-value provides information about the likelihood that a given sequence match is purely by chance. The lower the E-value, the less likely the database match is a result of random chance and therefore the more significant the match is. Empirical interpretation of the E-value is as follows. If E < 1e - 50 (or 1 × 10-50), there should be an extremely high confidence that the database match is a result of homologous relationships. If E is between 0.01 and 1e - 50, the match can be considered a result of homology. If E is between 0.01 and 10, the match is considered not significant, but may hint at a tentative remote homology relationship. Additional evidence is needed to confirm the tentative relationship. If E > 10, the sequences under consideration are either unrelated or related by extremely distant relationships that fall below the limit of detection with the current method. Because the E-value is proportionally affected by the database size, an obvious problem is that as the database grows, the E-value for a given sequence match also increases. Because the genuine evolutionary relationship between the two sequences remains constant, the decrease in credibility of the sequence match as the database grows means that one may “lose” previously detected homologs as the database enlarges. Thus, an alternative to E-value calculations is needed

A bit score is another prominent statistical indicator used in addition to the Evalue in a BLAST output. The bit score measures sequence similarity independent of query sequence length and database size and is normalized based on the rawpairwise alignment score. The bit score (S) is determined by the following formula: S = (λ × S − lnK)/ ln2 where λ is the Gumble distribution constant, S is the raw alignment score, and K is a constant associated with the scoring matrix used. Clearly, the bit score (S) is linearly related to the rawalignment score (S). Thus, the higher the bit score, the more highly significant the match is. The bit score provides a constant statistical indicator for searching different databases of different sizes or for searching the same database at different times as the database enlarges.

3
18 months ago by
alslonik150
Israel
alslonik150 wrote:

Just because I always bump into this one when I google for the article to cite for the cutoff for inferring homology between the two proteins (important condition). Here is the link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/ And here is the citation:

The bit-score provides a better rule-of-thumb for inferring homology. For average length proteins, a bit score of 50 is almost always significant. A bit score of 40 is only significant (E() < 0.001) in searches of protein databases with fewer than 7000 entries. Increasing the score by 10 bits increases the significance 210=1000-fold, so 50 bits would be significant in a database with less than 7 million entries (10 times SwissProt, and within a factor of 3 of the largest protein databases). Thus, the NCBI Blast web site uses a color code of blue for alignment with scores between 40–50 bits; and green for scores between 50–80 bits. In the yeast vs human example, the alignments with less than 20% identity had scores ranging from 55 – 170 bits. Except for very long proteins and very large databases, 50 bits of similarity score will always be statistically significant and is a much better rule-of-thumb for inferring homology in protein alignments.

1
2.3 years ago by
utsafar70
utsafar70 wrote:

First of all, mind that blast hits are HSPs which may be just some part of query and subject sequences not all of them. So here I just talk about these matched parts of your Q and S sequences.

Identity: "the average identity of percentage 35%" is meaningless because blast hits are independent. For example a BlastP with two hits: protein 1 against protein 2 with `pident` 55% and protein 1 against 3 with `pident` 15% say that protein 1 is to a high confidence homolog of protein 1, but about the homology between protein 1 and protein 3 you must be more cautious. mind that proteins are made of 20 different AAs and if you align two irrelevant protein sequences (or any other random AA sequences) with any length you will have a 5% random identity (for DNA and RNA sequences random identity is 25% since those are made of for different bases A,T,C,G). there is another parameter `ppos` in BlastP which is based on similarity. `ppos` is `pident`+(the percentage of similar but not identical AA matches). At all, I think, two AA sequences with `pident` higher than 20% and `ppos` higher than 30% are close enough to be called homolog. in NA sequences I think `pident` 40% and above is OK.

P-Value: depends on query and DB lengths but I think p-value lower than 10^-5 shows a relation.

BitScore: Very depends on query length. Compare bitscore with your `qlen`, I think if bitscore of a hit is 0.7 of qlen or greater, `query` and `subject` are close enough.