Question

What is "pident" (percentage of identical matches) in the "Diamond" protein alignment ?

0

Entering edit mode

11 months ago

Alexander ▴ 220

Question: Would you be so kind to comment what is "pident" (percentage of identical matches) in the "Diamond" protein alignment ?

Thanks to Kirill Petrikov and colleagues seems the following is an answer:

Answer the following seems to be used (at least when "gapopen" = 0 ):

matches['dif_s'] = (matches['send']-matches['sstart'] - matches['mismatch'] + 1 ) / matches['length']*100
matches['dif_q'] = (matches['qend']-matches['qstart'] - matches['mismatch'] + 1 ) / matches['length']*100
matches['pident_2'] = matches[['dif_s', 'dif_q']].min(axis=1).round(1)

enter image description here

PS

"Diamond": https://github.com/bbuchfink/diamond

"Diamond" use example: https://www.kaggle.com/code/geraseva/diamond

"Diamond" manual: https://gensoft.pasteur.fr/docs/diamond/0.9.19/diamond_manual.pdf

Example from the original post with pident = 27.2

enter imag description here

protein alignment • 1.4k views

ADD COMMENT • link updated 11 months ago by LauferVA 4.2k • written 11 months ago by Alexander ▴ 220

1

Entering edit mode

Those should all be the same headings as explained in the NCBI Help doc here: https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.options_common_to_all_blast/

You should examine the actual alignment and it should become clear.

Score           Expect  Method                           Identities Positives   Gaps
297 bits(761)   2e-101  Compositional matrix adjust.    144/151(95%)    146/151(96%)    0/151(0%)

Query  1    MQINFENLTIGVFAKAAGVNVETIRFYQRKGLLPEPDKPYGSIRRYGEADVTRVRFVKSA  60
            MQ NFENLTIGVFAK AGVNVETIRFYQRKGLLPEPDKPYGSIRRYGEADVTRVRFVKSA
Sbjct  1    MQNNFENLTIGVFAKTAGVNVETIRFYQRKGLLPEPDKPYGSIRRYGEADVTRVRFVKSA  60

Query  61   QRLGFSLDEIAELLRLEDGTHCEEASGLAEHKLKDVREKMADLARMEAVLSELVCACHAR  120
            QRLGFSLDEIAELLRLEDGTHCEEASGLAEHKLKDVREKMADLARMEAVLSELVCACHAR
Sbjct  61   QRLGFSLDEIAELLRLEDGTHCEEASGLAEHKLKDVREKMADLARMEAVLSELVCACHAR  120

Query  121  KGNVSCPLIASLQDGTKLAASARGSHGVTTP  151
            +GNVSCPLIASLQDGTK AASAR  HG+TTP
Sbjct  121  QGNVSCPLIASLQDGTKFAASAREPHGLTTP  151

ADD REPLY • link 11 months ago by GenoMax 142k

0

Entering edit mode

Thank you for the reply ! I got seems to be even more puzzled here - we see "identities" 95% while in the example "pident" is 27.2 - so now I even do not quite understand what is going on - these proteins are really well (LOCALLY) aligned to each other (95%) , or not (27.2%) ? PS thanks for providing the link: https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.options_common_to_all_blast/ but I cannot find the exact definition of "pident" - there. The main question - "percentage" is taken relative to "what value" seems to be not mentioned there.

ADD REPLY • link 11 months ago by Alexander ▴ 220

0

Entering edit mode

I added the link for NCBI "book" since DIAMOND seems to follow the format/reporting structure for blast+ results. While the numbers shown in the answer below seem to add up you should actually examine the alignment (in addition to the summary line posted above) to make sure you are accounting for identities/gaps etc.

As for the example above, 144 AA are identical out of 151 in the HSP so that works out to (144 * 100)/150 = 95%. Percentage of identical matches over the reported HSP.

ADD REPLY • link 11 months ago by GenoMax 142k

0

Entering edit mode

how does it correspond to 27.2 ? or these are the other proteins ?

ADD REPLY • link 11 months ago by Alexander ▴ 220

0

Entering edit mode

That is the question @Vincent answer poses since it seems to correspond to something different.

You can also create an issue on https://github.com/bbuchfink/diamond/issues with your question that the developer may respond to.

ADD REPLY • link 11 months ago by GenoMax 142k

score 1 · Answer 1 · 2023-06-02

Alexander,

Edits based on additional data:

Revised my post based on the additional data you provided. In the new picture, column 4 is the total length of the sequence, and column 5 is the length of the mismatched sequence.

Looking at row 3, we have 88 mismatching positions of a total sequence length of 335. Thus, naively, a first guess could be:

pident0 = (335 - 88 / 335) = 247/335 = 73.7%            (0)

Interestingly, equation 0 gives a value is very close to pident1 and pident2, but not identical to either. Now consider the gap open column. If you subtract that from the length of the matching sequence, you now have 245/335, which exactly equals 73.1% - the value of pident1. Thus, we write equation 1 as follows:

pident1 = (Matching sequence  length - Open Gap sequence) / total sequence length                     (1)

Because repeating the steps I outline for each other row I checked gives the value pident1 in every case, I am fairly confident this is the correct answer to your question ... However, it does leave one remaining issue:

pident2 = ?                    (2)

Because the values of pident2 are very close to the value of pident1 in every row, its highly likely it is a very similar metric, but perhaps deals with a detail differently. For example, pident2 may deal with gapopen differently; or it may have to do with differences between q and s. I leave verification of which - and more importantly, consideration of what biological phenomena might be better be represented by pident1 contra pident2 (or vice versa) - in your capable hands.

Best,

VAL

Initial post: (now deprecated)

First, I want to commend you for going to the docs, the github page, and the example before posting - you're setting a great example for others. One last place to check is the manuscript itself, but it does not define this term, I checked.

Anywho, in the documentation, pident is represented as the percentage of all returned matches that are identical matches. However, it actually appears to be the ratio of identical matches to non-identical matches. I say this because, using your example:

identical matches = 340
non-identical matches = 1247
all matches = 340 + 1247 = 1587
pident (percentage identical) = 27.2

But, 340/1587 = 21.2% != 27.2% ... so it is evidently not (identical matches / all matches) x 100...

Rather, it appears to be pident = 340/1247 (identical matches/non-identical matches) x 100 = 27.2%.

This is a bit odd because, in the event identical matches exceed non-identical, pident will be > 100%. In the github page you linked above, the maintainers note the software is actively supported. As such, if this does not resolve the question, could be worth reaching out to them directly, but I think the above goes a long way.

Does that help?