Alexander,

**Edits based on additional data:**

Revised my post based on the additional data you provided. In the new picture, column 4 is the total length of the sequence, and column 5 is the length of the *mis*matched sequence.

Looking at row 3, we have 88 *mis*matching positions of a total sequence length of 335. Thus, naively, a first guess could be:

```
pident0 = (335 - 88 / 335) = 247/335 = 73.7% (0)
```

Interestingly, equation 0 gives a value is very close to `pident1`

and `pident2`

, but not identical to either. Now consider the gap open column. If you subtract that from the length of the matching sequence, you now have 245/335, which exactly equals 73.1% - the value of `pident1`

. Thus, we write equation 1 as follows:

```
pident1 = (Matching sequence length - Open Gap sequence) / total sequence length (1)
```

Because repeating the steps I outline for each other row I checked gives the value `pident1`

in every case, I am fairly confident this is the correct answer to your question ... However, it does leave one remaining issue:

```
pident2 = ? (2)
```

Because the values of `pident2`

are very close to the value of `pident1`

in every row, its highly likely it is a very similar metric, but perhaps deals with a detail differently. For example, `pident2`

may deal with `gapopen`

differently; or it may have to do with differences between q and s. I leave verification of which - and more importantly, consideration of what biological phenomena might be better be represented by `pident1`

contra `pident2`

(or *vice versa*) - in your capable hands.

Best,

VAL

**Initial post: (now deprecated)**

First, I want to commend you for going to the docs, the github page, and the example before posting - you're setting a great example for others. One last place to check is the manuscript itself, but it does not define this term, I checked.

Anywho, in the documentation, `pident`

is represented as the percentage of all returned matches that are identical matches. However, it actually appears to be the *ratio of identical matches to non-identical matches*. I say this because, using your example:

- identical matches = 340
- non-identical matches = 1247
- all matches = 340 + 1247 = 1587
- pident (percentage identical) = 27.2

But, 340/1587 = 21.2% != 27.2% ... so it is evidently *not* (identical matches / all matches) x 100...

Rather, it appears to be `pident`

= 340/1247 (**identical matches/non-identical matches**) x 100 = 27.2%.

This is a bit odd because, in the event identical matches exceed non-identical, `pident`

will be > 100%. In the github page you linked above, the maintainers note the software is actively supported. As such, if this does not resolve the question, could be worth reaching out to them directly, but I think the above goes a long way.

Does that help?

Those should all be the same headings as explained in the NCBI Help doc here: https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.options_common_to_all_blast/

You should examine the actual alignment and it should become clear.

Thank you for the reply ! I got seems to be even more puzzled here - we see "identities" 95% while in the example "pident" is 27.2 - so now I even do not quite understand what is going on - these proteins are really well (LOCALLY) aligned to each other (95%) , or not (27.2%) ? PS thanks for providing the link: https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.options_common_to_all_blast/ but I cannot find the exact definition of "pident" - there. The main question - "percentage" is taken relative to "what value" seems to be not mentioned there.

I added the link for NCBI "book" since DIAMOND seems to follow the format/reporting structure for blast+ results. While the numbers shown in the answer below seem to add up you should actually examine the alignment (in addition to the summary line posted above) to make sure you are accounting for identities/gaps etc.

As for the example above, 144 AA are identical out of 151 in the HSP so that works out to (144 * 100)/150 = 95%. Percentage of identical matches over the reported HSP.

how does it correspond to 27.2 ? or these are the other proteins ?

That is the question @Vincent answer poses since it seems to correspond to something different.

You can also create an issue on https://github.com/bbuchfink/diamond/issues with your question that the developer may respond to.