Consider these two sequences:
circadian locomoter output cycles kaput protein (
CLK) orthologs from E. superba and D. melanogaster respectively.
UniProt webpages (correctly?) indicates that both orthologs carry a complement of 1
bHLH domain and 2
PAS domains (note that the annotation sources are different:
InterPro for E. superba, and
PROSITE for D. melanogaster respectively).
I happened to try and annotate the domains again using
hmmscan (https://www.ebi.ac.uk/Tools/hmmer/) against the
Pfam database, and to my surprise it did not find the second
PAS domain in the D. melanogaster sequence. (It did "correctly" identify 1x
bHLH and 2x
PAS domains in the E. superba sequence.)
Now this happened with the default search parameters, which includes using the so-called
Gathering Threshold for defining the cut-off used to indicate sequence membership to a family of domains. Changing this to
< 0.01, the default) "restores" the previously missing domain. I must also note that all isoforms of this sequence seem to be experiencing the same problem(s).
My question is: why is this the case? The D. melanogaster
CLK sequence is arguably the best studied ortholog from the
CLK family. This sequence has been used as a bait many times to discover orthologs in other organisms (most of which I presume
Pfam annotate correctly w.r.t. their domains). I don't know how the
HMM profiles are constructed but I presume this specific sequence contributed to the construction process.
Why then does the
Pfam combination using default cut-offs annotate this sequence incorrectly in comparison to its ortholog? Is there any way to fix this?
Edit: I am reading through the
hmmer userguide but this is tough going, and I'm not sure if I'll find an answer in there (or if I do find it, actually understand it).