Consider these two sequences:
These are circadian locomoter output cycles kaput
protein (CLK
) orthologs from E. superba and D. melanogaster respectively.
The UniProt
webpages (correctly?) indicates that both orthologs carry a complement of 1 bHLH
domain and 2 PAS
domains (note that the annotation sources are different: InterPro
for E. superba, and PROSITE
for D. melanogaster respectively).
I happened to try and annotate the domains again using hmmscan
(https://www.ebi.ac.uk/Tools/hmmer/) against the Pfam
database, and to my surprise it did not find the second PAS
domain in the D. melanogaster sequence. (It did "correctly" identify 1x bHLH
and 2x PAS
domains in the E. superba sequence.)
Now this happened with the default search parameters, which includes using the so-called Gathering Threshold
for defining the cut-off used to indicate sequence membership to a family of domains. Changing this to e-value
(< 0.01
, the default) "restores" the previously missing domain. I must also note that all isoforms of this sequence seem to be experiencing the same problem(s).
My question is: why is this the case? The D. melanogaster CLK
sequence is arguably the best studied ortholog from the CLK
family. This sequence has been used as a bait many times to discover orthologs in other organisms (most of which I presume hmmscan
+ Pfam
annotate correctly w.r.t. their domains). I don't know how the Pfam
HMM
profiles are constructed but I presume this specific sequence contributed to the construction process.
Why then does the hmmscan
+ Pfam
combination using default cut-offs annotate this sequence incorrectly in comparison to its ortholog? Is there any way to fix this?
Edit: I am reading through the hmmer
userguide but this is tough going, and I'm not sure if I'll find an answer in there (or if I do find it, actually understand it).
Hi lieven.sterck, thank you for the answer. I don't mean to be rude, but I don't think your answer addresses my question directly (tangentially perhaps, yes). I do urge you to re-read the OP again, or am I misunderstanding something in your answer? (I apologize for this coming off as somewhat confrontational!!)
My answer (in condensed form) is mostly this
&
and to add: I don't think
hmmscan + Pfam
is what most people do, moreover I believe that also UniProt will use InterPro (can't find the ref on their website immediately though )and for why it does work on one of the sequence but not on the other one is likely (because I don't know the exact details of this analysis/domain) because it does just fit within threshold for one and not for the other.
domain (hmm) profiles are build form a multiple alignment of similar sequences, it does not reflect a specific sequence
and no offense taken ;)
I suggest to run both sequences through interproscan, that forms the most comprehensive domain search you can do.