Why do the number of domains in these two orthologs differ? Why is hmmscan annotating one of these sequences incorrectly?
2
0
Entering edit mode
9 weeks ago
Dunois ▴ 620

Consider these two sequences:

These are circadian locomoter output cycles kaput protein (CLK) orthologs from E. superba and D. melanogaster respectively.

The UniProt webpages (correctly?) indicates that both orthologs carry a complement of 1 bHLH domain and 2 PAS domains (note that the annotation sources are different: InterPro for E. superba, and PROSITE for D. melanogaster respectively).

I happened to try and annotate the domains again using hmmscan (https://www.ebi.ac.uk/Tools/hmmer/) against the Pfam database, and to my surprise it did not find the second PAS domain in the D. melanogaster sequence. (It did "correctly" identify 1x bHLH and 2x PAS domains in the E. superba sequence.)

Now this happened with the default search parameters, which includes using the so-called Gathering Threshold for defining the cut-off used to indicate sequence membership to a family of domains. Changing this to e-value (< 0.01, the default) "restores" the previously missing domain. I must also note that all isoforms of this sequence seem to be experiencing the same problem(s).

My question is: why is this the case? The D. melanogaster CLK sequence is arguably the best studied ortholog from the CLK family. This sequence has been used as a bait many times to discover orthologs in other organisms (most of which I presume hmmscan + Pfam annotate correctly w.r.t. their domains). I don't know how the Pfam HMM profiles are constructed but I presume this specific sequence contributed to the construction process.

Why then does the hmmscan + Pfam combination using default cut-offs annotate this sequence incorrectly in comparison to its ortholog? Is there any way to fix this?

Edit: I am reading through the hmmer userguide but this is tough going, and I'm not sure if I'll find an answer in there (or if I do find it, actually understand it).

orthology pfam hmmer annotation domains • 407 views
ADD COMMENT
1
Entering edit mode
9 weeks ago

If I run that D. menanogaster sequence through interproscan (https://www.ebi.ac.uk/interpro/search/sequence/ ) it does find two PAS domains, but it seems they are different PAS domains (at least they have different PF domain IDs) .

Also, keep in mind that InterPro is not a 'domain database' it's rather an ensemble of many (all?) domain DBs around that are integrated into an overarching database. Moreover they specify for some searches their own cutoffs, which might be different than the ones used for any of the included DBs.

There are also different Pfam DBs, Pfam-A, Pfam-B ... perhaps the hmmscan at EBI does not use all of them?

Most people will likely use the InterPro-Interproscan route to annotate domains in proteins.

For your specific case I think it will be likely linked to different cut-off values used by different tools.

ADD COMMENT
0
Entering edit mode

Hi lieven.sterck, thank you for the answer. I don't mean to be rude, but I don't think your answer addresses my question directly (tangentially perhaps, yes). I do urge you to re-read the OP again, or am I misunderstanding something in your answer? (I apologize for this coming off as somewhat confrontational!!)

ADD REPLY
0
Entering edit mode

My answer (in condensed form) is mostly this

There are also different Pfam DBs, Pfam-A, Pfam-B ... perhaps the hmmscan at EBI does not use all of them?

&

For your specific case I think it will be likely linked to different cut-off values used by different tools.

and to add: I don't think hmmscan + Pfam is what most people do, moreover I believe that also UniProt will use InterPro (can't find the ref on their website immediately though )

and for why it does work on one of the sequence but not on the other one is likely (because I don't know the exact details of this analysis/domain) because it does just fit within threshold for one and not for the other.

domain (hmm) profiles are build form a multiple alignment of similar sequences, it does not reflect a specific sequence

and no offense taken ;)

ADD REPLY
0
Entering edit mode

I suggest to run both sequences through interproscan, that forms the most comprehensive domain search you can do.

ADD REPLY
0
Entering edit mode
9 weeks ago
Mensur Dlakic ★ 10k

Why then does the hmmscan + Pfam combination using default cut-offs annotate this sequence incorrectly in comparison to its ortholog? Is there any way to fix this?

Other than using their gathering thresholds, there is nothing you can fix with regard to Pfam HMMs.

HMMs are created from seed alignments, which do not include all sequences that have a given domain. Seed alignments are usually, though not always, created in such a way that they include most of representative sequences. For "old" Pfam domains, meaning those with small numbers, the seed alignment could have been created a decade ago, and does not necessarily include more recently added sequences with that domain. Even if the seed alignment is relatively recent, one of the two domains may be divergent enough not to get a statistically significant E-value, so it will be excluded. That is exactly why gathering threshold were "invented" in the first place - to catch borderline cases that have good enough bit-scores but statistically insignificant E-values.

Two options: 1) build your own and more inclusive alignment and convert it to HMM, which will hopefully score both domains so they have statistically significant E-values; 2) increase the E-value threshold, say to 0.1 or 1, and inspect manually borderline hits that are just above the significance threshold. It could be that the E-value cutoff was 0.01 and the second domain had E=0.011. That would exclude it from showing up in automatic annotation, but manual inspection of hits would hopefully prove that it is a legitimate domain member.

ADD COMMENT
0
Entering edit mode

I just did a Pfam search with your second protein using E=1.0. Indeed, it annotates only one PAS domain, but the list of insignificant matches shows two other PAS domains. While one of them has E=3000 and is likely a fluke, there is a good-looking PAS domain at residues 90-165 that is equivalent to what was annotated in the other ortholog.

Hint: Right-hand click on the image and chose "Open in new tab" to see a larger and more legible image.

enter image description here

ADD REPLY

Login before adding your answer.

Traffic: 1756 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6