Question: Dubious Pfam Domain Annotation Of A Protein
gravatar for Pappu
7.0 years ago by
Pappu1.9k wrote:

I searched a protein sequence for PfamA protein domain signatures by hmmsearch. It turns out that two related domains have significant E-value <1e-10 and they overlap with each other. One domain is common in bacteria and the other one in eukaryotes. The protein has only one domain spanning the whole sequence. However the Pfam database chose the domain with the lowest E-value.

I am wondering if I can say the protein came from gene dupication of the two related domains, and it indicates an intermediate sequence (common ancestor) bearing similarities to both domains. Also if the two domains are orthologs, what is point of giving two names to them and not merging the sequences to make one HMM.

hmmer • 3.2k views
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Pappu1.9k
gravatar for DG
7.0 years ago by
DG7.2k wrote:

I think you are confusing a few different things here, and there is some information that is lacking. The information that is lacking in this discussion is whether the hits for the domains occur at the same spot on the protein, remember a protein can be composed of many domains, if your hmmsearch is returning hits to different portions of your protein, the interpretation of those results are different than when you have two good hits that overlap.

Now, assuming we are talking about a situation where you have two good (as determined by the e-value) hits to the same region of your query protein sequence, if the two domains with hits are closely related to one another (part of the same PFam family, etc) this is the exact same situation as if you were running a BLAST search. You expect to get multiple matches with good e-values in different species, both orthologs and paralogs. With domains there are fewer of them, so you don't always have these results, but it is still expected in many cases.

Also, it is very important to keep in mind, that domains and proteins are two different "units" and we can talk about the evolution of domains and of whole proteins somewhat independently of one another. While related domains are probably often generated by gene duplication and divergence of genes, there are other a host of other processes at play driving that differentiation (exon shuffling, domain re-arrangement, recombination, translocation, etc).

In summary, and the short answer, no you cannot say that what you have represents an intermediate sequence between the two domains. All you know is that you have a domain in your protein sequence that shows strong evidence of being homologous to both domains in question, but that isn't unusual or unexpected in most cases.

ADD COMMENTlink written 7.0 years ago by DG7.2k

you may say that if they share the same domain they might be distantly related having the same structure and hence function, but you cannot say your protein came from gene dupication of the two related domains and it is their intermediate sequence

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by User000440

Thanks, I updated my question. If the domains is homologous what is the point of listing them separately?

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Pappu1.9k

homologous doesnt mean identical. Most of the domains have evolutionary relationship, homologous domains may have very low sequence similarities, since structure is more conserved than sequence.

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by User000440

As User000 said homologous is not the same as identical. It is like asking why you would list homologous sequences separately in BLAST, it is the same thing. While domains are a "building block" of a protein so to speak, they can also be considered independently in terms of their evolutionary history. Some domains have been determined to be homologous in that they are related to one another, but they are still distinct domains.

ADD REPLYlink written 7.0 years ago by DG7.2k

To put it in another way, during a HMM building we select some homologous sequences which are not identical. For another HMM building we also take homologus sequences similar to the first HMM. My question is who one should choose the boundary (blosum idenentity or something) of including sequences in the HMMs. Why not building one HMM instead of two?

ADD REPLYlink written 7.0 years ago by Pappu1.9k

Well, normally we use biological information to inform our choices about what to include in HMMs. An HMM is a model, a tool, designed to reflect something about biology, and used to answer specific questions. In the case of searching HMM domains from Pfam with HMMER, you are using it to answer questions about proteins based on Pfam domain definitions. Pfam has decided that those domains, while related, are distinct domains and should be treated as such in much the same way as we would treat paralogs as homologous but distinct entities.

ADD REPLYlink written 7.0 years ago by DG7.2k

Have you read this paper:

ADD REPLYlink written 7.0 years ago by Pappu1.9k

Yes, I have. If you believe that your particular domains in question should not be treated as two different domains that is a separate issue. In general you are going to have some pretty good evidence to assert that, keeping in mind that PfamA families are manually curated. It is also the opposite issue of the paper you just linked to, where the authors claim sequences that have undergone convergent evolution are being grouped together inappropriately.

ADD REPLYlink written 7.0 years ago by DG7.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2107 users visited in the last hour