Question

Pfam Based Functional Annotaion

11

Entering edit mode

14.1 years ago

Suk211 ★ 1.1k

I think in one of the earlier threads, Istvan has already asked about the reliability of GO annotation. I was wondering, if any of you have any experience with the functional annotation based upon the Pfam database. I am looking forward to functionally annotate a large set of peptide library and the easiest way I can think about is to do batch search of those peptides against the Pfam database.In case you guys know a better approach , kindly share it.

Cheers

protein annotation • 9.2k views

ADD COMMENT • link updated 14.0 years ago by Noyk ▴ 100 • written 14.1 years ago by Suk211 ★ 1.1k

0

Entering edit mode

minor correction it was Giovanni who asked that question

ADD REPLY • link 14.1 years ago by Istvan Albert 100k

0

Entering edit mode

did anybody use blast2go which map the interpro and blast hits to GO term?

ADD REPLY • link 14.0 years ago by Noyk ▴ 100

0

Entering edit mode

It would probably best if you asked this question as a new one rather than adding it to the existing answers.

ADD REPLY • link 14.0 years ago by Istvan Albert 100k

0

Entering edit mode

@noyk You'll get a detailed response from me if you post your question, as suggested by @Istvan Albert :)

ADD REPLY • link 14.0 years ago by Eric Normandeau 11k

0

Entering edit mode

ok will do that

ADD REPLY • link 14.0 years ago by Noyk ▴ 100

score 10 · Answer 1 · 2010-03-08

I think the Pfam approach may return something useful, but you need to be careful about how you interpret your results. Pfam is primarily a tool to assign sequences to protein families. It also does a good job of recognizing functional domains. It provides information about the usual function of the domains/family members- but I do not think it should be viewed as a tool to assign function directly, and I think the Pfam curators would agree with me. It is making an assignment based on sequence similarity, and is inferring structural and functional similarity. These inferences may or may not be correct. You have several risks you need to keep in mind. Two biggies that pop out too me are:

Your sequences are all shorter than most protein domains. So you may get false negatives where if you had the full sequence, you might have hit a domain, but because you only have a fragment, the similarity is too weak to produce a hit.
You might get false positives because you match a domain but have a few key residues in your sequence mutated, and therefore the protein from which your sequence was derived actually does not perform the function assigned to that domain in Pfam.

You asked about direct experience. Mine is roughly 5 years old now, but it was that Pfam was one of the best tools to identify functional domains, and was a good way to annotate sequences as long as I kept its limitations in mind. However, I was working with full length sequences, not fragments. My gut instinct is that it will not perform as well on small fragments, but I have no direct experience to back me up- just my knowledge that your fragments are shorter than most domains.

Back when I did function assignment for a living, I considered it very risky to rely on one tool to make an assignment. And I never considered any assignment anything more than a hypothesis that could then be tested in the lab.

Ram · Answer 2 · 2010-03-04

My experience with Pfam is limited, but I think relevant to your question.

I work on a human pathogen which has been entirely sequenced and therefore we know quite a bit about what's in it. In particular, I'm interested one pfam group (PF02009) that groups similar proteins from this pathogen.

The problem I have with the pfam group is that it groups several distinct groups of proteins. These proteins are related, I agree, however, at the level I'm comparing them (which is in detail), I would not jump to the conclusion that these proteins share the same function.

That brings me to the following comment on your question: looking for functional annotation is very vague. What detail of functional annotation are you looking for?

Do you want to know if these peptides belong to groups called "enzymes" or "receptors" or some kind of basic "building blocks", without any more detail?
Do you want to know if these peptides belong to a specific class of enzymes?
Do you want to know if these peptides belong to a specific sub-class of enzymes, going all the way down to the substrate specificity?

Another question I would have is regarding the length of your peptides. I recall one of my collaborators complaining about the fact that Pfam would not detect fragments that were too short. That was with Pfam2. I don't know how this is with Pfam3 though. So, you'll have to test this.

Depending on the answer to these questions (and many more) you may or may not want to only use Pfam. But in any case, Pfam could be a good start, if your peptides are not too short.

Another way that might be more relevant to short sequences would be to look at BLAST approaches (PSI- or PHI-BLAST in particular) to find what your peptides match to, and then look at the functional annotation of those hits (including whatever Pfam domains they may contain). I think this method would be more sensitive than the Pfam approach.

score 2 · Answer 3 · 2010-03-06

2

Entering edit mode

14.1 years ago

Chris ★ 1.6k

You might also consider to blast against Swissprot and transfer residue annotations.

ADD COMMENT • link 14.1 years ago by Chris ★ 1.6k

Ram · Answer 4 · 2010-03-10

This review article may be helpful or at least interesting to you: "Automated protein function prediction -- the genomic challenge" (Friedberg 2006)

Here's a relevant excerpt:

Pfam is arguably the database of choice for those seeking order within the protein sequence universe. [...] As we shall see, Pfam annotation is used by function prediction programs, either by directly querying Pfam or by using umbrella databases that include Pfam information such as InterPro. SMART, CDD, and PRODOM are other databases consisting of multiple alignments of protein domains. All these databases have proteins arranged in homologous clusters, which, when possible, are annotated. These databases are often deferred to when producing homology-based annotation transfers. It should be emphasized that the use of these databases for homology transfer should be done with caution, as they annotate proteins on a domain level. A multi-domain query aligned to Pfam, for example, should be carefully checked for mis-annotations due to domain shuffling, as mentioned eariler. Also, the 'granularity' of these databases varies. For example, a single Pfam family may contain several proteins which perform the same enzymatic reaction on different substrates.

Ram · Answer 5 · 2010-04-09

Regarding the validity of PFAM predictions. Some studies (e.g. GISMO (Gene prediction), CARMA (Phylogenetic classification of environmental metagenomics samples)) have used PFAM domains as input to generate training sets for classification. The underlying assumption: sequences with hits to know protein domains have a high probability of being real protein coding regions. This is at least in my opinion very much justified and also proven by the high precision of the resulting methods.