How to find GO file and do GO enrichment analysis ?
Entering edit mode
9.4 years ago
jack ▴ 970

I have list of differentially expressed gene for paramecium tetraurelia. I want to do gene ontology enrichment analysis. there are two problem:

1) I couldn't find the GO annotation for Paramecium.

2) Given that I found the GO annotation for this organism, which tool is the best to do GO enrichment analysis?

I have seen in one paper, which they have mentioned that "We conducted a domain search of the P. bursaria transcripts against the Pfam database release 26.0. Gene ontology (GO) terms were assigned to each transcript using the pfam2go conversion table" but it's not clear for me how.

Can somebody help me with this?

RNA-Seq genome R • 5.5k views
Entering edit mode
9.4 years ago
pld 5.1k

Pfam is a database of protein families. Specifically, using HMMER they create hidden markov models that represent a conserved group of proteins (a family). Now, when proteins are conserved we assume there is functional similarity. This is a general assumption and can be impacted in sequence and species specific ways, but in general it works.

So if you can establish that a conserved group of proteins (a family) shares some set of functions, you can assume that any member of that family should also have that function. So if these hold, predicting function becomes a problem of predicting which families a protein may belong to. This is what the authors did, they knew the functions of the families so to infer the potential functions of their proteins they had to find the families they may belong to.

As for the authors data, when they do these forms of annotation in general you should be able to find it either in the supplemental information or in some cases by contacting the author. Always check the supplement in these kinds of papers. For the paper I assume you're referring to the information is in the supplement: (see additional file 3).

Now, as for predicting function through homology all methods take the same general form but there are important distinctions. In general the idea is to infer function through finding which "thing" with known function matches your "thing" of unknown function. The two more common ways are through BLAST or HMMER/Pfam. The idea is the same, in BLAST you assign functions through specific sequences (BLAST hits) and the other through protein families (as described above).

However, there are important differences. In BLAST you usually infer function through a single best hit. This means your unknown is assigned all of the functions that specific protein has. When using Pfam, you assign all of the functions for all of the significantly high scoring Pfam hits. This seems trivial, but it can be important. Pfam simply looks at the functions that proteins in that family share, using BLAST you get functions that are known for that protein in that specific species.

The key difference is "in that specific species", you may see contextual information specific to the species of the known protein. The kicker is that it can be hard to tell if these "extra" terms are because that protein may do something unique in its host species, or there may be better/more complete annotations for that species. Very few species have concerted efforts to annotate their genomes with GO terms (

If your species isn't close phylogenetically to the species with GO annotation efforts, I would use both approaches. BLAST your genes against say UniProt and collect GO terms through the best BLAST hit of each predicted peptide. I would also run HMMER on the predicted peptides and infer functions that way.

Blast2Go is an option, but it is massively slow if you don't buy the full version. It'll take months to annotate a large set of genes/proteins. There are other tools available as previously mentioned, see if those can help.

If you have any programming/database experience, you can easily write a few scripts to handle this. I prefer this approach, it is easier to integrate into other forms of analysis (either on the transcriptome/etc or later analysis).

Or, just use what someone else already did! The data you want is right there in the publication!

Entering edit mode

Hi Joe,

Regarding your very informative comment, I would like to ask for some points:

  1. Does Blast2Go use both BLAST and HMMer/Pfam approaches?
  2. Currently, are there tools other than Blast2Go that perform this task efficiently?

Thank you very much in advance!

Entering edit mode
9.4 years ago
dago ★ 2.8k

You could download the proteome of P. It's available on embl. Otherwise, you could annotate your protein with blast2GO.

Blast2go has also a function for GO enrichment. Otherwise you could use other tools that are listed in this post:

Gene Ontology Enrichment Of Non-Model Bacterial Genome

Entering edit mode
9.4 years ago
chemcehn ▴ 210

Did you try DAVID?

Entering edit mode
2.2 years ago

P. teraurella is now one of the genomes loaded in PANTHER: You can do an EA directly from the GO homepage, which is a widget to PANTHER.

This organism's annotations are available in AmiGO:

Or, from AmiGO's homepage -> Advanced Search -> Search annotations = and then filter by organism (use blue More to add organism by name).

To retrieve annotations for most species that are not in AmiGO, we recommend using QuickGO to download the IEA (automatic) annotations that have gone through our QA pipeline. For Paramecium tetraurelia, taxon 5888, go to -> View GO Annotations -> use the blue Taxon button to add 5888 (white button) and Apply. I found 138,166 annotations, well within QuickGO's download limit of 2,000,000.

QuickGO's API is at


If you were looking for an organism wasn't in PANTHER, we have a FAQ with a link to the Nature protocol paper's Box 2:


Login before adding your answer.

Traffic: 2496 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6