8.2 years ago by
Washington University, St Louis, USA
I think you have answered your own question when you observe that there are many genes with multiple transcript/protein isoforms where each isoform has different GO annotations. This is because the Gene Ontology attaches terms from it's three ontologies (molecular functions, biological processes and/or cellular components) to gene products, not genes. In other words, terms are associated with specific protein isoforms. In many cases people have information only at the gene-locus level (e.g., their expression arrays don't do a good job of measuring specific transcripts) or if they have transcript-level data they map those transcripts to the gene-locus level rather than the protein isoform level. However, if you do have good transcript-level data I would argue that it is better to map those to the corresponding protein isoform (e.g., UniProt) and use that as input for your Gene Ontology analysis. Most GO over-representation software will allow you to upload your own "total/complete" lists from which your protein subset was derived. This will prevent the skewing of statistics that you are quite wisely concerned about. As an illustrative example, check out DAVID. Choose their 'Functional Annotation' (gene-annotation enrichment analysis) tool and you will see that you can upload many different types of transcript IDs or protein IDs for both your "gene list" and "background" list of interest. Running their statistics will tell you which GO terms are over-represented in your subset of transcript/protein IDs relative to the total/background list. Most GO enrichment tools will follow this pattern. You can explore a list maintained by GO here. All of this was a really long way of answering your first question: YES - it is appropriate to do transcript-level GO enrichment analysis. For your second question, there must be many references for this. Unfortunately, it is so common now that most people don't really explain what they have done in their publications. For your third question, given the above, I would not "collapse" different isoforms.
Your situation of not having a model organism creates a lot more challenges. I've never worked with blast2go. But I suppose if you have a complete set of transcripts, get some functional annotations for many of them from blast2go, then you should be able to build your own transcript-annotation database and use that for over-representation analysis of subsets of genes versus the total list. This will likely require custom analysis as opposed to tools like DAVID. I suggest you investigate Bioconductor packages like GOStats. They actually have a short vignette for your situation. This thread looks really helpful for someone trying to figure that vignette out for the first time.