Looking for a tool to do GO enrichment analysis with BLAST2GO output(s)
5
2
Entering edit mode
7.2 years ago
jornranzijn ▴ 20

Hi all,

I have annotated a set of proteins with BLAST2GO. I've also annotated the complete proteome of the same specie. Now i want to check for enriched GO terms in my protein set compared to the whole proteome.

The BLAST2GO fischer exact test feature isn't working correctly (i guess) because it excludes proteins from the reference set that are also present in the test set.

I've checked multiple tools, DAVID and ontologizer for example, but i can't find an appropriate one. ArgriGO seems to accept custom annotations but has an upload limit of 4 MB which is to low for my data set.

Any help is appreciated!!

Jorn

GO enrichment Custom annotation • 8.1k views
0
Entering edit mode

I did GO enrichment by downloading GO cc, mf and bp lists, searching my GO's in the list using dictionary and counting a frequency of occurrence of each GO term and produce an enrichment histogram.

3
Entering edit mode
7.2 years ago

Most tools I have found do not work with custom annotations or other organisms than predefined (which is a shame imo). Certainly, nobody would ever want to do research on other organisms than mouse, human, or drosophila.

Edit: What is more, most tools do not allow you restrict the background set of genes, such that they will always use the complete genome, something which makes them rather undesirable in many cases (not possibly to restrict to relevant genes only, e.g. those represented by microarray probes, or expressed at least once in a study).

So there are two features you should look for in a GO-enrichment tool for non-model organisms:

• ability to define own GO annotation and
• ability to define own background gene set

The Bioconductor package GOstats however will work with non-model organisms, you only need an annotation data.frame with three columns: go_id, evidence code, gene-id. that should be easy to generate from the Blast2GO output, see:

http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/GOstatsForUnsupportedOrganisms.pdf

evidence code should possibly be set to be 'IEA' for automatic annotation pipelines like Blast2GO or InterPro.

1
Entering edit mode

I agree, even on the level of GOA there is a very small number of species with annotation. That's what function through orthology is for I suppose.

3
Entering edit mode
7.2 years ago
David Fredman ★ 1.1k

As suggested by Michael Dondrup, the R GOstats package enables you to do this.

I've put a convenience function to convert a Blast2GO ".annot" file into a GSEA gsc file for GOstats on github. You might find it useful, see here:

https://github.com/davfre/GOstatsPlus/blob/master/R/b2g-to-gsc.R (make gsc file from Blast2GO export or similar)

0
Entering edit mode

Hi David,

I am having trouble in the following steps :

> gene_IDs_of_interest = unique(unlist(geneIds(GO_gsc)),500)

> GO_BP = test_GO(gene_IDs_of_interest, ontology = "BP", gsc=GO_gsc, pval = 0.01)

[1] GOBPID    Pvalue    OddsRatio ExpCount  Count     Size      Term
<0 rows> (or 0-length row.names)
Warning message:

While checking back what my "gene_IDs_of_interest" is storing

> length(gene_IDs_of_interest)
[1]130

But when I am just calculating just by calling for the head ( that would be the first 6 in the list of "gene_IDs_of_interest")

>gene_IDs_of_interest = head(unique(unlist(geneIds(GO_gsc))))

then on running test_GO.R outputs  the test values. I suppose I am not defining the "gene_IDs_of_interest" (Sorry I am very new to R).

Just to have a little background, I am working on RNA-seq data of non model organism. Hence the "gene_ids" are  custom names. The "file.annt" was obtained from blast2go( with initial 650 unique transcript ids as input). Following is my goAllFrame (first few)

go_id evidence             gene_id
1    GO:0000003      ISA  comp561024_c1_seq1
2    GO:0002376      ISA  comp550730_c0_seq1
3    GO:0002376      ISA  comp557737_c4_seq4
4    GO:0002376      ISA  comp562807_c1_seq1
5    GO:0002376      ISA  comp562807_c1_seq2
6    GO:0003008      ISA  comp520469_c2_seq1

Could you please guide me where I am going wrong.

Thanks

1
Entering edit mode

Dear Anki,

Do I understand it correctly that your complete gene set consists of 650 transcripts, of which 130 are annotated with a GO term? Is that the complete transcriptome, or are these genes that you found to be differentially expressed?

In the example provided in the vignette you are refering to, I picked the first 500 annotated genes as a "toy" gene set in which to look for overrepresented terms. In your case, that would pick all 130 annotated genes in our background dataset, and compare them to the background of 130 genes, resulting in no overrepresentation - the empty result that you are observing. If you instead take 6 genes, then you observe some statistically overrepresented terms in that small set, compared to the background of 130 annotated genes.

It seems to me that your species has very few genes, and is very sparsely annotated (few BLAST hits to swissprot), or that you perhaps are using only a subset of the complete gene annotation as input? (the universe background should be all annotated genes, and "genes of interest" are expected to be a small subset defined either by prior hypothesis or e.g. a differential expression experiment)

1
Entering edit mode
7.2 years ago
pld 5.0k

http://www.geneontology.org/page/go-enrichment-analysis

There are several R packages. I like IPA (but it is closed source/expensive) and InnateDB. InnateDB is aimed at immunology but its connected to the whole GOA for several species, so it works well.

This is too good! The GO Consortium actually has a beta version GO enrichment tool on their front page! I guess they got tired of people asking lol.

0
Entering edit mode
7.2 years ago
poisonAlien ★ 3.1k

How about CPDB ? (works only for human mouse and yeast)

0
Entering edit mode
6.4 years ago
CanyTon ▴ 10