Question: Looking for a tool to do GO enrichment analysis with BLAST2GO output(s)
2
gravatar for jornranzijn
4.8 years ago by
jornranzijn20
jornranzijn20 wrote:

Hi all,

I have annotated a set of proteins with BLAST2GO. I've also annotated the complete proteome of the same specie. Now i want to check for enriched GO terms in my protein set compared to the whole proteome.

The BLAST2GO fischer exact test feature isn't working correctly (i guess) because it excludes proteins from the reference set that are also present in the test set.

I've checked multiple tools, DAVID and ontologizer for example, but i can't find an appropriate one. ArgriGO seems to accept custom annotations but has an upload limit of 4 MB which is to low for my data set.

Any help is appreciated!!

Thank you in advance,

Jorn

 

ADD COMMENTlink modified 4.1 years ago by CanyTon10 • written 4.8 years ago by jornranzijn20

I did GO enrichment by downloading GO cc, mf and bp lists, searching my GO's in the list using dictionary and counting a frequency of occurrence of each GO term and produce an enrichment histogram. 
 

ADD REPLYlink written 4.1 years ago by User000270
3
gravatar for Michael Dondrup
4.8 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Most tools I have found do not work with custom annotations or other organisms than predefined (which is a shame imo). Certainly, nobody would ever want to do research on other organisms than mouse, human, or drosophila. 

Edit: What is more, most tools do not allow you restrict the background set of genes, such that they will always use the complete genome, something which makes them rather undesirable in many cases (not possibly to restrict to relevant genes only, e.g. those represented by microarray probes, or expressed at least once in a study).

So there are two features you should look for in a GO-enrichment tool for non-model organisms:

  • ability to define own GO annotation and
  • ability to define own background gene set 

The Bioconductor package GOstats however will work with non-model organisms, you only need an annotation data.frame with three columns: go_id, evidence code, gene-id. that should be easy to generate from the Blast2GO output, see:

http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/GOstatsForUnsupportedOrganisms.pdf

evidence code should possibly be set to be 'IEA' for automatic annotation pipelines like Blast2GO or InterPro. 

 

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Michael Dondrup46k
1

I agree, even on the level of GOA there is a very small number of species with annotation. That's what function through orthology is for I suppose.

ADD REPLYlink written 4.8 years ago by pld4.8k
3
gravatar for David Fredman
4.8 years ago by
David Fredman980
University of Bergen, Norway
David Fredman980 wrote:

As suggested by Michael Dondrup, the R GOstats package enables you to do this.

I've put a convenience function to convert a Blast2GO ".annot" file into a GSEA gsc file for GOstats on github. You might find it useful, see here:

https://github.com/davfre/GOstatsPlus/blob/master/vignettes/GOstatsPlus.Rmd (example)

https://github.com/davfre/GOstatsPlus/blob/master/R/b2g-to-gsc.R (make gsc file from Blast2GO export or similar)

ADD COMMENTlink modified 4.5 years ago • written 4.8 years ago by David Fredman980

Hi David,

I am having trouble in the following steps :

> gene_IDs_of_interest = unique(unlist(geneIds(GO_gsc)),500)

> GO_BP = test_GO(gene_IDs_of_interest, ontology = "BP", gsc=GO_gsc, pval = 0.01)
> head(summary(GO_BP))

[1] GOBPID    Pvalue    OddsRatio ExpCount  Count     Size      Term     
<0 rows> (or 0-length row.names)
Warning message:
No results met the specified criteria.  Returning 0-row data.frame

While checking back what my "gene_IDs_of_interest" is storing

> length(gene_IDs_of_interest)
[1]130

But when I am just calculating just by calling for the head ( that would be the first 6 in the list of "gene_IDs_of_interest")

>gene_IDs_of_interest = head(unique(unlist(geneIds(GO_gsc))))

then on running test_GO.R outputs  the test values. I suppose I am not defining the "gene_IDs_of_interest" (Sorry I am very new to R).

Just to have a little background, I am working on RNA-seq data of non model organism. Hence the "gene_ids" are  custom names. The "file.annt" was obtained from blast2go( with initial 650 unique transcript ids as input). Following is my goAllFrame (first few)

          go_id evidence             gene_id
1    GO:0000003      ISA  comp561024_c1_seq1
2    GO:0002376      ISA  comp550730_c0_seq1
3    GO:0002376      ISA  comp557737_c4_seq4
4    GO:0002376      ISA  comp562807_c1_seq1
5    GO:0002376      ISA  comp562807_c1_seq2
6    GO:0003008      ISA  comp520469_c2_seq1
 

Could you please guide me where I am going wrong.

Thanks

 

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Anushka20
1

Dear Anki,

 

Do I understand it correctly that your complete gene set consists of 650 transcripts, of which 130 are annotated with a GO term? Is that the complete transcriptome, or are these genes that you found to be differentially expressed?

In the example provided in the vignette you are refering to, I picked the first 500 annotated genes as a "toy" gene set in which to look for overrepresented terms. In your case, that would pick all 130 annotated genes in our background dataset, and compare them to the background of 130 genes, resulting in no overrepresentation - the empty result that you are observing. If you instead take 6 genes, then you observe some statistically overrepresented terms in that small set, compared to the background of 130 annotated genes.

It seems to me that your species has very few genes, and is very sparsely annotated (few BLAST hits to swissprot), or that you perhaps are using only a subset of the complete gene annotation as input? (the universe background should be all annotated genes, and "genes of interest" are expected to be a small subset defined either by prior hypothesis or e.g. a differential expression experiment)

 

 

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by David Fredman980
1
gravatar for pld
4.8 years ago by
pld4.8k
United States
pld4.8k wrote:

http://www.geneontology.org/page/go-enrichment-analysis

There are several R packages. I like IPA (but it is closed source/expensive) and InnateDB. InnateDB is aimed at immunology but its connected to the whole GOA for several species, so it works well.

 

This is too good! The GO Consortium actually has a beta version GO enrichment tool on their front page! I guess they got tired of people asking lol.

ADD COMMENTlink written 4.8 years ago by pld4.8k
0
gravatar for poisonAlien
4.8 years ago by
poisonAlien2.8k
Asgard
poisonAlien2.8k wrote:

How about CPDB ? (works only for human mouse and yeast)

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by poisonAlien2.8k
0
gravatar for CanyTon
4.1 years ago by
CanyTon10
United States
CanyTon10 wrote:

May this help? http://bioinformatics.oxfordjournals.org/content/21/18/3674.long

ADD COMMENTlink written 4.1 years ago by CanyTon10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1761 users visited in the last hour