Question: How Do I Do Simple Go Term Lookup Given A Gene (Or Mrna) Identifier?
18
gravatar for Ryan Thompson
7.2 years ago by
Ryan Thompson3.2k
TSRI, La Jolla, CA
Ryan Thompson3.2k wrote:

Suppose I have a gene ID such as NM_030621, which happens to be one of the human mRNA splice variants of DICER1. How can I programmatically get a list of GO terms with which this ID is associated? Going through protein or gene IDs as intermediaries is acceptable.

I want to do this for a whole list of such IDs, all corresponding to RefSeq mRNA sequences. I don't want to do enrichment analysis or anything, just get a list of GO terms for each mRNA.

Preferred languages are Perl and R.

gene annotation database • 21k views
ADD COMMENTlink modified 2.8 years ago by adam.maikai380 • written 7.2 years ago by Ryan Thompson3.2k

Thanks for all the answers, everyone. I ended up using biomaRt in R.

ADD REPLYlink written 7.2 years ago by Ryan Thompson3.2k

Nice question, awesome answers, just added my suggestion using GOOSE.

ADD REPLYlink written 7.2 years ago by Khader Shameer17k
15
gravatar for Neilfws
7.2 years ago by
Neilfws47k
Sydney, Australia
Neilfws47k wrote:

Using R, you can do this with the biomaRt package. Here is a quick example:

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
results <- getBM(attributes = c("refseq_dna", "go_molecular_function_id",
           "go_molecular_function__dm_name_1006"), filters = "refseq_dna",
           values = c("NM_030621"), mart = mart)

In the getBM() function, we specify which attributes to return (RefSeq ID, GO MF ID, GO MF Name), the filters to use for the query (RefSeq ID) and the filter values (NM_030621). You could also pass a list of values, e.g. from a data frame column as mydf$column.

Result:

        refseq_dna  go_molecular_function_id go_molecular_function__dm_name_1006
1       NM_030621                 GO:0005515                     protein binding
2       NM_030621                 GO:0008026     ATP-dependent helicase activity
3       NM_030621                 GO:0000166                  nucleotide binding
4       NM_030621                 GO:0000287               magnesium ion binding
5       NM_030621                 GO:0004386                   helicase activity
6       NM_030621                 GO:0004519               endonuclease activity
7       NM_030621                 GO:0016787                  hydrolase activity
8       NM_030621                 GO:0030145               manganese ion binding
9       NM_030621                 GO:0004525           ribonuclease III activity
10      NM_030621                 GO:0003725         double-stranded RNA binding
11      NM_030621                 GO:0005524                         ATP binding
12      NM_030621                 GO:0003723                         RNA binding
13      NM_030621                 GO:0003676                nucleic acid binding
14      NM_030621                 GO:0003677                         DNA binding

You could also use the BioMart website, where you can upload a list of IDs, specify attributes and filters and download results as delimited text files.

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Neilfws47k

Thanks, that worked great. Biomart looks incredibly useful.

ADD REPLYlink written 7.2 years ago by Ryan Thompson3.2k

Neil, do you known why there are more GO identifiers on ensembl for this protein ( http://www.ensembl.org/Homo_sapiens/Transcript/GO?db=core;g=ENSG00000100697;r=14:95552566-95623759;t=ENST00000393063 ) ?

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum96k

I think the Ensembl page shows all three GO ontologies (BP, CC and MF) - the example above is only MF.

ADD REPLYlink written 7.2 years ago by Neilfws47k

When I try this, I find that 'refseq_dna', 'go_molecular_function_id', and 'go_molecular_function__dm_name_1006' are not valid attributes. Has something changed?

ADD REPLYlink written 2.8 years ago by Greg P70

But, in this way, would not you get limited to the RefSeq ID's which have Ensemble information in the database? 

ADD REPLYlink written 2.5 years ago by Prakki Rama2.0k

What if your organism is not in ensembl list anymore, like deinococcus radiodurans?

ADD REPLYlink written 6 months ago by xioli20130
7
gravatar for Pierre Lindenbaum
7.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

My XSLT solution:

The stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >

<xsl:output method="text" encoding="ISO-8859-1"/>

<xsl:template match="/">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="Dbtag[Dbtag_db='GeneID']">
  <xsl:for-each select="Dbtag_tag/Object-id/Object-id_id">
     <xsl:variable name="genid" select="."/>
    <xsl:variable name="url" select="concat('&lt;a href=" http:="" eutils.ncbi.nlm.nih.gov="" entrez="" eutils="" efetch.fcgi?db="gene&amp;retmode=xml&amp;id=',$genid)" "="" rel="nofollow">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=',$genid)" />
    <xsl:apply-templates select="document($url,//Other-source)" mode="go"/>
  </xsl:for-each>
</xsl:template>

<xsl:template match="Other-source" mode="go">
 <xsl:if test="Other-source_src/Dbtag/Dbtag_db='GO'">
 <xsl:value-of select="concat('GO:',Other-source_src/Dbtag/Dbtag_tag/Object-id/Object-id_id)"/>
<xsl:text>        </xsl:text>
<xsl:value-of select="Other-source_anchor"/>
<xsl:text>
</xsl:text>
 </xsl:if>
</xsl:template>

<xsl:template match="text()">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()" mode="go">
</xsl:template>

</xsl:stylesheet>

Apply this stylesheet with xsltproc for NM_030621:

xsltproc --novalid stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_030621&rettype=xml"

Result:

GO:5524         ATP binding
GO:8026         ATP-dependent helicase activity
GO:3725         double-stranded RNA binding
GO:4519         endonuclease activity
GO:4386         helicase activity
GO:16787                hydrolase activity
GO:46872                metal ion binding
GO:166          nucleotide binding
GO:5515         protein binding
GO:4525         ribonuclease III activity
GO:6396         RNA processing
GO:1525         angiogenesis
GO:48754                branching morphogenesis of a tube
GO:35116                embryonic hindlimb morphogenesis
GO:31047                gene silencing by RNA
GO:30324                lung development
GO:31054                pre-microRNA processing
GO:30422                production of siRNA involved in RNA interference
GO:19827                stem cell maintenance
GO:30423                targeting of mRNA for destruction involved in RNA interference
GO:16442                RNA-induced silencing complex
GO:5737         cytoplasm
GO:5622         intracellular
ADD COMMENTlink written 7.2 years ago by Pierre Lindenbaum96k

Interesting solution!

ADD REPLYlink written 7.2 years ago by Istvan Albert ♦♦ 73k

Pierre could you share your experience/comments on when to go the xslt route for such tasks?

ADD REPLYlink written 7.2 years ago by Marcin Cieslik510

... hum, just a kind of feeling... :-) I first think of XSLT when I know that I can get the source of data as XML, when this source is not too large (must fit in memory), when speed is not an issue and I don't need a complicated processing (e.g. tokenizing, aggregation of data (mean/uniq/max/etc...) )

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum96k
7
gravatar for adam.maikai
2.8 years ago by
adam.maikai380
United States
adam.maikai380 wrote:

Using MyGene.info's new R/Bioconductor package mygene, this can be done simply with the query() function.

library(mygene)
res<-query('NM_030621', fields='go', species='human')$hits
lapply(res, as.list)

This returns a data.frame for CC, MF, and BP, matching score, and the gene id.

$go
$go$CC
$go$CC[[1]]
                                                  term         id evidence   pubmed
1 endoplasmic reticulum-Golgi intermediate compartment GO:0005793      IEA       NA
2                                              cytosol GO:0005829      TAS       NA
3                                         RISC complex GO:0016442      IDA 15973356
4                                                 axon GO:0030424      IEA       NA
5                                             dendrite GO:0030425      IEA       NA
6                                          growth cone GO:0030426      IEA       NA

$go$MF
$go$MF[[1]]
                         term   pubmed         id evidence
1 double-stranded RNA binding 12411504 GO:0003725      IDA
2           helicase activity       NA GO:0004386      IEA
3   ribonuclease III activity       NA GO:0004525      IDA
4             protein binding 12526743 GO:0005515      IPI
5                 ATP binding       NA GO:0005524      IEA
6               miRNA binding       NA GO:0035198      IEA
7           metal ion binding       NA GO:0046872      IEA

$go$BP
$go$BP[[1]]
                                                                   term         id evidence   pubmed
1  negative regulation of transcription from RNA polymerase II promoter GO:0000122      ISS       NA
2                                                          angiogenesis GO:0001525      IEA       NA
3                                            post-embryonic development GO:0009791      IEA       NA
4                                       zygote asymmetric cell division GO:0010070      IEA       NA
5                                                       gene expression GO:0010467      TAS       NA
6                     negative regulation of Schwann cell proliferation GO:0010626      ISS       NA
7                                positive regulation of gene expression GO:0010628      IEA       NA
8                   positive regulation of Schwann cell differentiation GO:0014040      ISS       NA
9                                                 stem cell maintenance GO:0019827      IEA       NA
10                             spinal cord motor neuron differentiation GO:0021522      IEA       NA
11                                                    nerve development GO:0021675      ISS       NA
12                           olfactory bulb interneuron differentiation GO:0021889      IEA       NA
13                                          cerebral cortex development GO:0021987      IEA       NA
14                                                     lung development GO:0030324      IEA       NA
15                     production of siRNA involved in RNA interference GO:0030422      IDA 15973356
16       targeting of mRNA for destruction involved in RNA interference GO:0030423      IMP 15973356
17                                                 pre-miRNA processing GO:0031054      IDA 15973356
18                                          hair follicle morphogenesis GO:0031069      IEA       NA
19                                   positive regulation of myelination GO:0031643      ISS       NA
20                           peripheral nervous system myelin formation GO:0032290      ISS       NA
21                                     embryonic hindlimb morphogenesis GO:0035116      IEA       NA
22             production of miRNAs involved in gene silencing by miRNA GO:0035196      ISS       NA
23                                        multicellular organism growth GO:0035264      IEA       NA
24                               regulation of viral genome replication GO:0045069      IEA       NA
25                                 regulation of neuron differentiation GO:0045664      IEA       NA
26                                                   mRNA stabilization GO:0048255      IEA       NA
27                                                   spleen development GO:0048536      IEA       NA
28                                   reproductive structure development GO:0048608      IEA       NA
29                        regulation of oligodendrocyte differentiation GO:0048713      IEA       NA
30                        branching morphogenesis of an epithelial tube GO:0048754      IEA       NA
31                                      neuron projection morphogenesis GO:0048812      ISS       NA
32                                                     spindle assembly GO:0051225      IEA       NA
33                                            defense response to virus GO:0051607      IEA       NA
34                                             regulation of cell cycle GO:0051726      IEA       NA
35                                      cardiac muscle cell development GO:0055013      IEA       NA
36                                  inner ear receptor cell development GO:0060119      IEA       NA
37                               intestinal epithelial cell development GO:0060576      IEA       NA
38                                  regulation of enamel mineralization GO:0070173      IEA       NA
39                                     hair follicle cell proliferation GO:0071335      IEA       NA
40                                   RNA phosphodiester bond hydrolysis GO:0090501      IDA       NA
41                  RNA phosphodiester bond hydrolysis, endonucleolytic GO:0090502      IDA       NA
42                  RNA phosphodiester bond hydrolysis, endonucleolytic GO:0090502      IEA       NA
43                       positive regulation of miRNA metabolic process GO:2000630      IEA       NA


$`_score`
$`_score`[[1]]
[1] 0.8782682

$`_id`
$`_id`[[1]]
[1] "23405"
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by adam.maikai380
5
gravatar for Jandot
7.2 years ago by
Jandot370
GB
Jandot370 wrote:

Hey Ryan,

Just had a very quick look at this, so am not presenting the whole solution here. However, using the ruby-ensembl-api and given a

[?]

Result: [?] GO:0005488 GO:0005515 GO:0007275 GO:0005622 GO:0007155 [?]

This uses the Ensembl stable ID instead of the RefSeq accession, but you should be easily able to link one to the other.

ADD COMMENTlink written 7.2 years ago by Jandot370
4
gravatar for Dario Corrada
7.2 years ago by
Monza - ITALY
Dario Corrada70 wrote:

For my work I've used an R package called CORNA 1 and I find very useful functions for a batch retrieves. Here there is an example code chunck:

# This script retrieve a list of dataframes for crosslinks among RefSeqID, Ensembl IDs (transcript e gene), GO IDs (cellular component, molecular function e biological process)

library(CORNA);

biomart = "ensembl"; # BioMart DB (for a complete list of avalaible databases use "listMarts" function [biomaRt])
dataset = "mmusculus_gene_ensembl"; # dataset
# for a complete list of avalaible datasets use a chunck like this:
# library(biomaRt)
# mart <- useMart('ensembl')
# listDatasets(mart)
org="mmu"; # organism, e.g. mmu -> Mus Musculus

# crosslinks between RefSeq and Ensembl Transcript
refseq2ensembl_tran <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_transcript_id"),
                                        col.new=c("refseq", "ensembl_transcript"));

# crosslinks between RefSeq and Ensembl Gene
refseq2ensembl_gene <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_gene_id"),
                                        col.new=c("refseq", "ensembl_gene"));

# crosslinks between Ensembl Transcript and GObp
tran2gobp  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_biological_process_id"),
                                col.new=c("ensembl_transcript", "gobp"));

# crosslinks between Ensembl Transcript and GOcc
tran2gocc  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_cellular_component_id"),
                                col.new=c("ensembl_transcript", "gocc"));

# crosslinks between Ensembl Transcript and GOmf
tran2gomf  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_molecular_function_id"),
                                col.new=c("ensembl_transcript", "gomf"));

# links between GO id and term
corna.go2term <- GO2df.fun(url = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz");

# pathway information from KEGG
corna.gene2path <- KEGG2df.fun(org = org);
ADD COMMENTlink written 7.2 years ago by Dario Corrada70

Thank you very much for this link, because I just happen to be doing GO-term enrichment analysis on miRNA targets.

ADD REPLYlink written 7.2 years ago by Ryan Thompson3.2k
3
gravatar for Andrew Su
7.2 years ago by
Andrew Su4.7k
San Diego, CA
Andrew Su4.7k wrote:

Seems like a good task for BioMart's MartService. Check out their API. You specify your query using their Query XML syntax, and a good way to generate a template is to do the query using the GUI and then export the corresponding XML.

Check out also this previous question, where QuickGO and goatools are mentioned...

ADD COMMENTlink written 7.2 years ago by Andrew Su4.7k
2
gravatar for Pierre Lindenbaum
7.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

I was also interested in using the mysql server for ensembl for this query. After doing some little reverse engineering with the SQL schema, I wrote the following SQL query:

query:

use homo_sapiens_core_48_36j;

select  distinct
    GENE_ID.stable_id as "ensembl.gene",
    RNA_ID.stable_id as "ensembl.transcript",
    PROT_ID.stable_id as "ensembl.translation",
    GO.acc as "go.acc",
    GO.name as "go.name",
    GOXREF.linkage_type as "evidence"
from

    ensembl_go_54.term as GO,
    external_db as EXTDB0,
    external_db as EXTDB1,
    object_xref as OX0,
    object_xref as OX1,
    xref as XREF0,
    xref as XREF1,
    transcript as RNA,
    transcript_stable_id as RNA_ID,
    gene as GENE,
    gene_stable_id as GENE_ID,
    translation as PROT,
    translation_stable_id as PROT_ID,
    go_xref as GOXREF
where
    XREF0.dbprimary_acc="NM_030621" and
    XREF0.external_db_id=EXTDB0.external_db_id and
    EXTDB0.db_name="RefSeq_dna" and
    OX0.xref_id=XREF0.xref_id and
    RNA.gene_id=GENE.gene_id and
    GENE.gene_id= GENE_ID.gene_id and
    RNA.transcript_id=OX0.ensembl_id and
    RNA_ID.transcript_id=RNA.transcript_id and
    PROT.transcript_id = RNA.transcript_id and
    OX1.ensembl_id=PROT.translation_id and
    PROT.translation_id=PROT_ID.translation_id and
    OX1.ensembl_object_type='Translation' and
    OX1.xref_id=XREF1.xref_id and
    GOXREF.object_xref_id=OX1.object_xref_id and 
    XREF1.external_db_id=EXTDB1.external_db_id and
    EXTDB1.db_name="GO" and
    GO.acc=XREF1.dbprimary_acc

order by GO.acc;

Execute:

mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306 < query.sql

Result:

ensembl.gene    ensembl.transcript      ensembl.translation     go.acc  go.name evidence
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0000166      nucleotide binding      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0001525      angiogenesis    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003676      nucleic acid binding    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003677      DNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003723      RNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003725      double-stranded RNA binding     IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004386      helicase activity       IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004519      endonuclease activity   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004525      ribonuclease III activity       IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005515      protein binding IPI
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005524      ATP binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005622      intracellular   NAS
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0006396      RNA processing  IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0008026      ATP-dependent helicase activity IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0016787      hydrolase activity      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0019827      stem cell maintenance   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030324      lung development        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030422      RNA interference, production of siRNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030423      RNA interference, targeting of mRNA for destruction     IEP
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0031047      gene silencing by RNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035116      embryonic hindlimb morphogenesis        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035196      gene silencing by miRNA, production of miRNAs   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0048754      branching morphogenesis of a tube       IEA

Note: my previous solution as well as Neil's one don't list all the GO terms visible in http://www.ensembl.org/Homo_sapiens/Transcript/GO?db=core;g=ENSG00000100697;r=14:95552566-95623759;t=ENST00000393063 . This SQL looks probably better even if the result is still different with the web page (e.g.: I can see GO:0016442 on the web page, but not in my result ). The ensembl schema is complex, and I might have missed some links (left join ?) between the tables.

ADD COMMENTlink written 7.2 years ago by Pierre Lindenbaum96k

ok, it should work better with homo_sapiens_core_58_37c Thanks @joachimbaran http://twitter.com/joachimbaran/status/14981387076

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum96k
0
gravatar for Khader Shameer
7.2 years ago by
Manhattan, NY
Khader Shameer17k wrote:

You can get a quick look of annotation using GOOSE (GO Online SQL Environment) GOOSE Manual and a long list of sample queries here.

For a perl based solution I will try with Onto-perl. Solution for your exact question is here. Loonger list of examples here - a valuable reference that provided most of the anticipated use cases in GO and solution using Onto-Perl.

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Khader Shameer17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1175 users visited in the last hour