Question: How Do I Do Simple Go Term Lookup Given A Gene (Or Mrna) Identifier?
18
gravatar for Ryan Thompson
8.6 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

Suppose I have a gene ID such as NM_030621, which happens to be one of the human mRNA splice variants of DICER1. How can I programmatically get a list of GO terms with which this ID is associated? Going through protein or gene IDs as intermediaries is acceptable.

I want to do this for a whole list of such IDs, all corresponding to RefSeq mRNA sequences. I don't want to do enrichment analysis or anything, just get a list of GO terms for each mRNA.

Preferred languages are Perl and R.

gene annotation database • 27k views
ADD COMMENTlink modified 4.1 years ago by adam.maikai460 • written 8.6 years ago by Ryan Thompson3.4k

Thanks for all the answers, everyone. I ended up using biomaRt in R.

ADD REPLYlink written 8.6 years ago by Ryan Thompson3.4k

Nice question, awesome answers, just added my suggestion using GOOSE.

ADD REPLYlink written 8.6 years ago by Khader Shameer17k
15
gravatar for Neilfws
8.6 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Using R, you can do this with the biomaRt package. Here is a quick example:

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
results <- getBM(attributes = c("refseq_dna", "go_molecular_function_id",
           "go_molecular_function__dm_name_1006"), filters = "refseq_dna",
           values = c("NM_030621"), mart = mart)

In the getBM() function, we specify which attributes to return (RefSeq ID, GO MF ID, GO MF Name), the filters to use for the query (RefSeq ID) and the filter values (NM_030621). You could also pass a list of values, e.g. from a data frame column as mydf$column.

Result:

        refseq_dna  go_molecular_function_id go_molecular_function__dm_name_1006
1       NM_030621                 GO:0005515                     protein binding
2       NM_030621                 GO:0008026     ATP-dependent helicase activity
3       NM_030621                 GO:0000166                  nucleotide binding
4       NM_030621                 GO:0000287               magnesium ion binding
5       NM_030621                 GO:0004386                   helicase activity
6       NM_030621                 GO:0004519               endonuclease activity
7       NM_030621                 GO:0016787                  hydrolase activity
8       NM_030621                 GO:0030145               manganese ion binding
9       NM_030621                 GO:0004525           ribonuclease III activity
10      NM_030621                 GO:0003725         double-stranded RNA binding
11      NM_030621                 GO:0005524                         ATP binding
12      NM_030621                 GO:0003723                         RNA binding
13      NM_030621                 GO:0003676                nucleic acid binding
14      NM_030621                 GO:0003677                         DNA binding

You could also use the BioMart website, where you can upload a list of IDs, specify attributes and filters and download results as delimited text files.

ADD COMMENTlink modified 3 months ago by RamRS19k • written 8.6 years ago by Neilfws48k

Thanks, that worked great. Biomart looks incredibly useful.

ADD REPLYlink written 8.6 years ago by Ryan Thompson3.4k

Neil, do you known why there are more GO identifiers on ensembl for this protein ?

ADD REPLYlink modified 3 months ago by RamRS19k • written 8.6 years ago by Pierre Lindenbaum115k

I think the Ensembl page shows all three GO ontologies (BP, CC and MF) - the example above is only MF.

ADD REPLYlink written 8.6 years ago by Neilfws48k

When I try this, I find that 'refseq_dna', 'go_molecular_function_id', and 'go_molecular_function__dm_name_1006' are not valid attributes. Has something changed?

ADD REPLYlink written 4.2 years ago by Greg P70

But, in this way, would not you get limited to the RefSeq ID's which have Ensemble information in the database? 

ADD REPLYlink written 3.9 years ago by Prakki Rama2.2k

What if your organism is not in ensembl list anymore, like deinococcus radiodurans?

ADD REPLYlink written 23 months ago by xioli20130
7
gravatar for Pierre Lindenbaum
8.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

My XSLT solution:

The stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >

<xsl:output method="text" encoding="ISO-8859-1"/>

<xsl:template match="/">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="Dbtag[Dbtag_db='GeneID']">
  <xsl:for-each select="Dbtag_tag/Object-id/Object-id_id">
     <xsl:variable name="genid" select="."/>
    <xsl:variable name="url" select="concat('&lt;a href=" http:="" eutils.ncbi.nlm.nih.gov="" entrez="" eutils="" efetch.fcgi?db="gene&amp;amp;retmode=xml&amp;amp;id=',$genid)" "="" rel="nofollow">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=',$genid)" />
    <xsl:apply-templates select="document($url,//Other-source)" mode="go"/>
  </xsl:for-each>
</xsl:template>

<xsl:template match="Other-source" mode="go">
 <xsl:if test="Other-source_src/Dbtag/Dbtag_db='GO'">
 <xsl:value-of select="concat('GO:',Other-source_src/Dbtag/Dbtag_tag/Object-id/Object-id_id)"/>
<xsl:text>        </xsl:text>
<xsl:value-of select="Other-source_anchor"/>
<xsl:text>
</xsl:text>
 </xsl:if>
</xsl:template>

<xsl:template match="text()">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()" mode="go">
</xsl:template>

</xsl:stylesheet>

Apply this stylesheet with xsltproc for NM_030621:

xsltproc --novalid stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_030621&rettype=xml"

Result:

GO:5524         ATP binding
GO:8026         ATP-dependent helicase activity
GO:3725         double-stranded RNA binding
GO:4519         endonuclease activity
GO:4386         helicase activity
GO:16787        hydrolase activity
GO:46872        metal ion binding
GO:166          nucleotide binding
GO:5515         protein binding
GO:4525         ribonuclease III activity
GO:6396         RNA processing
GO:1525         angiogenesis
GO:48754        branching morphogenesis of a tube
GO:35116        embryonic hindlimb morphogenesis
GO:31047        gene silencing by RNA
GO:30324        lung development
GO:31054        pre-microRNA processing
GO:30422        production of siRNA involved in RNA interference
GO:19827        stem cell maintenance
GO:30423        targeting of mRNA for destruction involved in RNA interference
GO:16442        RNA-induced silencing complex
GO:5737         cytoplasm
GO:5622         intracellular
ADD COMMENTlink modified 3 months ago by RamRS19k • written 8.6 years ago by Pierre Lindenbaum115k

Interesting solution!

ADD REPLYlink written 8.6 years ago by Istvan Albert ♦♦ 78k

Pierre could you share your experience/comments on when to go the xslt route for such tasks?

ADD REPLYlink written 8.6 years ago by Marcin Cieslik520

... hum, just a kind of feeling... :-) I first think of XSLT when I know that I can get the source of data as XML, when this source is not too large (must fit in memory), when speed is not an issue and I don't need a complicated processing (e.g. tokenizing, aggregation of data (mean/uniq/max/etc...) )

ADD REPLYlink written 8.6 years ago by Pierre Lindenbaum115k
7
gravatar for adam.maikai
4.1 years ago by
adam.maikai460
United States
adam.maikai460 wrote:

Using MyGene.info's new R/Bioconductor package mygene, this can be done simply with the query() function.

library(mygene)
res<-query('NM_030621', fields='go', species='human')$hits
lapply(res, as.list)

This returns a data.frame for CC, MF, and BP, matching score, and the gene id.

ADD COMMENTlink modified 3 months ago by RamRS19k • written 4.1 years ago by adam.maikai460
5
gravatar for Jandot
8.6 years ago by
Jandot370
GB
Jandot370 wrote:

Hey Ryan,

Just had a very quick look at this, so am not presenting the whole solution here. However, using the ruby-ensembl-api and given a

[?]

Result: [?] GO:0005488 GO:0005515 GO:0007275 GO:0005622 GO:0007155 [?]

This uses the Ensembl stable ID instead of the RefSeq accession, but you should be easily able to link one to the other.

ADD COMMENTlink written 8.6 years ago by Jandot370
4
gravatar for Dario Corrada
8.6 years ago by
Monza - ITALY
Dario Corrada70 wrote:

For my work I've used an R package called CORNA 1 and I find very useful functions for a batch retrieves. Here there is an example code chunk:

# This script retrieve a list of dataframes for crosslinks among RefSeqID, Ensembl IDs (transcript e gene), GO IDs (cellular component, molecular function e biological process)

library(CORNA);

biomart = "ensembl"; # BioMart DB (for a complete list of avalaible databases use "listMarts" function [biomaRt])
dataset = "mmusculus_gene_ensembl"; # dataset
# for a complete list of avalaible datasets use a chunck like this:
# library(biomaRt)
# mart <- useMart('ensembl')
# listDatasets(mart)
org="mmu"; # organism, e.g. mmu -> Mus Musculus

# crosslinks between RefSeq and Ensembl Transcript
refseq2ensembl_tran <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_transcript_id"),
                                        col.new=c("refseq", "ensembl_transcript"));

# crosslinks between RefSeq and Ensembl Gene
refseq2ensembl_gene <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_gene_id"),
                                        col.new=c("refseq", "ensembl_gene"));

# crosslinks between Ensembl Transcript and GObp
tran2gobp  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_biological_process_id"),
                                col.new=c("ensembl_transcript", "gobp"));

# crosslinks between Ensembl Transcript and GOcc
tran2gocc  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_cellular_component_id"),
                                col.new=c("ensembl_transcript", "gocc"));

# crosslinks between Ensembl Transcript and GOmf
tran2gomf  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_molecular_function_id"),
                                col.new=c("ensembl_transcript", "gomf"));

# links between GO id and term
corna.go2term <- GO2df.fun(url = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz");

# pathway information from KEGG
corna.gene2path <- KEGG2df.fun(org = org);
ADD COMMENTlink modified 3 months ago by RamRS19k • written 8.6 years ago by Dario Corrada70

Thank you very much for this link, because I just happen to be doing GO-term enrichment analysis on miRNA targets.

ADD REPLYlink written 8.6 years ago by Ryan Thompson3.4k
3
gravatar for Andrew Su
8.6 years ago by
Andrew Su4.8k
San Diego, CA
Andrew Su4.8k wrote:

Seems like a good task for BioMart's MartService. Check out their API. You specify your query using their Query XML syntax, and a good way to generate a template is to do the query using the GUI and then export the corresponding XML.

Check out also this previous question, where QuickGO and goatools are mentioned...

ADD COMMENTlink written 8.6 years ago by Andrew Su4.8k
2
gravatar for Pierre Lindenbaum
8.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

I was also interested in using the mysql server for ensembl for this query. After doing some little reverse engineering with the SQL schema, I wrote the following SQL query: query:

use homo_sapiens_core_48_36j;

select  distinct
    GENE_ID.stable_id as "ensembl.gene",
    RNA_ID.stable_id as "ensembl.transcript",
    PROT_ID.stable_id as "ensembl.translation",
    GO.acc as "go.acc",
    GO.name as "go.name",
    GOXREF.linkage_type as "evidence"
from

    ensembl_go_54.term as GO,
    external_db as EXTDB0,
    external_db as EXTDB1,
    object_xref as OX0,
    object_xref as OX1,
    xref as XREF0,
    xref as XREF1,
    transcript as RNA,
    transcript_stable_id as RNA_ID,
    gene as GENE,
    gene_stable_id as GENE_ID,
    translation as PROT,
    translation_stable_id as PROT_ID,
    go_xref as GOXREF
where
    XREF0.dbprimary_acc="NM_030621" and
    XREF0.external_db_id=EXTDB0.external_db_id and
    EXTDB0.db_name="RefSeq_dna" and
    OX0.xref_id=XREF0.xref_id and
    RNA.gene_id=GENE.gene_id and
    GENE.gene_id= GENE_ID.gene_id and
    RNA.transcript_id=OX0.ensembl_id and
    RNA_ID.transcript_id=RNA.transcript_id and
    PROT.transcript_id = RNA.transcript_id and
    OX1.ensembl_id=PROT.translation_id and
    PROT.translation_id=PROT_ID.translation_id and
    OX1.ensembl_object_type='Translation' and
    OX1.xref_id=XREF1.xref_id and
    GOXREF.object_xref_id=OX1.object_xref_id and 
    XREF1.external_db_id=EXTDB1.external_db_id and
    EXTDB1.db_name="GO" and
    GO.acc=XREF1.dbprimary_acc

order by GO.acc;

Execute:

mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306 < query.sql

Result:

ensembl.gene    ensembl.transcript      ensembl.translation     go.acc  go.name evidence
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0000166      nucleotide binding      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0001525      angiogenesis    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003676      nucleic acid binding    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003677      DNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003723      RNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003725      double-stranded RNA binding     IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004386      helicase activity       IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004519      endonuclease activity   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004525      ribonuclease III activity       IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005515      protein binding IPI
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005524      ATP binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005622      intracellular   NAS
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0006396      RNA processing  IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0008026      ATP-dependent helicase activity IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0016787      hydrolase activity      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0019827      stem cell maintenance   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030324      lung development        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030422      RNA interference, production of siRNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030423      RNA interference, targeting of mRNA for destruction     IEP
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0031047      gene silencing by RNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035116      embryonic hindlimb morphogenesis        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035196      gene silencing by miRNA, production of miRNAs   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0048754      branching morphogenesis of a tube       IEA

Note: my previous solution as well as Neil's one don't list all the GO terms visible here. This SQL looks probably better even if the result is still different with the web page (e.g.: I can see GO:0016442 on the web page, but not in my result ). The ensembl schema is complex, and I might have missed some links (left join ?) between the tables.

ADD COMMENTlink modified 3 months ago by RamRS19k • written 8.6 years ago by Pierre Lindenbaum115k

ok, it should work better with homo_sapiens_core_58_37c Thanks @joachimbaran http://twitter.com/joachimbaran/status/14981387076

ADD REPLYlink written 8.6 years ago by Pierre Lindenbaum115k
0
gravatar for Khader Shameer
8.6 years ago by
Manhattan, NY
Khader Shameer17k wrote:

You can get a quick look of annotation using GOOSE (GO Online SQL Environment) GOOSE Manual and a long list of sample queries here.

For a perl based solution I will try with Onto-perl. Solution for your exact question is here. Loonger list of examples here - a valuable reference that provided most of the anticipated use cases in GO and solution using Onto-Perl.

ADD COMMENTlink modified 8.6 years ago • written 8.6 years ago by Khader Shameer17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1206 users visited in the last hour