Question

Returning gene functional annotations with gene symbols as input (using mygene Bioconductor package)

0

Entering edit mode

6.8 years ago

Bohdan Khomtchouk ▴ 350

I'm working with the following R code where I'm returning gene ontology terms pertinent to the biological process (BP) category:

> source("https://bioconductor.org/biocLite.R")
> biocLite("mygene")
> xli <- c('BRCA1', 'BRCA2', 'SOX2', 'MYC')
> res <- queryMany(xli, scopes='symbol', fields=c('go'), species='rat')
Finished
> unlist(unlist(res$go.BP)[names(unlist(res$go.BP)) == 'term'], use.names = F)
  [1] "double-strand break repair via homologous recombination"                                                        
  [2] "double-strand break repair via homologous recombination"                                                        
  [3] "double-strand break repair via homologous recombination"                                                        
  [4] "DNA replication"                                                                                                
  [5] "DNA replication"                                                                                                
  .
  .
  .                                                           
  [378] "positive regulation of DNA biosynthetic process"                                                                
  [379] "positive regulation of response to DNA damage stimulus"                                                         
  [380] "positive regulation of response to DNA damage stimulus"                                                         
  [381] "positive regulation of ATP biosynthetic process"                                                                
  [382] "positive regulation of apoptotic signaling pathway"

Please see: Is there an R package that pulls up gene functional annotations with gene symbols as input?. But then when I try adding gene CARNS1 to xli:

> xli <- c('BRCA1', 'BRCA2', 'SOX2', 'MYC', 'CARNS1')
> res <- queryMany(xli, scopes='symbol', fields=c('go'), species='rat')
Finished
> unlist(unlist(res$go.BP)[names(unlist(res$go.BP)) == 'term'], use.names = F)
Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

Anyone know why this error happens? At first, I thought it was because of the scopes argument in queryMany(), but playing around with the parameters of http://mygene.info/doc/query_service.html#available-fields did not help clarify the issue.

Any suggestions would be appreciated.

P.S. I would also welcome suggestions and/or minimal working (code) examples for using other R packages to get the job done.

RNA-Seq ChIP-Seq R mygene • 2.7k views

ADD COMMENT • link updated 6.8 years ago by EagleEye 7.5k • written 6.8 years ago by Bohdan Khomtchouk ▴ 350

0

Entering edit mode

Hello Bohdan Khomtchouk!

It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/98138/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

score 0 · Answer 1 · 2017-07-16

0

Entering edit mode

6.8 years ago

Samuel Brady ▴ 330

You may have noticed that NCBI pages have gene ontology annotations. For example, for CARNS1 you see ATP binding, etc.

If you go to that link and click "Full Report" at the top-left of this web page and then "Full Report (text)" you will see a text version of the web page, rather than html. This can be parsed in R using the following command:

carns1_ncbi <- readLines('https://www.ncbi.nlm.nih.gov/gene/57571?report=full_report&format=text')

(Ignore the warning message when you run this command, by the way.) In this case the URL is the URL obtained after clicking "Full Report (text)." Your carns1_ncbi variable now contains the entire text of the file, which you can parse for gene ontology terms.

The trick here is getting the URL, which requires getting the gene ID. The CARNS1 gene ID is 57571 (which is part of the URL), so if you have a lookup between gene symbol and gene ID you can replace 57571 in the URL above with whatever gene ID you're interested in.

Just a minor side note: I noticed you are interested in the rat species. All of your gene symbols are capitalized, which is not the convention for rat. I'm not sure if this is causing the issue you're having but just thought I would mention it.

ADD COMMENT • link 6.8 years ago by Samuel Brady ▴ 330

0

Entering edit mode

Hi Samuel, thanks for responding. Regarding capitalization, I'm just following the code convention of the prior Biostars post, but in reality my list of gene names is not capitalized (makes no difference, btw). Likewise, I'm posting a minimal working example here. In reality, my gene list can sometimes be composed of hundreds of genes, so your approach would not scale in my case, unfortunately. If it was just a few genes we were talking about, it would be feasible but not in my specific scenario... however, I would recommend keeping your answer live, as it's still potentially useful for smaller cases.

ADD REPLY • link 6.8 years ago by Bohdan Khomtchouk ▴ 350

0

Entering edit mode

Thanks Bohdan. This solution could scale up if you are able to programmatically convert your list of gene symbols into a list of gene IDs, then run a loop where you parse the NCBI web page for each gene ID, replacing the "57571" in the example code above with your gene ID. It probably wouldn't be too difficult to make a static gene ID/gene symbol lookup text file using David, Biomart, or bioDBnet.

ADD REPLY • link 6.8 years ago by Samuel Brady ▴ 330

score 0 · Answer 2 · 2017-07-16

0

Entering edit mode

6.8 years ago

EagleEye 7.5k

Have you tried GeneSCF 'prepare_database' module. It works fine for your task (non-R solution).

Example for 'Rattus norvegicus':

./prepare_database -db=GO -org=rgd

The above command downloads complete GO (BP, MF, CC separately) db as simple text file with Gene Symbols and Entrez Geneids in following location, 'geneSCF-tool/class/lib/db/rgd/'.

ADD COMMENT • link 6.8 years ago by EagleEye 7.5k

0

Entering edit mode

Great tool but I need something cross-platform and GeneSCF only works for the Linux system. Can you recommend any R solutions?