8.6 years ago
jolespin ▴ 150

I want to figure out how to use Python to get all genes associated with a GO Term. I was trying to Biomart Python API but it's really weird and Retrieve All Genes Associated With A Go Term is for R. I'm trying to use mygene but there's no GO for the scopes (the input term) only in the fields (output terms) Gene Id Conversion Tool I need to do it for ~6000 GO pathways. I got it to work using bioservices db2db but I get timed out. Set a pause for 5 seconds between each search and I'm still getting the timeout. ( if anyone wants to know how to use this...really useful)

Does anyone know a tool in Python I can use to do this that won't time me out?

biomartian -d rnorvegicus_gene_ensembl  -i external_gene_name -o go_id | shuf -n 10
Lpcat1  GO:0005509
Klb GO:0005975
LOC498555   GO:0003735
Map3k12 GO:0046777
Hoxb1   GO:0045944
Cir1    GO:0006397
Rhoc    GO:0005525
Casr    GO:0060613
Cib1    GO:1900026
Onecut1 GO:0002064
7.8 years ago
Newgene ▴ 370

You can use mygene Python module to query GO terms for matching genes:

import mygene
mg = mygene.MyGeneInfo()

to query just one GO term:

mg.query('GO:0023026', size=1000)

By default, it returns the first ten matched genes, to get all genes, set a higher size like 1000. Note that there are some GO terms having a large list of genes associated, and you probably don't want to retrieve them all (not that useful anyway), so cap the returned gene list up to 1000 should be a reasonable setting (also avoid timeout).

With mygene, you can also query multiple GO terms in a batch:

mg.querymany(["GO:0023026", "GO:0002503"], scopes='go', size=1000)

In returned result, each gene hit contains a "query" attribute with the value of the corresponding GO term.

And you might want to restrict the number of GO terms in one batch, so that you don't overload the server.


