Question

Annotating gene symbols with BioPython

0

Entering edit mode

8.8 years ago

bojingjia ▴ 10

I have a list of gene symbols, rather than IDs, and I would like to use Entrez to annotate my genes (i.e., grabbing gene summary). I've seen example codes to annotate genes by gene IDs, and I figured it would be easy to search by gene symbols, but I'm having a lot of trouble debugging my code. Can anyone point me to the right direction?

Thanks in advance!

EDIT: I posted my code below. I think this is a really slow and dumb way of achieving what I want - the annotation currently prints out OK, still working on writing it to an excel spreadsheet. I've come across the esummary function on Entrez, and I was wondering if that may be a faster method?

annotation gene Entrez BioPython • 3.1k views

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by bojingjia ▴ 10

1

Entering edit mode

post your code..

ADD REPLY • link 8.8 years ago by steven ▴ 70

0

Entering edit mode

import sys
 
from Bio import Entrez
import xlrd


# *Always* tell NCBI who you are
Entrez.email = "john.doe@mail.com"
 
def retrieve_annotation(id_list):
 
    """Annotates Entrez Gene IDs using Bio.Entrez, in particular epost (to
    submit the data to NCBI) and esummary to retrieve the information. 
    Returns a list of dictionaries with the annotations."""
 
    # This below tests for search by gene symbol
    request = Entrez.epost("gene",id=",".join(id_list))
    try:
        result = Entrez.read(request)
    except RuntimeError as e:
        #FIXME: How generate NAs instead of causing an error with invalid IDs?
        print "An error occurred while retrieving the annotations."
        print "The error returned was %s" % e
        sys.exit(-1)
 
    webEnv = result["WebEnv"]
    queryKey = result["QueryKey"]
    data = Entrez.esummary(db="gene", webenv=webEnv, query_key =
            queryKey)
    annotations = Entrez.read(data)
 
    print "Retrieved %d annotations for %d genes" % (len(annotations),
            len(id_list)) 
    return annotations

#Read data from Excel Spreadsheet
wb = xlrd.open_workbook('C:/Users/user/geneSymbolsTest.xlsx')
sh = wb.sheet_by_index(0)
colA = sh.col_values(0)
colA.pop(0)

#Convert entries to Strings
symbol_list = []
for x in colA:
    symbol_list.append(str(x))

#Search for Gene ID, then find annotation
id_list = []
for x in symbol_list:
    sterm = x + '[sym] "Mus musculus"[orgn]'
    handle = Entrez.esearch(db="gene", retmode = "xml", term = sterm )
    record = Entrez.read(handle)
    IDArray = record["IdList"]
    toString = str(IDArray[0])
    id_list.append(toString)

annotation = retrieve_annotation(id_list)
print type(annotation)

ADD REPLY • link 8.8 years ago by bojingjia ▴ 10