Question

How To Use Biopython To Parse Blast Output And Get Gene Symbol From Ncbi?

1

Entering edit mode

11.6 years ago

eke ▴ 10

Hello all. I am very, very new to Python/Biopython and am currently stuck.

I am using standalone BLAST via Bash. I have about 40k non-human sequences which I am blasting against the human genome, outputting as XML format. Included in the output are GI and RefSeq accession numbers. I believe that it is possible to query NCBI for various bits of information, among those official gene symbols or Entrez ID's. What I would like to do is for each record in the BLAST output, utilize the hit id (accession number) to query NCBI for the official gene symbol and/or Entrez ID. I want my output to be the original BLAST query, the hit id, and the gene symbol and Entrez ID per record.

I have been delving into the various Biopython resources and have managed to parse my BLAST output (using Bash; I am very new to Biopython and just haven't yet committed the time to parsing it in Biopython instead of Bash) to grab only the query and hit id per record, but I do not know how to convert/use this in terms of querying NCBI for gene symbols/Entrez IDs. Any help in the right direction would be appreciated.

biopython blast ncbi • 7.7k views

ADD COMMENT • link updated 11.6 years ago by Whetting ★ 1.6k • written 11.6 years ago by eke ▴ 10

3

Entering edit mode

I suggest you start by looking over the sections on the BLAST XML parser and the NCBI Entrez web interface in the Biopython tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html and come back with some specific questions.

ADD REPLY • link 11.6 years ago by Peter 6.0k

score 3 · Answer 1 · 2012-09-04

Hi, I wrote this long ago to do something similar. The code is very verbose, but it is relatively easy to follow (I hope). This snippet of code will go into GenBank and based on the accession number (which you say you have) it will download the GenBank record and print it to a temporary file. Using SeqIO you can open the file and get all the information you may need (in this case I am getting the name and the sequence)

    from genbankdownload import get_accession

    def download(accession_number):
        temp = open('file.temp','w')
        file = get_accession(accession_number, 'nucleotide','gb')
        print >> temp, file
        temp.close()
        record = SeqIO.read("file.temp",'gb')
        temp = open('file.temp','w')
        print >> temp, ">"+record.description
        print >> temp, record.seq
        temp.close()
    #to use just use this command. You can place it in a loop if you want...
    download("JN639839.1")

EDIT: my sincere apologies, like I mentioned this code is old, and was part of a bigger project. You will need to get genbankdownload.py from here: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CDUQFjAD&url=http%3A%2F%2Fiysik.com%2Fuploads%2Fbioinformatics%2Fgenbankdownload.py&ei=E7RHUKn2NdG60QGSi4GYCw&usg=AFQjCNGbuF-frCqpQg3NQ0MkABHds_kuVg This python script was developed by Simon J. Greenhill. So I do not want to take credit for that part!!