Question: How To Use Biopython To Parse Blast Output And Get Gene Symbol From Ncbi?
gravatar for eke
8.1 years ago by
eke10 wrote:

Hello all. I am very, very new to Python/Biopython and am currently stuck.

I am using standalone BLAST via Bash. I have about 40k non-human sequences which I am blasting against the human genome, outputting as XML format. Included in the output are GI and RefSeq accession numbers. I believe that it is possible to query NCBI for various bits of information, among those official gene symbols or Entrez ID's. What I would like to do is for each record in the BLAST output, utilize the hit id (accession number) to query NCBI for the official gene symbol and/or Entrez ID. I want my output to be the original BLAST query, the hit id, and the gene symbol and Entrez ID per record.

I have been delving into the various Biopython resources and have managed to parse my BLAST output (using Bash; I am very new to Biopython and just haven't yet committed the time to parsing it in Biopython instead of Bash) to grab only the query and hit id per record, but I do not know how to convert/use this in terms of querying NCBI for gene symbols/Entrez IDs. Any help in the right direction would be appreciated.

ncbi biopython blast • 6.1k views
ADD COMMENTlink modified 8.1 years ago by Whetting1.5k • written 8.1 years ago by eke10

I suggest you start by looking over the sections on the BLAST XML parser and the NCBI Entrez web interface in the Biopython tutorial and come back with some specific questions.

ADD REPLYlink written 8.1 years ago by Peter5.8k
gravatar for Whetting
8.1 years ago by
Bethesda, MD
Whetting1.5k wrote:

Hi, I wrote this long ago to do something similar. The code is very verbose, but it is relatively easy to follow (I hope). This snippet of code will go into GenBank and based on the accession number (which you say you have) it will download the GenBank record and print it to a temporary file. Using SeqIO you can open the file and get all the information you may need (in this case I am getting the name and the sequence)

    from genbankdownload import get_accession

    def download(accession_number):
        temp = open('file.temp','w')
        file = get_accession(accession_number, 'nucleotide','gb')
        print >> temp, file
        record ="file.temp",'gb')
        temp = open('file.temp','w')
        print >> temp, ">"+record.description
        print >> temp, record.seq
    #to use just use this command. You can place it in a loop if you want...

EDIT: my sincere apologies, like I mentioned this code is old, and was part of a bigger project. You will need to get from here: This python script was developed by Simon J. Greenhill. So I do not want to take credit for that part!!

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by Whetting1.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1363 users visited in the last hour