Question: Programatically (Python) Navigate from BLASTp results to Genbank file
gravatar for cross12tamu
6 months ago by
cross12tamu10 wrote:

I am using BioPython to run a BLASTp on some proteins of interest. With the HSP's, I am wanting to take the returned accession and then fetch the corresponding GenBank file from which this protein is coded from.

For a hopefully simple example, if I got the Chaperone protein DnaK for E. coli K12MG1655 as a return protein; I'd want to be able to back track to the gbk file for E. coli K12MG1655.

Many of the protein accession files do not have a clear "this links you to a GBK file" or... "this is the specific taxid"...

So, my question is, can I do what I am trying to do? Perhaps I don't quite understand these files as well as I need to; but I had hoped that I could go from doing a BLASTp, see the protein accession hits, and then take those values and parse their file to retrieve some extra information to navigate to the correct gbk file.

Any thoughts on my predicament? And of course, let me know if I need to provide some more information / if I am not clear enough, as I will do so as prompt as possible.

And THANK YOU for spending your time helping/reading!

blast biopython python • 259 views
ADD COMMENTlink modified 6 months ago by massa.kassa.sc3na260 • written 6 months ago by cross12tamu10
gravatar for massa.kassa.sc3na
6 months ago by
massa.kassa.sc3na260 wrote:


the entrez should be a way to go (

for example following command will return the corresponding genbank record for protein NP_414555.1:

esearch -db protein -query NP_414555.1 | elink -db protein -target nuccore | efetch -format gb

Use protein IDs not their common names, as Ids should be unique but common names (like DnaK) are not.

In biopython there is a module Bio.Entrez which provides interface to entrez, however, you'll need to construct the pipeline yourself.

If you want it for many HSPs and/or blast searches, I would recommend pooling only sequence accessions first, then taking non redundant set of them and downloading only these (otherwise you could be downloading one genome many times).

ADD COMMENTlink modified 6 months ago • written 6 months ago by massa.kassa.sc3na260

Thank you for the feedback.

So with protein accessions (such as WP_000907403.1, EAW7713904.1, etc...) I can use them to retrieve Genbank files?

When I try this in Python:

Entrez.efetch(db='nuccore', id=id_list[1], rettype="gbwithparts", retmode="text") where id_list[1] = ADT73839.1

returns a HTTPError (Bad Request), which I am presuming it is because it is not accepting the id.

The id's returned from the BLASTp are a wide range of accessions(?) from different databases(?).

ie, the BLASTp result gives me different accession returns...

Hopefully that adds / helps to my predicament. Let me know if you (or anyone) needs a sample of what I am querying etc...

Thanks again!

ADD REPLYlink written 6 months ago by cross12tamu10

First things first - ncbi has many different databases, here I've used nuccore and protein, each holding respective records (nucleotide and protein). The nuccore db does not contain any proteins ( protein accession numbers ) so you need to give it nucleotide accession number - in your case, the organism from which your protein originates, not the protein accession. (This is why we need the elink - to link the databases.)

The posted command pipeline works for ADT73839.1 and retrieves CP002185.1 E.coli.

What you need to do is read up on BioPython documentation and replicate the pipeline with biopython.

That is

  1. create esearch request, process it, get the uids for the "protein" db

  2. create the elink request, process it, get the uids for the "nuccore" db

  3. create the efetch request, actually download the data

There are some alternatives:

Ps.: I'm not sure that there is gbwithparts format, try something simple e.g. gb it might be what you want and there is a possibility that it would interfere with your query construction.

ADD REPLYlink modified 6 months ago • written 6 months ago by massa.kassa.sc3na260

I was able to construct a pipeline with all of your recomendations with BioPython! Thank you so much for your replies and time spent!


ADD REPLYlink written 6 months ago by cross12tamu10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1586 users visited in the last hour