Question

Programatically (Python) Navigate from BLASTp results to Genbank file

0

Entering edit mode

5.6 years ago

cross12tamu ▴ 10

I am using BioPython to run a BLASTp on some proteins of interest. With the HSP's, I am wanting to take the returned accession and then fetch the corresponding GenBank file from which this protein is coded from.

For a hopefully simple example, if I got the Chaperone protein DnaK for E. coli K12MG1655 as a return protein; I'd want to be able to back track to the gbk file for E. coli K12MG1655.

Many of the protein accession files do not have a clear "this links you to a GBK file" or... "this is the specific taxid"...

So, my question is, can I do what I am trying to do? Perhaps I don't quite understand these files as well as I need to; but I had hoped that I could go from doing a BLASTp, see the protein accession hits, and then take those values and parse their file to retrieve some extra information to navigate to the correct gbk file.

Any thoughts on my predicament? And of course, let me know if I need to provide some more information / if I am not clear enough, as I will do so as prompt as possible.

And THANK YOU for spending your time helping/reading!

BLAST BioPython Python • 1.3k views

ADD COMMENT • link updated 5.6 years ago by massa.kassa.sc3na ▴ 650 • written 5.6 years ago by cross12tamu ▴ 10

score 4 · Accepted Answer · 2019-12-13

4

Entering edit mode

5.6 years ago

massa.kassa.sc3na ▴ 650

Hi,

the entrez should be a way to go (https://www.ncbi.nlm.nih.gov/books/NBK179288/).

for example following command will return the corresponding genbank record for protein NP_414555.1:

esearch -db protein -query NP_414555.1 | elink -db protein -target nuccore | efetch -format gb

Use protein IDs not their common names, as Ids should be unique but common names (like DnaK) are not.

In biopython there is a module Bio.Entrez which provides interface to entrez, however, you'll need to construct the pipeline yourself.

If you want it for many HSPs and/or blast searches, I would recommend pooling only sequence accessions first, then taking non redundant set of them and downloading only these (otherwise you could be downloading one genome many times).

ADD COMMENT • link 5.6 years ago by massa.kassa.sc3na ▴ 650

0

Entering edit mode

Thank you for the feedback.

So with protein accessions (such as WP_000907403.1, EAW7713904.1, etc...) I can use them to retrieve Genbank files?

When I try this in Python:

Entrez.efetch(db='nuccore', id=id_list[1], rettype="gbwithparts", retmode="text") where id_list[1] = ADT73839.1

returns a HTTPError (Bad Request), which I am presuming it is because it is not accepting the id.

The id's returned from the BLASTp are a wide range of accessions(?) from different databases(?).

ie, the BLASTp result gives me different accession returns...

Hopefully that adds / helps to my predicament. Let me know if you (or anyone) needs a sample of what I am querying etc...

Thanks again!

ADD REPLY • link 5.6 years ago by cross12tamu ▴ 10

1

Entering edit mode

First things first - ncbi has many different databases, here I've used nuccore and protein, each holding respective records (nucleotide and protein). The nuccore db does not contain any proteins ( protein accession numbers ) so you need to give it nucleotide accession number - in your case, the organism from which your protein originates, not the protein accession. (This is why we need the elink - to link the databases.)

The posted command pipeline works for ADT73839.1 and retrieves CP002185.1 E.coli.

What you need to do is read up on BioPython documentation and replicate the pipeline with biopython.

That is

create esearch request, process it, get the uids for the "protein" db
create the elink request, process it, get the uids for the "nuccore" db
create the efetch request, actually download the data

There are some alternatives:

ncbi eutils web interface with e.g. requests library - see this post Entrez epost + elink returns results out of order with Biopython brand
new python package for precisely this kind of job https://pypi.org/project/entrezpy/

Ps.: I'm not sure that there is gbwithparts format, try something simple e.g. gb it might be what you want and there is a possibility that it would interfere with your query construction.

ADD REPLY • link 5.6 years ago by massa.kassa.sc3na ▴ 650

1

Entering edit mode

I was able to construct a pipeline with all of your recomendations with BioPython! Thank you so much for your replies and time spent!

THANKS!

ADD REPLY • link 5.6 years ago by cross12tamu ▴ 10