I've got a large number of gene IDs from Vector Base (ex: AAEL006343-PA and AAEL001710-PA). Some of these IDs have record in NCBI Protein database.
I'm trying to use Biopython to get gene description and other info from NCBI using the following code (for simplicity I've put one id, but would normally do it as a list)
from Bio import Entrez
Entrez.email=emailhere
handle=Entrez.efetch(db="protein", id='AAEL006343-PA', rettype="gb", retmode="text")
records = Entrez.read(handle)
efetch fails with due to HTTP error Bad Request. I know that data does exist because using id 108877864 i get the result I want. However, 108877864 is the NCBI's own ID for this protein. The only way I found to convert AAEL006343-PA to 108877864 is via esearch, but I don't want to spam NCBI with hundreds of esearch queries.
Is there a way to do this ID conversion as a batch and without esearch?
You would not spam NCBI as long as you sign up for NCBI_API_KEY and build in an appropriate delay in your queries.
I can do that and loop over 1000+ search calls, but surely there is a better and also quicker way to do this?
Perhaps you could download one of the annotation files from Vector Base and grep the info you need from it?
The source of IDs I'm using are from Vector Base basefeatures GFF. I've had another look on their website, but I can't find a file that would provide actual description of the gene apart from GO for some features.
You can use a combination of esearch and efetch using the "history server", or this documentation from esearch and efetch - in perl with
WebEnv=<webenv string>&usehistory=y
so the python equivalent must be similar. I remember, esearch returns the WebEnv string you then need to use in efetchEdit - Genomax solution seems much simpler for a one off execution