Since you want to automate this, you'll have to use the NCBI E-utilities (the NCBI will blacklist users who script against the Entrez web interface), fortunately since you are using BioPython's Bio.Entrez (see the "Biopython Tutorial and Cookbook. Chapter 8 Accessing NCBI’s Entrez databases") this is already taken care of.
To bulk fetch entries you don't know the UID's (GI number in this case) for, you have to first get the UIDs. For this you use ESearch specifying the 'protein' database and a query to find the required entires. For example to find the Homo sapiens (NCBI TaxId=9606) entries from RefSeq the query is:
refseq[filter] AND txid9606[Organism]
This gives a result structure which contains the UIDs. EFetch takes a comma-separated list of UIDs, so extract the UIDs and construct the list, and then feed this to ESearch specifying the required format to get the data.
The following example Python script uses BioPython to fetch the proteins from Bovine papillomavirus 7 (NCBI TaxId=1001533) present in RefSeq in fasta sequence format:
from Bio import Entrez
entrezDbName = 'protein'
ncbiTaxId = '1001533' # Bovine papillomavirus 7
Entrez.email = 'firstname.lastname@example.org'
# Find entries matching the query
entrezQuery = "refseq[filter] AND txid%s"%(ncbiTaxId)
searchResultHandle = Entrez.esearch(db=entrezDbName, term=entrezQuery)
searchResult = Entrez.read(searchResultHandle)
# Get the data.
uidList = ','.join(searchResult['IdList'])
entryData = Entrez.efetch(db=entrezDbName, id=uidList, rettype='fasta').read()
While in this case the result is small, only 7 proteins, and thus using single step fetches is reasonable. For taxa with larger numbers of entries, you will want to retrieve the entry data in chunks rather then in one go, to avoid issues with time-outs, to limit the load on the NCBI's servers and to allow for checkpoints and retries in your own code. See "8.15 Using the history and WebEnv" for details of how to use the history capabilities of E-utilities from BioPython to simplify this process.
Alternatively there are many other resources which provide the RefSeq data, and provide combined query and fetch capabilities. For example:
RefSeq is available from Various public SRS servers (see Public SRS Installations). The EMBL-EBI's Linking to SRS guide documents how to use SRS via URLs and details of using URLs as an API to SRS. For the example above, using SRS@EBI, could be replaced with a call to the URL: http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-noSession+-view+FastaSeqs+[REFSEQP-NCBI_TaxId:1001533]
RefSeq is available on the main MRS server, and may be available on other MRS servers.
For an overview of using Python with web services see the Python section of the EMBL-EBI's "Introduction to Web Services". This includes links to the main documentation for the various tool-kits and tutorials for the most commonly used.
modified 12 months ago
RamRS ♦ 30k
8.6 years ago by
Hamish • 3.1k