Question: Refseq Proteins For A Given Taxid
gravatar for Chris
8.6 years ago by
Chris1.6k wrote:


I've got the following problem:

Given a NCBI taxid, I'd like to bulk-download all RefSeq protein sequences for that species. The ftp server seems to provide fasta files for select species such as human. However,the majority seems to be concatenated in huge fasta files organised by vertebrate, invertebrate, ... Of course I could download all of those and parse them for taxid, but given the vast size this seems infeasible to me.

Now, at least I'm able to do that using Entrez via the NCBI homepage. However, I need to do this programmatically, preferable using Python/BioPython.

I've already found a way to retrieve _single_ sequences using a RefSeq Accession which is very slow when iterating over 1000s of accessions:

from Bio import Entrez
#acc: some RefSeq accession, ver: its version
rec ="protein", term="%s.%s"%(acc,ver) ))
fasta = Entrez.efetch(db="protein", id=rec["IdList"][0], rettype="fasta").read()

Is there a similar way to bulk-retrieve all sequences for a given taxid?

Thanks, Chris

refseq protein biopython download • 8.4k views
ADD COMMENTlink modified 6.6 years ago by Giovanni M Dall'Olio27k • written 8.6 years ago by Chris1.6k
gravatar for Hamish
8.6 years ago by
Hamish3.1k wrote:

Since you want to automate this, you'll have to use the NCBI E-utilities (the NCBI will blacklist users who script against the Entrez web interface), fortunately since you are using BioPython's Bio.Entrez (see the "Biopython Tutorial and Cookbook. Chapter 8 Accessing NCBI’s Entrez databases") this is already taken care of.

To bulk fetch entries you don't know the UID's (GI number in this case) for, you have to first get the UIDs. For this you use ESearch specifying the 'protein' database and a query to find the required entires. For example to find the Homo sapiens (NCBI TaxId=9606) entries from RefSeq the query is:

refseq[filter] AND txid9606[Organism]

This gives a result structure which contains the UIDs. EFetch takes a comma-separated list of UIDs, so extract the UIDs and construct the list, and then feed this to ESearch specifying the required format to get the data.

The following example Python script uses BioPython to fetch the proteins from Bovine papillomavirus 7 (NCBI TaxId=1001533) present in RefSeq in fasta sequence format:

from Bio import Entrez

entrezDbName = 'protein'
ncbiTaxId = '1001533' # Bovine papillomavirus 7 = ''

# Find entries matching the query
entrezQuery = "refseq[filter] AND txid%s"%(ncbiTaxId)
searchResultHandle = Entrez.esearch(db=entrezDbName, term=entrezQuery)
searchResult =

# Get the data.
uidList = ','.join(searchResult['IdList'])
entryData = Entrez.efetch(db=entrezDbName, id=uidList, rettype='fasta').read()
print entryData

While in this case the result is small, only 7 proteins, and thus using single step fetches is reasonable. For taxa with larger numbers of entries, you will want to retrieve the entry data in chunks rather then in one go, to avoid issues with time-outs, to limit the load on the NCBI's servers and to allow for checkpoints and retries in your own code. See "8.15 Using the history and WebEnv" for details of how to use the history capabilities of E-utilities from BioPython to simplify this process.

Alternatively there are many other resources which provide the RefSeq data, and provide combined query and fetch capabilities. For example:

  1. RefSeq is available from Various public SRS servers (see Public SRS Installations). The EMBL-EBI's Linking to SRS guide documents how to use SRS via URLs and details of using URLs as an API to SRS. For the example above, using SRS@EBI, could be replaced with a call to the URL:[REFSEQP-NCBI_TaxId:1001533]

  2. RefSeq is available on the main MRS server, and may be available on other MRS servers.

For an overview of using Python with web services see the Python section of the EMBL-EBI's "Introduction to Web Services". This includes links to the main documentation for the various tool-kits and tutorials for the most commonly used.

ADD COMMENTlink modified 12 months ago by RamRS30k • written 8.6 years ago by Hamish3.1k

Thanks Hamish. I didn't know about the txid9606[Organism] construct. Now it works like a charm.

ADD REPLYlink written 8.6 years ago by Chris1.6k

See also - you can discover other useful fields to filter on via einfo.

Here's an example fetching viruses in GenBank format,

Also, beware of chimeric records,

ADD REPLYlink modified 6.6 years ago • written 6.6 years ago by Peter5.8k

I don't know if this will still get any attention, but I am running into RuntimeError: Search Backend failed:

or no elements found. Any thoughts?

I've gotten around this by not using so many variables. It's a little less clean, but it works.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by yarmda40

It might be best to email the NCBI Entrez team with the full details - it sounds like something on their server is failing.

ADD REPLYlink written 3.0 years ago by Peter5.8k
gravatar for Giovanni M Dall'Olio
6.6 years ago by
London, UK
Giovanni M Dall'Olio27k wrote:

Using the recently released Entrez command-line utilities, you can use:

 esearch -db protein -query "refseq[filter] AND txid9606[Organism]" | efetch -format fasta > human.refseq.sequences

Alternatively, just look at the Refseq's FTP site.

ADD COMMENTlink written 6.6 years ago by Giovanni M Dall'Olio27k
gravatar for Pierre Lindenbaum
8.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

Use the property srcdb_refseq

Search ncbi protein for

"Homo Sapiens"[ORGN] AND srcdb_refseq[Properties]

go to[ORGN]%20AND%20srcdb_refseq[Properties]

Send to/File/Fasta

You can also use NCBI ESearch/EFetch for the same query.

ADD COMMENTlink written 8.6 years ago by Pierre Lindenbaum130k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1890 users visited in the last hour