Question: How to download all sequences of a list of proteins for a particular organism
gravatar for traviata
17 months ago by
traviata10 wrote:

For every organism here, I would like to download the protein sequence of all genes whose /product matches "RNA polymerase subunit" or "ribosomal protein" and put all sequences in a single fasta proteins.fa.

How can I do this using Entrez Direct utilities in terminal? I understand that I can parse the table in the link for the RefSeq IDs, but from there, I'm not sure where to go.

esearch efetch entrez direct • 888 views
ADD COMMENTlink modified 17 months ago by Kevin Blighe39k • written 17 months ago by traviata10
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

You can use this Python script that I wrote just now. It requires BioPython, which, on my laptop, is stored under '/usr/local/lib/python2.7/dist-packages/' (hence I append that to the sys path at the beginning of the script).

import sys

import argparse

from Bio import Entrez

parser = argparse.ArgumentParser(description='Searches for protein sequences in the Title Word field ([TITL]) based on any provided key terms.\nSee here for further details:')
parser.add_argument('-e', action='store', dest='EmailAddress', required=True, help='Entrez requires your email address.')
parser.add_argument('-t', action='store', dest='SearchTerm', required=True, help='Requires a search term (wrap in double quotes).')

arguments = parser.parse_args() = arguments.EmailAddress

SearchTerm = arguments.SearchTerm

#LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]"
LookupCommand = "refseq[FILTER] AND " + SearchTerm + "[TITL]"

handle = Entrez.esearch(db='protein', term=LookupCommand)

results =


#Lookup the FASTA sequence for each protein by its GeneInfo Identifier (GI) number
for gi in results['IdList']:
    handle = Entrez.efetch(db='protein', id=gi, rettype='fasta')



Execute it with

python -e -t "RNA polymerase subunit" > protein.fa

python -e -t "ribosomal protein" >> protein.fa

sed -i '/^$/d' protein.fa

head protein.fa
>WP_098657443.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]
>WP_098657164.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]

The final sed command just deletes empty lines, which the entrez fetch command produces.

This script searches for your term in the [TITL] field in Entrez, which will contain the product name (see here: If you want just human sequences, then un-comment the #LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]" line in the script, and comment out the other line beneath it.


ADD COMMENTlink modified 17 months ago • written 17 months ago by Kevin Blighe39k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2448 users visited in the last hour