Question: How to download all sequences of a list of proteins for a particular organism
For every organism here, I would like to download the protein sequence of all genes whose /product matches "RNA polymerase subunit" or "ribosomal protein" and put all sequences in a single fasta proteins.fa.

How can I do this using Entrez Direct utilities in terminal? I understand that I can parse the table in the link for the RefSeq IDs, but from there, I'm not sure where to go.

You can use this Python script that I wrote just now. It requires BioPython, which, on my laptop, is stored under '/usr/local/lib/python2.7/dist-packages/' (hence I append that to the sys path at the beginning of the script).

import sys

import argparse

from Bio import Entrez

parser = argparse.ArgumentParser(description='Searches for protein sequences in the Title Word field ([TITL]) based on any provided key terms.\nSee here for further details:')
parser.add_argument('-e', action='store', dest='EmailAddress', required=True, help='Entrez requires your email address.')
parser.add_argument('-t', action='store', dest='SearchTerm', required=True, help='Requires a search term (wrap in double quotes).')

arguments = parser.parse_args() = arguments.EmailAddress

SearchTerm = arguments.SearchTerm

#LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]"
LookupCommand = "refseq[FILTER] AND " + SearchTerm + "[TITL]"

handle = Entrez.esearch(db='protein', term=LookupCommand)

results =


#Lookup the FASTA sequence for each protein by its GeneInfo Identifier (GI) number
for gi in results['IdList']:
    handle = Entrez.efetch(db='protein', id=gi, rettype='fasta')



Execute it with

python -e -t "RNA polymerase subunit" > protein.fa

python -e -t "ribosomal protein" >> protein.fa

sed -i '/^$/d' protein.fa

head protein.fa
>WP_098657443.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]
>WP_098657164.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]

The final sed command just deletes empty lines, which the entrez fetch command produces.

This script searches for your term in the [TITL] field in Entrez, which will contain the product name (see here: If you want just human sequences, then un-comment the #LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]" line in the script, and comment out the other line beneath it.


