Question

How to download all sequences of a list of proteins for a particular organism

0

Entering edit mode

7.7 years ago

traviata ▴ 20

For every organism here https://www.ncbi.nlm.nih.gov/genome/browse/reference/#, I would like to download the protein sequence of all genes whose /product matches "RNA polymerase subunit" or "ribosomal protein" and put all sequences in a single fasta proteins.fa.

How can I do this using Entrez Direct utilities in terminal? I understand that I can parse the table in the link for the RefSeq IDs, but from there, I'm not sure where to go.

efetch esearch entrez direct • 3.7k views

ADD COMMENT • link updated 7.7 years ago by Kevin Blighe 89k • written 7.7 years ago by traviata ▴ 20

score 6 · Accepted Answer · 2017-10-25

You can use this Python script that I wrote just now (only tested on Python 2.7).

import sys
import argparse
from Bio import Entrez

parser = argparse.ArgumentParser(description='Searches for protein sequences in the Title Word field ([TITL]) based on any provided key terms.\nSee here for further details: http://cbsu.tc.cornell.edu/resources/seq_comp/pb607_introductory/entrez/ncbi_entrez.html')
parser.add_argument('-e', action='store', dest='EmailAddress', required=True, help='Entrez requires your email address.')
parser.add_argument('-t', action='store', dest='SearchTerm', required=True, help='Requires a search term (wrap in double quotes).')

arguments = parser.parse_args()

Entrez.email = arguments.EmailAddress

SearchTerm = arguments.SearchTerm

#LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]"
LookupCommand = "refseq[FILTER] AND " + SearchTerm + "[TITL]"

handle = Entrez.esearch(db='protein', term=LookupCommand)

results = Entrez.read(handle)

handle.close()

#Lookup the FASTA sequence for each protein by its GeneInfo Identifier (GI) number
for gi in results['IdList']:
    handle = Entrez.efetch(db='protein', id=gi, rettype='fasta')

    print handle.read()

    handle.close()

Execute it with

python ProteinFASTASearchByFASTATitle.py -e Me@MyEmail.com -t "RNA polymerase subunit" > protein.fa

python ProteinFASTASearchByFASTATitle.py -e Me@MyEmail.com -t "ribosomal protein" >> protein.fa

sed -i '/^$/d' protein.fa

head protein.fa
>WP_098657443.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]
MNQSYSSLNRDESLTRTINLGTTARSIGPLVKPEDENFEVKEIWNYKVLSKQESLNLFRRYKHGEKDLRE
YLFHVNIGLVLSIARKYKKKHPEIEFDDLVQEGNEGMLRAIEDFDPDLGYCFSTYAYCWIKKSMLGFICK
KKSGPFKIPNYVNQFNVKYVEIEDKYLQMHNRIPTVEEVVKELDVTREKVVRHNVYYNWVTTMTLDIDTI
NEDIGILNSFCNDNSAIPSTNEMIMEDLNYEIWIIFDEVLNPKQKMVLNLCFGLLDGEIHLHKEIAKALM
ITTERVSQLKDEAINRLKKCDYKDEIFNLLHAKLKVMDELNMA
>WP_098657164.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]
MKPATFTETVVLYEGMIVNQIKKLSIYQDHEEYYQCGLIGLWYAYERYEEGKGSFPAYAVITVRGYILER
LKKECIMQERYVCVGEYDEQFESEETGMRAQDFMSVLNKRERHIISERFFVGKKMGEIACEMGMTYYQVR
WIYRQALEKMRDSVKG

The final sed command just deletes empty lines, which the entrez fetch command produces.

This script searches for your term in the [TITL] field in Entrez, which will contain the product name (see here: http://cbsu.tc.cornell.edu/resources/seq_comp/pb607_introductory/entrez/ncbi_entrez.html ). If you want just human sequences, then un-comment the #LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]" line in the script, and comment out the other line beneath it.

Kevin