Question: How to download all sequences of a list of proteins for a particular organism
0
gravatar for traviata
2.1 years ago by
traviata10
traviata10 wrote:

For every organism here https://www.ncbi.nlm.nih.gov/genome/browse/reference/#, I would like to download the protein sequence of all genes whose /product matches "RNA polymerase subunit" or "ribosomal protein" and put all sequences in a single fasta proteins.fa.

How can I do this using Entrez Direct utilities in terminal? I understand that I can parse the table in the link for the RefSeq IDs, but from there, I'm not sure where to go.

esearch efetch entrez direct • 1.2k views
ADD COMMENTlink modified 2.1 years ago by Kevin Blighe52k • written 2.1 years ago by traviata10
6
gravatar for Kevin Blighe
2.1 years ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

You can use this Python script that I wrote just now. It requires BioPython, which, on my laptop, is stored under '/usr/local/lib/python2.7/dist-packages/' (hence I append that to the sys path at the beginning of the script).

import sys
sys.path.append('/usr/local/lib/python2.7/dist-packages/')

import argparse

from Bio import Entrez

parser = argparse.ArgumentParser(description='Searches for protein sequences in the Title Word field ([TITL]) based on any provided key terms.\nSee here for further details: http://cbsu.tc.cornell.edu/resources/seq_comp/pb607_introductory/entrez/ncbi_entrez.html')
parser.add_argument('-e', action='store', dest='EmailAddress', required=True, help='Entrez requires your email address.')
parser.add_argument('-t', action='store', dest='SearchTerm', required=True, help='Requires a search term (wrap in double quotes).')

arguments = parser.parse_args()

Entrez.email = arguments.EmailAddress

SearchTerm = arguments.SearchTerm

#LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]"
LookupCommand = "refseq[FILTER] AND " + SearchTerm + "[TITL]"

handle = Entrez.esearch(db='protein', term=LookupCommand)

results = Entrez.read(handle)

handle.close()

#Lookup the FASTA sequence for each protein by its GeneInfo Identifier (GI) number
for gi in results['IdList']:
    handle = Entrez.efetch(db='protein', id=gi, rettype='fasta')

    print handle.read()

    handle.close()

Execute it with

python ProteinFASTASearchByFASTATitle.py -e Me@MyEmail.com -t "RNA polymerase subunit" > protein.fa

python ProteinFASTASearchByFASTATitle.py -e Me@MyEmail.com -t "ribosomal protein" >> protein.fa

sed -i '/^$/d' protein.fa

head protein.fa
>WP_098657443.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]
MNQSYSSLNRDESLTRTINLGTTARSIGPLVKPEDENFEVKEIWNYKVLSKQESLNLFRRYKHGEKDLRE
YLFHVNIGLVLSIARKYKKKHPEIEFDDLVQEGNEGMLRAIEDFDPDLGYCFSTYAYCWIKKSMLGFICK
KKSGPFKIPNYVNQFNVKYVEIEDKYLQMHNRIPTVEEVVKELDVTREKVVRHNVYYNWVTTMTLDIDTI
NEDIGILNSFCNDNSAIPSTNEMIMEDLNYEIWIIFDEVLNPKQKMVLNLCFGLLDGEIHLHKEIAKALM
ITTERVSQLKDEAINRLKKCDYKDEIFNLLHAKLKVMDELNMA
>WP_098657164.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]
MKPATFTETVVLYEGMIVNQIKKLSIYQDHEEYYQCGLIGLWYAYERYEEGKGSFPAYAVITVRGYILER
LKKECIMQERYVCVGEYDEQFESEETGMRAQDFMSVLNKRERHIISERFFVGKKMGEIACEMGMTYYQVR
WIYRQALEKMRDSVKG

The final sed command just deletes empty lines, which the entrez fetch command produces.

This script searches for your term in the [TITL] field in Entrez, which will contain the product name (see here: http://cbsu.tc.cornell.edu/resources/seq_comp/pb607_introductory/entrez/ncbi_entrez.html ). If you want just human sequences, then un-comment the #LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]" line in the script, and comment out the other line beneath it.

Kevin

ADD COMMENTlink modified 3 months ago • written 2.1 years ago by Kevin Blighe52k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1178 users visited in the last hour