Protein Sequence from Entrez by Taxonomic_ID: Very slow
1
0
Entering edit mode
5 months ago
The ▴ 180

I want to download a lot of protein sequences for a metaproteomics study and for hundreds of Genus( under which comes multiple species). My Esearch/Efetch command looks like this but appears to be quite slow and though I belong to University with high speed net connection, download is quite slow and many a times the link gets broken.

esearch -db "protein" -query "txid374666[Organism]" | efetch -format fasta > txid_374666.fasta

Then I lifted the following Python code from a biostars thread . This is slow again , and sometimes issues "bad gateway" error. Can anybody suggest some fast way of downloading the sequences? Thanks

from Bio import Entrez
import json
import pandas as pd
import time


def get_ids(response) -> list:
    j = json.loads(response.read())
    return list(j['esearchresult']['idlist'])

Entrez.email = "my.name@myuniv.edu"
RETMAX = 990000


txids =[187492] #100K sequences

for txid in txids:
        prids = get_ids(Entrez.esearch(db="Protein", term=F"txid{txid}[Organism]", retmax=RETMAX, retmode="json"))
        with open(f"taxid_{txid}.fasta", 'w') as file:
            start_time = time.time()
            for prid in prids:
                # print(json.loads(Entrez.esummary(db="Protein", id=prid, retmode="json").read())['result'][prid])
                fasta = Entrez.efetch(db="Protein", id=prid, rettype="fasta", retmode="text").read()
                file.write(fasta)

            print("--- %s minutes, %s proteins  , taxid_%s ---" % ( (time.time() - start_time)//60 ,len(prids), txid ))
Efetch entrez python protein sequence • 458 views
ADD COMMENT
3
Entering edit mode
5 months ago
GenoMax 141k

Please use NCBI datasets for this kind of a workload.

An example download using taxID in your post above

datasets download genome taxon 187492  --include protein

this currently gets you 103 genomes in 2 mins.

ADD COMMENT
0
Entering edit mode

Thanks a ton

ADD REPLY

Login before adding your answer.

Traffic: 2596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6