efetch from NCBI E-utilities returns "curl error s 400 & 500" and takes a very long time
0
0
Entering edit mode
8 months ago
Eugene • 0

I run this command to download ~4,000 gene sequences for invA gene for taxonomy# 28901. It works fine for smaller datasets, but ... but takes very long time and never finishes for this large dataset:

esearch -db nuccore -query 'gbdiv BCT[PROP] AND ( invA[gene] ) AND txid28901[ORGN] ' | efetch -format gbc | xtract -insd CDS gene sub_sequence | sed 's/ /_/g' | awk '{ IGNORECASE=1; if ( $2 ~ /invA/ ) print $0 }' > file

The command generates a tab-delimited output for all genes in genomes for tax=28901 -- a very large output given many genomes x ~4,000 genes in each, even though I need only sequences for a single gene=invA that I use awk or grep.

Here are errors I get:

curl: (22) The requested URL returned error: 400
ERROR:  curl command failed with: 22
-X POST https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi -d query_key=1&WebEnv=MCID_64dbfbab6b6e680dea649326&retstart=33000&retmax=100&db=nuccore&rettype=gbc&retmode=xml&api_key=xxx&tool=edirect&edirect=20.0&edirect_os=Linux
HTTP/1.1 400 Bad Request
 WARNING:  FAILURE
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -query_key 1 -WebEnv MCID_64dbfbab6b6e680dea649326 -retstart 33000 -retmax 100 -db nuccore -rettype gbc -retmode xml -api_key xxxx -tool edirect -edirect 20.0 -edirect_os Linux
EMPTY RESULT
SECOND ATTEMPT
curl: (22) The requested URL returned error: 500
 ERROR:  curl command failed with: 22
-X POST https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi -d query_key=1&WebEnv=MCID_64dbfbab6b6e680dea649326&retstart=34100&retmax=100&db=nuccore&rettype=gbc&retmode=xml&api_key=xxx&tool=edirect&edirect=20.0&edirect_os=Linux

Is there a way to speed this command up OR break this query into smaller chunks, so that it does not timeout.

Thank you

--
Gene

NCBI efetch E-utilities • 543 views
ADD COMMENT
0
Entering edit mode

I assume you are using NCBI API KEY otherwise this would not be working. If it is a matter of query timing out because of the large amount of data there may not be much you can do.

Consider using datasets instead of EntrezDirect as an alternative (LINK).

Is there a way to speed this command up OR break this query into smaller chunks

You may also want to get the accessions numbers of the records and then submit this query in chunks with a certain number of records at one time.

ADD REPLY
0
Entering edit mode

Thank you! I I used NCBI API KEY. Using NCBI datasets seems a very good idea, but neither command-line nor NCBI interface return any results:

datasets download gene symbol invA --taxon 28901 --include gene,cds Error: No genes found that match selection

I tried an E.coli gene as positive control and used species name instead if taxID, upper/lower case for gene symbol -- same result. I will follow your suggestion and download all ACC first, then do efetch | xtract | awk for each ACC separately.
PS My esearch | efetch generates output, but it seems it does not release memory for data it generated: I got this error message on Ubuntu22.04 with >100 GB RAM: ecommon.sh: xrealloc: cannot allocate 18446744072361744256 bytes

ADD REPLY

Login before adding your answer.

Traffic: 1705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6