How to avoid http failure when using rentrez to fetch records from a long list of species names
4.3 years ago
lvogel ▴ 30

I have a list of almost 5000 species of interest, for which I would like to download the sequences from Genbank, to create a custom database. I've been using rentrez, and follwing the tutorial, specifically this example:

snail_coi <- entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]", use_history=TRUE)


But my problem is, my species list is not monophyletic. So I can't just use a single search term as [ORGN]. Instead, I read the species list into R, convert it to a character vector, and loop through it, using entrez_search, like this:

i <- 1
while(i <= length(speciesvec)){
org_search[[i]] <- entrez_search(db="nuccore", term=paste(speciesvec[i], "AND COI[Gene]", sep=" "), use_history=TRUE)
i <- i + 1
}


But usually after a couple hundred iterations or so, I get kicked out with 502 bad gateway error. It says that this often happens when trying to download many records at once, and to try using web history. I believe the problem lies in that I'm only adding entries to a list object, not creating an actual web history object. I'm running the command thousands of times, instead of once, like in the example; but I can't think of any other ways to do it. I appreciate any advice.

Care NCBI will soon move to a NCBI API Keys system (youtube link)

I do not know when this will be set up.

Also, you could try to sleep your process after a couple hundred iterations

@Bastien has already noted a need to create an API key for NCBI programmatic queries.

I think this could be done faster using blastdbcmd and a local copy of nt blast database if you have that available.

I hope OP has a good fiber connection (around 60GB for nt)

I got blastdbcmd to work for one id at a time, like in the example on the web page:  blastdbcmd -db nt -entry all -outfmt "%g %T" | \ awk ' { if ($2 == 9606) { print$1 } } ' | \ blastdbcmd -db nt -entry_batch - -out human_sequences.txt  But I have a list of almost 5000 species, and putting the above in a loop seems unfeasible.

How about getting the corresponding fasta file for nt here and then retrieving the sequences you need from it?

genomax: good idea. I'm downloading it now. I imagine I'll use the descriptions in the fasta headers to search for my species names, because I think the taxids aren't in there. Will have to parse it somehow. If I have trouble I'll post again. Thanks.

It's working, except for the fact that in some places the file contains ^A characters, followed by what appears to be missing data. I would guess I should delete the whole nt file, and download it again, and see if that works.

Thanks for the information. blastdbcmd looks like an interesting tool. I'll either figure out a way to use it, or just modify my current way with Sys.sleep, etc. Internet connection not so bad--last time I downloaded nt, it only took about a day or two. ;)