Hello everyone,
I'm having a problem with the R package rentrez that I am hoping somebody might be able to help me with.
I am trying to download all mitochondrial COI sequences from Genbank in .gb format, using entrez_search followed by entrez_fetch. The number of records returned by entrez_search is quite large (> 2 million), so I have been using a for loop in rentrez_fetch to download the sequences in smaller chunks. I used this approach some months ago for another gene (mitochondrial 12S rRNA; ~200,000 gb entries) and it worked, but now I keep getting output files that don't contain any sequence entries and instead have the following message (one error message per for loop itineration):
https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20131226/efetch.dtd"> <eFetchResult> <ERROR>Unable to obtain query #1</ERROR> </eFetchResult>
I've tried this a number of times over the past few days and it keeps happening. Sometimes the download appears to go ok for a number of itinerations of the for loop, but then it starts throwing the same error. Other times I only see the error message and no sequences at all. I have tried doing the snail coi example in the rentrez tutorial and it seemed to work ok. I also repeated the script I used for the 12S gene - the first 20,000 entries downloaded ok, but then the same error message started appearing. I have tried using sys.sleep to create a pause between loops, to lessen the demand on the NCBI server and reducing the chunks to 500 sequences, but I still get the same error. I've also been running the commands after 9pm US eastern time, as per the NCBI guidelines. My R script is:
mitochondrial_COI_seqs_170118<- entrez_search(db="nuccore", term="(((((((((COX1) OR cytochrome c oxidase I) OR COI) OR MT-CO1) OR MTCO1) OR CO I) OR mitochondrially encoded cytochrome c oxidase I) OR main subunit of cytochrome c oxidase) OR cytochrome c oxidase subunit I) AND Mitochondrion[Filter]", use_history=TRUE)
mitochondrial_COI_seqs_170118
for( seq_start in seq(1,2692598,10000)){recs <- entrez_fetch(db="nuccore", web_history=mitochondrial_COI_seqs_170118$web_history,
rettype="gb", retmax=10000, retstart=seq_start)
# if first in sequence create new file otherwise append to file
if(seq_start==1){
cat(recs, file="mitochondrial_COI_seqs_170118.gb")
}else{
cat(recs, file="mitochondrial_COI_seqs_170118.gb", append=TRUE)
}
# print status to console
cat(seq_start+9999, "sequences downloaded\n")
}
Does anyone have any suggestions? I wondered if I should be breaking the entrez_search step into chunks as well - is that possible? I'm also confused about the difference between using entrez_post and creating a web history within entrez_search with use_history=TRUE.
Thanks in advance :)