I am using biomaRt to pull some sequences from EnsEMBL using getSequence() from the biomaRt library in R.
Here is a relevant piece of code:
library(biomaRt) mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl") seq <- getSequence(id=M, type="ensembl_gene_id", seqType="gene_flank", upstream=1000, mart=mart)
What I am doing is taking in a list of EnsEMBL IDs from stdin and trying to get the corresponding 1000 bps upstream for each corresponding gene. M represent the first element in a tab delimited line, which is the ID.
I have a total of approximately 30000 genes to process. The genes are not taken in by stdin all at once. Rather, I have a simple bash script that will invoke the R script on separate sets of files, organized according to a criteria. The files have, on average, 60-70 IDs.
The problem: the script stops downloading sequences after some time, say, 2 or 3 hours. I cannot figure out why. Does Biomart have a connection time-out setup for users?
I did testing and it downloaded sequences just fine, up to 50 or 60 I'd say, in one run. I compared the output with the one returned by RSAT (this one is based on a Perl script) and it looked as I wanted it to.
I could not find anything on time-out online and no other posts on Biostars about something similar. I've gone through the Bioconductor documentation and getSequence() documentation but, like I said, I could not find anything.
EDIT: there are some posts online but the majority of the solutions suggest to break down the queries. In my case, it's not that the files have a large number of IDs, but there are many files, so it adds up.