I am using biomaRt to pull some sequences from EnsEMBL using getSequence() from the biomaRt library in R.
Here is a relevant piece of code:
library(biomaRt)
mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
seq <- getSequence(id=M[1], type="ensembl_gene_id", seqType="gene_flank", upstream=1000, mart=mart)
What I am doing is taking in a list of EnsEMBL IDs from stdin and trying to get the corresponding 1000 bps upstream for each corresponding gene. M[1] represent the first element in a tab delimited line, which is the ID.
I have a total of approximately 30000 genes to process. The genes are not taken in by stdin all at once. Rather, I have a simple bash script that will invoke the R script on separate sets of files, organized according to a criteria. The files have, on average, 60-70 IDs.
The problem: the script stops downloading sequences after some time, say, 2 or 3 hours. I cannot figure out why. Does Biomart have a connection time-out setup for users?
I did testing and it downloaded sequences just fine, up to 50 or 60 I'd say, in one run. I compared the output with the one returned by RSAT (this one is based on a Perl script) and it looked as I wanted it to.
I could not find anything on time-out online and no other posts on Biostars about something similar. I've gone through the Bioconductor documentation and getSequence() documentation but, like I said, I could not find anything.
EDIT: there are some posts online but the majority of the solutions suggest to break down the queries. In my case, it's not that the files have a large number of IDs, but there are many files, so it adds up.
I have split the files into 20 IDs each. I will come back and edit my comment after the script runs.
EDIT: Ran the script on 20 IDs in each file, ran into the same problem. I have split the files into 20 IDs each and will try again.
Hi Flaviu,
OK - please keep me updated how it goes.
Best wishes
Ben
Even splitting on 10IDs/file did not do the job. Since I have the gene locations (start and end sites) I ended up writing a script that extracts upstream sequences given chromosomal gene coordinates.
Hi Flaviu,
I'm sorry to hear that you're still experiencing problems, but glad to see you've managed to find a work around. We'l carry on looking into this.
Best wishes
Ben