Question: biomaRt getSequence() connection time-out
gravatar for flaviuvadan
21 months ago by
flaviuvadan0 wrote:

I am using biomaRt to pull some sequences from EnsEMBL using getSequence() from the biomaRt library in R.

Here is a relevant piece of code:

mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
seq <- getSequence(id=M[1], type="ensembl_gene_id", seqType="gene_flank", upstream=1000, mart=mart)

What I am doing is taking in a list of EnsEMBL IDs from stdin and trying to get the corresponding 1000 bps upstream for each corresponding gene. M[1] represent the first element in a tab delimited line, which is the ID.

I have a total of approximately 30000 genes to process. The genes are not taken in by stdin all at once. Rather, I have a simple bash script that will invoke the R script on separate sets of files, organized according to a criteria. The files have, on average, 60-70 IDs.

The problem: the script stops downloading sequences after some time, say, 2 or 3 hours. I cannot figure out why. Does Biomart have a connection time-out setup for users?

I did testing and it downloaded sequences just fine, up to 50 or 60 I'd say, in one run. I compared the output with the one returned by RSAT (this one is based on a Perl script) and it looked as I wanted it to.

I could not find anything on time-out online and no other posts on Biostars about something similar. I've gone through the Bioconductor documentation and getSequence() documentation but, like I said, I could not find anything.

EDIT: there are some posts online but the majority of the solutions suggest to break down the queries. In my case, it's not that the files have a large number of IDs, but there are many files, so it adds up.

ensembl sequence upstream R • 857 views
ADD COMMENTlink modified 21 months ago by Ben_Ensembl950 • written 21 months ago by flaviuvadan0
gravatar for Ben_Ensembl
21 months ago by
Ben_Ensembl950 wrote:

Hi Flaviu,

We can't see anything wrong with your query or with biomaRt itself, so we think you might need to reduce the number of IDs per file from 50-60 to maybe 20 and see if this improve the problem. Sequences are quite heavy queries compared to just filtering on genes. Please do let me know how you get doing it like this.

Best wishes

Ben Ensembl Helpdesk

ADD COMMENTlink written 21 months ago by Ben_Ensembl950

I have split the files into 20 IDs each. I will come back and edit my comment after the script runs.

EDIT: Ran the script on 20 IDs in each file, ran into the same problem. I have split the files into 20 IDs each and will try again.

ADD REPLYlink modified 21 months ago • written 21 months ago by flaviuvadan0

Hi Flaviu,

OK - please keep me updated how it goes.

Best wishes


ADD REPLYlink written 21 months ago by Ben_Ensembl950

Even splitting on 10IDs/file did not do the job. Since I have the gene locations (start and end sites) I ended up writing a script that extracts upstream sequences given chromosomal gene coordinates.

ADD REPLYlink written 21 months ago by flaviuvadan0

Hi Flaviu,

I'm sorry to hear that you're still experiencing problems, but glad to see you've managed to find a work around. We'l carry on looking into this.

Best wishes


ADD REPLYlink written 21 months ago by Ben_Ensembl950
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1169 users visited in the last hour