Question: biomaRt getSequence() connection time-out
gravatar for flaviuvadan
3.6 years ago by
flaviuvadan0 wrote:

I am using biomaRt to pull some sequences from EnsEMBL using getSequence() from the biomaRt library in R.

Here is a relevant piece of code:

mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
seq <- getSequence(id=M[1], type="ensembl_gene_id", seqType="gene_flank", upstream=1000, mart=mart)

What I am doing is taking in a list of EnsEMBL IDs from stdin and trying to get the corresponding 1000 bps upstream for each corresponding gene. M[1] represent the first element in a tab delimited line, which is the ID.

I have a total of approximately 30000 genes to process. The genes are not taken in by stdin all at once. Rather, I have a simple bash script that will invoke the R script on separate sets of files, organized according to a criteria. The files have, on average, 60-70 IDs.

The problem: the script stops downloading sequences after some time, say, 2 or 3 hours. I cannot figure out why. Does Biomart have a connection time-out setup for users?

I did testing and it downloaded sequences just fine, up to 50 or 60 I'd say, in one run. I compared the output with the one returned by RSAT (this one is based on a Perl script) and it looked as I wanted it to.

I could not find anything on time-out online and no other posts on Biostars about something similar. I've gone through the Bioconductor documentation and getSequence() documentation but, like I said, I could not find anything.

EDIT: there are some posts online but the majority of the solutions suggest to break down the queries. In my case, it's not that the files have a large number of IDs, but there are many files, so it adds up.

ensembl sequence upstream R • 1.6k views
ADD COMMENTlink modified 3.6 years ago by Ben_Ensembl1.6k • written 3.6 years ago by flaviuvadan0
gravatar for Ben_Ensembl
3.6 years ago by
Ben_Ensembl1.6k wrote:

Hi Flaviu,

We can't see anything wrong with your query or with biomaRt itself, so we think you might need to reduce the number of IDs per file from 50-60 to maybe 20 and see if this improve the problem. Sequences are quite heavy queries compared to just filtering on genes. Please do let me know how you get doing it like this.

Best wishes

Ben Ensembl Helpdesk

ADD COMMENTlink written 3.6 years ago by Ben_Ensembl1.6k

I have split the files into 20 IDs each. I will come back and edit my comment after the script runs.

EDIT: Ran the script on 20 IDs in each file, ran into the same problem. I have split the files into 20 IDs each and will try again.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by flaviuvadan0

Hi Flaviu,

OK - please keep me updated how it goes.

Best wishes


ADD REPLYlink written 3.6 years ago by Ben_Ensembl1.6k

Even splitting on 10IDs/file did not do the job. Since I have the gene locations (start and end sites) I ended up writing a script that extracts upstream sequences given chromosomal gene coordinates.

ADD REPLYlink written 3.6 years ago by flaviuvadan0

Hi Flaviu,

I'm sorry to hear that you're still experiencing problems, but glad to see you've managed to find a work around. We'l carry on looking into this.

Best wishes


ADD REPLYlink written 3.6 years ago by Ben_Ensembl1.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1120 users visited in the last hour