Question: biomaRt getSequence() connection time-out
0
gravatar for flaviuvadan
10 weeks ago by
flaviuvadan0 wrote:

I am using biomaRt to pull some sequences from EnsEMBL using getSequence() from the biomaRt library in R.

Here is a relevant piece of code:

library(biomaRt)
mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
seq <- getSequence(id=M[1], type="ensembl_gene_id", seqType="gene_flank", upstream=1000, mart=mart)

What I am doing is taking in a list of EnsEMBL IDs from stdin and trying to get the corresponding 1000 bps upstream for each corresponding gene. M[1] represent the first element in a tab delimited line, which is the ID.

I have a total of approximately 30000 genes to process. The genes are not taken in by stdin all at once. Rather, I have a simple bash script that will invoke the R script on separate sets of files, organized according to a criteria. The files have, on average, 60-70 IDs.

The problem: the script stops downloading sequences after some time, say, 2 or 3 hours. I cannot figure out why. Does Biomart have a connection time-out setup for users?

I did testing and it downloaded sequences just fine, up to 50 or 60 I'd say, in one run. I compared the output with the one returned by RSAT (this one is based on a Perl script) and it looked as I wanted it to.

I could not find anything on time-out online and no other posts on Biostars about something similar. I've gone through the Bioconductor documentation and getSequence() documentation but, like I said, I could not find anything.

EDIT: there are some posts online but the majority of the solutions suggest to break down the queries. In my case, it's not that the files have a large number of IDs, but there are many files, so it adds up.

ensembl sequence upstream R • 218 views
ADD COMMENTlink modified 10 weeks ago by Ben_Ensembl300 • written 10 weeks ago by flaviuvadan0
0
gravatar for Ben_Ensembl
10 weeks ago by
Ben_Ensembl300
EMBL-EBI
Ben_Ensembl300 wrote:

Hi Flaviu,

We can't see anything wrong with your query or with biomaRt itself, so we think you might need to reduce the number of IDs per file from 50-60 to maybe 20 and see if this improve the problem. Sequences are quite heavy queries compared to just filtering on genes. Please do let me know how you get doing it like this.

Best wishes

Ben Ensembl Helpdesk

ADD COMMENTlink written 10 weeks ago by Ben_Ensembl300

I have split the files into 20 IDs each. I will come back and edit my comment after the script runs.

EDIT: Ran the script on 20 IDs in each file, ran into the same problem. I have split the files into 20 IDs each and will try again.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by flaviuvadan0

Hi Flaviu,

OK - please keep me updated how it goes.

Best wishes

Ben

ADD REPLYlink written 10 weeks ago by Ben_Ensembl300

Even splitting on 10IDs/file did not do the job. Since I have the gene locations (start and end sites) I ended up writing a script that extracts upstream sequences given chromosomal gene coordinates.

ADD REPLYlink written 9 weeks ago by flaviuvadan0

Hi Flaviu,

I'm sorry to hear that you're still experiencing problems, but glad to see you've managed to find a work around. We'l carry on looking into this.

Best wishes

Ben

ADD REPLYlink written 8 weeks ago by Ben_Ensembl300
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 492 users visited in the last hour