Question

biomaRt getSequence() connection time-out

0

Entering edit mode

6.9 years ago

flaviuvadan • 0

I am using biomaRt to pull some sequences from EnsEMBL using getSequence() from the biomaRt library in R.

Here is a relevant piece of code:

library(biomaRt)
mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
seq <- getSequence(id=M[1], type="ensembl_gene_id", seqType="gene_flank", upstream=1000, mart=mart)

What I am doing is taking in a list of EnsEMBL IDs from stdin and trying to get the corresponding 1000 bps upstream for each corresponding gene. M[1] represent the first element in a tab delimited line, which is the ID.

I have a total of approximately 30000 genes to process. The genes are not taken in by stdin all at once. Rather, I have a simple bash script that will invoke the R script on separate sets of files, organized according to a criteria. The files have, on average, 60-70 IDs.

The problem: the script stops downloading sequences after some time, say, 2 or 3 hours. I cannot figure out why. Does Biomart have a connection time-out setup for users?

I did testing and it downloaded sequences just fine, up to 50 or 60 I'd say, in one run. I compared the output with the one returned by RSAT (this one is based on a Perl script) and it looked as I wanted it to.

I could not find anything on time-out online and no other posts on Biostars about something similar. I've gone through the Bioconductor documentation and getSequence() documentation but, like I said, I could not find anything.

EDIT: there are some posts online but the majority of the solutions suggest to break down the queries. In my case, it's not that the files have a large number of IDs, but there are many files, so it adds up.

R Ensembl Upstream sequence • 2.4k views

ADD COMMENT • link updated 6.9 years ago by Ben_Ensembl ★ 2.4k • written 6.9 years ago by flaviuvadan • 0

score 0 · Answer 1 · 2017-06-07

0

Entering edit mode

6.9 years ago

Ben_Ensembl ★ 2.4k

Hi Flaviu,

We can't see anything wrong with your query or with biomaRt itself, so we think you might need to reduce the number of IDs per file from 50-60 to maybe 20 and see if this improve the problem. Sequences are quite heavy queries compared to just filtering on genes. Please do let me know how you get doing it like this.

Best wishes

Ben Ensembl Helpdesk

ADD COMMENT • link 6.9 years ago by Ben_Ensembl ★ 2.4k

0

Entering edit mode

I have split the files into 20 IDs each. I will come back and edit my comment after the script runs.

EDIT: Ran the script on 20 IDs in each file, ran into the same problem. I have split the files into 20 IDs each and will try again.

ADD REPLY • link 6.9 years ago by flaviuvadan • 0

0

Entering edit mode

Hi Flaviu,

OK - please keep me updated how it goes.

Best wishes

Ben

ADD REPLY • link 6.9 years ago by Ben_Ensembl ★ 2.4k

0

Entering edit mode

Even splitting on 10IDs/file did not do the job. Since I have the gene locations (start and end sites) I ended up writing a script that extracts upstream sequences given chromosomal gene coordinates.

ADD REPLY • link 6.8 years ago by flaviuvadan • 0

0

Entering edit mode

Hi Flaviu,

I'm sorry to hear that you're still experiencing problems, but glad to see you've managed to find a work around. We'l carry on looking into this.

Best wishes

Ben

ADD REPLY • link 6.8 years ago by Ben_Ensembl ★ 2.4k