Question: vector dimension limit in biomaRt
0
gravatar for difraiadomenico
2.5 years ago by
difraiadomenico0 wrote:

I there to all, I have some issue trying to retrieve all go terms for some uniprot entries. I have a vector of nearly 12000 uniprot entries, some are swissprot,some are trembl etc etc. And I want to retrieve for all of them the corresponding go terms.

When i run the R command:

major_proteins_GO_terms = getBM(attributes = c("uniprotsptrembl","go_id","name_1006","namespace_1003"),filters = c('uniprotsptrembl'),values = unlist(major_proteins_ids),mart = ensembl)

I get only a fraction of them. But the strange things is that if I select the entries non found, and launch another query like:

not_found_GO_terms = getBM(attributes = c("uniprotsptrembl","go_id","name_1006","namespace_1003"),filters = c('uniprotsptrembl'),values = not_found_proteins,mart = ensembl)

I can retrieve the GO terms i want. However also this element is not complete, and i have to launch recursively until the go terms for all the entries I want. This seems very strange, and I wanna know if someone has some knowledge on how to deal with that.

Someone has any ideas? This is happening, both with swissprot but also with trembl...

Thanks a lot!

biomart R • 1.0k views
ADD COMMENTlink modified 2.5 years ago by WouterDeCoster41k • written 2.5 years ago by difraiadomenico0

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 2.5 years ago by WouterDeCoster41k
1
gravatar for Mike Smith
2.5 years ago by
Mike Smith1.4k
EMBL Heidelberg / de.NBI
Mike Smith1.4k wrote:

This is probably because the BioMart back end tends to cope badly (but silently) with queries containing more than about 500 entries. Admittedly I don't think this is documented anywhere in biomaRt, so that's not easy to figure out. If you use the web interface it warns you next to the filter.

I am planning to implement something to handle this internally (or at least warn the user), but for now I suggest splitting your vector of IDs into a list of smaller vectors, and then using lapply() on that. You might need to do some deduplication on the results if you want the set of GO terms for the whole set of IDs.

ADD COMMENTlink written 2.5 years ago by Mike Smith1.4k

Thanks Mike! I'll follow the hint you gave me.

The strange things is that, at every search are always the same proteins that remain "unfound". For example at the first search are always the same proteins that are not considered in the search. This seems not a random behavior..

However thanks for the hint!

Domenico

ADD REPLYlink written 2.5 years ago by difraiadomenico0

Yes, it almost feels deterministic sometimes, but I don't know exactly what drives it. Presumably the search takes the same amount of time every time you run it, and then it times out at the same point. There was a bit more discussion on the Bioconductor support site a while ago at https://support.bioconductor.org/p/86358/.

Would it be possible for you to share the list of 12,000 IDs with me? I've been working on a batch submission version of biomaRt today, but all the tests I've run so far with the current version return everything I'm expecting, so I can actually replicate the behaviour you're currently seeing.

ADD REPLYlink written 2.5 years ago by Mike Smith1.4k

Yea, sure, just give me some location were to send the list.

ADD REPLYlink written 2.5 years ago by difraiadomenico0

Thanks, that'd be great. You can send it to grimbough [at] gmail [dot] com. Cheers.

ADD REPLYlink written 2.5 years ago by Mike Smith1.4k
1
gravatar for Mike Smith
2.5 years ago by
Mike Smith1.4k
EMBL Heidelberg / de.NBI
Mike Smith1.4k wrote:

I've modified the getBM() function in biomaRt to submit queries in batches if the number of values exceeds 500. If you have multiple filters each of which have more than 500 values it should generate multiple mutually exclusive queries so that all combinations are run without breaking the 500 value limit. All of this is done internally, so existing biomaRt scripts shouldn't need to be changed. It will also display a progress bar so you can tell it is still proceeding. This is available from biomaRt version 2.33.1

If anyone finds any issues with this, please let me know.


You can test the code with the following example:

library(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl" )

Download a list of 20,000 Uniprot/TrEMBL IDs to use as our query values, and then submit the biomaRt query.

protein_ids <- read.table("http://msmith.de/data/20k_uniprot_ids.txt",
                          header = TRUE,
                          stringsAsFactors = FALSE,
                          sep = "\t")

GO_terms <- getBM(attributes = c("uniprotsptrembl",
                                 "go_id",
                                 "name_1006",
                                 "namespace_1003",
                                 "go_linkage_type"),
                      filters = c('uniprotsptrembl'),
                      values = protein_ids,
                      mart = ensembl)

If we run this with the current release version, we see that ~8% of the protein IDs were silently dropped from the return:

> packageVersion("biomaRt")
[1] ‘2.32.0’
> table(protein_ids[,1] %in% GO_terms$uniprotsptrembl)

FALSE  TRUE 
 1618 18382

Using the devel version this no longer happens:

> packageVersion("biomaRt")
[1] ‘2.33.1’
> table(protein_ids[,1] %in% GO_terms$uniprotsptrembl)

 TRUE 
20000
ADD COMMENTlink written 2.5 years ago by Mike Smith1.4k

Wow great Mike! Thanks! You still need the ID list?

ADD REPLYlink written 2.5 years ago by difraiadomenico0

If you want to use the devel version of biomaRt and report back here whether it's fixed the problem or not that would be perfect. Since it takes a few days for changes to propagate through the Bioc build system, I find the easiest way to install a devel package is with:

source("https://bioconductor.org/biocLite.R")
biocLite("Bioconductor-mirror/biomaRt")

Alternatively, if you don't want to install developmental code and risk messing things up, you can send me the ID list for testing.

ADD REPLYlink written 2.5 years ago by Mike Smith1.4k

Ok Mike, i've sent you a mail with those IDs!

Thanks for your help!

ADD REPLYlink written 2.5 years ago by difraiadomenico0

Hello, on the other hand, a file of 20,000 lines brings an error: 11mError in getBM(attributes = c("refsnp_id", "ensembl_gene_stable_id", "ensembl_transcript_stable_id"), : The query to the BioMart webservice returned an invalid result: biomaRt expected a character string of length 1. Please report this to the mailing list. Execution halted

ADD REPLYlink written 4 months ago by amandinelecerfdefer20

No need to bump this old post. You're getting lots of responses from the Ensembl team at BioMart : the BioMart webservice returned an invalid result and I responded to your cross posted query at https://support.bioconductor.org/p/121827/

ADD REPLYlink written 4 months ago by Mike Smith1.4k

Thank you for notifying me of your answer, I had not paid attention to this position. Thank you. I take note of your answer.

ADD REPLYlink written 4 months ago by amandinelecerfdefer20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1440 users visited in the last hour