Question

How to extract all attribute from Biomart?

2

Entering edit mode

5.4 years ago

Learner ▴ 280

I want to extract all attributes , normally I do the following to extract 2 attributes, in this example: "ensembl_gene_id" and "hgnc_symbol":

library("biomaRt")

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mapping <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), mart = ensembl)

But I wish to extract all available attributes, any ideas?

r biomart • 9.0k views

ADD COMMENT • link written 5.4 years ago by Learner ▴ 280

1

Entering edit mode

Here's my best guess. Running it runs into problems as ensembl seems to have set limits on the number of attribs you can pick:

lapply(X = unique(listAttributes(ensembl)$page), 
       FUN = function(attrib_page) {
           getBM(
                attributes = listAttributes(ensembl, page = attrib_page, what = "name"),
                mart=ensembl
                )
             }
       )

ADD REPLY • link updated 5.4 years ago by zx8754 11k • written 5.4 years ago by Ram 43k

0

Entering edit mode

Based on the vignette:

mapping <- getBM(attributes = listAttributes(ensembl), mart = ensembl)

ADD REPLY • link 5.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Does not work, I got this error

Error in getBM(attributes = listAttributes(ensembl), mart = ensembl) : 
  Invalid attribute(s): c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id", "ensembl_exon_id", "description", "chromosome_name", "start_position", "end_position", "strand", "band", "transcript_start", "transcript_end", "transcription_start_site", "transcript_length", "transcript_tsl", "transcript_gencode_basic", "transcript_appris_pi", "external_gene_name", "external_gene_source", "external_transcript_name", "external_transcript_source_name", "transcript_count", "percentage_gc_content", "gene_biotype", "transcript_biotype", 
"source", "transcript_source", "status", "transcript_status", "version", "transcript_version", "phenotype_description", "source_name", "study_external_id", "go_id", "name_1006", "definition_1006", "go_linkage_type", "namespace_1003", "goslim_goa_accession", "goslim_goa_description", "arrayexpress", "chembl", "clone_based_ensembl_gene_name", "clone_based_ensembl_transcript_name", "clone_based_vega_gene_name", "clone_based_vega_transcript_name

ADD REPLY • link updated 5.4 years ago by GenoMax 141k • written 5.4 years ago by Learner ▴ 280

0

Entering edit mode

Right, I was too fast, but we'll have an answer ready soon :)

ADD REPLY • link 5.4 years ago by WouterDeCoster 47k

score 6 · Answer 1 · 2018-12-06

It's really not a sensible idea to try and use biomaRt to do this. Literally nothing about the system (the biomaRt package and the BioMart framework) was designed to provide data on that scale.

If you want all the data in Ensembl for a particular species try getting it from the FTP site (http://ftp.ensembl.org/) - you'll have more success.

Persisting with trying to obtain all attributes using biomaRt to will actually result in even more data than if you just downloaded it. There's duplicate on every BioMart page e.g. every page contains the gene ID, so you'll have that repeated 6 times in the example zx8754 presents. That's not so bad, but it'll actually be many more times since BioMart returns completely normalised data for each query, as James MacDonald explained over on the BioC site (https://support.bioconductor.org/p/115867/#115875).

Also consider that blindly asking for all attributes would return the nucleotide sequence for each entry over and over again, with sequences for the whole gene, each transcript (spliced and unspliced), the flanking and untranslated regions etc. This will be huge!

You're much better off either working out which attributes are pertinent to the task you're trying to do, or getting a data dump from another source and then reading that into R.

score 4 · Answer 2 · 2018-12-06

4

Entering edit mode

5.4 years ago

zx8754 11k

As we can't query attributes from multiple pages, we need to loop through pages:

Querying attributes from multiple attribute pages is not allowed. To see the attribute pages attributes belong to, use the function attributePages.

We have in total 2579 attributes:

# get attributes
x <- listAttributes(ensembl)

table(x$page)
# feature_page     homologs    sequences          snp  snp_somatic    structure 
#          195         2219           55           38           38           34 
nrow(x)
# [1] 2579    3

Loop through pages, and get attributes. Getting all attributes from all pages will not work. We might as well go and download the whole biomart.

But we could make it work by using filters, for example below I am querying hgnc_symbol == foxp2. And I am only querying one atttribute per page i[ 1 ]:

res <- lapply(split(x$name, x$page), function(i){
  # what you want... but will not work
  # getBM(attributes = i, mart = ensembl)

  # but could work with filters. Here we are getting one attribute per page for one gene
  getBM(attributes = i[ 1 ], filters = "hgnc_symbol", values = "foxp2", mart = ensembl)
})

One gene one attribute result object size is 7Mb, even if the query for all pages and all attributes did work, I doubt it would fit in average PC memory.

print(object.size(res), units = "Mb")
# 7 Mb

ADD COMMENT • link 5.4 years ago by zx8754 11k

0

Entering edit mode

@zx8754 brilliant ! just could not understand completely, so what is filter for and what is value? lets say I want to obtain all the entries for each gene (then possibly will be less computational extensive) how can I use the filter and value to loop through it ?

ADD REPLY • link 5.4 years ago by Learner ▴ 280

2

Entering edit mode

what is filter for and what is value?

The dataset fetched will be subset by filter == value, in this case hgnc_symbol == 'foxp2'. getBM usually works with attributes, filters and values, applying them on a global dataset. It subsets the global dataset by matching the value parameter to the filter attribute, then fetches all attributes specified in the attribute parameter. In SQL terms, SELECT $attribute FROM mart WHERE $filter = $value. You're trying to run a SELECT * FROM mart, which is not allowed.

ADD REPLY • link 5.4 years ago by Ram 43k

0

Entering edit mode

Something like below (not tested):

myGeneList <- c("foxp2", "brca1", "brca2")

res <- lapply(split(x$name, x$page), function(i){
  lapply(myGeneList, function(j){
    getBM(attributes = i[ 1 ], filters = "hgnc_symbol", values = j, mart = ensembl)
  })
})

ADD REPLY • link 5.4 years ago by zx8754 11k