Question: How to extract all attribute from Biomart?
1
gravatar for Learner
10 months ago by
Learner 180
Learner 180 wrote:

I want to extract all attributes , normally I do the following to extract 2 attributes, in this example: "ensembl_gene_id" and "hgnc_symbol":

library("biomaRt")

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mapping <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), mart = ensembl)

But I wish to extract all available attributes, any ideas?

biomart R • 826 views
ADD COMMENTlink modified 10 months ago by walkerbrian2470 • written 10 months ago by Learner 180
1

Here's my best guess. Running it runs into problems as ensembl seems to have set limits on the number of attribs you can pick:

lapply(X = unique(listAttributes(ensembl)$page), 
       FUN = function(attrib_page) {
           getBM(
                attributes = listAttributes(ensembl, page = attrib_page, what = "name"),
                mart=ensembl
                )
             }
       )
ADD REPLYlink modified 10 months ago by zx87548.2k • written 10 months ago by RamRS24k

Based on the vignette:

mapping <- getBM(attributes = listAttributes(ensembl), mart = ensembl)
ADD REPLYlink written 10 months ago by WouterDeCoster41k

Does not work, I got this error

Error in getBM(attributes = listAttributes(ensembl), mart = ensembl) : 
  Invalid attribute(s): c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id", "ensembl_exon_id", "description", "chromosome_name", "start_position", "end_position", "strand", "band", "transcript_start", "transcript_end", "transcription_start_site", "transcript_length", "transcript_tsl", "transcript_gencode_basic", "transcript_appris_pi", "external_gene_name", "external_gene_source", "external_transcript_name", "external_transcript_source_name", "transcript_count", "percentage_gc_content", "gene_biotype", "transcript_biotype", 
"source", "transcript_source", "status", "transcript_status", "version", "transcript_version", "phenotype_description", "source_name", "study_external_id", "go_id", "name_1006", "definition_1006", "go_linkage_type", "namespace_1003", "goslim_goa_accession", "goslim_goa_description", "arrayexpress", "chembl", "clone_based_ensembl_gene_name", "clone_based_ensembl_transcript_name", "clone_based_vega_gene_name", "clone_based_vega_transcript_name
ADD REPLYlink modified 10 months ago by genomax73k • written 10 months ago by Learner 180

Right, I was too fast, but we'll have an answer ready soon :)

ADD REPLYlink written 10 months ago by WouterDeCoster41k
5
gravatar for Mike Smith
10 months ago by
Mike Smith1.4k
EMBL Heidelberg / de.NBI
Mike Smith1.4k wrote:

It's really not a sensible idea to try and use biomaRt to do this. Literally nothing about the system (the biomaRt package and the BioMart framework) was designed to provide data on that scale.

If you want all the data in Ensembl for a particular species try getting it from the FTP site (http://ftp.ensembl.org/) - you'll have more success.

Persisting with trying to obtain all attributes using biomaRt to will actually result in even more data than if you just downloaded it. There's duplicate on every BioMart page e.g. every page contains the gene ID, so you'll have that repeated 6 times in the example zx8754 presents. That's not so bad, but it'll actually be many more times since BioMart returns completely normalised data for each query, as James MacDonald explained over on the BioC site (https://support.bioconductor.org/p/115867/#115875).

Also consider that blindly asking for all attributes would return the nucleotide sequence for each entry over and over again, with sequences for the whole gene, each transcript (spliced and unspliced), the flanking and untranslated regions etc. This will be huge!

You're much better off either working out which attributes are pertinent to the task you're trying to do, or getting a data dump from another source and then reading that into R.

ADD COMMENTlink written 10 months ago by Mike Smith1.4k
4
gravatar for zx8754
10 months ago by
zx87548.2k
London
zx87548.2k wrote:

As we can't query attributes from multiple pages, we need to loop through pages:

Querying attributes from multiple attribute pages is not allowed. To see the attribute pages attributes belong to, use the function attributePages.

We have in total 2579 attributes:

# get attributes
x <- listAttributes(ensembl)

table(x$page)
# feature_page     homologs    sequences          snp  snp_somatic    structure 
#          195         2219           55           38           38           34 
nrow(x)
# [1] 2579    3

Loop through pages, and get attributes. Getting all attributes from all pages will not work. We might as well go and download the whole biomart.

But we could make it work by using filters, for example below I am querying hgnc_symbol == foxp2. And I am only querying one atttribute per page i[ 1 ]:

res <- lapply(split(x$name, x$page), function(i){
  # what you want... but will not work
  # getBM(attributes = i, mart = ensembl)

  # but could work with filters. Here we are getting one attribute per page for one gene
  getBM(attributes = i[ 1 ], filters = "hgnc_symbol", values = "foxp2", mart = ensembl)
})

One gene one attribute result object size is 7Mb, even if the query for all pages and all attributes did work, I doubt it would fit in average PC memory.

print(object.size(res), units = "Mb")
# 7 Mb
ADD COMMENTlink written 10 months ago by zx87548.2k

@zx8754 brilliant ! just could not understand completely, so what is filter for and what is value? lets say I want to obtain all the entries for each gene (then possibly will be less computational extensive) how can I use the filter and value to loop through it ?

ADD REPLYlink written 10 months ago by Learner 180
2

what is filter for and what is value?

The dataset fetched will be subset by filter == value, in this case hgnc_symbol == 'foxp2'. getBM usually works with attributes, filters and values, applying them on a global dataset. It subsets the global dataset by matching the value parameter to the filter attribute, then fetches all attributes specified in the attribute parameter. In SQL terms, SELECT $attribute FROM mart WHERE $filter = $value. You're trying to run a SELECT * FROM mart, which is not allowed.

ADD REPLYlink modified 10 months ago • written 10 months ago by RamRS24k

Something like below (not tested):

myGeneList <- c("foxp2", "brca1", "brca2")

res <- lapply(split(x$name, x$page), function(i){
  lapply(myGeneList, function(j){
    getBM(attributes = i[ 1 ], filters = "hgnc_symbol", values = j, mart = ensembl)
  })
})
ADD REPLYlink written 10 months ago by zx87548.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 987 users visited in the last hour