Question: How to extract all attribute from Biomart?
1
gravatar for Learner
9 days ago by
Learner 110
Learner 110 wrote:

I want to extract all attributes , normally I do the following to extract 2 attributes, in this example: "ensembl_gene_id" and "hgnc_symbol":

library("biomaRt")

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mapping <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), mart = ensembl)

But I wish to extract all available attributes, any ideas?

biomart R • 171 views
ADD COMMENTlink modified 8 days ago by walkerbrian2470 • written 9 days ago by Learner 110
1

Here's my best guess. Running it runs into problems as ensembl seems to have set limits on the number of attribs you can pick:

lapply(X = unique(listAttributes(ensembl)$page), 
       FUN = function(attrib_page) {
           getBM(
                attributes = listAttributes(ensembl, page = attrib_page, what = "name"),
                mart=ensembl
                )
             }
       )
ADD REPLYlink modified 9 days ago by zx87546.1k • written 9 days ago by RamRS19k

Based on the vignette:

mapping <- getBM(attributes = listAttributes(ensembl), mart = ensembl)
ADD REPLYlink written 9 days ago by WouterDeCoster35k

Does not work, I got this error

Error in getBM(attributes = listAttributes(ensembl), mart = ensembl) : 
  Invalid attribute(s): c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id", "ensembl_exon_id", "description", "chromosome_name", "start_position", "end_position", "strand", "band", "transcript_start", "transcript_end", "transcription_start_site", "transcript_length", "transcript_tsl", "transcript_gencode_basic", "transcript_appris_pi", "external_gene_name", "external_gene_source", "external_transcript_name", "external_transcript_source_name", "transcript_count", "percentage_gc_content", "gene_biotype", "transcript_biotype", 
"source", "transcript_source", "status", "transcript_status", "version", "transcript_version", "phenotype_description", "source_name", "study_external_id", "go_id", "name_1006", "definition_1006", "go_linkage_type", "namespace_1003", "goslim_goa_accession", "goslim_goa_description", "arrayexpress", "chembl", "clone_based_ensembl_gene_name", "clone_based_ensembl_transcript_name", "clone_based_vega_gene_name", "clone_based_vega_transcript_name
ADD REPLYlink modified 9 days ago by genomax59k • written 9 days ago by Learner 110

Right, I was too fast, but we'll have an answer ready soon :)

ADD REPLYlink written 9 days ago by WouterDeCoster35k
5
gravatar for Mike Smith
8 days ago by
Mike Smith950
EMBL Heidelberg / de.NBI
Mike Smith950 wrote:

It's really not a sensible idea to try and use biomaRt to do this. Literally nothing about the system (the biomaRt package and the BioMart framework) was designed to provide data on that scale.

If you want all the data in Ensembl for a particular species try getting it from the FTP site (http://ftp.ensembl.org/) - you'll have more success.

Persisting with trying to obtain all attributes using biomaRt to will actually result in even more data than if you just downloaded it. There's duplicate on every BioMart page e.g. every page contains the gene ID, so you'll have that repeated 6 times in the example zx8754 presents. That's not so bad, but it'll actually be many more times since BioMart returns completely normalised data for each query, as James MacDonald explained over on the BioC site (https://support.bioconductor.org/p/115867/#115875).

Also consider that blindly asking for all attributes would return the nucleotide sequence for each entry over and over again, with sequences for the whole gene, each transcript (spliced and unspliced), the flanking and untranslated regions etc. This will be huge!

You're much better off either working out which attributes are pertinent to the task you're trying to do, or getting a data dump from another source and then reading that into R.

ADD COMMENTlink written 8 days ago by Mike Smith950
4
gravatar for zx8754
9 days ago by
zx87546.1k
London
zx87546.1k wrote:

As we can't query attributes from multiple pages, we need to loop through pages:

Querying attributes from multiple attribute pages is not allowed. To see the attribute pages attributes belong to, use the function attributePages.

We have in total 2579 attributes:

# get attributes
x <- listAttributes(ensembl)

table(x$page)
# feature_page     homologs    sequences          snp  snp_somatic    structure 
#          195         2219           55           38           38           34 
nrow(x)
# [1] 2579    3

Loop through pages, and get attributes. Getting all attributes from all pages will not work. We might as well go and download the whole biomart.

But we could make it work by using filters, for example below I am querying hgnc_symbol == foxp2. And I am only querying one atttribute per page i[ 1 ]:

res <- lapply(split(x$name, x$page), function(i){
  # what you want... but will not work
  # getBM(attributes = i, mart = ensembl)

  # but could work with filters. Here we are getting one attribute per page for one gene
  getBM(attributes = i[ 1 ], filters = "hgnc_symbol", values = "foxp2", mart = ensembl)
})

One gene one attribute result object size is 7Mb, even if the query for all pages and all attributes did work, I doubt it would fit in average PC memory.

print(object.size(res), units = "Mb")
# 7 Mb
ADD COMMENTlink written 9 days ago by zx87546.1k

@zx8754 brilliant ! just could not understand completely, so what is filter for and what is value? lets say I want to obtain all the entries for each gene (then possibly will be less computational extensive) how can I use the filter and value to loop through it ?

ADD REPLYlink written 9 days ago by Learner 110
2

what is filter for and what is value?

The dataset fetched will be subset by filter == value, in this case hgnc_symbol == 'foxp2'. getBM usually works with attributes, filters and values, applying them on a global dataset. It subsets the global dataset by matching the value parameter to the filter attribute, then fetches all attributes specified in the attribute parameter. In SQL terms, SELECT $attribute FROM mart WHERE $filter = $value. You're trying to run a SELECT * FROM mart, which is not allowed.

ADD REPLYlink modified 8 days ago • written 8 days ago by RamRS19k

Something like below (not tested):

myGeneList <- c("foxp2", "brca1", "brca2")

res <- lapply(split(x$name, x$page), function(i){
  lapply(myGeneList, function(j){
    getBM(attributes = i[ 1 ], filters = "hgnc_symbol", values = j, mart = ensembl)
  })
})
ADD REPLYlink written 9 days ago by zx87546.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 707 users visited in the last hour