Question: Biomart Annotation
3
gravatar for int11ap1
6.2 years ago by
int11ap1400
Barcelona
int11ap1400 wrote:

Good evening,

I have a vector (in R) of probes from an Affymetrix microarray. I would like to find the Ensembl ID, the gene name (hgnc), the gene length and the GC-content using the library BiomaRt in R. In order to do it, I do:

# Finding Ensembl IDs
data <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
ensemblids <- getBM(attributes=c("ensembl_gene_id"), filters=c("affy_hg_u133a"), values=probes, mart=data)
# Finding gene name (hgnc), gene length and GC-content
dframe <- getBM(attributes=c("hgnc_symbol", "percentage_gc_content"), filters=c("ensembl_gene_id"), values=ensemblids, mart=data)

However, as you see, I only obtain the gene name and the GC content because I do not find any attribute related in obtaining the gene length. Do you know how to solve this? Another thing. In my vector I have 22.000 genes, but in ensemblids there are 16.000 Ensembl IDs. Why is it?

Thanks in advance.

annotation biomart • 5.9k views
ADD COMMENTlink modified 6.2 years ago by Emily_Ensembl20k • written 6.2 years ago by int11ap1400
3
gravatar for Emily_Ensembl
6.2 years ago by
Emily_Ensembl20k
EMBL-EBI
Emily_Ensembl20k wrote:

Neil is right. There isn't 1:1 mapping between Affy probes and Ensembl IDs. Some probes will map to the same gene, particularly if that gene is quite large. Depending on your chip, they may not map to genes at all. Another source of confusion may be the way that we handle probes in our database. We don't take the databases from Affy stating which probe goes with which gene. Instead we map the sequences of their probes to the genome and see where they map to genes. This may also lead to us reporting different genes to each probe than they do. There's a help page that explains this here.

ADD COMMENTlink written 6.2 years ago by Emily_Ensembl20k

Thank you, Emily!

ADD REPLYlink written 6.2 years ago by int11ap1400
2
gravatar for Neilfws
6.2 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

1) You can get a data frame containing all attributes like this:

attrs <- listAttributes(data)

Then, grep for attributes named length. CDS Length might be useful?

attrs[grep("length", attrs$name),]
#            name description
# 149  cds_length  CDS Length
# 1764 cds_length  CDS Length

2) The short, unsatisfying answer is that for various reasons, not every HGNC symbol maps directly to an Ensembl Gene ID. I'm sure Emily_Ensembl can tell you more about that.

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Neilfws48k

Hi Neilfws, using cds_length (actually, it is not what I am looking for: gene length != CDS length) I obtain an error: Error in getBM(attributes = c("hgnc_symbol", "cds_length", "percentage_gc_content"), : Query ERROR: caught BioMart::Exception::Usage: Attributes from multiple attribute pages are not allowed

ADD REPLYlink written 6.2 years ago by int11ap1400
2

There are different sections that you can get attributes from. To see how this is structured, have a look at the BioMart browser tool.

We don't actually have gene length as an attribute, but you can get the start and end coordinates, then just do some arithmetic. The start and end are in the same section as the other attributes you need, so you can get everything you need in a single query.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Emily_Ensembl20k

Google that error; it's quite common and means that you're trying to query tables that are not linked. You'll need to do 2 separate queries, then merge the results.

ADD REPLYlink written 6.2 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1859 users visited in the last hour