Biomart for getting human RNA binding proteins
2
0
Entering edit mode
9 months ago
iibrams07 ▴ 10

I would like to retrieve human RNA binding proteins from Ensembl by means of Biomart.

How can I do that, i.e. how should the R command look like? I tried this command which did not work out.

   library(biomaRt)
ensembl = useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
results < getBM(attributes=c("ensembl_gene_id","hgnc_symbol","transcript_biotype"),filters = c("transcript_biotype"), values=list("RNA_binding"), mart=ensembl)


Thanks.

ensembl protein R RNA biomart • 1.2k views
3
Entering edit mode
9 months ago
Mike Smith ★ 2.0k

Here's one way to do this using biomaRt, based on Michael Dondrup 's information that you need to use GO ID GO:0003723.

library(biomaRt)
## I'm using the uswest mirror as the main Ensembl site is very slow today
human <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl", mirror = "uswest")
results <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol","ensembl_transcript_id","transcript_biotype"),
filters = c("go"),
values = list("GO:0003723"),
mart = human)

dim(results)
#> [1] 5671    4

#>   ensembl_gene_id hgnc_symbol ensembl_transcript_id transcript_biotype
#> 1 ENSG00000262860      LSM14A       ENST00000575811     protein_coding
#> 2 ENSG00000275051      PIWIL1       ENST00000613226     protein_coding
#> 3 ENSG00000275051      PIWIL1       ENST00000632888     protein_coding
#> 4 ENSG00000262860      LSM14A       ENST00000570462     protein_coding
#> 5 ENSG00000278229       RPS17       ENST00000617731     protein_coding
#> 6 ENSG00000262156    APOBEC3A       ENST00000623492     protein_coding


That finds all transcripts of genes that are directly annotated with the GO ID. You might also be interested in genes that are annoted with a child of "RNA Binding" in the GO heirachy. For that we can use the filter go_parent_term instead. You can see this returns more results and that all of our first hits can be found in this second set of results (as you'd expect).

results2 <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol","ensembl_transcript_id","transcript_biotype"),
filters = c("go_parent_term"),
values = list("GO:0003723"),
mart = human)

dim(results2)
#> [1] 8077    4

table(results$ensembl_gene_id %in% results2$ensembl_gene_id)
#>
#> TRUE
#> 5671

0
Entering edit mode

Many thanks. It worked fine. I might need a slight modification. If I wanted to include only rows with unique ENSG how should I change the attributes of getBM ? I eventually tried "ensembl_gene_id = unique" but it did not work.

0
Entering edit mode

I think it depends on exactly how much information you want. The information you get back from Ensembl tends to be "unique rows". So if in the example above, if a gene has two transcripts it would appear twice. Similarly if it is annotated with two biotypes it will also appear twice.

If you're really just interested in the Ensembl ID for genes annotated with that GO term, you can ask for just the ensembl_gene_id attribute e.g.

rna_binding_genes <- getBM(attributes = c("ensembl_gene_id"),
filters = c("go"),
values = list("GO:0003723"),
mart = human)

## How many gene IDs does this return?
dim(rna_binding_genes)
#> [1] 1725    1

## Are they all unique?
length(unique(rna_binding_genes\$ensembl_gene_id))
#> [1] 1725

0
Entering edit mode

What bothers me a bit now is the difference in gene counts between the web-interface (3467) and biomaRt (1725). Possibly the web-interface counts the genes differently?

0
Entering edit mode

This is because of there's two possible filters both of which look like "Search by GO ID".

The 1725 results come from picking "GO ID(s)":

The 3467 hits are the result of choosing "GO Term Accession":

It's the equivalent to the filters go vs go_parent_term in the biomaRt examples above and is super confusing!

This difference is that the second version finds genes like ENSG00000210049 which is annotated with GO:0030533 (triplet codon-amino acid adaptor activity) which is a child of GO:0003723 (RNA Binding). This first doesn't return a hit like that.

1
Entering edit mode
9 months ago

One could use GO-term annotation for molecular_function term: RNA_binding. You can then simply download the gene list from AmiGO: http://amigo.geneontology.org/amigo/term/GO:0003723 and set organism filter to Homo sapiens.

If you want to download from Ensembl Biomart instead, use this GO-term in the Filter settings of Biomart. The gene list might be slightly different between these two approaches but it should not be more than a handful.

0
Entering edit mode

Thanks. I need to use Ensembl. For Dataset I choose Human genes (GRCh38.p13). In the Filters I select under Gene Ontology GO Term Name and enter there, following your suggestion, RNA_binding. For Attributes I select Gene stable ID and Gene name. I then click on Results. What I get is just a an empty dataset consisting of two columns with the corresponding column names. I do not see why is this the case.

0
Entering edit mode

You need to use the GO-id, so GO:0003723, not the name.

0
Entering edit mode

When I enter in the filter GO Term Accession the id GO:0003723 I get a list of only 10 genes. So, again something is going wrong. Is it not easier to perform this by using R ? As for the amigo option, it is good but I need ENSG's as gene ids which are not used in amigo.

1
Entering edit mode

You need to press the count button to see how many results there really are. There are over 3400 genes. Only the preview is limited to 10 rows by default but can be chaged. When you download the results, you will get all of them.