Question: How to get just protein_coding genes using biomart in R
gravatar for M K
3.9 years ago by
M K470
United States
M K470 wrote:

Dear all,

I would like to have help with getting just protein_coding genes from gene expression file using biomart. What I have is a file of all genes expression for mouse (mm10)  with ensemble gene_names, and I need to get ride from other non-coding and pseudogene.

sequencing rna-seq R • 2.2k views
ADD COMMENTlink modified 3.9 years ago by cyril-cros890 • written 3.9 years ago by M K470
gravatar for cyril-cros
3.9 years ago by
cyril-cros890 wrote:

You can go to Ensembl Biomart, and select the following attributes in the Gene section: Gene type, Transcript type. "protein-coding" is the one you want. Just do something like `grep "protein_coding" biomartResults.txt` and you should be set.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by cyril-cros890

I already have my file with all genes on it, and I want to use R to get the protein_coding genes only from my file. 

ADD REPLYlink written 3.9 years ago by M K470

You can really use anything for that. If you want to do it in R:

if (!shouldImport || !file.exists(saveFile)){
  print("Querying Biomart for protein coding genes")
  ensemblMouse = useDataset("mmusculus_gene_ensembl",mart=ensembl)
  mouseProteinCodingGenes = getBM(attributes=c("ensembl_gene_id","external_gene_name","description"), filters='biotype', values=c('protein_coding'), mart=ensemblMouse)
} else {
  print("Loading genes from savefile")

The only useful part is the one about ensembl, the rest just saves the result of your Biomart query to a file so it can be loaded again (querying Biomart takes a bit of time). biomaRt is the R library you want, you specify what mart you are using, request with getBM a list of the attributes of all the entries whose attributes in filters match the terms in values. listAttributes() does what it is called.
The rest is just dataframe manipulation.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by cyril-cros890

I run this your R code and it worked, but we have to change  ensembl=useMart("ensembl") with

myMart <- useMart("ENSEMBL_MART_ENSEMBL",dataset="mmusculus_gene_ensembl", host="")

because there are some changes in Ensembl proxy. the issue that I have that I couldn't read  .Rda file. Is there any way to save this file as text, because what I need to do is using merge function in R to merge this file with my file to get only protein_coding genes in mine.

ADD REPLYlink written 3.9 years ago by M K470

Thanks. Worked perfectly!

ADD REPLYlink written 10 months ago by SmallChess500
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2734 users visited in the last hour