Retrieve All Genes Contained Within A Specific Chromosomal Region Using R And Biomart
1
8
Entering edit mode
11.1 years ago
secretjess ▴ 210

EDIT: Now fixed, using filters = c("chromosome_name","start","end","biotype") instead. Question still stands though if anyone knows?

Original question: I've written some code which I think should take chromosomal coordinates (e.g. X:1-200000) and return protein coding genes (e.g. PLCXD1 and LINC00108) within that region on NCBI36.

I'm happy for it to return a gene only partially spanning the region or only genes contained entirely within that region at this stage, but I have no idea what it is actually returning.

Here's some simplified code of what I'm doing:

rm(list=ls())
library("biomaRt")
ensembl54=useMart("ENSEMBL_MART_ENSEMBL", host="may2009.archive.ensembl.org/biomart/martservice/", dataset="hsapiens_gene_ensembl")

chr.region = c("18:19092052-30289323","16:77462979-77500915","X:146419715-146584279","4:32776556-33393589","2:187947442-188506618")

entrez.ids=vector() 
entrez.count=vector()
all.results=data.frame() 

for (i in 1:length(chr.region)){
  filterlist=list(chr.region[i],"protein_coding")

  results=getBM(attributes = c("hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position"), 
               filters = c("chromosomal_region","biotype"), 
               values = filterlist, 
               mart = ensembl54)

  results$region = chr.region[i]
  all.results=rbind(all.results,results)

  ids=unique(results$entrezgene) 
  ids <- ids[!is.na(ids)] 

  entrez.ids[i]=paste(ids, sep=",", collapse=",") 
  entrez.count[i]=unique(length(ids)) 
}

write.csv(all.results, file="all_results.csv",row.names=F)

Many thanks for any advice.

biomart bioconductor ensembl • 11k views
ADD COMMENT
1
Entering edit mode

Could you post a few lines from the csv file that you are writing ?

ADD REPLY
0
Entering edit mode

Hi Sudeep, here are two lines:

hgnc_symbol    entrezgene    chromosome_name    start_position    end_position    region
PARD6G    84552    18    76016114    76106388    18:19092052-30289323
ADNP2    22850    18    75967903    75999217    18:19092052-30289323

My question is now: if the imput region is between 19092052 and 30289323 why does the chromosomal_region filter return a gene at 75967903? Basically I am unsure what that filter is for and I'm just curious :)

ADD REPLY
0
Entering edit mode

Your code looks fine now that you have added the filters. Do you have any other doubt, or question?

ADD REPLY
0
Entering edit mode

Thanks Giovanni - I'm just wondering what the chromosomal_region filter does?

ADD REPLY
8
Entering edit mode
11.1 years ago
Sudeep ★ 1.7k

Hi the issue was that chromosomal_region should be given as 18:19092052:30289323 and not 18:19092052-30289323

try this you will get the results you want

filterlist <- list("18:19092052:30289323","protein_coding")

results=getBM(attributes = c("hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position"),filters = c("chromosomal_region","biotype"),values = filterlist, mart = ensembl54)

when you give incorrectly formatted chromosomal_region like 18:19092052-30289323 biomart takes the first value separated by ":" which in this case is 18 (chromosome) and returns all the protein coding genes on Chr.18

ADD COMMENT
0
Entering edit mode

Ahha, makes sense - many thanks! :)

ADD REPLY

Login before adding your answer.

Traffic: 3111 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6