Question: How to get all genes for a specific list of regions in R preferably using Biomart
1
gravatar for sebastianzeki0
3.0 years ago by
United Kingdom
sebastianzeki0120 wrote:

I have a list of homo sapiens genomic loci. I would like to see what genes lie within the region of my loci, say give or take 1Mb.

I have written a script in R and am trying to use Biomart to do this.

 

So far I have my list called filterlist

list("1:4864876:5864876", "1:14283067:15283067", "1:21786817:22786817", 
    "1:33465769:34465769", "1:45300539:46300539", "1:54333815:55333815", 
    "1:65114236:66114236", "1:75194833:76194833", "1:86037468:87037468", 
    "1:96462256:97462256", "1:105259436:106259436", "1:116234756:117234756", 
    "1:120842170:121842170", "1:145808064:146808064", "1:155459582:156459582", 
    "1:166112356:167112356", "1:174453227:175453227", "1:185347260:186347260", 
    "1:194299241:195299241", "1:205731116:206731116")

 

and my code which admittedly is a bit of a cut and paste job but which doesn't seem to give me any output at all

library("biomaRt")
listMarts(host="www.ensembl.org")
ensembl = useMart(biomart = "ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host = "jul2015.archive.ensembl.org")
filters = listFilters(ensembl)
filterlist<-as.list(HMapSamp$MergeCol)
results=getBM(attributes = c("hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position"),filters = c("chromosomal_region","biotype"),values = filterlist, mart = ensembl)

 

I just want to get the names genes lying within (or intersecting with) the regions in my list

 

ADD COMMENTlink modified 3.0 years ago by Emily_Ensembl16k • written 3.0 years ago by sebastianzeki0120
9
gravatar for Giovanni M Dall'Olio
3.0 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

It's very simple in R, but I would use the Annotation packages rather than biomaRt, so you don't have to send a query through the wires each time. First of all, install Homo sapiens from Bioconductor:

> source("https://bioconductor.org/biocLite.R")
> biocLite("Homo.sapiens")
> library(Homo.sapiens)

This will also install a TxDb object for homo sapiens, called TxDb.Hsapiens.UCSC.hg19.knownGene.

To get all the gene coordinates, use the genes() function:

> genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

To get the intersection with your coordinates, you must first convert them to a GenomicRanges list. The easiest way is to convert the list to a dataframe, and then use the makeGRangesFromDataFrame function, which should have already been loaded when with Homo.sapiens. Remember to append the prefix 'chr' to your chromosome names. Coordinates should be 1-based.

> library(dplyr)
> mycoords.gr = lapply(mycoords.list, function (x) {res=strsplit(x, ':')}) %>%
   unlist %>%
   as.numeric %>%
   matrix(ncol=3, byrow=T) %>%
   as.data.frame %>%
   select(chrom=V1, start=V2, end=V3) %>%
   mutate(chrom=paste0('chr', chrom)) %>%
   makeGRangesFromDataFrame

> mycoords.gr
GRanges object with 20 ranges and 0 metadata columns:
       seqnames                 ranges strand
          <Rle>              <IRanges>  <Rle>
   [1]     chr1   [ 4864876,  5864876]      *
   [2]     chr1   [14283067, 15283067]      *
   [3]     chr1   [21786817, 22786817]      *
   [4]     chr1   [33465769, 34465769]      *
   [5]     chr1   [45300539, 46300539]      *
   ...      ...                    ...    ...
  [16]     chr1 [166112356, 167112356]      *
  [17]     chr1 [174453227, 175453227]      *
  [18]     chr1 [185347260, 186347260]      *
  [19]     chr1 [194299241, 195299241]      *
  [20]     chr1 [205731116, 206731116]      *
  -------

At this point you can simply use subsetByOverlaps or mergeByOverlaps, to get the genes of interest:

> subsetByOverlaps(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), mycoords.gr)
GRanges object with 188 ranges and 1 metadata column:
            seqnames                 ranges strand   |     gene_id
               <Rle>              <IRanges>  <Rle>   | <character>
  100126349     chr1 [166123980, 166124035]      -   |   100126349
  100129405     chr1 [155715559, 155720673]      +   |   100129405
  100132406     chr1 [145209111, 146467744]      +   |   100132406
  100288142     chr1 [144146811, 146467744]      +   |   100288142
  100302117     chr1 [117214371, 117214449]      +   |   100302117
        ...      ...                    ...    ... ...         ...
       9674     chr1 [175126123, 175162229]      -   |        9674
       9829     chr1 [ 65720148,  65881552]      +   |        9829
       9910     chr1 [174128552, 174964445]      +   |        9910
       9923     chr1 [ 22778344,  22857650]      +   |        9923
        998     chr1 [ 22379120,  22419436]      +   |         998

The gene_id column in this dataframe contains the Entrez ID of the human gene overlapping the coordinates. If you also want the gene symbols, or any other ID, you can get it from org.Hs.db:

> as.data.frame(org.Hs.egSYMBOL) %>% head
  gene_id symbol
1       1   A1BG
2       2    A2M
3       3  A2MP1
4       9   NAT1
5      10   NAT2
6      11   NATP


Just type "org.Hs.eg" and hit tab to see all the possible Entrez Gene to ID conversion available.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Giovanni M Dall'Olio26k
2
gravatar for Emily_Ensembl
3.0 years ago by
Emily_Ensembl16k
EMBL-EBI
Emily_Ensembl16k wrote:

There's a slight error in your query as you include biotype as a filter, but don't specify any value for it. The following would work and get you only coding genes:

results=getBM(attributes = c("hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"),filters = c("chromosomal_region","biotype"),values = list(chromosomal_region=filterlist,biotype="protein_coding"), mart = ensembl)

Or, to get all genes, coding or non:

results=getBM(attributes = c("hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"),filters = c("chromosomal_region"),values = list(chromosomal_region=filterlist,), mart = ensembl)
ADD COMMENTlink written 3.0 years ago by Emily_Ensembl16k
1
gravatar for Alex Reynolds
3.0 years ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

A command-line approach with a toolkit like BEDOPS will likely run faster.

Within R, convert the list to a vector and write it to a text file:

> l <- list("1:4864876:5864876", "1:14283067:15283067", "1:21786817:22786817", "1:33465769:34465769", "1:45300539:46300539", "1:54333815:55333815", "1:65114236:66114236", "1:75194833:76194833", "1:86037468:87037468", "1:96462256:97462256", "1:105259436:106259436", "1:116234756:117234756", "1:120842170:121842170", "1:145808064:146808064", "1:155459582:156459582", "1:166112356:167112356", "1:174453227:175453227", "1:185347260:186347260", "1:194299241:195299241", "1:205731116:206731116")
> write(unlist(l), file="elements.txt")

In a command-line shell, replace colons with tabs and prepend with a common chromosome name prefix, writing a sorted BED file called elements.bed:

$ tr ':' '\t' < elements.txt | awk '{ print "chr"$0; }' - | sort-bed - > elements.bed

Grab a set of gene annotations, converted to BED with gtf2bed:

$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | awk '$8 == "gene"' - \
    > gencode.v19.genes.bed

This is just an example. Use whichever gene annotations you prefer, here.

If your elements are already padded to 1Mb intervals (which may be the case, on a second read of your question), then just use bedmap to map them to genes:

$ bedmap --echo --echo-map-id-uniq --delim '\t' elements.bed gencode.v19.genes.bed > elements_with_gene_IDs.bed

If you need to pad elements, you can use bedmap --range and map the padded elements to genes:

$ bedmap --range 500000 --echo --echo-map-id-uniq --delim '\t' elements.bed gencode.v19.genes.bed > elements_with_gene_IDs_within_1Mb.bed
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Alex Reynolds26k
0
gravatar for cyril-cros
3.0 years ago by
cyril-cros840
France
cyril-cros840 wrote:

Here is a solution which does not use R but BEDtools (http://bedtools.readthedocs.org/en/latest/), which is great for manipulating sets of genomic intervals. Also, you may want to keep information about the strand of your genes of interest.

  1. Go the UCSC table browser, download the human genome annotation in BED format.
  2. Convert your genomic intervals to the bed format. "1:166112356:167112356" becomes "chr1 166112356 167112356" (tab separated).
  3. Use `bedtools slop` to extend these intervals. You can get your locii plus 500Kb upstream and downstream of each one for example. You need a file with the size of each chromosome.
  4. Use `bedtools intersect` on the human annotation and your extended intervals.

You now have all the human genes that fall in your intervals, plus their position. Use `cut -f4` to keep only the column with their names.

Alternative, may be quicker:

Go to https://genome.ucsc.edu/cgi-bin/hgTables (UCSC table browser). There is a "Region" line, and a "define regions" button; click on it. You can use it with your positions, but you don't have a simple way to get 1Mb around your loci.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by cyril-cros840
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1095 users visited in the last hour