Question

Find gene using nucleotide sequence position in R

1

Entering edit mode

6.7 years ago

swamyvinny ▴ 20

Hi, I am trying to find what gene, if any, a specific nucleotide positions belong to. For example I'd like to know what gene nucleotide 100004651 is on. I'm using R currently and would be grateful if anyone can point me towards any packages that could help.

I'm working with the human genome if that helps at all, and also have information on which chromsome the position belongs to and which strand(+/-)

Thanks in advance

genome R sequence • 4.8k views

ADD COMMENT • link updated 6.7 years ago by Charles Plessy ★ 2.9k • written 6.7 years ago by swamyvinny ▴ 20

0

Entering edit mode

use bedtools to intersect coordinates with gene coordinates

ADD REPLY • link 6.7 years ago by cpad0112 21k

score 5 · Answer 1 · 2017-08-24

5

Entering edit mode

6.7 years ago

Emily 23k

You could use biomaRt, filtering by your loci.

ADD COMMENT • link 6.7 years ago by Emily 23k

0

Entering edit mode

Since I recently did this I can easily share you some code (random position in human genome)

library("biomaRt")
mart = useMart("ensembl", dataset="hsapiens_gene_ensembl")

all.genes <- getBM(
    attributes=c("ensembl_gene_id","hgnc_symbol","chromosome_name","start_position","end_position"),
    filters=c("chromosome_name", "start", "end"),
    values=list(chromosome="5", start="1565669",end="1565670"),
    mart=mart)

ADD REPLY • link 6.7 years ago by WouterDeCoster 47k

0

Entering edit mode

This is exactly what I need! thank you so much

ADD REPLY • link 6.7 years ago by swamyvinny ▴ 20

0

Entering edit mode

Glad that it was useful. I have moved my comment to an answer so you can accept it and mark this thread as closed.

Upvote|Bookmark|Accept

ADD REPLY • link 6.7 years ago by WouterDeCoster 47k

score 0 · Answer 2 · 2017-08-24

Here is an internal function that I am about to add to Bioconductor's CAGEr package, to solve the same problem.

#' ranges2genes
#' 
#' Assign gene symbol(s) to Genomic Ranges.
#' 
#' @param ranges Genomics Ranges object, for example extracted from a
#'               RangedSummarizedExperiment object with the \code{rowRanges}
#'               command.
#' 
#' @param genes A \code{\link{GRanges}} object containing \code{gene_name} metadata.
#' 
#' @return A character vector of same length as the GRanges object, indicating
#'         one gene symbol or a comma-separated list of gene symbols for each
#'         range.
#' 
#' @importFrom GenomicRanges findOverlaps
#' @importFrom S4Vectors List Rle unstrsplit
#' @importFrom IRanges extractList
#' 
#' @examples
#' # Example for Biostars
#' # Imagine that nucleotides are represented by a GRanges object called "gr"
#' # Download GENCODE from Bioc's annotation hub:
#' ah <- AnnotationHub::AnnotationHub()
#' gff <- ah[["AH49556"]]
#' Instead one can also use rtracklayer::import.gff, etc...
#' 
#' Annotate the GRanges:
#' gr$geneSymbols <- ranges2genes(gr, gff)

ranges2genes <- function(ranges, genes) {
  if (is.null(genes$gene_name))
    stop("Annotation must contain ", dQuote("gene_name"), " metdata.")
  gnames <- findOverlaps(ranges, genes)
  gnames <- as(gnames, "List")
  gnames <- extractList(genes$gene_name, gnames)
  gnames <- unique(gnames)
  gnames <- unstrsplit(gnames, ";")
  Rle(gnames)
}