Question: Finding All The Genes That Are In A Given Set Of Genomic Co-Ordinates Using R
gravatar for Ankur
6.9 years ago by
Ankur100 wrote:


So I've got a list of chromosome segments amplified and deleted by sample with the chromosome, start and end co-ordinates. How can I, using R, get hold of identifiers for all the genes that fall within those segments?


R cnv • 1.9k views
ADD COMMENTlink modified 6.9 years ago by Steve Lianoglou5.1k • written 6.9 years ago by Ankur100

Which organism?

ADD REPLYlink written 6.9 years ago by Neilfws49k
gravatar for Neilfws
6.9 years ago by
Sydney, Australia
Neilfws49k wrote:

I'll assume for now that this is human data.

Short answer: biomaRt.

A brief example:

mart.hs <- useMart("ensembl", "hsapiens_gene_ensembl")
# example: chromosome 22, start = 20000000, end = 20100000

genes <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position", "end_position"),
               filters    = c("chromosomal_region"), 
               values     = c("22:20000000:20100000"), 
               mart       = mart.hs)

#   hgnc_symbol chromosome_name start_position end_position
#1        ARVCF              22       19957419     20004331
#2       TANGO2              22       20004537     20053449
#3        DGCR8              22       20067755     20099400
#4       TRMT2A              22       20099389     20104915
#5       MIR185              22       20020662     20020743
#6                           22       20050503     20058045
#7                           22       20052075     20053228
#8      MIR3618              22       20073269     20073356
#9      MIR1306              22       20073581     20073665
#10                          22       20098344     20099398
ADD COMMENTlink written 6.9 years ago by Neilfws49k

The trouble though is that I've got a dataframe full of segments and samples and I need to apply that function to each row so I can get a vector of gene symbols for each row; biomaRt doesn't seem to be very good at doing that.

ADD REPLYlink written 6.9 years ago by Ankur100

That is not a problem at all. You just need to paste() together the chromosomes, starts and ends and supply that vector as the values argument to getBM().

ADD REPLYlink written 6.9 years ago by Neilfws49k

Perhaps instead of using R to do set operations, export the gene table for whole chromosomes to a file that you can do lookups on. Use a dedicated set operation tool like bedmap to map genes to your segments of interest. More about bedmap over here.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Alex Reynolds31k
gravatar for Steve Lianoglou
6.9 years ago by
Steve Lianoglou5.1k
Steve Lianoglou5.1k wrote:

The basic outline would look like this:

  1. Store your amplified/deleted segments in a GRanges object (from the GenomicRanges package) named gr. You can annotate each range in this object with the sampleID it comes from, if this is necessary.
  2. Load up the appropriate TranscriptDb object (GenomicFeatures package) for your organism and reference, or create your own from a set of annotations you care about (read the vignette for this package to learn how to do so).
  3. Get the relevant features you want out of your txdb objects, and subsetByOverlaps(features, gr) .

More concretely, assuming your working in human:

tx <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
interesting <- subsetByOverlaps(tx, gr)

Now interesting has all of the transcripts that overlap (in any way) with the segments in gr. Read through the documentation available for GenomicRanges (and likely IRanges) to tune how you want to consider overlaps, or how to use the lower level findOverlaps methods to better tune the results/output from these queries.

ADD COMMENTlink written 6.9 years ago by Steve Lianoglou5.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1881 users visited in the last hour