Finding All The Genes That Are In A Given Set Of Genomic Co-Ordinates Using R
2
2
Entering edit mode
7.3 years ago
Ankur ▴ 100

Hi,

So I've got a list of chromosome segments amplified and deleted by sample with the chromosome, start and end co-ordinates. How can I, using R, get hold of identifiers for all the genes that fall within those segments?

Cheers.

cnv r • 2.0k views
1
Entering edit mode

Which organism?

4
Entering edit mode
7.3 years ago
Neilfws 49k

I'll assume for now that this is human data.

A brief example:

library(biomaRt)
mart.hs <- useMart("ensembl", "hsapiens_gene_ensembl")
# example: chromosome 22, start = 20000000, end = 20100000

genes <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position", "end_position"),
filters    = c("chromosomal_region"),
values     = c("22:20000000:20100000"),
mart       = mart.hs)

genes
#   hgnc_symbol chromosome_name start_position end_position
#1        ARVCF              22       19957419     20004331
#2       TANGO2              22       20004537     20053449
#3        DGCR8              22       20067755     20099400
#4       TRMT2A              22       20099389     20104915
#5       MIR185              22       20020662     20020743
#6                           22       20050503     20058045
#7                           22       20052075     20053228
#8      MIR3618              22       20073269     20073356
#9      MIR1306              22       20073581     20073665
#10                          22       20098344     20099398

0
Entering edit mode

The trouble though is that I've got a dataframe full of segments and samples and I need to apply that function to each row so I can get a vector of gene symbols for each row; biomaRt doesn't seem to be very good at doing that.

0
Entering edit mode

That is not a problem at all. You just need to paste() together the chromosomes, starts and ends and supply that vector as the values argument to getBM().

0
Entering edit mode

Perhaps instead of using R to do set operations, export the gene table for whole chromosomes to a file that you can do lookups on. Use a dedicated set operation tool like bedmap to map genes to your segments of interest. More about bedmap over here.

2
Entering edit mode
7.3 years ago

The basic outline would look like this:

1. Store your amplified/deleted segments in a GRanges object (from the GenomicRanges package) named gr. You can annotate each range in this object with the sampleID it comes from, if this is necessary.
2. Load up the appropriate TranscriptDb object (GenomicFeatures package) for your organism and reference, or create your own from a set of annotations you care about (read the vignette for this package to learn how to do so).
3. Get the relevant features you want out of your txdb objects, and subsetByOverlaps(features, gr) .

More concretely, assuming your working in human:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
tx <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
interesting <- subsetByOverlaps(tx, gr)


Now interesting has all of the transcripts that overlap (in any way) with the segments in gr. Read through the documentation available for GenomicRanges (and likely IRanges) to tune how you want to consider overlaps, or how to use the lower level findOverlaps methods to better tune the results/output from these queries.