map subsetByOverlaps back to input coordinates
1
0
Entering edit mode
3.3 years ago
brtawe • 0

I am trying to find overlap between my own set of genomic coordinates and genes, as shown below. I am able to get get the subset of genes that overlap using subsetByOverlaps, however, how do I map the genes back to the coordinates that I used to subset? I want to know which of my input coordinates overlapped with each gene(s)

I included a metadata column in my coordinates containing an ID, but that ID is missing from the subset output

mycoords.gr = as.data.frame(input) %>%
 makeGRangesFromDataFrame(keep.extra.columns=TRUE) #make GRanges form Bed file

genes <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene) #extract GRanges for genes
gene_list <- subsetByOverlaps(genes, mycoords.gr) #subset genes that overlap with my coordinates

gene_result = as(gene_list, "data.frame") #convert GRanges object to data frame

Thanks for any help!

alignment genome • 943 views
ADD COMMENT
1
Entering edit mode
3.3 years ago

Example data.

library("GenomicFeatures")

my_coords <- structure(list(seqnames = structure(c(1L, 1L, 1L, 1L), class = "factor", .Label = "I"), 
    start = c(5L, 10L, 15L, 20L), end = c(8L, 13L, 18L, 23L), 
    width = c(4L, 4L, 4L, 4L), strand = structure(c(1L, 2L, 1L, 
    2L), class = "factor", .Label = c("+", "-", "*"))), class = "data.frame", row.names = c(NA, 
-4L))
my_coords <- makeGRangesFromDataFrame(my_coords)

> my_coords
GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        I       5-8      +
  [2]        I     10-13      -
  [3]        I     15-18      +
  [4]        I     20-23      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

genes <- structure(list(seqnames = c("I", "I"), start = c(1, 21), end = c(9, 
30), strand = c("+", "-"), name = c("ENSG006069", "ENSG006241"
)), class = "data.frame", row.names = c(NA, -2L))
genes <- makeGRangesFromDataFrame(genes, keep.extra.columns=TRUE)

> genes
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        I       1-9      + |  ENSG006069
  [2]        I     21-30      - |  ENSG006241

I find plyranges::join_overlap_left_directed is simpler than the base GenomicRanges method for this task.

library("plyranges")

overlaps <-  join_overlap_left_directed(my_coords, genes)

> overlaps
GRanges object with 4 ranges and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        I       5-8      + |  ENSG006069
  [2]        I     10-13      - |        <NA>
  [3]        I     15-18      + |        <NA>
  [4]        I     20-23      - |  ENSG006241
ADD COMMENT
0
Entering edit mode

That works great, thanks for the help!

ADD REPLY

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6