Question

Annotate Granges Object With Gene-Symbol

1

Entering edit mode

10.5 years ago

tnw01a ▴ 10

Hi,

I am struggling with the seemingly very simple task to map genomic coordinates to a gene symbol (such as TP53).

This is what I did: I've read a bedgraph file containing read coverage information into R using rtracklayer's import function.

The result is the following:


> some_granges_object
GRanges with 667858 ranges and 1 metadata column:
                 seqnames             ranges strand   |     score
                    <rle>          <iranges>  <rle>   | <numeric>
       [1]           chr1 [x, x]      *   |       155
       [2]           chr1 [x, x]      *   |       154
       [3]           chr1 [x, x]      *   |       235
       [4]           chr1 [x, x]      *   |       237
       [5]           chr1 [x, x]      *   |       236
       ...            ...                ...    ... ...       ...
  [667854] chrUn_gl000241     [x, x]      *   |       224
  [667855] chrUn_gl000241     [x, x]      *   |       210
  [667856] chrUn_gl000241     [x, x]      *   |       204
  [667857] chrUn_gl000241     [x, x]      *   |       209
  [667858] chrUn_gl000241     [x, x]      *   |       160
  ---
  seqlengths:
                    chrM                  chr1 ...        chrUn_gl000249
                      NA                    NA ...                    NA

In this particular sample there are only a few regions covered by reads (maybe 100 genes?) and I would like to infer the gene names as soon as I have a region above a certain coverage value (lets say 150).

Ideally I would like to add a few more columns to the GRanges object with an Entrez gene Id and a common gene symbol.

I've tried to use VariantAnnotation, GenomicFeatures and am stuck.

bioconductor • 9.2k views

ADD COMMENT • link 10.5 years ago by tnw01a ▴ 10

0

Entering edit mode

Are there any real coordinates in that GRanges object? All I can see in your example is [x, x].

ADD REPLY • link 10.5 years ago by Neilfws 49k

0

Entering edit mode

Yes, I just removed the coordinates for privacy reasons...

ADD REPLY • link 10.5 years ago by tnw01a ▴ 10

1

Entering edit mode

I don't think the coordinates would be giving away too much information, to be honest :)

ADD REPLY • link 10.5 years ago by Neilfws 49k

score 1 · Answer 1 · 2013-11-07

1

Entering edit mode

10.5 years ago

Michael 54k

Possibly, the most straight-forward way would include the following steps:

Get or make the complete genome annotation as a GFF file with Ensembl GeneIDs (including gene symbols for some genes, if possible) and import via package rtracklayer
calculate overlaps of some_granges_objects_at... with the gene annotation regions using findOverlaps
alternatively: use package biomaRt to map geneIDs to gene symbols, if you didn't include gene symbols in step one

Please let us know if you need more specific help with any of those steps.

ADD COMMENT • link 10.5 years ago by Michael 54k

0

Entering edit mode

I already tried using findOverlaps:

transcriptsByGene = transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
overlaps = findOverlaps(some_granges_object, transcriptsByGene)

But I couldn't figure out what to do with the resulting data structure:

> overlaps
Hits of length 722867
queryLength: 667858
subjectLength: 22932
       queryHits subjectHits 
        <integer>   <integer> 
 1             71         608 
 2             72         608 
 3             73         608 
 4             74         608 
 5             75         608

ADD REPLY • link 10.5 years ago by tnw01a ▴ 10

0

Entering edit mode

FYI, it looks like some of your queries don't actually overlap a gene. You might want to use nearest() in that case. For extracting the indices from overlaps, you can use queryHits(overlaps) and subjectHits(overlaps) (have a look at help('HitsList-class') if you're unsure what the numbers there mean).

ADD REPLY • link 10.5 years ago by Devon Ryan 104k