Annotate Granges Object With Gene-Symbol
1
1
Entering edit mode
10.5 years ago
tnw01a ▴ 10

Hi,

I am struggling with the seemingly very simple task to map genomic coordinates to a gene symbol (such as TP53).

This is what I did: I've read a bedgraph file containing read coverage information into R using rtracklayer's import function.

The result is the following:


> some_granges_object
GRanges with 667858 ranges and 1 metadata column:
                 seqnames             ranges strand   |     score
                    <rle>          <iranges>  <rle>   | <numeric>
       [1]           chr1 [x, x]      *   |       155
       [2]           chr1 [x, x]      *   |       154
       [3]           chr1 [x, x]      *   |       235
       [4]           chr1 [x, x]      *   |       237
       [5]           chr1 [x, x]      *   |       236
       ...            ...                ...    ... ...       ...
  [667854] chrUn_gl000241     [x, x]      *   |       224
  [667855] chrUn_gl000241     [x, x]      *   |       210
  [667856] chrUn_gl000241     [x, x]      *   |       204
  [667857] chrUn_gl000241     [x, x]      *   |       209
  [667858] chrUn_gl000241     [x, x]      *   |       160
  ---
  seqlengths:
                    chrM                  chr1 ...        chrUn_gl000249
                      NA                    NA ...                    NA

In this particular sample there are only a few regions covered by reads (maybe 100 genes?) and I would like to infer the gene names as soon as I have a region above a certain coverage value (lets say 150).

Ideally I would like to add a few more columns to the GRanges object with an Entrez gene Id and a common gene symbol.

I've tried to use VariantAnnotation, GenomicFeatures and am stuck.

bioconductor • 9.2k views
ADD COMMENT
0
Entering edit mode

Are there any real coordinates in that GRanges object? All I can see in your example is [x, x].

ADD REPLY
0
Entering edit mode

Yes, I just removed the coordinates for privacy reasons...

ADD REPLY
1
Entering edit mode

I don't think the coordinates would be giving away too much information, to be honest :)

ADD REPLY
1
Entering edit mode
10.5 years ago
Michael 54k

Possibly, the most straight-forward way would include the following steps:

  1. Get or make the complete genome annotation as a GFF file with Ensembl GeneIDs (including gene symbols for some genes, if possible) and import via package rtracklayer
  2. calculate overlaps of some_granges_objects_at... with the gene annotation regions using findOverlaps
  3. alternatively: use package biomaRt to map geneIDs to gene symbols, if you didn't include gene symbols in step one

Please let us know if you need more specific help with any of those steps.

ADD COMMENT
0
Entering edit mode

I already tried using findOverlaps:

transcriptsByGene = transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
overlaps = findOverlaps(some_granges_object, transcriptsByGene)

But I couldn't figure out what to do with the resulting data structure:

> overlaps
Hits of length 722867
queryLength: 667858
subjectLength: 22932
       queryHits subjectHits 
        <integer>   <integer> 
 1             71         608 
 2             72         608 
 3             73         608 
 4             74         608 
 5             75         608 
ADD REPLY
0
Entering edit mode

FYI, it looks like some of your queries don't actually overlap a gene. You might want to use nearest() in that case. For extracting the indices from overlaps, you can use queryHits(overlaps) and subjectHits(overlaps) (have a look at help('HitsList-class') if you're unsure what the numbers there mean).

ADD REPLY

Login before adding your answer.

Traffic: 3040 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6