How to avoid annotating multiple genes to a single CpG locus?
4.1 years ago
Mathias ▴ 90

I am downloading TCGA DNA methylation data using TCGAbiolinks. The rowData() function output looks a bit messy; all ensembl ids with which a cpg overlaps are joined together in one cell, just like the hgnc symbols and gene types.

I want to just provide one column with protein coding gene hgnc symbols to my data. this is my biomaRt query:

               mart=ensembl) # ensembl human genes

I'm using GRanges findOverlaps to find genes in which the CpGs are located. When I now join the genes to my CpGs, and use the table function to inspect my CpGs, I can see that a lot of CpGs are 'located in multiple genes'.

For example, one CpG is located in PCDHGC(5,4,3) and so on - I end up with about 5K genes in which about 20k cpgs are located at least twice, some CpGs are annotated over 20 different genes.

I'm already filtering on protein coding genes, using hgnc symbols, are there other filters I can use to find exactly 1 gene at a specific locus?

4.1 years ago
Mike Smith ★ 1.9k

I think this is non-trivial. If you look at your PCDHGC example (;g=ENSG00000240184;r=5:141475207-141513719) Ensembl has multiple genes spanning the same region. I don't know the reasoning behind why these are separate genes rather than transcripts, but your biomaRt query is always going to return all of them as they have unique gene IDs, and then the a CpG loci will intersect all of them.

I guess you could run findOverlaps() on the list of genes first, to find examples like this, and then only keep the the largest versions, but you'll obviously lose some level of information doing this.


