I am downloading TCGA DNA methylation data using TCGAbiolinks. The rowData() function output looks a bit messy; all ensembl ids with which a cpg overlaps are joined together in one cell, just like the hgnc symbols and gene types.
I want to just provide one column with protein coding gene hgnc symbols to my data. this is my biomaRt query:
getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"),
filters=c("chromosome_name","biotype"),
values=list(chromosome_name=c(1:22,"X","Y"),biotype="protein_coding"),
mart=ensembl) # ensembl human genes
I'm using GRanges findOverlaps to find genes in which the CpGs are located. When I now join the genes to my CpGs, and use the table function to inspect my CpGs, I can see that a lot of CpGs are 'located in multiple genes'.
For example, one CpG is located in PCDHGC(5,4,3) and so on - I end up with about 5K genes in which about 20k cpgs are located at least twice, some CpGs are annotated over 20 different genes.
I'm already filtering on protein coding genes, using hgnc symbols, are there other filters I can use to find exactly 1 gene at a specific locus?