Question

Convert Gene Ensemble IDs to Gene Symbols on R

0

Entering edit mode

18 months ago

Amr ▴ 160

Convert Gene Ensemble IDs to Gene Symbols on R

I tried to convert Ensemble Gene IDs to Gene Symbols by using biomart and annotations (org.Hs.eg.db) on R and biotools website but there were some genes did not convert to symbols.

Why some genes did not convert? and is there a better solution?

Thanks

Ensemble R Gene-Symbols biomart • 1.3k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 18 months ago by Amr ▴ 160

GenoMax · Answer 1 · 2022-09-29

1

Entering edit mode

18 months ago

Marco Pannone ▴ 790

Can you give an example? It might be pseudogenes or transcripts not currently annotated.

ADD COMMENT • link 18 months ago by Marco Pannone ▴ 790

1

Entering edit mode

ENSG00000276171','ENSG00000269227','ENSG00000280816','ENSG00000236269','
ENSG00000182109','ENSG00000261254','ENSG00000278882','ENSG00000273689','
ENSG00000277573','ENSG00000278939','ENSG00000252817','ENSG00000216109','
ENSG00000280316','ENSG00000260766','ENSG00000239373',

ADD REPLY • link updated 18 months ago by GenoMax 141k • written 18 months ago by Amr ▴ 160

0

Entering edit mode

If you simply google some of them you can see that these Ensembl IDs refer to transcripts of non-coding regions and not annotated regions.

ADD REPLY • link 18 months ago by Marco Pannone ▴ 790

0

Entering edit mode

So, they have no symbols, right?

ADD REPLY • link 18 months ago by Amr ▴ 160

1

Entering edit mode

So naturally, you are not going to see them associated with any "Gene Symbol" when doing the conversion from "Ensembl ID".

ADD REPLY • link 18 months ago by Marco Pannone ▴ 790

1

Entering edit mode

Novel genes used to be assigned temporary cryptic placeholder symbols like AC010680.1 or LINC02050 or C1orf43. They recently stopped doing that in favor of just using Ensembl IDs, since those symbols were not particularly helpful. There was a blog post somewhere about this, but I can't find it.

ADD REPLY • link 18 months ago by igor 13k

0

Entering edit mode

But how I can see their symbols in the unnormalized data? How their symbols have been obtained?

ADD REPLY • link 18 months ago by Amr ▴ 160

0

Entering edit mode

I do not know why you are mentioning normalization now since it has nothing to do with the ID of a transcript in your dataset. However, I guess by even simple intuition if a transcript comes from a coding region you would expect it to have also a "Gene Symbol". Otherwise, if the transcript comes from a non-coding region, you would not expect it to annotate to any "Gene Symbol". Transcripts from non-coding regions still have "Ensembl ID" (for example, see here how this is possible: https://www.ensembl.org/info/genome/genebuild/ncrna.html).

I tried my best to explain it in the most simple way, so I hope it is clear. But I would recommend you to do some more reading because these are pretty basic and straightforward concepts.

ADD REPLY • link 18 months ago by Marco Pannone ▴ 790

score 1 · Answer 2 · 2022-09-29

The biomaRt package is the best way to do it, I believe, although I personally am using an equivalent package in Python. However, the principle is the same:

Some genes don't have names, especially if they're newly predicted, and you just have to identify them with their Ensembl Gene ID (ENSG#). Occasionally, if you google around, you might find a name for some of them, but they aren't in the BioMart database.

Some of the unnamed ones may receive names in future releases of BioMart.