Question: Why many ENSEMBL gene ID cannot be converted to Entrez ID using org.Hs.eg.db in R?
0
gravatar for biock
3 months ago by
biock10
biock10 wrote:

Hello!
I'm analyzing some RNA-seq data with edgeR. According edgeR manual, we can use org.Hs.egENSEMBL database in org.Hs.eg.db package (version: 3.7.0) to convert ENSEMBL gene ID (ENSGxxxxxx) to Entrez ID. However, I found there are many ENSEMBL gene IDs cannot be found in egENSEMBL database. There are 30292 ENSEMBL ID records in egENSEMBL, while there are 58721 ENSEMBL gene IDs stored in GENCODE GRCh38 annotation file. Should I exclude genes being not in egENSEMBL database for downstream differential expression analysis just as the edgeR manual do?

Thank you!

Codes in edgeR manual (I use egENSEMBL instead of egREFSEQ in my pipeline):

# y is DGEList object
idfound <- y$genes$RefSeqID %in% mappedRkeys(org.Hs.egREFSEQ)
y <- y[idfound,]
ensembl entrez R gene • 194 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by biock10
2
gravatar for Jean-Karim Heriche
3 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche20k wrote:

Ensembl and RefSeq are fundamentally different resources. Ensembl is an annotation of a reference genome whereas RefSeq is a collection of sequences with annotations. They differ in particular with respect to how they define a gene. In Ensembl, a gene is an annotated locus on the reference assembly. In Refseq, it seems to be an extra attribute assigned to transcript sequences. How RefSeq assigns genes to sequences has never been clear to me. In general, I wouldn't recommend mixing references, i.e. either work with Ensembl or work with RefSeq.

ADD COMMENTlink written 3 months ago by Jean-Karim Heriche20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1561 users visited in the last hour