Question: R AnnotationDbi returning NA for pseudogenes during gene id annotation
gravatar for sg197
8 weeks ago by
sg19710 wrote:

I'm using "AnnotationDbi" and "" in R to annotate gene ids of DE genes with both their gene symbol and entrez id. My original gtf file was downloaded from ensembl, so I'm using ensembl as the keytype.

For some of the genes it works, however for many instead of a symbol and ID it returns NA. I've googled a couple of the gene IDs that are returning NA and on their ensembl page they are all pseudogenes (not sure if this is a coincidence?).

Are pseudogenes not included in AnnotationDbi? Does anyone know how else to annotate gene ids to get their gene symbol and entrez id if so?


ADD COMMENTlink modified 8 weeks ago by Kevin Blighe52k • written 8 weeks ago by sg19710
gravatar for Kevin Blighe
8 weeks ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

You can create a lookup table via biomaRt:

mart <- useMart('ensembl', dataset = 'mmusculus_gene_ensembl')

lookuptable <- getBM(
  mart = mart,
  attributes = c(
  uniqueRows = TRUE)

Now randomly select 20 rows from the table:

lookuptable[sample(1:nrow(lookuptable), 20),]

         ensembl_gene_id entrezgene_id    mgi_symbol           gene_biotype
23767 ENSMUSG00000031626        234214        Sorbs2         protein_coding
51795 ENSMUSG00000077380            NA       Gm22661                 snoRNA
47872 ENSMUSG00000039220         52040       Ppp1r10         protein_coding
20037 ENSMUSG00000100353            NA       Gm28859 unprocessed_pseudogene
50314 ENSMUSG00000098690            NA       Gm27316               misc_RNA
40352 ENSMUSG00000020327         67112         Fgf22         protein_coding
9819  ENSMUSG00000101704            NA       Gm21878 unprocessed_pseudogene
33302 ENSMUSG00000043331        258825       Olfr975         protein_coding
33862 ENSMUSG00000021481         26919        Zfp346         protein_coding
15008 ENSMUSG00000087333            NA       Gm13652                 lncRNA
35358 ENSMUSG00000083260            NA       Gm13460   processed_pseudogene
1304  ENSMUSG00000070858        381633        Gm1673         protein_coding
42353 ENSMUSG00000106333            NA 4930428O21Rik                 lncRNA
9893  ENSMUSG00000104133            NA 4921511E07Rik                    TEC
22670 ENSMUSG00000102334            NA       Gm18865   processed_pseudogene
37763 ENSMUSG00000115123            NA       Gm33299                 lncRNA
35981 ENSMUSG00000004814         56221         Ccl24         protein_coding
42510 ENSMUSG00000062400         68484      Krtap6-5         protein_coding
32556 ENSMUSG00000110703            NA       Gm45821                 lncRNA
12942 ENSMUSG00000056270        109314          Prr9         protein_coding


ADD COMMENTlink written 8 weeks ago by Kevin Blighe52k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1998 users visited in the last hour