Question

R AnnotationDbi returning NA for pseudogenes during gene id annotation

1

Entering edit mode

4.6 years ago

sg197 ▴ 40

I'm using "AnnotationDbi" and "org.Mm.eg.db" in R to annotate gene ids of DE genes with both their gene symbol and entrez id. My original gtf file was downloaded from ensembl, so I'm using ensembl as the keytype.

For some of the genes it works, however for many instead of a symbol and ID it returns NA. I've googled a couple of the gene IDs that are returning NA and on their ensembl page they are all pseudogenes (not sure if this is a coincidence?).

Are pseudogenes not included in AnnotationDbi? Does anyone know how else to annotate gene ids to get their gene symbol and entrez id if so?

Thanks

R AnnotationDbi pseudogene DESeq2 ensembl • 2.4k views

ADD COMMENT • link updated 4.2 years ago by Fratam ▴ 50 • written 4.6 years ago by sg197 ▴ 40

score 1 · Answer 1 · 2020-02-16

I had a the same problem while I was working on my human RNA-seq data. lincRNA and other NON-protein coding transcripts were annotated as NA. I have solved by changing the database in my script and then adapting it, so it can work with the new database data.

res$symbol <- mapIds(org.Hs.eg.db,
                     keys=ens.str,
                     column="SYMBOL",
                     keytype="ENSEMBL",
                     multiVals="first")

This code gave me NA and "?" in my html reporter. I have changed it in this:

res$symbol <- mapIds(EnsDb.Hsapiens.v75, keys=ens.str, column="SYMBOL", keytype="GENEID", multiVals="first")

#This second line adds info about the kind of transcript (Eg lncRNA, protein_coding,etc)#
res$Txbiotype <- mapIds(EnsDb.Hsapiens.v75, keys=ens.str, column="TXBIOTYPE", keytype="GENEID", multiVals="first")

Note that I have changed the database (org.Hs.eg.db to --> EnsDb.Hsapiens.v75). This change alone required to change keytype="ENSEMBL" to keytype="GENEID" even if I was dealing with the same original annotation.

You can try to use the analogous databases for Mus musculus, which should be EnsDb.Mmusculus.v79.

score 0 · Answer 2 · 2019-10-10

You can create a lookup table via biomaRt:

require(biomaRt)
mart <- useMart('ensembl', dataset = 'mmusculus_gene_ensembl')

lookuptable <- getBM(
  mart = mart,
  attributes = c(
    'ensembl_gene_id',
    'entrezgene_id',
    'mgi_symbol',
    'gene_biotype'),
  uniqueRows = TRUE)

Now randomly select 20 rows from the table:

lookuptable[sample(1:nrow(lookuptable), 20),]

         ensembl_gene_id entrezgene_id    mgi_symbol           gene_biotype
23767 ENSMUSG00000031626        234214        Sorbs2         protein_coding
51795 ENSMUSG00000077380            NA       Gm22661                 snoRNA
47872 ENSMUSG00000039220         52040       Ppp1r10         protein_coding
20037 ENSMUSG00000100353            NA       Gm28859 unprocessed_pseudogene
50314 ENSMUSG00000098690            NA       Gm27316               misc_RNA
40352 ENSMUSG00000020327         67112         Fgf22         protein_coding
9819  ENSMUSG00000101704            NA       Gm21878 unprocessed_pseudogene
33302 ENSMUSG00000043331        258825       Olfr975         protein_coding
33862 ENSMUSG00000021481         26919        Zfp346         protein_coding
15008 ENSMUSG00000087333            NA       Gm13652                 lncRNA
35358 ENSMUSG00000083260            NA       Gm13460   processed_pseudogene
1304  ENSMUSG00000070858        381633        Gm1673         protein_coding
42353 ENSMUSG00000106333            NA 4930428O21Rik                 lncRNA
9893  ENSMUSG00000104133            NA 4921511E07Rik                    TEC
22670 ENSMUSG00000102334            NA       Gm18865   processed_pseudogene
37763 ENSMUSG00000115123            NA       Gm33299                 lncRNA
35981 ENSMUSG00000004814         56221         Ccl24         protein_coding
42510 ENSMUSG00000062400         68484      Krtap6-5         protein_coding
32556 ENSMUSG00000110703            NA       Gm45821                 lncRNA
12942 ENSMUSG00000056270        109314          Prr9         protein_coding

Kevin