Question

A large portion of gene ID cannot be mapped when running the "bitr" command of the "clusterProfiler" package

0

Entering edit mode

12 months ago

sunyeping ▴ 110

Hello everyone,

When I run the bitr command, I get warning message saying that 58.9% of input gene IDs are fail to map..., like the following:

library(ggplot2)
library(clusterProfiler)
> gs.up = bitr(up, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
'select()' returned 1:1 mapping between keys and columns
Warning message:
In bitr(up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db") :
  58.9% of input gene IDs are fail to map...

The input genes are:

> up
   [1] "KRT14"        "S100A8"       "AC114760.2"   "S100A9"       "KRT6A"       
   [6] "KRT16"        "RPL13A"       "TRAC"         "MALAT1"       "RPL27A"      
  [11] "RPL7"         "ACTB"         "RPS2"         "S100A2"       "RPL21"       
  [16] "TMSB4X"       "RPL31"        "PSMA2"        "CD52"         "S100A4"      
  [21] "RPS17"        "CD3D"         "PFN1"         "RPS20"        "MYL12A"      
  [26] "CTSD"         "GZMA"         "B2M"          "RPS10"        "ARPC4"       
  [31] "CORO1A"       "CCL3"         "CKLF"         "NDUFB8"       "TMBIM4"      
  [36] "SH3BGRL3"     "ID2"          "IL32"         "RPL23"        "ARF6"        
  [41] "ARPC1B"       "CCL4"         "RGS1"         "TMSB10"       "CFL1"        
  [46] "IL2RG"        "RPS11"        "TRGC2"        "IFNG"         "EVL"         
  [51] "GMFG"         "ACTG1"        "RPS18"        "ATP5PO"       "RPLP2"       
  [56] "VAMP2"        "GZMB"         "RPL15"        "S100A6"       "CD63"        
  [61] "CD2"          "CYBA"         "RPL34"        "NDUFA11"      "CD69"        
  [66] "CAPG"         "LGALS1"       "RPL3"         "MZT2B"        "NEAT1"       
  [71] "HCST"         "LSP1"         "CHMP4A"       "APOBEC3C"     "CCL5"

Do you have any suggestions on how to address these problems?

With many thanks!

scRNA-seq Seurat GSEA • 1.7k views

ADD COMMENT • link updated 12 months ago by rpolicastro 13k • written 12 months ago by sunyeping ▴ 110

0

Entering edit mode

What is the source of the gene symbols you're trying to convert, i.e. what genome annotation and steps were taken to generate this?

ADD REPLY • link 12 months ago by rpolicastro 13k

0

Entering edit mode

The gene symbols are from a Seurat object.

ADD REPLY • link 12 months ago by sunyeping ▴ 110

0

Entering edit mode

You need to provide more information than this otherwise we won't be able to answer the question. There are multiple genome annotation sources, multiple annotation versions per species, and software used to generate the count matrices can report the features in different ways.

ADD REPLY • link 12 months ago by rpolicastro 13k

0

Entering edit mode

The dataset is in rds format. It was download from DISCO, a single-cell database （Publication: https://doi.org/10.1093/nar/gkab1020）. Accordint to this DISCO paper, the datasets deposited in this database were retrived as raw counts (fastq, SRA, or bam files) from datases (GEO, ArrayExpress, GSA, and other public resources) and preprocessed. The SRA and bam files were converted to fastq file. The data were demultiplexed, and then reads tagged with a cell barcode and UMI were mapped to the human reference genome assembly hg38 using STAR (version 020201) , and then assigned to the corresponding gene (Ensembl 93) with featureCounts (version 2.0.2).

ADD REPLY • link 12 months ago by sunyeping ▴ 110

0

Entering edit mode

Great, thanks for the additional info 👍

One last question, are there any genes in ENSEMBL gene ID format (e.g. ENSG0000001)?

head(up[grep(up, pattern="^ENSG")])

ADD REPLY • link 12 months ago by rpolicastro 13k

0

Entering edit mode

> up[grep(up, pattern="^ENSG")]
character(0)

ADD REPLY • link 12 months ago by sunyeping ▴ 110

score 1 · Accepted Answer · 2023-04-23

The reference they used was ENSEMBL and only gene symbols/names are present, so your best bet to fix or troubleshoot the issue is to query ENSEMBL directly using the same annotation version they used (93). I'll use the biomaRt package in R for this.

library("biomaRt")

up <- c("KRT14", "S100A8", "AC114760.2", "S100A9", "KRT6A", "KRT16", 
"RPL13A", "TRAC", "MALAT1", "RPL27A", "RPL7", "ACTB", "RPS2", 
"S100A2", "RPL21", "TMSB4X", "RPL31", "PSMA2", "CD52", "S100A4", 
"RPS17", "CD3D", "PFN1", "RPS20", "MYL12A", "CTSD", "GZMA", "B2M", 
"RPS10", "ARPC4", "CORO1A", "CCL3", "CKLF", "NDUFB8", "TMBIM4", 
"SH3BGRL3", "ID2", "IL32", "RPL23", "ARF6", "ARPC1B", "CCL4", 
"RGS1", "TMSB10", "CFL1", "IL2RG", "RPS11", "TRGC2", "IFNG", 
"EVL", "GMFG", "ACTG1", "RPS18", "ATP5PO", "RPLP2", "VAMP2", 
"GZMB", "RPL15", "S100A6", "CD63", "CD2", "CYBA", "RPL34", "NDUFA11", 
"CD69", "CAPG", "LGALS1", "RPL3", "MZT2B", "NEAT1", "HCST", "LSP1", 
"CHMP4A", "APOBEC3C", "CCL5")

ensembl <- useEnsembl(biomart="genes", dataset="hsapiens_gene_ensembl", version=93)

symbol_to_entrez <- getBM(
  attributes=c("external_gene_name", "ensembl_gene_id", "entrezgene"),
  filters="external_gene_name", values=up, mart=ensembl)

You now have a table mapping your gene symbols to Entrez IDs.

  external_gene_name ensembl_gene_id entrezgene
1             LGALS1 ENSG00000100097       3956
2                B2M ENSG00000273686        567
3               TRAC ENSG00000277734         NA
4              KRT14 ENSG00000186847       3861
5              KRT16 ENSG00000186832       3868
6               CFL1 ENSG00000172757       1072

From the input genes you provided there should only be 5 genes without a mapping to an Entrez ID.

> symbol_to_entrez[is.na(symbol_to_entrez$entrezgene), ]
   external_gene_name ensembl_gene_id entrezgene
3                TRAC ENSG00000277734         NA
17         AC114760.2 ENSG00000272211         NA
52              TRGC2 ENSG00000227191         NA
70              NEAT1 ENSG00000245532         NA
73             MALAT1 ENSG00000251562         NA

When you want to move forward with your analysis you can pull the info your need from the table.

gs.up <- unique(symbol_to_entrez[!is.na(symbol_to_entrez$entrezgene), ]$entrezgene)