A large portion of gene ID cannot be mapped when running the "bitr" command of the "clusterProfiler" package
1
0
Entering edit mode
12 months ago
sunyeping ▴ 110

Hello everyone,

When I run the bitr command, I get warning message saying that 58.9% of input gene IDs are fail to map..., like the following:

library(ggplot2)
library(clusterProfiler)
> gs.up = bitr(up, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
'select()' returned 1:1 mapping between keys and columns
Warning message:
In bitr(up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db") :
  58.9% of input gene IDs are fail to map...

The input genes are:

> up
   [1] "KRT14"        "S100A8"       "AC114760.2"   "S100A9"       "KRT6A"       
   [6] "KRT16"        "RPL13A"       "TRAC"         "MALAT1"       "RPL27A"      
  [11] "RPL7"         "ACTB"         "RPS2"         "S100A2"       "RPL21"       
  [16] "TMSB4X"       "RPL31"        "PSMA2"        "CD52"         "S100A4"      
  [21] "RPS17"        "CD3D"         "PFN1"         "RPS20"        "MYL12A"      
  [26] "CTSD"         "GZMA"         "B2M"          "RPS10"        "ARPC4"       
  [31] "CORO1A"       "CCL3"         "CKLF"         "NDUFB8"       "TMBIM4"      
  [36] "SH3BGRL3"     "ID2"          "IL32"         "RPL23"        "ARF6"        
  [41] "ARPC1B"       "CCL4"         "RGS1"         "TMSB10"       "CFL1"        
  [46] "IL2RG"        "RPS11"        "TRGC2"        "IFNG"         "EVL"         
  [51] "GMFG"         "ACTG1"        "RPS18"        "ATP5PO"       "RPLP2"       
  [56] "VAMP2"        "GZMB"         "RPL15"        "S100A6"       "CD63"        
  [61] "CD2"          "CYBA"         "RPL34"        "NDUFA11"      "CD69"        
  [66] "CAPG"         "LGALS1"       "RPL3"         "MZT2B"        "NEAT1"       
  [71] "HCST"         "LSP1"         "CHMP4A"       "APOBEC3C"     "CCL5"        

Do you have any suggestions on how to address these problems?

With many thanks!

scRNA-seq Seurat GSEA • 1.7k views
ADD COMMENT
0
Entering edit mode

What is the source of the gene symbols you're trying to convert, i.e. what genome annotation and steps were taken to generate this?

ADD REPLY
0
Entering edit mode

The gene symbols are from a Seurat object.

ADD REPLY
0
Entering edit mode

You need to provide more information than this otherwise we won't be able to answer the question. There are multiple genome annotation sources, multiple annotation versions per species, and software used to generate the count matrices can report the features in different ways.

ADD REPLY
0
Entering edit mode

The dataset is in rds format. It was download from DISCO, a single-cell database (Publication: https://doi.org/10.1093/nar/gkab1020). Accordint to this DISCO paper, the datasets deposited in this database were retrived as raw counts (fastq, SRA, or bam files) from datases (GEO, ArrayExpress, GSA, and other public resources) and preprocessed. The SRA and bam files were converted to fastq file. The data were demultiplexed, and then reads tagged with a cell barcode and UMI were mapped to the human reference genome assembly hg38 using STAR (version 020201) , and then assigned to the corresponding gene (Ensembl 93) with featureCounts (version 2.0.2).

ADD REPLY
0
Entering edit mode

Great, thanks for the additional info 👍

One last question, are there any genes in ENSEMBL gene ID format (e.g. ENSG0000001)?

head(up[grep(up, pattern="^ENSG")])
ADD REPLY
0
Entering edit mode
> up[grep(up, pattern="^ENSG")]
character(0)
ADD REPLY
1
Entering edit mode
12 months ago

The reference they used was ENSEMBL and only gene symbols/names are present, so your best bet to fix or troubleshoot the issue is to query ENSEMBL directly using the same annotation version they used (93). I'll use the biomaRt package in R for this.

library("biomaRt")

up <- c("KRT14", "S100A8", "AC114760.2", "S100A9", "KRT6A", "KRT16", 
"RPL13A", "TRAC", "MALAT1", "RPL27A", "RPL7", "ACTB", "RPS2", 
"S100A2", "RPL21", "TMSB4X", "RPL31", "PSMA2", "CD52", "S100A4", 
"RPS17", "CD3D", "PFN1", "RPS20", "MYL12A", "CTSD", "GZMA", "B2M", 
"RPS10", "ARPC4", "CORO1A", "CCL3", "CKLF", "NDUFB8", "TMBIM4", 
"SH3BGRL3", "ID2", "IL32", "RPL23", "ARF6", "ARPC1B", "CCL4", 
"RGS1", "TMSB10", "CFL1", "IL2RG", "RPS11", "TRGC2", "IFNG", 
"EVL", "GMFG", "ACTG1", "RPS18", "ATP5PO", "RPLP2", "VAMP2", 
"GZMB", "RPL15", "S100A6", "CD63", "CD2", "CYBA", "RPL34", "NDUFA11", 
"CD69", "CAPG", "LGALS1", "RPL3", "MZT2B", "NEAT1", "HCST", "LSP1", 
"CHMP4A", "APOBEC3C", "CCL5")

ensembl <- useEnsembl(biomart="genes", dataset="hsapiens_gene_ensembl", version=93)

symbol_to_entrez <- getBM(
  attributes=c("external_gene_name", "ensembl_gene_id", "entrezgene"),
  filters="external_gene_name", values=up, mart=ensembl)

You now have a table mapping your gene symbols to Entrez IDs.

  external_gene_name ensembl_gene_id entrezgene
1             LGALS1 ENSG00000100097       3956
2                B2M ENSG00000273686        567
3               TRAC ENSG00000277734         NA
4              KRT14 ENSG00000186847       3861
5              KRT16 ENSG00000186832       3868
6               CFL1 ENSG00000172757       1072

From the input genes you provided there should only be 5 genes without a mapping to an Entrez ID.

> symbol_to_entrez[is.na(symbol_to_entrez$entrezgene), ]
   external_gene_name ensembl_gene_id entrezgene
3                TRAC ENSG00000277734         NA
17         AC114760.2 ENSG00000272211         NA
52              TRGC2 ENSG00000227191         NA
70              NEAT1 ENSG00000245532         NA
73             MALAT1 ENSG00000251562         NA

When you want to move forward with your analysis you can pull the info your need from the table.

gs.up <- unique(symbol_to_entrez[!is.na(symbol_to_entrez$entrezgene), ]$entrezgene)
ADD COMMENT

Login before adding your answer.

Traffic: 2155 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6