When I convert the Ensembl IDs to gene symbols, why lots of genes are duplicated?
0
0
Entering edit mode
2.5 years ago
Zahra ▴ 110

Hi all, I have raw counts of samples in a dataframe. The row names is Ensembl ID and I want to convert them to a gene symbol. So I’ve run the code below.

query <- GDCquery(project = "TCGA-COAD" ,
                      data.category = "Transcriptome Profiling" ,
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts" ,
                      sample.type = c("Primary Tumor", "Solid Tissue Normal"), 
                      experimental.strategy = "RNA-Seq")


    GDCdownload(query)

    query.counts.colon <- GDCprepare(query)

    ColonMatrix <- as.data.frame(SummarizedExperiment::assay(query.counts.colon ))

    ens <- row.names(ColonMatrix)


  > length(ens)
    [1] 56602


 #Ensembl id converting

require(org.Hs.eg.db)
ens_to_symbol <- mapIds(
  org.Hs.eg.db,
  keys = ens,
  column = 'SYMBOL',
  keytype = 'ENSEMBL')


mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl'))
ens_to_symbol_biomart <- getBM(
  filters = 'ensembl_gene_id',
  attributes = c('ensembl_gene_id', 'hgnc_symbol'),
  values = ens,
  mart = mart)


ens_to_symbol_biomart <- merge(
  x = as.data.frame(ens),
  y =  ens_to_symbol_biomart ,
  by.y = 'ensembl_gene_id',
  all.x = TRUE,
  by.x = 'ens')
head(ens_to_symbol_biomart)


ens               hgnc_symbol

1 ENSG00000000003      TSPAN6
2 ENSG00000000005        TNMD
3 ENSG00000000419        DPM1
4 ENSG00000000457       SCYL3
5 ENSG00000000460    C1orf112
6 ENSG00000000938         FGR

but when I check for duplicated gene symbols I found this :

>table(duplicated(ens_to_symbol_biomart$ hgnc_symbol))
FALSE  TRUE 
38446 18156

I don't know what is the reason for these duplicates. Should I remove these duplicated rows? Thanks for any help

Ensembl raw TCGA counts RNA-Seq • 1.4k views
ADD COMMENT
0
Entering edit mode

You might have duplicates in your ensembl gene list. Or, there might be many ensembl IDs with blank gene IDs. I don't think there are 56k coding genes in the human genome.

ADD REPLY

Login before adding your answer.

Traffic: 2014 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6