Question

Unable to match Ensembl IDs from TCGA-related CNV datasets

2

Entering edit mode

3.1 years ago

dc37 ▴ 20

Hi everyone,

I'm working on Copy Number data from TCGA. I download "Gene Level Copy Number Variation" using TCGABiolinks R package and the following code:

library(TCGAbiolinks)

query_cnv <- GDCquery(project = "TCGA-KICH",
                  data.category = "Copy Number Variation",
                  data.type = "Gene Level Copy Number Scores")
GDCdownload(query_cnv)
data <- GDCprepare(query_cnv)

Everything works great. I get a nice dataframe with first three columns being: "Gene.Symbol" / "Gene.ID" / "Cytoband". To facilitate the analysis and being able to merge data from other sources such as RNASeq, I tried to convert Ensembl gene ids contained in Gene.Symbol in Hugo Symbol using BioMart.

library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- gsub(".\\.","\\1",data$Gene.Symbol)
geneIDs <- getBM(filters = "ensembl_gene_id", attributes = c("ensembl_gene_id","hgnc_symbol"), values = genes, mart = mart)

However, over 19729 different Ensembl ID, I only get 3269 match.

What is surprizing is that according GDC docs (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/), this dataset should contain CNV associated to each gene, so I would expect a little bit more match according coding regions.

When I tried to search the description of Ensembl ID not found by Biomart. I get zero answers from both Ensembl and NCBI. (Example: "ENSG000000081221" "ENSG000000081314" "ENSG000000676014" "ENSG000000783616" "ENSG000000788015"); So, it's like these ID did not exist in any database. So, where they are coming from ?

Did I miss something ? Is it normal to have few genes coding for proteins in this kind of datasets ? Should I process differently for the analysis of such data ?

Any suggestions or comments will be really helpful.

Ensembl Number TCGA TCGABiolinks Variation Copy • 1.5k views

ADD COMMENT • link updated 3.1 years ago by Kevin Blighe 87k • written 3.1 years ago by dc37 ▴ 20

score 5 · Accepted Answer · 2021-04-03

The IDs as you present them do not exist.

The problem is this line of code, which is not doing what [I believe] you believe it's doing:

genes <- gsub(".\\.","\\1",data$Gene.Symbol)

You need to remove the final digits from each Ensembl ID after the dot. This works:

library(TCGAbiolinks)
query_cnv <- GDCquery(project = "TCGA-KICH",
                  data.category = "Copy Number Variation",
                  data.type = "Gene Level Copy Number Scores")
GDCdownload(query_cnv)
data <- GDCprepare(query_cnv)
data <- data.frame(data)

ens <- sub('\\.[0-9]*$', '', data$Gene.Symbol)

require(org.Hs.eg.db)
ens_to_symbol <- mapIds(
  org.Hs.eg.db,
  keys = ens,
  column = 'SYMBOL',
  keytype = 'ENSEMBL')
head(ens_to_symbol)
ENSG00000008128 ENSG00000008130 ENSG00000067606 ENSG00000078369 ENSG00000078808 
       "CDK11A"          "NADK"         "PRKCZ"          "GNB1"          "SDF4" 
ENSG00000107404 
         "DVL1"


library(biomaRt)
mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl'))
ens_to_symbol_biomart <- getBM(
  filters = 'ensembl_gene_id',
  attributes = c('ensembl_gene_id', 'hgnc_symbol'),
  values = ens,
  mart = mart)
ens_to_symbol_biomart <- merge(
  x = as.data.frame(ens),
  y =  ens_to_symbol_biomart ,
  by.y = 'ensembl_gene_id',
  all.x = TRUE,
  by.x = 'ens')

head(ens_to_symbol_biomart)
              ens hgnc_symbol
1 ENSG00000000003      TSPAN6
2 ENSG00000000005        TNMD
3 ENSG00000000419        DPM1
4 ENSG00000000457       SCYL3
5 ENSG00000000460    C1orf112
6 ENSG00000000938         FGR

Kevin

===============