Unable to match Ensembl IDs from TCGA-related CNV datasets
1
2
Entering edit mode
3.1 years ago
dc37 ▴ 20

Hi everyone,

I'm working on Copy Number data from TCGA. I download "Gene Level Copy Number Variation" using TCGABiolinks R package and the following code:

library(TCGAbiolinks)

query_cnv <- GDCquery(project = "TCGA-KICH",
                  data.category = "Copy Number Variation",
                  data.type = "Gene Level Copy Number Scores")
GDCdownload(query_cnv)
data <- GDCprepare(query_cnv)

Everything works great. I get a nice dataframe with first three columns being: "Gene.Symbol" / "Gene.ID" / "Cytoband". To facilitate the analysis and being able to merge data from other sources such as RNASeq, I tried to convert Ensembl gene ids contained in Gene.Symbol in Hugo Symbol using BioMart.

library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- gsub(".\\.","\\1",data$Gene.Symbol)
geneIDs <- getBM(filters = "ensembl_gene_id", attributes = c("ensembl_gene_id","hgnc_symbol"), values = genes, mart = mart)

However, over 19729 different Ensembl ID, I only get 3269 match.

What is surprizing is that according GDC docs (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/), this dataset should contain CNV associated to each gene, so I would expect a little bit more match according coding regions.

When I tried to search the description of Ensembl ID not found by Biomart. I get zero answers from both Ensembl and NCBI. (Example: "ENSG000000081221" "ENSG000000081314" "ENSG000000676014" "ENSG000000783616" "ENSG000000788015"); So, it's like these ID did not exist in any database. So, where they are coming from ?

Did I miss something ? Is it normal to have few genes coding for proteins in this kind of datasets ? Should I process differently for the analysis of such data ?

Any suggestions or comments will be really helpful.

Ensembl Number TCGA TCGABiolinks Variation Copy • 1.5k views
ADD COMMENT
5
Entering edit mode
3.1 years ago

The IDs as you present them do not exist.

The problem is this line of code, which is not doing what [I believe] you believe it's doing:

genes <- gsub(".\\.","\\1",data$Gene.Symbol)

You need to remove the final digits from each Ensembl ID after the dot. This works:

library(TCGAbiolinks)
query_cnv <- GDCquery(project = "TCGA-KICH",
                  data.category = "Copy Number Variation",
                  data.type = "Gene Level Copy Number Scores")
GDCdownload(query_cnv)
data <- GDCprepare(query_cnv)
data <- data.frame(data)

ens <- sub('\\.[0-9]*$', '', data$Gene.Symbol)

require(org.Hs.eg.db)
ens_to_symbol <- mapIds(
  org.Hs.eg.db,
  keys = ens,
  column = 'SYMBOL',
  keytype = 'ENSEMBL')
head(ens_to_symbol)
ENSG00000008128 ENSG00000008130 ENSG00000067606 ENSG00000078369 ENSG00000078808 
       "CDK11A"          "NADK"         "PRKCZ"          "GNB1"          "SDF4" 
ENSG00000107404 
         "DVL1"


library(biomaRt)
mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl'))
ens_to_symbol_biomart <- getBM(
  filters = 'ensembl_gene_id',
  attributes = c('ensembl_gene_id', 'hgnc_symbol'),
  values = ens,
  mart = mart)
ens_to_symbol_biomart <- merge(
  x = as.data.frame(ens),
  y =  ens_to_symbol_biomart ,
  by.y = 'ensembl_gene_id',
  all.x = TRUE,
  by.x = 'ens')

head(ens_to_symbol_biomart)
              ens hgnc_symbol
1 ENSG00000000003      TSPAN6
2 ENSG00000000005        TNMD
3 ENSG00000000419        DPM1
4 ENSG00000000457       SCYL3
5 ENSG00000000460    C1orf112
6 ENSG00000000938         FGR

Kevin

===============

ADD COMMENT
0
Entering edit mode

Thank you so much for your reply ! Indeed, it was just this so dumb mistake on the gsub (I'm ashamed to not have checked this part). Now with your code, it works perfectly (only 289 unmatched over 19729 different codes). Thank you very much for your kind help.

ADD REPLY
0
Entering edit mode

No problem, dc37. We all make slight errors / mistakes.

ADD REPLY

Login before adding your answer.

Traffic: 2558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6