Hi everyone,
I'm working on Copy Number data from TCGA. I download "Gene Level Copy Number Variation" using TCGABiolinks R package and the following code:
library(TCGAbiolinks)
query_cnv <- GDCquery(project = "TCGA-KICH",
data.category = "Copy Number Variation",
data.type = "Gene Level Copy Number Scores")
GDCdownload(query_cnv)
data <- GDCprepare(query_cnv)
Everything works great. I get a nice dataframe with first three columns being: "Gene.Symbol" / "Gene.ID" / "Cytoband". To facilitate the analysis and being able to merge data from other sources such as RNASeq, I tried to convert Ensembl gene ids contained in Gene.Symbol in Hugo Symbol using BioMart.
library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- gsub(".\\.","\\1",data$Gene.Symbol)
geneIDs <- getBM(filters = "ensembl_gene_id", attributes = c("ensembl_gene_id","hgnc_symbol"), values = genes, mart = mart)
However, over 19729 different Ensembl ID, I only get 3269 match.
What is surprizing is that according GDC docs (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/), this dataset should contain CNV associated to each gene, so I would expect a little bit more match according coding regions.
When I tried to search the description of Ensembl ID not found by Biomart. I get zero answers from both Ensembl and NCBI. (Example: "ENSG000000081221" "ENSG000000081314" "ENSG000000676014" "ENSG000000783616" "ENSG000000788015"); So, it's like these ID did not exist in any database. So, where they are coming from ?
Did I miss something ? Is it normal to have few genes coding for proteins in this kind of datasets ? Should I process differently for the analysis of such data ?
Any suggestions or comments will be really helpful.
Thank you so much for your reply ! Indeed, it was just this so dumb mistake on the gsub (I'm ashamed to not have checked this part). Now with your code, it works perfectly (only 289 unmatched over 19729 different codes). Thank you very much for your kind help.
No problem, dc37. We all make slight errors / mistakes.