I'm working on Copy Number data from TCGA. I download "Gene Level Copy Number Variation" using TCGABiolinks R package and the following code:
library(TCGAbiolinks) query_cnv <- GDCquery(project = "TCGA-KICH", data.category = "Copy Number Variation", data.type = "Gene Level Copy Number Scores") GDCdownload(query_cnv) data <- GDCprepare(query_cnv)
Everything works great. I get a nice dataframe with first three columns being: "Gene.Symbol" / "Gene.ID" / "Cytoband". To facilitate the analysis and being able to merge data from other sources such as RNASeq, I tried to convert Ensembl gene ids contained in Gene.Symbol in Hugo Symbol using BioMart.
library(biomaRt) mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl")) genes <- gsub(".\\.","\\1",data$Gene.Symbol) geneIDs <- getBM(filters = "ensembl_gene_id", attributes = c("ensembl_gene_id","hgnc_symbol"), values = genes, mart = mart)
However, over 19729 different Ensembl ID, I only get 3269 match.
What is surprizing is that according GDC docs (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/), this dataset should contain CNV associated to each gene, so I would expect a little bit more match according coding regions.
When I tried to search the description of Ensembl ID not found by Biomart. I get zero answers from both Ensembl and NCBI. (Example: "ENSG000000081221" "ENSG000000081314" "ENSG000000676014" "ENSG000000783616" "ENSG000000788015"); So, it's like these ID did not exist in any database. So, where they are coming from ?
Did I miss something ? Is it normal to have few genes coding for proteins in this kind of datasets ? Should I process differently for the analysis of such data ?
Any suggestions or comments will be really helpful.