Hello everybody. I am running some analyses in these days on TCGA data on different types of tumors (BRCA, COAD) with TCGAbiolinks package and I noticed that when I download gene expression data (legacy data = hg19) I have always the same gene that is missing in the gene IDs, which is UHRF1. I used this code:
query.exp <- GDCquery(project = "TCGA-BRCA", legacy = TRUE, data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", file.type = "results", experimental.strategy = "RNA-Seq", sample.type = c("Solid Tissue Normal", "Primary Tumor")) # Collect the barcodes of only TP samples dataTP <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"TP") # Collect the barcodes of only NT samples dataNT <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"NT") # Filter: dataNT.filt <- dataNT[1:50] # keep only 50 normals dataTP.filt <- dataTP[1:50]# keep only 50 tumors exp.barcodes <- c(dataNT.filt, dataTP.filt) query.exp <- GDCquery(project = TCGAprj, legacy = TRUE, data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", file.type = "results", barcode = exp.barcodes, experimental.strategy = "RNA-Seq") GDCdownload(query.exp) BRCA.exp <- GDCprepare(query.exp, save = TRUE, summarizedExperiment = TRUE, save.filename = "BRCAexp.rda") # Prepare expression matrix with geneID in the rows and samples (barcode) in the columns # rsem.genes.results as values BRCAMatrix <- assay(BRCA.exp,"raw_count")
If I inspect BRCAMatrix searching for UHRF1 gene, there is no trace of it (in this matrix, it should be named: "UHRF1|29128" ).
If I perform normalization and then try to search for UHRF1 gene (that now should lack the " |29128" after its name) I get this:
dataNorm <- TCGAanalyze_Normalization(tabDF = BRCA.exp, geneInfo = geneInfo, method = "gcContent") > dataNorm["UHRF1", 1] > Error in dataNorm["UHRF1", 1] : subscript out of bounds
I should expect the corresponding value after normalization but there is no trace of this gene. I also searched for UHRF1 aliases (I.e., ICBP90, RNF106, TDRD22, HUHRF1, hUHRF1, Np95, NP95, HuNp95, HNP95, HNp95...) but I wasn't able to find it. I don't know if this is a bug in the code or a problem with GDC, but I am also thinking if there are other genes that are missing... Can somebody help me? Maybe I am doing something wrong or there is another "name" or ID that I should search for.
ps I also wrote a post in the issue section of TCGAbiolinks github page: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/454
EDIT 12.03.21: The gene is missing only when using legacy data (hg19), with GCh38 the gene is present.