TCGAbiolinks -> Gene ID totally missing in downloaded data.. A bug of the package or a problem with GDC data?
0
0
Entering edit mode
7 months ago
miky.zo ▴ 60

Hello everybody. I am running some analyses in these days on TCGA data on different types of tumors (BRCA, COAD) with TCGAbiolinks package and I noticed that when I download gene expression data (legacy data = hg19) I have always the same gene that is missing in the gene IDs, which is UHRF1. I used this code:

query.exp <- GDCquery(project = "TCGA-BRCA",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
experimental.strategy = "RNA-Seq",
sample.type = c("Solid Tissue Normal", "Primary Tumor"))

# Collect the barcodes of only TP samples
dataTP <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"TP")

# Collect the barcodes of only NT samples
dataNT <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"NT")

# Filter:
dataNT.filt <- dataNT[1:50] # keep only 50 normals
dataTP.filt <- dataTP[1:50]# keep only 50 tumors

exp.barcodes <- c(dataNT.filt, dataTP.filt)

query.exp <- GDCquery(project = TCGAprj,
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
barcode = exp.barcodes,
experimental.strategy = "RNA-Seq")

BRCA.exp <- GDCprepare(query.exp,
save = TRUE,
summarizedExperiment = TRUE,
save.filename = "BRCAexp.rda")

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCAMatrix <- assay(BRCA.exp,"raw_count")


If I inspect BRCAMatrix searching for UHRF1 gene, there is no trace of it (in this matrix, it should be named: "UHRF1|29128" ).

If I perform normalization and then try to search for UHRF1 gene (that now should lack the " |29128" after its name) I get this:

dataNorm <- TCGAanalyze_Normalization(tabDF = BRCA.exp,
geneInfo = geneInfo,
method = "gcContent")

> dataNorm["UHRF1", 1]
> Error in dataNorm["UHRF1", 1] : subscript out of bounds


I should expect the corresponding value after normalization but there is no trace of this gene. I also searched for UHRF1 aliases (I.e., ICBP90, RNF106, TDRD22, HUHRF1, hUHRF1, Np95, NP95, HuNp95, HNP95, HNp95...) but I wasn't able to find it. I don't know if this is a bug in the code or a problem with GDC, but I am also thinking if there are other genes that are missing... Can somebody help me? Maybe I am doing something wrong or there is another "name" or ID that I should search for.

ps I also wrote a post in the issue section of TCGAbiolinks github page: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/454

Thank you!

EDIT 12.03.21: The gene is missing only when using legacy data (hg19), with GCh38 the gene is present.

TCGA GDC TCGAbiolinks R • 575 views