TCGAbiolinks -> Gene ID totally missing in downloaded data.. A bug of the package or a problem with GDC data?
0
0
Entering edit mode
7 months ago
miky.zo ▴ 60

Hello everybody. I am running some analyses in these days on TCGA data on different types of tumors (BRCA, COAD) with TCGAbiolinks package and I noticed that when I download gene expression data (legacy data = hg19) I have always the same gene that is missing in the gene IDs, which is UHRF1. I used this code:

query.exp <- GDCquery(project = "TCGA-BRCA",
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      experimental.strategy = "RNA-Seq",
                      sample.type = c("Solid Tissue Normal", "Primary Tumor"))

# Collect the barcodes of only TP samples
dataTP <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"TP")

# Collect the barcodes of only NT samples
dataNT <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"NT")

# Filter:
dataNT.filt <- dataNT[1:50] # keep only 50 normals
dataTP.filt <- dataTP[1:50]# keep only 50 tumors

exp.barcodes <- c(dataNT.filt, dataTP.filt)

query.exp <- GDCquery(project = TCGAprj,
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      barcode = exp.barcodes,
                      experimental.strategy = "RNA-Seq")

GDCdownload(query.exp)

BRCA.exp <- GDCprepare(query.exp, 
                       save = TRUE, 
                       summarizedExperiment = TRUE, 
                       save.filename = "BRCAexp.rda")

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCAMatrix <- assay(BRCA.exp,"raw_count")

If I inspect BRCAMatrix searching for UHRF1 gene, there is no trace of it (in this matrix, it should be named: "UHRF1|29128" ).

If I perform normalization and then try to search for UHRF1 gene (that now should lack the " |29128" after its name) I get this:

dataNorm <- TCGAanalyze_Normalization(tabDF = BRCA.exp,
                                      geneInfo = geneInfo,
                                      method = "gcContent") 

> dataNorm["UHRF1", 1]
> Error in dataNorm["UHRF1", 1] : subscript out of bounds

I should expect the corresponding value after normalization but there is no trace of this gene. I also searched for UHRF1 aliases (I.e., ICBP90, RNF106, TDRD22, HUHRF1, hUHRF1, Np95, NP95, HuNp95, HNP95, HNp95...) but I wasn't able to find it. I don't know if this is a bug in the code or a problem with GDC, but I am also thinking if there are other genes that are missing... Can somebody help me? Maybe I am doing something wrong or there is another "name" or ID that I should search for.

ps I also wrote a post in the issue section of TCGAbiolinks github page: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/454

Thank you!

EDIT 12.03.21: The gene is missing only when using legacy data (hg19), with GCh38 the gene is present.

TCGA GDC TCGAbiolinks R • 575 views
ADD COMMENT

Login before adding your answer.

Traffic: 1734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6