Question

Deleted:TCGAbiolinks -> Gene ID totally missing in downloaded data.. A bug of the package or a problem with GDC data?

0

Entering edit mode

4.3 years ago

mario.red8976 ▴ 140

Hello everybody. I am running some analyses in these days on TCGA data on different types of tumors (BRCA, COAD) with TCGAbiolinks package and I noticed that when I download gene expression data (legacy data = hg19) I have always the same gene that is missing in the gene IDs, which is UHRF1. I used this code:

query.exp <- GDCquery(project = "TCGA-BRCA",
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      experimental.strategy = "RNA-Seq",
                      sample.type = c("Solid Tissue Normal", "Primary Tumor"))

# Collect the barcodes of only TP samples
dataTP <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"TP")

# Collect the barcodes of only NT samples
dataNT <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"NT")

# Filter:
dataNT.filt <- dataNT[1:50] # keep only 50 normals
dataTP.filt <- dataTP[1:50]# keep only 50 tumors

exp.barcodes <- c(dataNT.filt, dataTP.filt)

query.exp <- GDCquery(project = TCGAprj,
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      barcode = exp.barcodes,
                      experimental.strategy = "RNA-Seq")

GDCdownload(query.exp)

BRCA.exp <- GDCprepare(query.exp, 
                       save = TRUE, 
                       summarizedExperiment = TRUE, 
                       save.filename = "BRCAexp.rda")

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCAMatrix <- assay(BRCA.exp,"raw_count")

If I inspect BRCAMatrix searching for UHRF1 gene, there is no trace of it (in this matrix, it should be named: "UHRF1|29128" ).

If I perform normalization and then try to search for UHRF1 gene (that now should lack the " |29128" after its name) I get this:

dataNorm <- TCGAanalyze_Normalization(tabDF = BRCA.exp,
                                      geneInfo = geneInfo,
                                      method = "gcContent") 

> dataNorm["UHRF1", 1]
> Error in dataNorm["UHRF1", 1] : subscript out of bounds

I should expect the corresponding value after normalization but there is no trace of this gene. I also searched for UHRF1 aliases (I.e., ICBP90, RNF106, TDRD22, HUHRF1, hUHRF1, Np95, NP95, HuNp95, HNP95, HNp95...) but I wasn't able to find it. I don't know if this is a bug in the code or a problem with GDC, but I am also thinking if there are other genes that are missing... Can somebody help me? Maybe I am doing something wrong or there is another "name" or ID that I should search for.

ps I also wrote a post in the issue section of TCGAbiolinks github page: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/454

Thank you!

EDIT 12.03.21: The gene is missing only when using legacy data (hg19), with GCh38 the gene is present.

TCGA GDC TCGAbiolinks R • 1.4k views

ADD COMMENT • link 4.3 years ago by mario.red8976 ▴ 140