Error in GDCprepare
1
0
Entering edit mode
3.4 years ago
Kasthuri ▴ 290

I am working with RNA seq data from TCGA using TCGAbiolinks. Everything has been working fine so far and suddenly I get this error:

> RnaseqSE <- GDCprepare(query)
|=================================================================================| 100%
Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column


I don't think I changed anything in the code. Also, as suggested here, I changed the query to:

 RnaseqSE <- GDCprepare(query,
save=TRUE,
save.filename = "Gene_Expression_Quantification.rda",
summarizedExperiment = FALSE)


and I get it running, however, my next command fails,

Matrix.C1 <- assay(RnaseqSE,"raw_count")
Error in (function (classes, fdef, mtable)  :
unable to find an inherited method for function ‘assay’ for signature ‘"data.frame", "character"’


Any help would be appreciated. Thanks!

TCGA • 4.3k views
3
Entering edit mode
3.4 years ago

Could you please write the query, otherwise its hard to speculate. Further at times errors occur using this package owing to updates. Make sure you update it by typing the following command in R/Rstudio.  devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks") 

0
Entering edit mode

Thanks! I updated it and still have problems. Here is the full query - it is exactly as posted in their manual

library(TCGAbiolinks)
library(SummarizedExperiment)

# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
"TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
"TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
"TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
"TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode
query <- GDCquery(project = "TCGA-BRCA",
data.category = "Gene expression",
data.type = "Gene expression quantification",
experimental.strategy = "RNA-Seq",
platform = "Illumina HiSeq",
file.type = "results",
barcode = listSamples,
legacy = TRUE)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
RnaseqSE <- GDCprepare(query)

Matrix.C1 <- assay(RnaseqSE,"raw_count")

1
Entering edit mode

Hi,

Few things.

Coming specifically to your query, the returned object of GDCPrepare is a data.frame (not an object of any other class). Thus you cannot apply assay function. Raw counts are present in the data.frame along with the other elements. To extract them use following:  Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)] 

However, I think you should be aware you are working with legacy data and thus it contains the older samples from TCGA portal. There exists an issue for the particular command. However now it has been updated to GDCportal which contains more samples and the package has been more optimised for that. So if you want you work with the more updated data, use the following, it works:

library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-BRCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = 'HTSeq - Counts',
barcode = listSamples,
legacy = F)
RnaseqSE <- GDCprepare(query)
Matrix.C1 <- assays(RNaseqSE)


If you remove listSamples argument, you will get all the samples and not these particular ones.

0
Entering edit mode

Hey noorpratap.singh, for your code, you may want to just highlight it and then click the 101 010 button. The backticks are not needed!

0
Entering edit mode

0
Entering edit mode

Thanks a lot, noorpratap.singh. However, I get errors down the line, now.

listSamples <- Cluster.1.df\$Sample ID

library(SummarizedExperiment)

query <- GDCquery(project = "TCGA-BRCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = 'HTSeq - Counts',
barcode = listSamples,
legacy = F)

RnaseqSE <- GDCprepare(query)
Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]
Norm.Mat.C1 <- TCGAanalyze_Normalization(tabDF = Matrix.C1, geneInfo =  geneInfo)
I Need about  0 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Error in names(y) <- 1:length(y) :
'names' attribute [2] must be the same length as the vector [0]
Timing stopped at: 0.004 0 0.004

0
Entering edit mode

Ok, I figured out. If we are using the new query as suggested, we need to use geneInfo = geneInfoHT option in the TCGAanalyze_Normalization function. Thus, it should be,

Norm.Mat.C1 <- TCGAanalyze_Normalization(tabDF = Matrix.C1, geneInfo =  geneInfoHT)


However, using this updated protocol results in fewer samples being analyzed in the differential gene expression analysis downstream (TCGAanalyze_DEA) and as a result, we get a lot less differentially expressed genes.

Your first solution is much better - that is, using Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)] however, we need to set summarizedExperiment = FALSE in the original query GDCprepare. Therefore, I think the best solution is:

library(TCGAbiolinks)
library(SummarizedExperiment)

# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
"TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
"TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
"TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
"TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode
query <- GDCquery(project = "TCGA-BRCA",
data.category = "Gene expression",
data.type = "Gene expression quantification",
experimental.strategy = "RNA-Seq",
platform = "Illumina HiSeq",
file.type = "results",
barcode = listSamples,
legacy = TRUE)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
RnaseqSE <- GDCprepare(query, summarizedExperiment = FALSE)
Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]

Norm.Mat.C1 <- TCGAanalyze_Normalization(tabDF = Matrix.C1, geneInfo =  geneInfo)


This should give us the desired matrix for further downstream differential expression analysis.

0
Entering edit mode

Can you double-confirm that this solves the problem? If so, I will move noorpratap's first comment to become an answer, which you can then up-vote and/or accept, if you wish. Threads with no accepted answers are 'bumped' back to the top of the list by the Biostars bot every now and then.

0
Entering edit mode

Yes, it works for me and I processed my data. Hopefully, they shouldn't change the code at TCGAbiolinks.

1
Entering edit mode

Okay, thanks. I have now moved this entire thread to an answer. TCGAbiolinks is liable to change, I feel. It has had to change over the years due to the fact that the very data that it accesses has changed. Maybe some day it will all become stable!