Question: Error in GDCprepare
0
gravatar for Kasthuri
11 weeks ago by
Kasthuri240
United States
Kasthuri240 wrote:

I am working with RNA seq data from TCGA using TCGAbiolinks. Everything has been working fine so far and suddenly I get this error:

> RnaseqSE <- GDCprepare(query)
|=================================================================================| 100%
Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column

I don't think I changed anything in the code. Also, as suggested here, I changed the query to:

 RnaseqSE <- GDCprepare(query,
                 save=TRUE,
                 save.filename = "Gene_Expression_Quantification.rda",
                 summarizedExperiment = FALSE)

and I get it running, however, my next command fails,

Matrix.C1 <- assay(RnaseqSE,"raw_count")
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘assay’ for signature ‘"data.frame", "character"’

Any help would be appreciated. Thanks!

tcga • 277 views
ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Kasthuri240
3
gravatar for noorpratap.singh
11 weeks ago by
India
noorpratap.singh270 wrote:

Could you please write the query, otherwise its hard to speculate. Further at times errors occur using this package owing to updates. Make sure you update it by typing the following command in R/Rstudio. devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")

ADD COMMENTlink written 11 weeks ago by noorpratap.singh270

Thanks! I updated it and still have problems. Here is the full query - it is exactly as posted in their manual

library(TCGAbiolinks)
library(SummarizedExperiment)

    # You can define a list of samples to query and download providing relative TCGA barcodes.
    listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
                     "TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
                     "TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
                     "TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
                     "TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

    # Query platform Illumina HiSeq with a list of barcode 
    query <- GDCquery(project = "TCGA-BRCA", 
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      experimental.strategy = "RNA-Seq",
                      platform = "Illumina HiSeq",
                      file.type = "results",
                      barcode = listSamples, 
                      legacy = TRUE)

    # Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
    GDCdownload(query)

    # Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
    # rsem.genes.results as values
    RnaseqSE <- GDCprepare(query)

    Matrix.C1 <- assay(RnaseqSE,"raw_count")
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Kasthuri240
1

Hi,

Few things.

Coming specifically to your query, the returned object of GDCPrepare is a data.frame (not an object of any other class). Thus you cannot apply assay function. Raw counts are present in the data.frame along with the other elements. To extract them use following: Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]

However, I think you should be aware you are working with legacy data and thus it contains the older samples from TCGA portal. There exists an issue for the particular command. However now it has been updated to GDCportal which contains more samples and the package has been more optimised for that. So if you want you work with the more updated data, use the following, it works:

library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-BRCA", 
              data.category = "Transcriptome Profiling",
              data.type = "Gene Expression Quantification",
              workflow.type = 'HTSeq - Counts',
              barcode = listSamples, 
              legacy = F)
GDCdownload(query, method = 'api')
RnaseqSE <- GDCprepare(query)
Matrix.C1 <- assays(RNaseqSE)

If you remove listSamples argument, you will get all the samples and not these particular ones.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by noorpratap.singh270

Hey noorpratap.singh, for your code, you may want to just highlight it and then click the 101 010 button. The backticks are not needed!

ADD REPLYlink written 11 weeks ago by Kevin Blighe41k

Updated. Thanks for the advise.

ADD REPLYlink written 11 weeks ago by noorpratap.singh270

Thanks a lot, noorpratap.singh. However, I get errors down the line, now.

listSamples <- Cluster.1.df$`Sample ID`

library(SummarizedExperiment)

query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = 'HTSeq - Counts',
                  barcode = listSamples, 
                  legacy = F)  

 GDCdownload(query, method = 'api')
 RnaseqSE <- GDCprepare(query)
 Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]
 Norm.Mat.C1 <- TCGAanalyze_Normalization(tabDF = Matrix.C1, geneInfo =  geneInfo)
    I Need about  0 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]  
    Step 1 of 4: newSeqExpressionSet ...
    Step 2 of 4: withinLaneNormalization ...
    Error in names(y) <- 1:length(y) : 
      'names' attribute [2] must be the same length as the vector [0]
    Timing stopped at: 0.004 0 0.004
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Kasthuri240

Ok, I figured out. If we are using the new query as suggested, we need to use geneInfo = geneInfoHT option in the TCGAanalyze_Normalization function. Thus, it should be,

Norm.Mat.C1 <- TCGAanalyze_Normalization(tabDF = Matrix.C1, geneInfo =  geneInfoHT)

However, using this updated protocol results in fewer samples being analyzed in the differential gene expression analysis downstream (TCGAanalyze_DEA) and as a result, we get a lot less differentially expressed genes.

Your first solution is much better - that is, using Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)] however, we need to set summarizedExperiment = FALSE in the original query GDCprepare. Therefore, I think the best solution is:

library(TCGAbiolinks)
library(SummarizedExperiment)

# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
                 "TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
                 "TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
                 "TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
                 "TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode 
query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = "Illumina HiSeq",
                  file.type = "results",
                  barcode = listSamples, 
                  legacy = TRUE)

# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
RnaseqSE <- GDCprepare(query, summarizedExperiment = FALSE)
Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]

Norm.Mat.C1 <- TCGAanalyze_Normalization(tabDF = Matrix.C1, geneInfo =  geneInfo)

This should give us the desired matrix for further downstream differential expression analysis.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Kasthuri240

Can you double-confirm that this solves the problem? If so, I will move noorpratap's first comment to become an answer, which you can then up-vote and/or accept, if you wish. Threads with no accepted answers are 'bumped' back to the top of the list by the Biostars bot every now and then.

ADD REPLYlink written 11 weeks ago by Kevin Blighe41k

Yes, it works for me and I processed my data. Hopefully, they shouldn't change the code at TCGAbiolinks.

ADD REPLYlink written 11 weeks ago by Kasthuri240
1

Okay, thanks. I have now moved this entire thread to an answer. TCGAbiolinks is liable to change, I feel. It has had to change over the years due to the fact that the very data that it accesses has changed. Maybe some day it will all become stable!

ADD REPLYlink written 11 weeks ago by Kevin Blighe41k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1792 users visited in the last hour