Question: Error in GDCprepare
0
gravatar for Kasthuri
14 months ago by
Kasthuri260
United States
Kasthuri260 wrote:

I am working with RNA seq data from TCGA using TCGAbiolinks. Everything has been working fine so far and suddenly I get this error:

> RnaseqSE <- GDCprepare(query)
|=================================================================================| 100%
Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column

I don't think I changed anything in the code. Also, as suggested here, I changed the query to:

 RnaseqSE <- GDCprepare(query,
                 save=TRUE,
                 save.filename = "Gene_Expression_Quantification.rda",
                 summarizedExperiment = FALSE)

and I get it running, however, my next command fails,

Matrix.C1 <- assay(RnaseqSE,"raw_count")
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘assay’ for signature ‘"data.frame", "character"’

Any help would be appreciated. Thanks!

tcga • 1.1k views
ADD COMMENTlink modified 14 months ago • written 14 months ago by Kasthuri260
3
gravatar for noorpratap.singh
14 months ago by
University of Maryland
noorpratap.singh290 wrote:

Could you please write the query, otherwise its hard to speculate. Further at times errors occur using this package owing to updates. Make sure you update it by typing the following command in R/Rstudio. devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")

ADD COMMENTlink written 14 months ago by noorpratap.singh290

Thanks! I updated it and still have problems. Here is the full query - it is exactly as posted in their manual

library(TCGAbiolinks)
library(SummarizedExperiment)

    # You can define a list of samples to query and download providing relative TCGA barcodes.
    listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
                     "TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
                     "TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
                     "TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
                     "TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

    # Query platform Illumina HiSeq with a list of barcode 
    query <- GDCquery(project = "TCGA-BRCA", 
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      experimental.strategy = "RNA-Seq",
                      platform = "Illumina HiSeq",
                      file.type = "results",
                      barcode = listSamples, 
                      legacy = TRUE)

    # Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
    GDCdownload(query)

    # Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
    # rsem.genes.results as values
    RnaseqSE <- GDCprepare(query)

    Matrix.C1 <- assay(RnaseqSE,"raw_count")
ADD REPLYlink modified 14 months ago • written 14 months ago by Kasthuri260
1

Hi,

Few things.

Coming specifically to your query, the returned object of GDCPrepare is a data.frame (not an object of any other class). Thus you cannot apply assay function. Raw counts are present in the data.frame along with the other elements. To extract them use following: Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]

However, I think you should be aware you are working with legacy data and thus it contains the older samples from TCGA portal. There exists an issue for the particular command. However now it has been updated to GDCportal which contains more samples and the package has been more optimised for that. So if you want you work with the more updated data, use the following, it works:

library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-BRCA", 
              data.category = "Transcriptome Profiling",
              data.type = "Gene Expression Quantification",
              workflow.type = 'HTSeq - Counts',
              barcode = listSamples, 
              legacy = F)
GDCdownload(query, method = 'api')
RnaseqSE <- GDCprepare(query)
Matrix.C1 <- assays(RNaseqSE)

If you remove listSamples argument, you will get all the samples and not these particular ones.

ADD REPLYlink modified 14 months ago • written 14 months ago by noorpratap.singh290

Hey noorpratap.singh, for your code, you may want to just highlight it and then click the 101 010 button. The backticks are not needed!

ADD REPLYlink written 14 months ago by Kevin Blighe56k

Updated. Thanks for the advise.

ADD REPLYlink written 14 months ago by noorpratap.singh290

Thanks a lot, noorpratap.singh. However, I get errors down the line, now.

listSamples <- Cluster.1.df$`Sample ID`

library(SummarizedExperiment)

query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = 'HTSeq - Counts',
                  barcode = listSamples, 
                  legacy = F)  

 GDCdownload(query, method = 'api')
 RnaseqSE <- GDCprepare(query)
 Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]
 Norm.Mat.C1 <- TCGAanalyze_Normalization(tabDF = Matrix.C1, geneInfo =  geneInfo)
    I Need about  0 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]  
    Step 1 of 4: newSeqExpressionSet ...
    Step 2 of 4: withinLaneNormalization ...
    Error in names(y) <- 1:length(y) : 
      'names' attribute [2] must be the same length as the vector [0]
    Timing stopped at: 0.004 0 0.004
ADD REPLYlink modified 14 months ago • written 14 months ago by Kasthuri260

Ok, I figured out. If we are using the new query as suggested, we need to use geneInfo = geneInfoHT option in the TCGAanalyze_Normalization function. Thus, it should be,

Norm.Mat.C1 <- TCGAanalyze_Normalization(tabDF = Matrix.C1, geneInfo =  geneInfoHT)

However, using this updated protocol results in fewer samples being analyzed in the differential gene expression analysis downstream (TCGAanalyze_DEA) and as a result, we get a lot less differentially expressed genes.

Your first solution is much better - that is, using Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)] however, we need to set summarizedExperiment = FALSE in the original query GDCprepare. Therefore, I think the best solution is:

library(TCGAbiolinks)
library(SummarizedExperiment)

# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
                 "TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
                 "TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
                 "TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
                 "TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode 
query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = "Illumina HiSeq",
                  file.type = "results",
                  barcode = listSamples, 
                  legacy = TRUE)

# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
RnaseqSE <- GDCprepare(query, summarizedExperiment = FALSE)
Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]

Norm.Mat.C1 <- TCGAanalyze_Normalization(tabDF = Matrix.C1, geneInfo =  geneInfo)

This should give us the desired matrix for further downstream differential expression analysis.

ADD REPLYlink modified 14 months ago • written 14 months ago by Kasthuri260

Can you double-confirm that this solves the problem? If so, I will move noorpratap's first comment to become an answer, which you can then up-vote and/or accept, if you wish. Threads with no accepted answers are 'bumped' back to the top of the list by the Biostars bot every now and then.

ADD REPLYlink written 14 months ago by Kevin Blighe56k

Yes, it works for me and I processed my data. Hopefully, they shouldn't change the code at TCGAbiolinks.

ADD REPLYlink written 14 months ago by Kasthuri260
1

Okay, thanks. I have now moved this entire thread to an answer. TCGAbiolinks is liable to change, I feel. It has had to change over the years due to the fact that the very data that it accesses has changed. Maybe some day it will all become stable!

ADD REPLYlink written 14 months ago by Kevin Blighe56k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1658 users visited in the last hour