Entering edit mode
4.3 years ago
Kasthuri ▴ 290
I am working with RNA seq data from TCGA using TCGAbiolinks. Everything has been working fine so far and suddenly I get this error:
> RnaseqSE <- GDCprepare(query) |=================================================================================| 100% Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column
I don't think I changed anything in the code. Also, as suggested here, I changed the query to:
RnaseqSE <- GDCprepare(query, save=TRUE, save.filename = "Gene_Expression_Quantification.rda", summarizedExperiment = FALSE)
and I get it running, however, my next command fails,
Matrix.C1 <- assay(RnaseqSE,"raw_count") Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘assay’ for signature ‘"data.frame", "character"’
Any help would be appreciated. Thanks!
Thanks! I updated it and still have problems. Here is the full query - it is exactly as posted in their manual
Coming specifically to your query, the returned object of GDCPrepare is a data.frame (not an object of any other class). Thus you cannot apply assay function. Raw counts are present in the data.frame along with the other elements. To extract them use following:
Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]
However, I think you should be aware you are working with legacy data and thus it contains the older samples from TCGA portal. There exists an issue for the particular command. However now it has been updated to GDCportal which contains more samples and the package has been more optimised for that. So if you want you work with the more updated data, use the following, it works:
If you remove listSamples argument, you will get all the samples and not these particular ones.
Hey noorpratap.singh, for your code, you may want to just highlight it and then click the
101 010button. The backticks are not needed!
Updated. Thanks for the advise.
Thanks a lot, noorpratap.singh. However, I get errors down the line, now.
Ok, I figured out. If we are using the new query as suggested, we need to use
geneInfo = geneInfoHToption in the
TCGAanalyze_Normalizationfunction. Thus, it should be,
However, using this updated protocol results in fewer samples being analyzed in the differential gene expression analysis downstream (
TCGAanalyze_DEA) and as a result, we get a lot less differentially expressed genes.
Your first solution is much better - that is, using
Matrix.C1 <- RnaseqSE[,seq(1,ncol(RnaseqSE),3)]however, we need to set
summarizedExperiment = FALSEin the original query GDCprepare. Therefore, I think the best solution is:
This should give us the desired matrix for further downstream differential expression analysis.
Can you double-confirm that this solves the problem? If so, I will move noorpratap's first comment to become an answer, which you can then up-vote and/or accept, if you wish. Threads with no accepted answers are 'bumped' back to the top of the list by the Biostars bot every now and then.
Yes, it works for me and I processed my data. Hopefully, they shouldn't change the code at TCGAbiolinks.
Okay, thanks. I have now moved this entire thread to an answer. TCGAbiolinks is liable to change, I feel. It has had to change over the years due to the fact that the very data that it accesses has changed. Maybe some day it will all become stable!