Question: Multiple TCGA expression datasets
0
gravatar for george.wiggins
19 days ago by
New Zealand
george.wiggins10 wrote:

Simply I want to get expression data for a given set of gene(s) for say all solid tumours in TCGA. I can go about downloading each dataset manually, pulling out the genes of interest and then doing my analysis.

What I was hoping for is and easy way to implement this in R. This ideally would be quick as I would only have a few genes so wouldn't need to get the entire dataset.

R • 122 views
ADD COMMENTlink modified 19 days ago by cindy.perscheid80 • written 19 days ago by george.wiggins10
0
gravatar for cindy.perscheid
19 days ago by
Hasso Plattner Institute, Potsdam, Germany
cindy.perscheid80 wrote:

As far as I know for the GDC, you can only query the data with parameters like study group, sample type, platform, and specific barcodes as the files are only available sample-wise (I suppose it is the same for all the other sources where you can download the data). I am afraid that you have to implement a custom solution for yourself, first downloading the data (e.g. according to your critieria only tumor samples of specific study groups), doing the preprocessing and then selecting the genes you want to keep. This can be automized in an R script (I did this myself, only without selecting specific genes), but takes of course some time to set it up if you want to keep it flexible.

Best, Cindy

ADD COMMENTlink written 19 days ago by cindy.perscheid80

Thanks Cindy,
Did you download the data through the GDC data portal or using an R package? If so which one?

I know cbioportal can do gene based queries on TCGA data, however, at least through the web interface, you can only select multiple studies and get mutation or CNV status of your genes of interest not RNA expression data. My current implementation, which is messy to say the least, is using the cgdsr package and importing each expression data set for my given gene panel for the studies I want. This may be the best I can manage for now.

ADD REPLYlink written 19 days ago by george.wiggins10
0
gravatar for cindy.perscheid
19 days ago by
Hasso Plattner Institute, Potsdam, Germany
cindy.perscheid80 wrote:

I used the TCGAbiolinks R package: https://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#gdcquery:_searching_tcga_open-access_data

I implemented a python wrapper that creates an R script to generate an expression data set that is ready for downstream ML analysis. It uses TCGAbiolinks for customized downloads, e.g. if I want to have multiple data types of the same study, I just invoke my wrapper with the corresponding parameter and the queries are generated automatically in R, and does some basic preprocessing like filtering, normalization/log-transformation, discretization (all then in R). I plan to expand this to more tools and make it more flexible to design your own preprocessing pipeline, but currently I have just one basic workflow. For this, I found TCGAbiolinks to come in quite handy without having to study and query the GDC API by myself.

ADD COMMENTlink written 19 days ago by cindy.perscheid80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 966 users visited in the last hour