Question

Multiple TCGA expression datasets

0

Entering edit mode

6.4 years ago

george.wiggins ▴ 10

Simply I want to get expression data for a given set of gene(s) for say all solid tumours in TCGA. I can go about downloading each dataset manually, pulling out the genes of interest and then doing my analysis.

What I was hoping for is and easy way to implement this in R. This ideally would be quick as I would only have a few genes so wouldn't need to get the entire dataset.

R • 2.0k views

ADD COMMENT • link updated 6.4 years ago by cindy.perscheid ▴ 100 • written 6.4 years ago by george.wiggins ▴ 10

score 0 · Answer 1 · 2017-11-21

0

Entering edit mode

6.4 years ago

cindy.perscheid ▴ 100

As far as I know for the GDC, you can only query the data with parameters like study group, sample type, platform, and specific barcodes as the files are only available sample-wise (I suppose it is the same for all the other sources where you can download the data). I am afraid that you have to implement a custom solution for yourself, first downloading the data (e.g. according to your critieria only tumor samples of specific study groups), doing the preprocessing and then selecting the genes you want to keep. This can be automized in an R script (I did this myself, only without selecting specific genes), but takes of course some time to set it up if you want to keep it flexible.

Best, Cindy

ADD COMMENT • link 6.4 years ago by cindy.perscheid ▴ 100

0

Entering edit mode

Thanks Cindy,
Did you download the data through the GDC data portal or using an R package? If so which one?

I know cbioportal can do gene based queries on TCGA data, however, at least through the web interface, you can only select multiple studies and get mutation or CNV status of your genes of interest not RNA expression data. My current implementation, which is messy to say the least, is using the cgdsr package and importing each expression data set for my given gene panel for the studies I want. This may be the best I can manage for now.

ADD REPLY • link 6.4 years ago by george.wiggins ▴ 10

score 0 · Answer 2 · 2017-11-21

I used the TCGAbiolinks R package: https://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#gdcquery:_searching_tcga_open-access_data

I implemented a python wrapper that creates an R script to generate an expression data set that is ready for downstream ML analysis. It uses TCGAbiolinks for customized downloads, e.g. if I want to have multiple data types of the same study, I just invoke my wrapper with the corresponding parameter and the queries are generated automatically in R, and does some basic preprocessing like filtering, normalization/log-transformation, discretization (all then in R). I plan to expand this to more tools and make it more flexible to design your own preprocessing pipeline, but currently I have just one basic workflow. For this, I found TCGAbiolinks to come in quite handy without having to study and query the GDC API by myself.