Tutorial: retrieve full TCGA datasets from cBioportal with R
gravatar for TriS
4.2 years ago by
United States, Buffalo
TriS4.3k wrote:

cBioportal (http://www.cbioportal.org/index.do) revolutionized the way we look at genomics data. it offers a great interface to visualize TCGA data, integrate them, correlate alterations with survival etc etc. numerous investigators now utilize TCGA data to develop hypotheses or validate their results and the easiest interface to use is, indeed, cBioportal.

for the non-super-lay user, data are accessible also at TCGA firehose (https://gdac.broadinstitute.org/) but this tutorial is not for those data.

I have found online numerous people asking how to download full datasets from cBioportal and no clear answer is available. although cBioportal allows to download data from their website, there still has a limitation on the number of genes you can get.

therefore, the purpose of this short tutorial is to give you a quick n' dirty tweak to get full RNASeq data. the process is valid for other datasets too, this tutorial will show RNASeq data and leaves to the user the fun of expanding this to other data.

required stuff:

the first few steps are described in the link above, I add them here too just for completing the guide. so, load the library and create the cgdsr object

mycgds <- CGDS("http://www.cbioportal.org/public-portal/")

then we need to chose which cancer study we want. I'll go with prostate cancer. the getCancerStudies() function will return a matrix with a list of the studies. the prostate cancer I want is in line 117, therefore:

prad_study <- getCancerStudies(mycgds)[117,1]

and I want to keep all patients:

pca_case_list <- getCaseLists(mycgds,prad_study)[1,1]

now, I need the list of all genes in the genome. one (of the million) ways of doing this is to get the RSEM_genes_normalized__data from the prostate cancer dataset at Firehose (see above) and extract the gene names from the first column. here I changed the dataset name to PRAD.txt (note, the following command is for UNIX/Mac terminal):

awk '{if(NR>1) print $1}' PRAD.txt | cut -d '|' -f1 | sort | uniq | grep -v "?" > gene_names.txt

now you can save this in a variable

rows_pca <- read.table("gene_names.txt")$V1
[1] 20502

now the little trick... the cgdsr package allows you to download up to ~500 genes...so what we do it to download 500 genes at the time :)

i <- 1
while(i <= length(rows_pca)){
  if(i == 1){
    z_pca <- matrix()
  k <- i+499
  temp <- getProfileData(mycgds, rows_pca[i:k], "prad_tcga_rna_seq_v2_mrna_median_Zscores",pca_case_list)
  message(paste("done for genes -->",i," to ",k))
  if(dim(temp)[1] != 0){
    z_pca <- cbind(z_pca,  temp)
  i <- k+1

now, data here have genes as columns and samples as rows, we transpose it and since there will be a number of rows with NAs, we can remove them.

z_pca <- t(z_pca)
rem_pca <- as.numeric(which(apply(z_pca,1,function(x) sumis.na(x))) == ncol(z_pca)))
z_pca <- z_pca[-rem_pca,]
[1] 18015   491

now, you don't need to do this last step, but if you want format the columns using the 12 digit code, you can by doing this:

colnames(z_pca) <- substr(colnames(z_pca),1,12)

and voila', now you have all the z-scores used in cBioportal :)

comments and feedback are always welcome!

ADD COMMENTlink modified 20 months ago by raffael.azevedo30 • written 4.2 years ago by TriS4.3k

Very helpful example, thanks.

ADD REPLYlink written 4.2 years ago by devking0

Hi, I have one question about how to obtain case and control data from TCGA database.

ADD REPLYlink written 3.9 years ago by Ben50

you can't get normals form cBioportal, you need to use TCGA Firehose

ADD REPLYlink written 3.9 years ago by TriS4.3k

But,please, dont forget these will be all processed datasets. cBioPortal also uses processed data.

ADD REPLYlink written 3.7 years ago by realnewbie20

they are not the same "processed". cBiportal gives you expression data in tumor compared to normal. TCGA firehose gives you level 3 TCGA RNASeq data (aligned, mapped, normalized RPKM/FPKM gene/exon/etc level), which cBioportal does not.

ADD REPLYlink written 3.7 years ago by TriS4.3k

I am new to the TCGA and in the comments above, you mention that this line is for UNIX/MAC operating systems:

 awk '{if(NR>1) print $1}' PRAD.txt | cut -d '|' -f1 | sort | uniq | grep -v "?" > gene_names.txt

This is such a great tutorial and I am wondering if there is an equivalent for the Windows environment. Also, are there R statements to download the RSEM_genes_normalized__data directly from the Broad (a line before writing them out to a file) and write out the file that would allow Windows users to finish the tutorial?

ADD REPLYlink written 2.2 years ago by dmbergau30

Hi There,

Thank you for putting up this tutorial. It is very helpful. However, I have errors that I need support with.

1.) The part with downloading all patients. I used the pca_case_list <- getCaseLists(mycgds,prad_study)[1,1], here I get a result value

   *[1] "mixed_allen_2018_all"*

Then applying this command

temp <- getProfileData(mycgds, rows_pca[i:k], "prad_tcga_rna_seq_v2_mrna_median_Zscores",pca_case_list)

I get the following error

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1 did not have 251 elements

This is certainly the fact that there is not getGeneticProfiles value in the getProfileData function. The matrix of the getGeneticProfiles is a matrix with many ids. I would appreciate if someone could guide me here.

2.) The second point is that the remove method in the following code

z_pca <- t(z_pca)
rem_pca <- as.numeric(which(apply(z_pca,1,function(x) sumis.na(x))) == ncol(z_pca))
z_pca <- z_pca[-rem_pca,]

gives me the following error:

Error in sumis.na(x) : could not find function "sumis.na"

Kindly could someone provide me help as to what to do here in order to get the code working?



ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by akshayb0420

I was able to rectify one part of the error Error in sumis.na(x) : could not find function "sumis.na"

Here the actual syntax for the code is rem_pca <- as.numeric(which(apply(z_pca,1,function(x) sumis.na(x))) == ncol(z_pca)))

Also, I had forgotten to load/install package dplyr.

ADD REPLYlink written 2.2 years ago by akshayb0420

Thank you, my friend! You solved three weeks of headache.

ADD REPLYlink written 20 months ago by raffael.azevedo30
gravatar for TriS
3.9 years ago by
United States, Buffalo
TriS4.3k wrote:

Recently cBioportal allowed users to download full datasets, including mutations, clinical, methylation data etc etc... to retrieve use the "Summary" link on the right of the tumor type name and on the top part of the page you will be able to see a link saying "Download dataset"...click on that and you'll be golden :)

ADD COMMENTlink written 3.9 years ago by TriS4.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1900 users visited in the last hour