Tutorial: retrieve full TCGA datasets from cBioportal with R
9
gravatar for TriS
2.1 years ago by
TriS3.5k
United States, Buffalo
TriS3.5k wrote:

cBioportal (http://www.cbioportal.org/index.do) revolutionized the way we look at genomics data. it offers a great interface to visualize TCGA data, integrate them, correlate alterations with survival etc etc. numerous investigators now utilize TCGA data to develop hypotheses or validate their results and the easiest interface to use is, indeed, cBioportal.

for the non-super-lay user, data are accessible also at TCGA firehose (https://gdac.broadinstitute.org/) but this tutorial is not for those data.

I have found online numerous people asking how to download full datasets from cBioportal and no clear answer is available. although cBioportal allows to download data from their website, there still has a limitation on the number of genes you can get.

therefore, the purpose of this short tutorial is to give you a quick n' dirty tweak to get full RNASeq data. the process is valid for other datasets too, this tutorial will show RNASeq data and leaves to the user the fun of expanding this to other data.

required stuff:

the first few steps are described in the link above, I add them here too just for completing the guide. so, load the library and create the cgdsr object

library(cgdsr)
mycgds <- CGDS("http://www.cbioportal.org/public-portal/")

then we need to chose which cancer study we want. I'll go with prostate cancer. the getCancerStudies() function will return a matrix with a list of the studies. the prostate cancer I want is in line 117, therefore:

prad_study <- getCancerStudies(mycgds)[117,1]

and I want to keep all patients:

pca_case_list <- getCaseLists(mycgds,prad_study)[1,1]

now, I need the list of all genes in the genome. one (of the million) ways of doing this is to get the RSEM_genes_normalized__data from the prostate cancer dataset at Firehose (see above) and extract the gene names from the first column. here I changed the dataset name to PRAD.txt (note, the following command is for UNIX/Mac terminal):

awk '{if(NR>1) print $1}' PRAD.txt | cut -d '|' -f1 | sort | uniq | grep -v "?" > gene_names.txt

now you can save this in a variable

rows_pca <- read.table("gene_names.txt")$V1
length(rows_pca)
[1] 20502

now the little trick... the cgdsr package allows you to download up to ~500 genes...so what we do it to download 500 genes at the time :)

i <- 1
while(i <= length(rows_pca)){
  if(i == 1){
    z_pca <- matrix()
  }
  k <- i+499
  temp <- getProfileData(mycgds, rows_pca[i:k], "prad_tcga_rna_seq_v2_mrna_median_Zscores",pca_case_list)
  message(paste("done for genes -->",i," to ",k))
  if(dim(temp)[1] != 0){
    z_pca <- cbind(z_pca,  temp)
  }
  i <- k+1
}

now, data here have genes as columns and samples as rows, we transpose it and since there will be a number of rows with NAs, we can remove them.

z_pca <- t(z_pca)
rem_pca <- as.numeric(which(apply(z_pca,1,function(x) sumis.na(x))) == ncol(z_pca)))
z_pca <- z_pca[-rem_pca,]
dim(z_pca)
[1] 18015   491

now, you don't need to do this last step, but if you want format the columns using the 12 digit code, you can by doing this:

colnames(z_pca) <- substr(colnames(z_pca),1,12)

and voila', now you have all the z-scores used in cBioportal :)

comments and feedback are always welcome!

ADD COMMENTlink modified 18 days ago by akshayb0420 • written 2.1 years ago by TriS3.5k

Very helpful example, thanks.

ADD REPLYlink written 2.1 years ago by devking0

Hi, I have one question about how to obtain case and control data from TCGA database.

ADD REPLYlink written 21 months ago by Ben50

you can't get normals form cBioportal, you need to use TCGA Firehose

ADD REPLYlink written 21 months ago by TriS3.5k

But,please, dont forget these will be all processed datasets. cBioPortal also uses processed data.

ADD REPLYlink written 19 months ago by realnewbie10

they are not the same "processed". cBiportal gives you expression data in tumor compared to normal. TCGA firehose gives you level 3 TCGA RNASeq data (aligned, mapped, normalized RPKM/FPKM gene/exon/etc level), which cBioportal does not.

ADD REPLYlink written 19 months ago by TriS3.5k

I am new to the TCGA and in the comments above, you mention that this line is for UNIX/MAC operating systems:

 awk '{if(NR>1) print $1}' PRAD.txt | cut -d '|' -f1 | sort | uniq | grep -v "?" > gene_names.txt

This is such a great tutorial and I am wondering if there is an equivalent for the Windows environment. Also, are there R statements to download the RSEM_genes_normalized__data directly from the Broad (a line before writing them out to a file) and write out the file that would allow Windows users to finish the tutorial?

ADD REPLYlink written 5 weeks ago by dmbergau0
0
gravatar for TriS
21 months ago by
TriS3.5k
United States, Buffalo
TriS3.5k wrote:

Recently cBioportal allowed users to download full datasets, including mutations, clinical, methylation data etc etc... to retrieve use the "Summary" link on the right of the tumor type name and on the top part of the page you will be able to see a link saying "Download dataset"...click on that and you'll be golden :)

ADD COMMENTlink written 21 months ago by TriS3.5k
0
gravatar for akshayb04
18 days ago by
akshayb0420
Austin, Texas
akshayb0420 wrote:

Hi There,

Thank you for putting up this tutorial. It is very helpful. However, I have errors that I need support with.

1.) The part with downloading all patients. I used the pca_case_list <- getCaseLists(mycgds,prad_study)[1,1], here I get a result value

   *[1] "mixed_allen_2018_all"*

Then applying this command

temp <- getProfileData(mycgds, rows_pca[i:k], "prad_tcga_rna_seq_v2_mrna_median_Zscores",pca_case_list)

I get the following error

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1 did not have 251 elements

This is certainly the fact that there is not getGeneticProfiles value in the getProfileData function. The matrix of the getGeneticProfiles is a matrix with many ids. I would appreciate if someone could guide me here.

2.) The second point is that the remove method in the following code

z_pca <- t(z_pca)
rem_pca <- as.numeric(which(apply(z_pca,1,function(x) sumis.na(x))) == ncol(z_pca))
z_pca <- z_pca[-rem_pca,]
dim(z_pca)

gives me the following error:

Error in sumis.na(x) : could not find function "sumis.na"

Kindly could someone provide me help as to what to do here in order to get the code working?

Thanks,

akshayb04

ADD COMMENTlink modified 18 days ago • written 18 days ago by akshayb0420

I was able to rectify one part of the error Error in sumis.na(x) : could not find function "sumis.na"

Here the actual syntax for the code is rem_pca <- as.numeric(which(apply(z_pca,1,function(x) sumis.na(x))) == ncol(z_pca)))

Also, I had forgotten to load/install package dplyr.

ADD REPLYlink written 17 days ago by akshayb0420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1095 users visited in the last hour