Question: From TCGA to GDC (Genomic data commons)
4
gravatar for rbrtdambrosio
4.3 years ago by
rbrtdambrosio70 wrote:

Hello,

I was using TCGA data related to Colon adenocarcinoma (COAD). In the specific I was using "IlluminaGA_RNASeqV2", "IlluminaHiSeq_RNASeqV2" platforms

For the COAD cancer and for those platforms were available the level 3 information. I was using the raw count from the rsem.genes.results files.

Now that TCGA moved under Genomic data commons (GDC), i'm struggling to retrive the same information. I would like to understand how to download from https://gdc-portal.nci.nih.gov/ the same information that were available from TGCA.

I was using TCGABiolinks, but now seems not working anymore. Any suggestion about R library to import GDC data?

Thanks

gdc tcga R • 5.9k views
ADD COMMENTlink modified 4.3 years ago by Mike1.6k • written 4.3 years ago by rbrtdambrosio70
4

I agree that the transition is very confusing, not least because of the way the gdc-portal displays data files for downloading.

Anyways, have you been here - Firehose On the landing page, at the row for COAD, under Data col, click Browse. The pop-up window that opens should be able to give you what you are looking for.

The file naming is a bit different now, but you would be able to make out. I haven't used the R library, but the Firehose site has its own client (like a wget). There is an R package described as well.

ADD REPLYlink written 4.3 years ago by Amitm2.0k

Thanks for the suggestion. I will give a try to Firehose

ADD REPLYlink written 4.3 years ago by rbrtdambrosio70

Hello Amit, Do we get access to protected data in Firebrowse?

ADD REPLYlink written 3.7 years ago by enigmargs0

Nopes. I think that would be GDC.

ADD REPLYlink written 3.7 years ago by Amitm2.0k
8
gravatar for Cyriac Kandoth
4.3 years ago by
Cyriac Kandoth5.5k
Memorial Sloan Kettering, New York, USA
Cyriac Kandoth5.5k wrote:

Nearly all TCGA data/results can be found at Broad Institute's Firehose pipelines. Get raw data/results here or browse the web-based UIs at MSKCC's cbioportal.org or Broad's firebrowse.org.

NOTE: This is just a temporary solution, while I figure out how to use the GDC via CLI. :)

There is a convenient python script to download raw data/results. Download it as follows:

mkdir scripts
curl -o scripts/firehose_get_latest.zip http://gdac.broadinstitute.org/runs/code/firehose_get_latest.zip
unzip -d scripts scripts/firehose_get_latest.zip

Here is how to use that tool to download the normalized per-gene expression estimates from RNA-seq data:

./scripts/firehose_get -b -only Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data data latest

It creates a folder structure with gzipped tarballs in separate tumor-type subfolders. Unpack all the tarballs:

mkdir rna_seq
for file in stddata__*/*/*/*RSEM*.Level_3*.tar.gz; do tar -zxf $file -C rna_seq; done

Rename the resulting subfolders to just the tumor type codes, using some in-line Perl and bash:

ls -d rna_seq/gdac* | perl -ne 'chomp; ($t)=m/gdac.broadinstitute.org_(\w+)/; print "mv $_ rna_seq/$t\n"' | bash

Delete the separate colon/rectal cohorts, leaving behind only the combined cohort COADREAD:

rm -rf rna_seq/{COAD,READ}

There are also KIPAN (KICH+KIRC+KIRP) and GBMLGG (GBM+LGG), but keep them, they're interesting. The per-gene RNA-expression estimates are now in these files:

rna_seq/*/*.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt
ADD COMMENTlink modified 2.9 years ago • written 4.3 years ago by Cyriac Kandoth5.5k

Hi Cyriac,

Do you know the difference between illumina rnaseq2 vs illumina rnaseq and the current rna seq data on GDC?

I see several type in the link but on GDC portal, there is only one kind of RNA-seq data.

Thanks

ADD REPLYlink written 4.3 years ago by bxia150

v2 reports RSEM, the other reports RPKM. GDC only reports RSEM.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Cyriac Kandoth5.5k

thanks,

then what is the relationship between RSEM and HTSeq?

I saw GDC have HTSeq-counts, HTseq-FPKM, HTseq-FPKM-UQ.

I thought RSEM will generate calculated expression, and HTseq is the raw counts,

I am very confused about what data I am downloading...

And looks like if I use the HTseq from new GDC portal, I have to combine them by myself since they download the file folder by folder separately...Does some see a merge HTSeq file?

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by bxia150

Sorry, I was wrong. GDC runs their own RNA-seq pipeline defined here, which appears to report FPKM.

ADD REPLYlink written 4.3 years ago by Cyriac Kandoth5.5k

AFAIK RNAseq TCGA V1 analysis (old) used BWA and the V2 analysis (new) which uses MapSplice. All V1 data was reprocessed as V2. So there should be only one kind of RNAseq data (submitted by UNC).

ADD REPLYlink written 4.3 years ago by genomax91k
5
gravatar for tiagochst
4.3 years ago by
tiagochst70
tiagochst70 wrote:

TCGAbiolinks was fixed to search, download and prepare data from GDC data portal.

The new vignette is already in bioconductor: https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html

ADD COMMENTlink written 4.3 years ago by tiagochst70
0
gravatar for Mike
4.3 years ago by
Mike1.6k
UK
Mike1.6k wrote:

Now TCGAbiolinks is updated, they replace "TCGAquery" with "GDCquery" function.

The functions TCGAquery, TCGAdownload, TCGAPrepare, TCGAquery_maf, TCGAquery_clinical, were replaced by GDCquery, GDCdownload, GDCPrepare, GDCquery_maf, GDCquery_clinical.

And it can acess both the GDC and GDC Legacy Archive.

https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#gdcquery-searching-tcga-open-access-data

ADD COMMENTlink written 4.3 years ago by Mike1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1986 users visited in the last hour