From TCGA to GDC (Genomic data commons)
3
4
Entering edit mode
6.4 years ago

Hello,

I was using TCGA data related to Colon adenocarcinoma (COAD). In the specific I was using "IlluminaGA_RNASeqV2", "IlluminaHiSeq_RNASeqV2" platforms

For the COAD cancer and for those platforms were available the level 3 information. I was using the raw count from the rsem.genes.results files.

Now that TCGA moved under Genomic data commons (GDC), i'm struggling to retrive the same information. I would like to understand how to download from https://gdc-portal.nci.nih.gov/ the same information that were available from TGCA.

I was using TCGABiolinks, but now seems not working anymore. Any suggestion about R library to import GDC data?

Thanks

R TCGA gdc • 7.5k views
4
Entering edit mode

I agree that the transition is very confusing, not least because of the way the gdc-portal displays data files for downloading.

Anyways, have you been here - Firehose On the landing page, at the row for COAD, under Data col, click Browse. The pop-up window that opens should be able to give you what you are looking for.

The file naming is a bit different now, but you would be able to make out. I haven't used the R library, but the Firehose site has its own client (like a wget). There is an R package described as well.

0
Entering edit mode

Thanks for the suggestion. I will give a try to Firehose

0
Entering edit mode

0
Entering edit mode

Nopes. I think that would be GDC.

8
Entering edit mode
6.4 years ago

Nearly all TCGA data/results can be found at Broad Institute's Firehose pipelines. Get raw data/results here or browse the web-based UIs at MSKCC's cbioportal.org or Broad's firebrowse.org.

NOTE: This is just a temporary solution, while I figure out how to use the GDC via CLI. :)

mkdir scripts
unzip -d scripts scripts/firehose_get_latest.zip


Here is how to use that tool to download the normalized per-gene expression estimates from RNA-seq data:

./scripts/firehose_get -b -only Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data data latest


It creates a folder structure with gzipped tarballs in separate tumor-type subfolders. Unpack all the tarballs:

mkdir rna_seq
for file in stddata__*/*/*/*RSEM*.Level_3*.tar.gz; do tar -zxf $file -C rna_seq; done  Rename the resulting subfolders to just the tumor type codes, using some in-line Perl and bash: ls -d rna_seq/gdac* | perl -ne 'chomp; ($t)=m/gdac.broadinstitute.org_(\w+)/; print "mv $_ rna_seq/$t\n"' | bash


Delete the separate colon/rectal cohorts, leaving behind only the combined cohort COADREAD:

rm -rf rna_seq/{COAD,READ}


There are also KIPAN (KICH+KIRC+KIRP) and GBMLGG (GBM+LGG), but keep them, they're interesting. The per-gene RNA-expression estimates are now in these files:

rna_seq/*/*.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt

0
Entering edit mode

Hi Cyriac,

Do you know the difference between illumina rnaseq2 vs illumina rnaseq and the current rna seq data on GDC?

I see several type in the link but on GDC portal, there is only one kind of RNA-seq data.

Thanks

0
Entering edit mode

v2 reports RSEM, the other reports RPKM. GDC only reports RSEM.

0
Entering edit mode

thanks,

then what is the relationship between RSEM and HTSeq?

I saw GDC have HTSeq-counts, HTseq-FPKM, HTseq-FPKM-UQ.

I thought RSEM will generate calculated expression, and HTseq is the raw counts,

And looks like if I use the HTseq from new GDC portal, I have to combine them by myself since they download the file folder by folder separately...Does some see a merge HTSeq file?

0
Entering edit mode

Sorry, I was wrong. GDC runs their own RNA-seq pipeline defined here, which appears to report FPKM.

0
Entering edit mode

AFAIK RNAseq TCGA V1 analysis (old) used BWA and the V2 analysis (new) which uses MapSplice. All V1 data was reprocessed as V2. So there should be only one kind of RNAseq data (submitted by UNC).

5
Entering edit mode
6.4 years ago
tiagochst ▴ 70

0
Entering edit mode
6.4 years ago
Mike ★ 1.8k

Now TCGAbiolinks is updated, they replace "TCGAquery" with "GDCquery" function.

And it can acess both the GDC and GDC Legacy Archive.