Tutorial:TCGA transcriptome data to R (DESeq2)
Entering edit mode
2.2 years ago
Barry Digby ★ 1.3k

This is a frequently asked question, so here is a robust method to fully recapitulate the counts given by TCGA and analyse them using DESeq2.

! Updated to reflect Data Release 32.0

GDC Product: Data - GENCODE v36 Release

Release Date: March 29, 2022

Why the long way? Tanya and I noticed via TCGA-Biolinks and Firehose did not generate the full count matrix. ~5-10% of genes were missing to begin with. (more on this at the end of the post)

GDC Portal

1. Add files to cart

First, you will need to go to the GDC Portal and add all of the files you want to download to your cart:

  1. In the left panel, select the dataset you want. This example is for TCGA-PRAD gene expression quantification using STAR.
  2. Add all files to cart


2. Download Manifest, Sample Sheet


GDC Client

1. Install gdc-client tool:

wget https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.1_Ubuntu_x64.zip
unzip gdc-client_v1.6.1_Ubuntu_x64.zip
sudo mv gdc-client /usr/bin && rm gdc-client_v1.6.1_Ubuntu_x64.zip
gdc-client --help

Please check if newer versions are available as the post ages: GDC Data Transfer Tool Client.

2. Download files using the client:

gdc-client download -m gdc_manifest_20220427_075342.txt -d TCGA-PRAD/

You will now have a directory for each patient under TCGA-PRAD/:

$ tree TCGA-PRAD/ | head
├── 0007888f-8d96-4c01-8251-7fef6cc71596
│   ├── 88215dd0-5841-44f1-9393-eefd8238cbb3.rna_seq.augmented_star_gene_counts.tsv
│   └── logs
│       └── 88215dd0-5841-44f1-9393-eefd8238cbb3.rna_seq.augmented_star_gene_counts.tsv.parcel
├── 00d6b6a9-179f-4b2a-826f-f8326197b17d
│   ├── 9c56eacc-c71b-48b5-8747-008fd1105f89.rna_seq.augmented_star_gene_counts.tsv
│   └── logs
│       └── 9c56eacc-c71b-48b5-8747-008fd1105f89.rna_seq.augmented_star_gene_counts.tsv.parcel
├── 012f91be-31a2-4e38-9922-2ee6f9b3328c

Generate count matrix in R

Note: This step is memory intensive

Please uncomment the lines below if you wish to select only protein_coding genes (recommended).


generate_count_mat <- function(path, pattern) {
    files = list.files(path, pattern, full.names = TRUE, recursive=TRUE, include.dirs=TRUE)
    mat = as.data.frame(do.call(cbind, lapply(files, function(x) fread(x, stringsAsFactors = FALSE))))
    mat <- mat[-c(1:4),]
    #gene_type <- as.character(mat[,3])
    rownames(mat) = mat[,1]
    mat = as.data.frame(mat[, seq(4, ncol(mat), 9)])
    #mat$gene_type <- gene_type
    #mat <- subset(mat, mat$gene_type == "protein_coding")
    #mat <- mat[,-c(ncol(mat))]

# stage raw counts
mrna_counts <- generate_count_mat("/data/TCGA-PRAD", "\\.rna_seq.augmented_star_gene_counts.tsv$")

Uncommenting the lines above results in a count matrix with 19962 protein coding genes. Without filtering, the resulting count matrix has 60660 genes, composed of:

    14 IG_C_gene
      9 IG_C_pseudogene
     37 IG_D_gene
     18 IG_J_gene
      3 IG_J_pseudogene
      1 IG_pseudogene
    145 IG_V_gene
    187 IG_V_pseudogene
  16901 lncRNA
   1881 miRNA
   2212 misc_RNA
      2 Mt_rRNA
     22 Mt_tRNA
     48 polymorphic_pseudogene
  10167 processed_pseudogene
  19962 protein_coding
     18 pseudogene
      8 ribozyme
     47 rRNA
    497 rRNA_pseudogene
     49 scaRNA
      1 scRNA
    943 snoRNA
   1901 snRNA
      5 sRNA
   1057 TEC
    500 transcribed_processed_pseudogene
    138 transcribed_unitary_pseudogene
    939 transcribed_unprocessed_pseudogene
      2 translated_processed_pseudogene
      1 translated_unprocessed_pseudogene
      6 TR_C_gene
      4 TR_D_gene
     79 TR_J_gene
      4 TR_J_pseudogene
    106 TR_V_gene
     33 TR_V_pseudogene
     98 unitary_pseudogene
   2614 unprocessed_pseudogene
      1 vault_RNA

Fetch Metadata

You might have noticed the count matrix has non-sensical column names. The count matrix was generated based on list.files() thus, we must re-arrange the sample sheet to match the same ordering prior to appending Sample.ID as column names.

This step utilises the Sample Sheet Downloaded in the GDC Portal step.

file_names <- list.files("/data/TCGA-PRAD", "\\.rna_seq.augmented_star_gene_counts.tsv$", full.names = FALSE, recursive = TRUE, include.dirs = FALSE)
file_names <- sub(".*/", "", file_names)
samplesheet <- read.table("gdc_sample_sheet.2022-04-27.tsv", header=T, sep="\t")
samplesheet <- samplesheet[match(file_names, samplesheet$File.Name),]
colnames(mrna_counts) <- samplesheet$Sample.ID
meta <- subset(samplesheet, select=c(Sample.ID, Sample.Type))
rownames(meta) <- NULL
meta <- column_to_rownames(mrna_meta, var="Sample.ID")

If you want to compare Tumor vs. Normal, filter to remove samples that are neither:

meta$Sample.Type <- gsub("Primary Tumor", "Tumor", meta$Sample.Type,)
meta$Sample.Type <- gsub("Solid Tissue Normal", "Normal", meta$Sample.Type)
meta$Sample.Type <- as.factor(meta$Sample.Type)
> levels(meta$Sample.Type)
[1] "Metastatic" "Normal"     "Tumor"

Remove the Metastatic sample from both the counts matrix and metadata:

metastatic_key <- rownames(meta)[which(meta$Sample.Type == "Metastatic")]
mrna_counts <- mrna_counts[names(mrna_counts)[!names(mrna_counts) %in% metastatic_key]]
meta <- subset(meta, !(rownames(meta) %in% metastatic_key))
table(colnames(mrna_counts) == rownames(meta))
> table(colnames(mrna_counts) == rownames(meta))


Note: If you want to perform more complex contrasts, download the clinical sample sheet in the GDC cart checkout page.


dds <- DESeqDataSetFromMatrix(mrna_counts, colData = meta, design = ~ Sample.Type)
dds$Sample.Type <- relevel(dds$Sample.Type, ref = "Normal")
dds <- DESeq(dds)

## DESeq2 results
res <- results(dds, alpha = 0.05, name = "Sample.Type_Tumor_vs_Normal")
res_df <- as.data.frame(res)

## Function to split results
get_upregulated <- function(df){
    key <- intersect(rownames(df)[which(df$log2FoldChange>=1)],

    results <- as.data.frame((df)[which(rownames(df) %in% key),])

get_downregulated <- function(df){
  key <- intersect(rownames(df)[which(df$log2FoldChange<=-1)],

  results <- as.data.frame((df)[which(rownames(df) %in% key),])

up_reg <- get_upregulated(res_df)
down_reg <- get_downregulated(res_df)

Appending Gene Names to results

The STAR counts in release 32 come with a gene_name column we can use to annotate our results (or use BiomaRt..). Read one of the raw files in and use this as an ensembl_gene_id_version to gene_name map.

$ head TCGA-PRAD/0007888f-8d96-4c01-8251-7fef6cc71596/88215dd0-5841-44f1-9393-eefd8238cbb3.rna_seq.augmented_star_gene_counts.tsv 
# gene-model: GENCODE v36
gene_id gene_name   gene_type   unstranded  stranded_first  stranded_second tpm_unstranded  fpkm_unstranded fpkm_uq_unstranded
N_unmapped          1119332 1119332 1119332         
N_multimapping          4161493 4161493 4161493         
N_noFeature         3363885 28590771    28547217            
N_ambiguous         5128483 1241035 1244999         
ENSG00000000003.15  TSPAN6  protein_coding  3473    1768    1705    54.5637 16.3336 16.0655
ENSG00000000005.6   TNMD    protein_coding  2   2   0   0.0966  0.0289  0.0284
ENSG00000000419.13  DPM1    protein_coding  1591    827 764 93.9366 28.1198 27.6583
ENSG00000000457.14  SCYL3   protein_coding  1263    938 957 13.0767 3.9145  3.8502


GDC Portal Method:

> dim(mrna_counts)
[1] 60660   553

GDC Query Method:


mrna_query <- GDCquery(project = "TCGA-PRAD",
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",
                      experimental.strategy = "RNA-Seq")
GDCdownload(mrna_query, method = "api", files.per.chunk = 100,
            directory = "~/Desktop/TCGA/mRNA/")
mrna_df <- GDCprepare(mrna_query, directory = "~/Desktop/TCGA/mRNA/")
# extract what you need from mrna_df$ to add to metadata
mrna_meta <- mrna_df$sample
mrna_meta <- cbind(mrna_meta, mrna_df$definition)
mrna_df <- assay(mrna_df)

> dim(mrna_df)
[1] 56602   551

You are instantly missing 6.5% of genes compared to the gdc-client method - apparently firehose performs worse.

For the sake of reproducibility, one should be able to obtain the full TCGA gene set and decide for themselves what to discard.

Full credit to Tanya Aneichyk for raising these discrepancies in the first place, and to Dónal O'Shea for pointing out the need for preserving ordering using file.names() before appending Sample.ID.

TCGA GDC DESeq2 • 6.7k views
Entering edit mode

When I performed using TCGABiolinks I'm getting the exact count. Can you cross-check with updated pipeline?


mrna_query <- GDCquery(project = "TCGA-PRAD",
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "STAR - Counts")

GDCdownload(mrna_query, method = "api")

mrna_df <- GDCprepare(mrna_query)

mrna_df <- assay(mrna_df)


[1] 60661   553
Entering edit mode

Seems like the discrepancy is fixed

Entering edit mode

.htseq.counts are not available in the updated gdc-portal website. They provide only STAR - Counts. I think this workflow needs change as per STAR - Counts as the file formats are different.

Entering edit mode

Thanks for the heads up.

Entering edit mode

Seems to be a minor tweak - I just have to change the column indexing to grab the unstranded column from the STAR file. I will update the documentation to reflect the augmented_star_gene_counts.tsv nomenclature. Thanks again.

Entering edit mode

Hey this is exactly what I was looking for since the genes I am interested in seem to be out of all the online TCGA seq datasets. However, I don't think my computer can handle this downloading all of this (I'm a noobie too). Would it be possible to host the results online somewhere or transfer them to me?

Entering edit mode

I haven't tested this code, but there is a corner case in TCGA/CTPAC-3 or any other projects that have RPPA protein array data. As you might know, all libraries are constructed in the aliquot-level, and RPPA just needs a lot more materials so the analysis center actually combined a few samples to a single aliquot, instead of the regular way (split one sample into multiple aliquot). As a consequence of that, you might see sometimes the sample_id, sample_type, etc are comma delimited value array instead a simple enum.

Entering edit mode

I didn't know about this. If you can point to a specific example I can add an addendum as an additional comment/answer. These types of tutorials benefit from inputs from the community, thanks for mentioning this.

Entering edit mode

If you just select all CPTAC-3 RNA-Seq Genomic-aligned BAMs, add them to the GDC cart and download the "sample sheet", you will see ~200 BAMs generated from combined aliquots. One example is 8912abbb-e37b-4bee-8244-55e30f2fc4bc generated from an aliquot that is a combination of C3N-00498-04 and C3N-00498-02, both are "Primary Tumor" sample_type and belong to the same case C3N-00498


Login before adding your answer.

Traffic: 2716 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6