Tutorial:TCGA transcriptome data to R (DESeq2)
0
11
Entering edit mode
11 months ago
Barry Digby ★ 1.0k

This is a frequently asked question, so here is a robust method to fully recapitulate the counts given by TCGA and analyse them using DESeq2.

# ! Updated to reflect Data Release 32.0

GDC Product: Data - GENCODE v36 Release

Release Date: March 29, 2022

Why the long way? Tanya and I noticed via TCGA-Biolinks and Firehose did not generate the full count matrix. ~5-10% of genes were missing to begin with. (more on this at the end of the post)

## GDC Portal

#### 1. Add files to cart

First, you will need to go to the GDC Portal and add all of the files you want to download to your cart:

1. In the left panel, select the dataset you want. This example is for TCGA-PRAD gene expression quantification using STAR.
2. Add all files to cart

## GDC Client

#### 1. Install gdc-client tool:

wget https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.1_Ubuntu_x64.zip
unzip gdc-client_v1.6.1_Ubuntu_x64.zip
sudo mv gdc-client /usr/bin && rm gdc-client_v1.6.1_Ubuntu_x64.zip
gdc-client --help


Please check if newer versions are available as the post ages: GDC Data Transfer Tool Client.

mkdir TCGA-PRAD


You will now have a directory for each patient under TCGA-PRAD/:

$tree TCGA-PRAD/ | head TCGA-PRAD/ ├── 0007888f-8d96-4c01-8251-7fef6cc71596 │ ├── 88215dd0-5841-44f1-9393-eefd8238cbb3.rna_seq.augmented_star_gene_counts.tsv │ └── logs │ └── 88215dd0-5841-44f1-9393-eefd8238cbb3.rna_seq.augmented_star_gene_counts.tsv.parcel ├── 00d6b6a9-179f-4b2a-826f-f8326197b17d │ ├── 9c56eacc-c71b-48b5-8747-008fd1105f89.rna_seq.augmented_star_gene_counts.tsv │ └── logs │ └── 9c56eacc-c71b-48b5-8747-008fd1105f89.rna_seq.augmented_star_gene_counts.tsv.parcel ├── 012f91be-31a2-4e38-9922-2ee6f9b3328c  ## Generate count matrix in R Note: This step is memory intensive Please uncomment the lines below if you wish to select only protein_coding genes (recommended). library(data.table) generate_count_mat <- function(path, pattern) { files = list.files(path, pattern, full.names = TRUE, recursive=TRUE, include.dirs=TRUE) mat = as.data.frame(do.call(cbind, lapply(files, function(x) fread(x, stringsAsFactors = FALSE)))) mat <- mat[-c(1:4),] #gene_type <- as.character(mat[,3]) rownames(mat) = mat[,1] mat = as.data.frame(mat[, seq(4, ncol(mat), 9)]) #mat$gene_type <- gene_type
#mat <- subset(mat, mat$gene_type == "protein_coding") #mat <- mat[,-c(ncol(mat))] return(mat) } # stage raw counts mrna_counts <- generate_count_mat("/data/TCGA-PRAD", "\\.rna_seq.augmented_star_gene_counts.tsv$")


Uncommenting the lines above results in a count matrix with 19962 protein coding genes. Without filtering, the resulting count matrix has 60660 genes, composed of:

    14 IG_C_gene
9 IG_C_pseudogene
37 IG_D_gene
18 IG_J_gene
3 IG_J_pseudogene
1 IG_pseudogene
145 IG_V_gene
187 IG_V_pseudogene
16901 lncRNA
1881 miRNA
2212 misc_RNA
2 Mt_rRNA
22 Mt_tRNA
48 polymorphic_pseudogene
10167 processed_pseudogene
19962 protein_coding
18 pseudogene
8 ribozyme
47 rRNA
497 rRNA_pseudogene
49 scaRNA
1 scRNA
943 snoRNA
1901 snRNA
5 sRNA
1057 TEC
500 transcribed_processed_pseudogene
138 transcribed_unitary_pseudogene
939 transcribed_unprocessed_pseudogene
2 translated_processed_pseudogene
1 translated_unprocessed_pseudogene
6 TR_C_gene
4 TR_D_gene
79 TR_J_gene
4 TR_J_pseudogene
106 TR_V_gene
33 TR_V_pseudogene
98 unitary_pseudogene
2614 unprocessed_pseudogene
1 vault_RNA


You might have noticed the count matrix has non-sensical column names. The count matrix was generated based on list.files() thus, we must re-arrange the sample sheet to match the same ordering prior to appending Sample.ID as column names.

This step utilises the Sample Sheet Downloaded in the GDC Portal step.

library(tidyverse)
file_names <- list.files("/data/TCGA-PRAD", "\\.rna_seq.augmented_star_gene_counts.tsv$", full.names = FALSE, recursive = TRUE, include.dirs = FALSE) file_names <- sub(".*/", "", file_names) samplesheet <- read.table("gdc_sample_sheet.2022-04-27.tsv", header=T, sep="\t") samplesheet <- samplesheet[match(file_names, samplesheet$File.Name),]
colnames(mrna_counts) <- samplesheet$Sample.ID meta <- subset(samplesheet, select=c(Sample.ID, Sample.Type)) rownames(meta) <- NULL meta <- column_to_rownames(mrna_meta, var="Sample.ID")  If you want to compare Tumor vs. Normal, filter to remove samples that are neither: meta$Sample.Type <- gsub("Primary Tumor", "Tumor", meta$Sample.Type,) meta$Sample.Type <- gsub("Solid Tissue Normal", "Normal", meta$Sample.Type) meta$Sample.Type <- as.factor(meta$Sample.Type) levels(meta$Sample.Type)

> levels(meta$Sample.Type) [1] "Metastatic" "Normal" "Tumor"  Remove the Metastatic sample from both the counts matrix and metadata: metastatic_key <- rownames(meta)[which(meta$Sample.Type == "Metastatic")]
mrna_counts <- mrna_counts[names(mrna_counts)[!names(mrna_counts) %in% metastatic_key]]
meta <- subset(meta, !(rownames(meta) %in% metastatic_key))
table(colnames(mrna_counts) == rownames(meta))

> table(colnames(mrna_counts) == rownames(meta))

TRUE
550


Note: If you want to perform more complex contrasts, download the clinical sample sheet in the GDC cart checkout page.

## DESeq2

dds <- DESeqDataSetFromMatrix(mrna_counts, colData = meta, design = ~ Sample.Type)
dds$Sample.Type <- relevel(dds$Sample.Type, ref = "Normal")
dds <- DESeq(dds)

## DESeq2 results
res <- results(dds, alpha = 0.05, name = "Sample.Type_Tumor_vs_Normal")
res_df <- as.data.frame(res)

## Function to split results
get_upregulated <- function(df){
key <- intersect(rownames(df)[which(df$log2FoldChange>=1)], rownames(df)[which(df$pvalue<=0.05)])

results <- as.data.frame((df)[which(rownames(df) %in% key),])
return(results)
}

get_downregulated <- function(df){
key <- intersect(rownames(df)[which(df$log2FoldChange<=-1)], rownames(df)[which(df$pvalue<=0.05)])

results <- as.data.frame((df)[which(rownames(df) %in% key),])
return(results)
}

up_reg <- get_upregulated(res_df)
down_reg <- get_downregulated(res_df)


## Appending Gene Names to results

The STAR counts in release 32 come with a gene_name column we can use to annotate our results (or use BiomaRt..). Read one of the raw files in and use this as an ensembl_gene_id_version to gene_name map.

$head TCGA-PRAD/0007888f-8d96-4c01-8251-7fef6cc71596/88215dd0-5841-44f1-9393-eefd8238cbb3.rna_seq.augmented_star_gene_counts.tsv # gene-model: GENCODE v36 gene_id gene_name gene_type unstranded stranded_first stranded_second tpm_unstranded fpkm_unstranded fpkm_uq_unstranded N_unmapped 1119332 1119332 1119332 N_multimapping 4161493 4161493 4161493 N_noFeature 3363885 28590771 28547217 N_ambiguous 5128483 1241035 1244999 ENSG00000000003.15 TSPAN6 protein_coding 3473 1768 1705 54.5637 16.3336 16.0655 ENSG00000000005.6 TNMD protein_coding 2 2 0 0.0966 0.0289 0.0284 ENSG00000000419.13 DPM1 protein_coding 1591 827 764 93.9366 28.1198 27.6583 ENSG00000000457.14 SCYL3 protein_coding 1263 938 957 13.0767 3.9145 3.8502  ## Discrepancies #### GDC Portal Method: > dim(mrna_counts) [1] 60660 553  #### GDC Query Method: library(TCGAbiolinks) library(SummarizedExperiment) library(biomaRt) mrna_query <- GDCquery(project = "TCGA-PRAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts", experimental.strategy = "RNA-Seq") GDCdownload(mrna_query, method = "api", files.per.chunk = 100, directory = "~/Desktop/TCGA/mRNA/") mrna_df <- GDCprepare(mrna_query, directory = "~/Desktop/TCGA/mRNA/") # extract what you need from mrna_df$ to add to metadata
mrna_meta <- mrna_df$sample mrna_meta <- cbind(mrna_meta, mrna_df$definition)
mrna_df <- assay(mrna_df)

dim(mrna_df)

> dim(mrna_df)
[1] 56602   551


You are instantly missing 6.5% of genes compared to the gdc-client method - apparently firehose performs worse.

For the sake of reproducibility, one should be able to obtain the full TCGA gene set and decide for themselves what to discard.

Full credit to Tanya Aneichyk for raising these discrepancies in the first place, and to Dónal O'Shea for pointing out the need for preserving ordering using file.names() before appending Sample.ID.

TCGA GDC DESeq2 • 3.0k views
1
Entering edit mode

When I performed using TCGABiolinks I'm getting the exact count. Can you cross-check with updated pipeline?

library(TCGAbiolinks)
library(SummarizedExperiment)
library(dplyr)
library(DT)

data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts")

mrna_df <- GDCprepare(mrna_query)

mrna_df <- assay(mrna_df)

dim(mrna_df)

[1] 60661   553

0
Entering edit mode

Seems like the discrepancy is fixed

0
Entering edit mode

Hey this is exactly what I was looking for since the genes I am interested in seem to be out of all the online TCGA seq datasets. However, I don't think my computer can handle this downloading all of this (I'm a noobie too). Would it be possible to host the results online somewhere or transfer them to me?

0
Entering edit mode

.htseq.counts are not available in the updated gdc-portal website. They provide only STAR - Counts. I think this workflow needs change as per STAR - Counts as the file formats are different.

0
Entering edit mode

0
Entering edit mode

Seems to be a minor tweak - I just have to change the column indexing to grab the unstranded column from the STAR file. I will update the documentation to reflect the augmented_star_gene_counts.tsv nomenclature. Thanks again.

0
Entering edit mode

I haven't tested this code, but there is a corner case in TCGA/CTPAC-3 or any other projects that have RPPA protein array data. As you might know, all libraries are constructed in the aliquot-level, and RPPA just needs a lot more materials so the analysis center actually combined a few samples to a single aliquot, instead of the regular way (split one sample into multiple aliquot). As a consequence of that, you might see sometimes the sample_id, sample_type, etc are comma delimited value array instead a simple enum.

0
Entering edit mode

0
Entering edit mode

If you just select all CPTAC-3 RNA-Seq Genomic-aligned BAMs, add them to the GDC cart and download the "sample sheet", you will see ~200 BAMs generated from combined aliquots. One example is 8912abbb-e37b-4bee-8244-55e30f2fc4bc generated from an aliquot that is a combination of C3N-00498-04 and C3N-00498-02, both are "Primary Tumor" sample_type and belong to the same case C3N-00498