Diferences between TCGAbiolinks and cBioportal
3 months ago
jomagrax ▴ 40


Im exploring and integrating the LUAD TGCA transcriptomic and genomic data. Im trying to do so both with TCGAbiolinks in R and cBioportal.

With TCGAbiolinks I acces the data this way (https://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/analysis.html#TCGAvisualize:_Visualize_results_from_analysis_functions_with_TCGA%E2%80%99s_data)

Trasncriptomic data

query <- GDCquery(#legacy = T,
                  project = "TCGA-LUAD",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "STAR - Counts",    
                  experimental.strategy = "RNA-Seq")

GDCdownload(query, method = "api",files.per.chunk = 1000, directory = "/home/arantxa/proyects/itziar/FIS")
LUAD <- GDCprepare(query = query, directory = "/home/arantxa/proyects/itziar/FIS")

LUADMatrix <- assay(LUAD,"unstranded") 

# For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run
LUAD.RNAseq_CorOutliers <- TCGAanalyze_Preprocessing(LUAD)

# change of sample name to be the same than genomic data
rna_samples <- data_frame(V1=colnames(LUAD.RNAseq_CorOutliers))
colnames(LUAD.RNAseq_CorOutliers) <- rna_samples %>% mutate(V2 = str_sub(V1, start = 1, end = -13)) %>% .$V2

Genomic data

query <- GDCquery(
  project = "TCGA-LUAD", 
  data.category = "Simple Nucleotide Variation", 
  access = "open", 
  legacy = FALSE, 
  data.type = "Masked Somatic Mutation", 
  workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
GDCdownload(query, directory = "/home/arantxa/proyects/itziar/FIS")
maf <- GDCprepare(query, directory = "/home/arantxa/proyects/itziar/FIS")

But then, when I try to acces LUAD TCGA on cBioportal I find 3 different datasets (firehose, NATURE and PanCancer Atlas). And the number of samples with specific mutations etc doesn´t add up with the TCGAbiolinks cohort. Also, I can´t compare transcriptomic data from diferent datasets.

So my question is

1.- Where does this difference come from? 2.- Which is the best way to explore this dataset on cBioportal, as it would be my first choice.

R TCGAbiolinks cBioportal • 351 views
3 months ago
Zhenyu Zhang ★ 1.1k

They are different analysis of the raw data. GDC is normally the most recent; however, GDC does not perform batch effect correction, so if you are not comparing TCGA-LUAD with other projects, the best bet is the data from the latest publication that performs batch effect correction.


