Diferences between TCGAbiolinks and cBioportal
1
0
Entering edit mode
5 months ago
jomagrax ▴ 40

Hi,

Im exploring and integrating the LUAD TGCA transcriptomic and genomic data. Im trying to do so both with TCGAbiolinks in R and cBioportal.

With TCGAbiolinks I acces the data this way (https://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/analysis.html#TCGAvisualize:_Visualize_results_from_analysis_functions_with_TCGA%E2%80%99s_data)

Trasncriptomic data

query <- GDCquery(#legacy = T,
                  project = "TCGA-LUAD",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "STAR - Counts",    
                  experimental.strategy = "RNA-Seq")

GDCdownload(query, method = "api",files.per.chunk = 1000, directory = "/home/arantxa/proyects/itziar/FIS")
LUAD <- GDCprepare(query = query, directory = "/home/arantxa/proyects/itziar/FIS")




LUADMatrix <- assay(LUAD,"unstranded") 

# For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run
LUAD.RNAseq_CorOutliers <- TCGAanalyze_Preprocessing(LUAD)

# change of sample name to be the same than genomic data
rna_samples <- data_frame(V1=colnames(LUAD.RNAseq_CorOutliers))
colnames(LUAD.RNAseq_CorOutliers) <- rna_samples %>% mutate(V2 = str_sub(V1, start = 1, end = -13)) %>% .$V2

Genomic data

query <- GDCquery(
  project = "TCGA-LUAD", 
  data.category = "Simple Nucleotide Variation", 
  access = "open", 
  legacy = FALSE, 
  data.type = "Masked Somatic Mutation", 
  workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)
GDCdownload(query, directory = "/home/arantxa/proyects/itziar/FIS")
maf <- GDCprepare(query, directory = "/home/arantxa/proyects/itziar/FIS")

But then, when I try to acces LUAD TCGA on cBioportal I find 3 different datasets (firehose, NATURE and PanCancer Atlas). And the number of samples with specific mutations etc doesn´t add up with the TCGAbiolinks cohort. Also, I can´t compare transcriptomic data from diferent datasets.

So my question is

1.- Where does this difference come from? 2.- Which is the best way to explore this dataset on cBioportal, as it would be my first choice.

R TCGAbiolinks cBioportal • 435 views
ADD COMMENT
0
Entering edit mode
5 months ago
Zhenyu Zhang ★ 1.2k

They are different analysis of the raw data. GDC is normally the most recent; however, GDC does not perform batch effect correction, so if you are not comparing TCGA-LUAD with other projects, the best bet is the data from the latest publication that performs batch effect correction.

ADD COMMENT

Login before adding your answer.

Traffic: 1736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6