Question

Question on GDC Legacy RNAseq data

0

Entering edit mode

8.2 years ago

kvougas • 0

Hello everyone,

I have the following issue.

I downloaded TCGA RNAseq Legacy data (level 3 - I think this means rormalized) using the TCGAbiolinks Bioconductor pakage and within different files of the same project I find discrepancies and I would like someone to explain.

Specifically

The TCGAbiolinks query was: query <- GDCquery(project = project, data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", legacy = T)

Project: TCGA-BRCA

FIle/case: unc.edu.e6dbaf07-3551-4c73-a2f2-f1bea4fa8e72.1989506.rsem.genes.normalized_results Sample: gene_id normalized_count ?|100130426 0 ?|100133144 13.6068 ?|100134869 12.0568
FIle/case:UNCID_421458.TCGA-BH-A0BW-01A-11R-A115-07.110527_UNC10-SN254_0224_AD0CPKABXX.2.trimmed.annotated.gene.quantification.txt Sample: gene raw_counts median_length_normalized RPKM ?|100130426 0 0 0 ?|100133144 189 7.7218543046 1.1683197549 ?|100134869 139 4.3601003764 0.6511684249

In the first case I get normalized counts only while in the second case I get raw counts, median_length_normalized & RPKM. Say that i want to compare gene expression between 1 & 2. What do I do since I think it wouldn't be wise to compare normalized counts vs raw counts

Sorry if this question is really basic but I am just starting to find my way around...

Thanks in advance

RNA-Seq TCGA GDC Normalization TCGAbiolincs • 2.1k views

ADD COMMENT • link 8.2 years ago by kvougas • 0