4.4 years ago
kvougas

Hello everyone,

I have the following issue.

I downloaded TCGA RNAseq Legacy data (level 3 - I think this means rormalized) using the TCGAbiolinks Bioconductor pakage and within different files of the same project I find discrepancies and I would like someone to explain.


The TCGAbiolinks query was: query <- GDCquery(project = project, data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", legacy = T)

Project: TCGA-BRCA

  1. FIle/case: Sample: gene_id normalized_count ?|100130426 0 ?|100133144 13.6068 ?|100134869 12.0568

  2. FIle/case:UNCID_421458.TCGA-BH-A0BW-01A-11R-A115-07.110527_UNC10-SN254_0224_AD0CPKABXX.2.trimmed.annotated.gene.quantification.txt Sample: gene raw_counts median_length_normalized RPKM ?|100130426 0 0 0 ?|100133144 189 7.7218543046 1.1683197549 ?|100134869 139 4.3601003764 0.6511684249

In the first case I get normalized counts only while in the second case I get raw counts, median_length_normalized & RPKM. Say that i want to compare gene expression between 1 & 2. What do I do since I think it wouldn't be wise to compare normalized counts vs raw counts

Sorry if this question is really basic but I am just starting to find my way around...

Thanks in advance

