TCGAbiolinks: which normalization before differential expression analysis (legacy=TRUE vs. legacy=FALSE)
1
0
Entering edit mode
11 months ago
erica.fary • 0

Dear All,

I am following the TCGAbiolinks tutorial for conducting differential expression analysis on TCGA data ("TCGAanalyze: Analyze data from TCGA" section). I have 2 questions about it.

1) I don't understand the following: when dealing with legacy=TRUE data (platform = "Illumina HiSeq", file.type = "results"), they perform normalization to correct gene length (TCGAanalyze_Normalization with default parameter); but when they are dealing with legacy=FALSE data (workflow.type = "HTSeq - Counts"), they perform normalization to correct GC content (TCGAanalyze_Normalization with method = "gcContent"). What is the reason for that ? Do you have any explanation ?

2) if I want to use the TCGAanalyze_DEA function with pipeline=limma, should I use the same normalization methods as for pipeline=edgeR ? otherwise, which one should I use for the legacy=FALSE and legacy=TRUE data, respectively ?

Hope you could help a bit. Thanks in advance !

Erica

TCGAbiolinks limma TCGA RNA-seq normalization • 754 views
0
Entering edit mode
11 months ago
fracarb8 ▴ 810

If you look at the Query tab, they say that

There are two available sources to download GDC data using TCGAbiolinks:
- GDC Legacy Archive : provides access to an unmodified copy of data that was previously stored in CGHub and in
the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC), in which uses as references GRCh37 (hg19) and GRCh36 (hg18).
- GDC harmonized database: data available was harmonized against GRCh38 (hg38) using GDC Bioinformatics Pipelines
which provides methods to the standardization of biospecimen and clinical data.


That means that legacy refer to data as it was provided to them and that it is not harmonized (e.g. everything normalised and scaled to be comparable between projects). You need to look at the documentation (or find somewhere in the portal) the protocols they used, so that you know what and where was normalised/scaled/raw.

The best approach, would be to download the raw counts using GDCquery. Based on this post, it should be possible.