Question

GSEA from broad institute normalization method

0

Entering edit mode

4.5 years ago

Rob ▴ 180

Hi freinds

I have HT-Seq raw count data of 19000 coding genes and 300 samples of two groups (treatment and control). I want to do gene set enrichment analysis with GSEA from broad institute. do I need to normalize my data before GSEA? How should I normalize for this purpose?

Thanks

RNA-Seq • 4.8k views

ADD COMMENT • link updated 24 months ago by benformatics 4.1k • written 4.5 years ago by Rob ▴ 180

score 5 · Accepted Answer · 2021-01-16

5

Entering edit mode

4.5 years ago

Kevin Blighe 89k

Hi,

As input to the Broad Institute's GSEA program, you should use any type of expression data that is [properly] normalised such that cross-sample differences can be faithfully gauged. This can mean using any of these:

normalised RNA-seq counts via DESeq2's 'geometric' normalisation, EdgeR's TMM method, et cetera.
normalised + transformed RNA-seq expression levels, such as variance-stabilised (vst) or regularised log (rlog) expression levels from DESeq2, or log2 CPMs from EdgeR
normalised microarray data via RMA, GC-RMA, MAS5, neqc, et cetera

This does not mean raw counts or any of these types of expression levels: FPKM, RPKM, TPM, et cetera

More information here:

Kevin

ADD COMMENT • link 4.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi I am trying to normalize my count data to do GSEA. these error come up: what is the solution?

data <- read.csv("myData.csv")
data <- matrix(data)
vst(data)

Error in vst(data) : less than 'nsub' rows,
  it is recommended to use varianceStabilizingTransformation directly


rlog(data)
Error in DESeqDataSet(se, design = design, ignoreRank) : 
  'list' object cannot be coerced to type 'double'

ADD REPLY • link 4.5 years ago by Rob ▴ 180

1

Entering edit mode

Assuming that your data that is held in data is raw counts, you should first normalise this via:

dds <- DESeqDataSetFromMatrix(
  countData = data,
  colData = coldata,
  design= ~ Group)
dds <- DESeq(dds)

data does not have to be a data matrix.

coldata should be a data frame that represents the metadata for data, and its rows should be perfectly aligned with the columns of data. You should have at leas one column, in this case 'group', that represents treatment and control

Then we can transform these normalised counts:

varStabilised <- vst(data, blind = FALSE)
regularisedLog <- rlog(data, blind = FALSE)

The normalsied + transformed expression levels will then be accessible via:

assay(varStabilised)
assay(regularisedLog)

ADD REPLY • link 4.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin What I understand from your answer is : first I do DESeq work flow with the raw count data, then the result (dds) as the normalized count will come to vst() or rlog():

varStabilised <- vst(dds, blind = FALSE)

Then this varStabilised will be my normalized data for GSEA. Am I getting this correctly?

ADD REPLY • link 4.5 years ago by Rob ▴ 180

1

Entering edit mode

Yep, but you can also use the normalised counts, accessible via:

counts(dds, normalized = TRUE)

The distributions of the normalised counts and that of the variance stabilised expression levels differ, but this is not a problem due to the fact that GSEA is based on ranking.

ADD REPLY • link 4.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks for explanation When I used this I get error:

counts(dds, normalized = TRUE)
Error in .local(object, ...) : 
  first calculate size factors, add normalizationFactors, or set normalized=FALSE

how can I solve this?

ADD REPLY • link 4.5 years ago by Rob ▴ 180

1

Entering edit mode

You have evidently not yet run DESeq(dds)

ADD REPLY • link 4.5 years ago by Kevin Blighe 89k

1

Entering edit mode

Thanks Kevin. it was helpful.

ADD REPLY • link 4.5 years ago by Rob ▴ 180