Question: GSEA from broad institute normalization method
0
gravatar for Rob
7 weeks ago by
Rob40
Rob40 wrote:

Hi freinds

I have HT-Seq raw count data of 19000 coding genes and 300 samples of two groups (treatment and control). I want to do gene set enrichment analysis with GSEA from broad institute. do I need to normalize my data before GSEA? How should I normalize for this purpose?

Thanks

rna-seq • 147 views
ADD COMMENTlink modified 7 weeks ago by Kevin Blighe71k • written 7 weeks ago by Rob40
4
gravatar for Kevin Blighe
7 weeks ago by
Kevin Blighe71k
Republic of Ireland
Kevin Blighe71k wrote:

Hi,

As input to the Broad Institute's GSEA program, you should use any type of expression data that is [properly] normalised such that cross-sample differences can be faithfully gauged. This can mean using any of these:

  • normalised RNA-seq counts via DESeq2's 'geometric' normalisation, EdgeR's TMM method, et cetera.
  • normalised + transformed RNA-seq expression levels, such as variance-stabilised (vst) or regularised log (rlog) expression levels from DESeq2, or log2 CPMs from EdgeR
  • normalised microarray data via RMA, GC-RMA, MAS5, neqc, et cetera

This does not mean raw counts or any of these types of expression levels: FPKM, RPKM, TPM, et cetera

More information here:

Kevin

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Kevin Blighe71k

Hi I am trying to normalize my count data to do GSEA. these error come up: what is the solution?

data <- read.csv("myData.csv")
data <- matrix(data)
vst(data)

Error in vst(data) : less than 'nsub' rows,
  it is recommended to use varianceStabilizingTransformation directly


rlog(data)
Error in DESeqDataSet(se, design = design, ignoreRank) : 
  'list' object cannot be coerced to type 'double'
ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Rob40
1

Assuming that your data that is held in data is raw counts, you should first normalise this via:

dds <- DESeqDataSetFromMatrix(
  countData = data,
  colData = coldata,
  design= ~ Group)
dds <- DESeq(dds)

data does not have to be a data matrix.

coldata should be a data frame that represents the metadata for data, and its rows should be perfectly aligned with the columns of data. You should have at leas one column, in this case 'group', that represents treatment and control

Then we can transform these normalised counts:

varStabilised <- vst(data, blind = FALSE)
regularisedLog <- rlog(data, blind = FALSE)

The normalsied + transformed expression levels will then be accessible via:

assay(varStabilised)
assay(regularisedLog)
ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Kevin Blighe71k

Thanks Kevin What I understand from your answer is : first I do DESeq work flow with the raw count data, then the result (dds) as the normalized count will come to vst() or rlog():

varStabilised <- vst(dds, blind = FALSE)

Then this varStabilised will be my normalized data for GSEA. Am I getting this correctly?

ADD REPLYlink written 6 weeks ago by Rob40
1

Yep, but you can also use the normalised counts, accessible via:

counts(dds, normalized = TRUE)

The distributions of the normalised counts and that of the variance stabilised expression levels differ, but this is not a problem due to the fact that GSEA is based on ranking.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe71k

Thanks for explanation When I used this I get error:

counts(dds, normalized = TRUE)
Error in .local(object, ...) : 
  first calculate size factors, add normalizationFactors, or set normalized=FALSE

how can I solve this?

ADD REPLYlink written 6 weeks ago by Rob40
1

You have evidently not yet run DESeq(dds)

ADD REPLYlink written 6 weeks ago by Kevin Blighe71k
1

Thanks Kevin. it was helpful.

ADD REPLYlink written 6 weeks ago by Rob40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1263 users visited in the last hour
_